Sophie

Sophie

distrib > Mandriva > 2010.1 > i586 > by-pkgid > e578866d55cd81fdb23827cdf3cec911 > files > 725

python-scikits-learn-0.6-1mdv2010.2.i586.rpm



<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    
    <title>2. Getting started: an introduction to machine learning with scikits.learn &mdash; scikits.learn v0.6.0 documentation</title>
    <link rel="stylesheet" href="_static/nature.css" type="text/css" />
    <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
    <script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '',
        VERSION:     '0.6.0',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script>
    <script type="text/javascript" src="_static/jquery.js"></script>
    <script type="text/javascript" src="_static/underscore.js"></script>
    <script type="text/javascript" src="_static/doctools.js"></script>
    <link rel="shortcut icon" href="_static/favicon.ico"/>
    <link rel="author" title="About these documents" href="about.html" />
    <link rel="top" title="scikits.learn v0.6.0 documentation" href="index.html" />
    <link rel="up" title="&lt;no title&gt;" href="contents.html" />
    <link rel="next" title="3. Supervised learning" href="supervised_learning.html" />
    <link rel="prev" title="1. Installing scikits.learn" href="install.html" /> 
  </head>
  <body>
    <div class="header-wrapper">
      <div class="header">
          <p class="logo"><a href="index.html">
            <img src="_static/scikit-learn-logo-small.png" alt="Logo"/>
          </a>
          </p><div class="navbar">
          <ul>
            <li><a href="install.html">Download</a></li>
            <li><a href="support.html">Support</a></li>
            <li><a href="user_guide.html">User Guide</a></li>
            <li><a href="auto_examples/index.html">Examples</a></li>
            <li><a href="developers/index.html">Development</a></li>
       </ul>

<div class="search_form">

<div id="cse" style="width: 100%;"></div>
<script src="http://www.google.com/jsapi" type="text/javascript"></script>
<script type="text/javascript">
  google.load('search', '1', {language : 'en'});
  google.setOnLoadCallback(function() {
    var customSearchControl = new google.search.CustomSearchControl('016639176250731907682:tjtqbvtvij0');
    customSearchControl.setResultSetSize(google.search.Search.FILTERED_CSE_RESULTSET);
    var options = new google.search.DrawOptions();
    options.setAutoComplete(true);
    customSearchControl.draw('cse', options);
  }, true);
</script>

</div>

          </div> <!-- end navbar --></div>
    </div>

    <div class="content-wrapper">

    <!-- <div id="blue_tile"></div> -->

        <div class="sphinxsidebar">
        <div class="rel">
          <a href="install.html" title="1. Installing scikits.learn"
             accesskey="P">previous</a> |
          <a href="supervised_learning.html" title="3. Supervised learning"
             accesskey="N">next</a> |
          <a href="genindex.html" title="General Index"
             accesskey="I">index</a>
        </div>
        

        <h3>Contents</h3>
         <ul>
<li><a class="reference internal" href="#">2. Getting started: an introduction to machine learning with scikits.learn</a><ul>
<li><a class="reference internal" href="#machine-learning-the-problem-setting">2.1. Machine learning: the problem setting</a></li>
<li><a class="reference internal" href="#loading-an-example-dataset">2.2. Loading an example dataset</a></li>
<li><a class="reference internal" href="#learning-and-predicting">2.3. Learning and Predicting</a></li>
</ul>
</li>
</ul>


        

        </div>

      <div class="content">
            
      <div class="documentwrapper">
        <div class="bodywrapper">
          <div class="body">
            
  <div class="section" id="getting-started-an-introduction-to-machine-learning-with-scikits-learn">
<h1>2. Getting started: an introduction to machine learning with scikits.learn<a class="headerlink" href="#getting-started-an-introduction-to-machine-learning-with-scikits-learn" title="Permalink to this headline">¶</a></h1>
<div class="topic">
<p class="topic-title first">Section contents</p>
<p>In this section, we introduce the machine learning vocabulary that we
use through-out <cite>scikits.learn</cite> and give a simple learning example.</p>
</div>
<div class="section" id="machine-learning-the-problem-setting">
<h2>2.1. Machine learning: the problem setting<a class="headerlink" href="#machine-learning-the-problem-setting" title="Permalink to this headline">¶</a></h2>
<p>In general, a learning problem considers a set of n <em>samples</em> of data and
try to predict properties of unknown data. If each sample is more than a
single number, and for instance a multi-dimensional entry (aka
<em>multivariate</em> data), is it said to have several attributes, or
<em>features</em>.</p>
<p>We can separate learning problems in a few large categories:</p>
<blockquote>
<ul>
<li><p class="first"><strong>supervised learning</strong>, in which the data comes with additional
attributes that we want to predict. This problem can be either:</p>
<blockquote>
<ul>
<li><p class="first"><strong>classification</strong>: samples belong to two or more classes and we
want to learn from already labeled data how to predict the class
of un-labeled data. An example of classification problem would
be the digit recognition example, in which the aim is to assign
each input vector to one of a finite number of discrete
categories.</p>
</li>
<li><dl class="first docutils">
<dt><strong>regression</strong>: if the desired output consists of one or more</dt>
<dd><p class="first last">continuous variables, then the task is called <em>regression</em>. An
example of a regression problem would be the prediction of the
length of a salmon as a function of its age and weight.</p>
</dd>
</dl>
</li>
</ul>
</blockquote>
</li>
<li><dl class="first docutils">
<dt><strong>unsupervised learning</strong>, in which the training data consists of a</dt>
<dd><p class="first last">set of input vectors x without any corresponding target
values. The goal in such problems may be to discover groups of
similar examples within the data, where it is called
<em>clustering</em>, or to determine the distribution of data within the
input space, known as <em>density estimation</em>, or to project the data
from a high-dimensional space down to two or thee dimensions for
the purpose of <em>visualization</em>.</p>
</dd>
</dl>
</li>
</ul>
</blockquote>
<div class="topic">
<p class="topic-title first">Training set and testing set</p>
<p>Machine learning is about learning some properties of a data set and
applying them to new data. This is why a common practice in machine
learning to evaluate an algorithm is to split the data at hand in two
sets, one that we call a <em>training set</em> on which we learn data
properties, and one that we call a <em>testing set</em>, on which we test
these properties.</p>
</div>
</div>
<div class="section" id="loading-an-example-dataset">
<h2>2.2. Loading an example dataset<a class="headerlink" href="#loading-an-example-dataset" title="Permalink to this headline">¶</a></h2>
<p><cite>scikits.learn</cite> comes with a few standard datasets, for instance the
<a class="reference external" href="http://en.wikipedia.org/wiki/Iris_flower_data_set">iris dataset</a>, or
the <a class="reference external" href="http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits">digits dataset</a>:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">scikits.learn</span> <span class="kn">import</span> <span class="n">datasets</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">iris</span> <span class="o">=</span> <span class="n">datasets</span><span class="o">.</span><span class="n">load_iris</span><span class="p">()</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">digits</span> <span class="o">=</span> <span class="n">datasets</span><span class="o">.</span><span class="n">load_digits</span><span class="p">()</span>
</pre></div>
</div>
<p>A dataset is a dictionary-like object that holds all the data and some
metadata about the data. This data is stored in the <cite>.data</cite> member, which
is a <cite>n_samples, n_features</cite> array. In the case of supervised problem,
explanatory variables are stored in the <cite>.target</cite> member.</p>
<p>For instance, in the case of the digits dataset, <cite>digits.data</cite> gives
access to the features that can be used to classify the digits samples:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">&gt;&gt;&gt; </span><span class="k">print</span> <span class="n">digits</span><span class="o">.</span><span class="n">data</span>
<span class="go">[[  0.   0.   5. ...,   0.   0.   0.]</span>
<span class="go"> [  0.   0.   0. ...,  10.   0.   0.]</span>
<span class="go"> [  0.   0.   0. ...,  16.   9.   0.]</span>
<span class="go"> ...,</span>
<span class="go"> [  0.   0.   1. ...,   6.   0.   0.]</span>
<span class="go"> [  0.   0.   2. ...,  12.   0.   0.]</span>
<span class="go"> [  0.   0.  10. ...,  12.   1.   0.]]</span>
</pre></div>
</div>
<p>and <cite>digits.target</cite> gives the ground truth for the digit dataset, that
is the number corresponding to each digit image that we are trying to
learn:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">&gt;&gt;&gt; </span><span class="n">digits</span><span class="o">.</span><span class="n">target</span>
<span class="go">array([0, 1, 2, ..., 8, 9, 8])</span>
</pre></div>
</div>
<div class="topic">
<p class="topic-title first">Shape of the data arrays</p>
<p>The data is always a 2D array, <cite>n_samples, n_features</cite>, although
the original data may have had a different shape. In the case of the
digits, each original sample is an image of shape <cite>8, 8</cite> and can be
accessed using:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">&gt;&gt;&gt; </span><span class="n">digits</span><span class="o">.</span><span class="n">images</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="go">array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],</span>
<span class="go">       [  0.,   0.,  13.,  15.,  10.,  15.,   5.,   0.],</span>
<span class="go">       [  0.,   3.,  15.,   2.,   0.,  11.,   8.,   0.],</span>
<span class="go">       [  0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.],</span>
<span class="go">       [  0.,   5.,   8.,   0.,   0.,   9.,   8.,   0.],</span>
<span class="go">       [  0.,   4.,  11.,   0.,   1.,  12.,   7.,   0.],</span>
<span class="go">       [  0.,   2.,  14.,   5.,  10.,  12.,   0.,   0.],</span>
<span class="go">       [  0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.]])</span>
</pre></div>
</div>
<p>The <a class="reference internal" href="auto_examples/plot_digits_classification.html#example-plot-digits-classification-py"><em>simple example on this dataset</em></a>
illustrates how starting from the original problem one can shape the
data for consumption in the <cite>scikit.learn</cite>.</p>
</div>
<p><tt class="docutils literal"><span class="pre">scikits.learn</span></tt> also offers the possibility to reuse external datasets coming
from the <a class="reference external" href="http://mlcomp.org">http://mlcomp.org</a> online service that provides a repository of public
datasets for various tasks (binary &amp; multi label classification, regression,
document classification, ...) along with a runtime environment to compare
program performance on those datasets. Please refer to the following example for
for instructions on the <tt class="docutils literal"><span class="pre">mlcomp</span></tt> dataset loader:
<em class="xref std std-ref">example_mlcomp_document_classification.py</em>.</p>
</div>
<div class="section" id="learning-and-predicting">
<h2>2.3. Learning and Predicting<a class="headerlink" href="#learning-and-predicting" title="Permalink to this headline">¶</a></h2>
<p>In the case of the digits dataset, the task is to predict the value of a
hand-written digit from an image. We are given samples of each of the 10
possible classes on which we <em>fit</em> an <cite>estimator</cite> to be able to <em>predict</em>
the labels corresponding to new data.</p>
<p>In <cite>scikit.learn</cite>, an <em>estimator</em> is just a plain Python class that
implements the methods <cite>fit(X, Y)</cite> and <cite>predict(T)</cite>.</p>
<p>An example of estimator is the class <tt class="docutils literal"><span class="pre">scikits.learn.svm.SVC</span></tt> that
implements <a class="reference external" href="http://en.wikipedia.org/wiki/Support_vector_machine">Support Vector Classification</a>. The
constructor of an estimator takes as arguments the parameters of the
model, but for the time being, we will consider the estimator as a black
box and not worry about these:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">scikits.learn</span> <span class="kn">import</span> <span class="n">svm</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">clf</span> <span class="o">=</span> <span class="n">svm</span><span class="o">.</span><span class="n">SVC</span><span class="p">()</span>
</pre></div>
</div>
<p>We call our estimator instance <cite>clf</cite> as it is a classifier. It now must
be fitted to the model, that is, it must <cite>learn</cite> from the model. This is
done by passing our training set to the <tt class="docutils literal"><span class="pre">fit</span></tt> method. As a training
set, let us use the all the images of our dataset appart from the last
one:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">&gt;&gt;&gt; </span><span class="n">clf</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">digits</span><span class="o">.</span><span class="n">data</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">digits</span><span class="o">.</span><span class="n">target</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="go">SVC(kernel=&#39;rbf&#39;, C=1.0, probability=False, degree=3, coef0=0.0, eps=0.001,</span>
<span class="go">  cache_size=100.0, shrinking=True, gamma=0.000556792873051)</span>
</pre></div>
</div>
<p>Now you can predict new values, in particular, we can ask to the
classifier what is the digit of our last image in the <cite>digits</cite> dataset,
which we have not used to train the classifier:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">&gt;&gt;&gt; </span><span class="n">clf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">digits</span><span class="o">.</span><span class="n">data</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="go">array([ 8.])</span>
</pre></div>
</div>
<p>The corresponding image is the following:</p>
<a class="reference internal image-reference" href="_images/last_digit.png"><img alt="_images/last_digit.png" class="align-center" src="_images/last_digit.png" style="width: 60.0px; height: 60.0px;" /></a>
<p>As you can see, it is a challenging task: the images are of poor
resolution. Do you agree with the classifier?</p>
<p>A complete example of this classification problem is available as an
example that you can run and study:
<a class="reference internal" href="auto_examples/plot_digits_classification.html#example-plot-digits-classification-py"><em>Recognizing hand-written digits</em></a>.</p>
</div>
</div>


          </div>
        </div>
      </div>
        <div class="clearer"></div>
      </div>
    </div>

    <div class="footer">
        <p style="text-align: center">This documentation is relative
        to scikits.learn version 0.6.0<p>
        &copy; 2010, scikits.learn developers (BSD Lincense).
      Created using <a href="http://sphinx.pocoo.org/">Sphinx</a> 1.0.5. Design by <a href="http://webylimonada.com">Web y Limonada</a>.
    </div>
  </body>
</html>