Sophie: python-scikits-learn-0.6-1mdv2010.2 i586

python-scikits-learn-0.6-1mdv2010.2.i586.rpm

Getting started: an introduction to machine learning with scikits.learn
=======================================================================

.. topic:: Section contents

    In this section, we introduce the machine learning vocabulary that we
    use through-out `scikits.learn` and give a simple learning example.


Machine learning: the problem setting
---------------------------------------

In general, a learning problem considers a set of n *samples* of data and
try to predict properties of unknown data. If each sample is more than a
single number, and for instance a multi-dimensional entry (aka
*multivariate* data), is it said to have several attributes, or
*features*.

We can separate learning problems in a few large categories:

 * **supervised learning**, in which the data comes with additional
   attributes that we want to predict. This problem can be either:

    * **classification**: samples belong to two or more classes and we
      want to learn from already labeled data how to predict the class
      of un-labeled data. An example of classification problem would
      be the digit recognition example, in which the aim is to assign
      each input vector to one of a finite number of discrete
      categories.

    * **regression**: if the desired output consists of one or more
        continuous variables, then the task is called *regression*. An
        example of a regression problem would be the prediction of the
        length of a salmon as a function of its age and weight.

 * **unsupervised learning**, in which the training data consists of a
     set of input vectors x without any corresponding target
     values. The goal in such problems may be to discover groups of
     similar examples within the data, where it is called
     *clustering*, or to determine the distribution of data within the
     input space, known as *density estimation*, or to project the data
     from a high-dimensional space down to two or thee dimensions for
     the purpose of *visualization*.

.. topic:: Training set and testing set

    Machine learning is about learning some properties of a data set and
    applying them to new data. This is why a common practice in machine
    learning to evaluate an algorithm is to split the data at hand in two
    sets, one that we call a *training set* on which we learn data
    properties, and one that we call a *testing set*, on which we test
    these properties.


Loading an example dataset
--------------------------

`scikits.learn` comes with a few standard datasets, for instance the
`iris dataset <http://en.wikipedia.org/wiki/Iris_flower_data_set>`_, or
the `digits dataset
<http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits>`_::

    >>> from scikits.learn import datasets
    >>> iris = datasets.load_iris()
    >>> digits = datasets.load_digits()

A dataset is a dictionary-like object that holds all the data and some
metadata about the data. This data is stored in the `.data` member, which
is a `n_samples, n_features` array. In the case of supervised problem,
explanatory variables are stored in the `.target` member.

For instance, in the case of the digits dataset, `digits.data` gives
access to the features that can be used to classify the digits samples::

    >>> print digits.data
    [[  0.   0.   5. ...,   0.   0.   0.]
     [  0.   0.   0. ...,  10.   0.   0.]
     [  0.   0.   0. ...,  16.   9.   0.]
     ..., 
     [  0.   0.   1. ...,   6.   0.   0.]
     [  0.   0.   2. ...,  12.   0.   0.]
     [  0.   0.  10. ...,  12.   1.   0.]]

and `digits.target` gives the ground truth for the digit dataset, that
is the number corresponding to each digit image that we are trying to
learn:

>>> digits.target
array([0, 1, 2, ..., 8, 9, 8])

.. topic:: Shape of the data arrays

    The data is always a 2D array, `n_samples, n_features`, although
    the original data may have had a different shape. In the case of the
    digits, each original sample is an image of shape `8, 8` and can be
    accessed using:

    >>> digits.images[0]
    array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
           [  0.,   0.,  13.,  15.,  10.,  15.,   5.,   0.],
           [  0.,   3.,  15.,   2.,   0.,  11.,   8.,   0.],
           [  0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.],
           [  0.,   5.,   8.,   0.,   0.,   9.,   8.,   0.],
           [  0.,   4.,  11.,   0.,   1.,  12.,   7.,   0.],
           [  0.,   2.,  14.,   5.,  10.,  12.,   0.,   0.],
           [  0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.]])

    The :ref:`simple example on this dataset <example_plot_digits_classification.py>`
    illustrates how starting from the original problem one can shape the
    data for consumption in the `scikit.learn`.


``scikits.learn`` also offers the possibility to reuse external datasets coming
from the http://mlcomp.org online service that provides a repository of public
datasets for various tasks (binary & multi label classification, regression,
document classification, ...) along with a runtime environment to compare
program performance on those datasets. Please refer to the following example for
for instructions on the ``mlcomp`` dataset loader:
:ref:`example_mlcomp_document_classification.py`.


Learning and Predicting
------------------------

In the case of the digits dataset, the task is to predict the value of a
hand-written digit from an image. We are given samples of each of the 10
possible classes on which we *fit* an `estimator` to be able to *predict*
the labels corresponding to new data.

In `scikit.learn`, an *estimator* is just a plain Python class that
implements the methods `fit(X, Y)` and `predict(T)`.

An example of estimator is the class ``scikits.learn.svm.SVC`` that
implements `Support Vector Classification
<http://en.wikipedia.org/wiki/Support_vector_machine>`_. The
constructor of an estimator takes as arguments the parameters of the
model, but for the time being, we will consider the estimator as a black
box and not worry about these:

>>> from scikits.learn import svm
>>> clf = svm.SVC()

We call our estimator instance `clf` as it is a classifier. It now must
be fitted to the model, that is, it must `learn` from the model. This is
done by passing our training set to the ``fit`` method. As a training
set, let us use the all the images of our dataset appart from the last
one:

>>> clf.fit(digits.data[:-1], digits.target[:-1])
SVC(kernel='rbf', C=1.0, probability=False, degree=3, coef0=0.0, eps=0.001,
  cache_size=100.0, shrinking=True, gamma=0.000556792873051)

Now you can predict new values, in particular, we can ask to the
classifier what is the digit of our last image in the `digits` dataset,
which we have not used to train the classifier:

>>> clf.predict(digits.data[-1])
array([ 8.])

The corresponding image is the following:

.. image:: images/last_digit.png
    :align: center
    :scale: 50

As you can see, it is a challenging task: the images are of poor
resolution. Do you agree with the classifier?

A complete example of this classification problem is available as an
example that you can run and study:
:ref:`example_plot_digits_classification.py`.