<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>2. Getting started: an introduction to machine learning with scikits.learn — scikits.learn v0.6.0 documentation</title> <link rel="stylesheet" href="_static/nature.css" type="text/css" /> <link rel="stylesheet" href="_static/pygments.css" type="text/css" /> <script type="text/javascript"> var DOCUMENTATION_OPTIONS = { URL_ROOT: '', VERSION: '0.6.0', COLLAPSE_INDEX: false, FILE_SUFFIX: '.html', HAS_SOURCE: true }; </script> <script type="text/javascript" src="_static/jquery.js"></script> <script type="text/javascript" src="_static/underscore.js"></script> <script type="text/javascript" src="_static/doctools.js"></script> <link rel="shortcut icon" href="_static/favicon.ico"/> <link rel="author" title="About these documents" href="about.html" /> <link rel="top" title="scikits.learn v0.6.0 documentation" href="index.html" /> <link rel="up" title="<no title>" href="contents.html" /> <link rel="next" title="3. Supervised learning" href="supervised_learning.html" /> <link rel="prev" title="1. Installing scikits.learn" href="install.html" /> </head> <body> <div class="header-wrapper"> <div class="header"> <p class="logo"><a href="index.html"> <img src="_static/scikit-learn-logo-small.png" alt="Logo"/> </a> </p><div class="navbar"> <ul> <li><a href="install.html">Download</a></li> <li><a href="support.html">Support</a></li> <li><a href="user_guide.html">User Guide</a></li> <li><a href="auto_examples/index.html">Examples</a></li> <li><a href="developers/index.html">Development</a></li> </ul> <div class="search_form"> <div id="cse" style="width: 100%;"></div> <script src="http://www.google.com/jsapi" type="text/javascript"></script> <script type="text/javascript"> google.load('search', '1', {language : 'en'}); google.setOnLoadCallback(function() { var customSearchControl = new google.search.CustomSearchControl('016639176250731907682:tjtqbvtvij0'); customSearchControl.setResultSetSize(google.search.Search.FILTERED_CSE_RESULTSET); var options = new google.search.DrawOptions(); options.setAutoComplete(true); customSearchControl.draw('cse', options); }, true); </script> </div> </div> <!-- end navbar --></div> </div> <div class="content-wrapper"> <!-- <div id="blue_tile"></div> --> <div class="sphinxsidebar"> <div class="rel"> <a href="install.html" title="1. Installing scikits.learn" accesskey="P">previous</a> | <a href="supervised_learning.html" title="3. Supervised learning" accesskey="N">next</a> | <a href="genindex.html" title="General Index" accesskey="I">index</a> </div> <h3>Contents</h3> <ul> <li><a class="reference internal" href="#">2. Getting started: an introduction to machine learning with scikits.learn</a><ul> <li><a class="reference internal" href="#machine-learning-the-problem-setting">2.1. Machine learning: the problem setting</a></li> <li><a class="reference internal" href="#loading-an-example-dataset">2.2. Loading an example dataset</a></li> <li><a class="reference internal" href="#learning-and-predicting">2.3. Learning and Predicting</a></li> </ul> </li> </ul> </div> <div class="content"> <div class="documentwrapper"> <div class="bodywrapper"> <div class="body"> <div class="section" id="getting-started-an-introduction-to-machine-learning-with-scikits-learn"> <h1>2. Getting started: an introduction to machine learning with scikits.learn<a class="headerlink" href="#getting-started-an-introduction-to-machine-learning-with-scikits-learn" title="Permalink to this headline">¶</a></h1> <div class="topic"> <p class="topic-title first">Section contents</p> <p>In this section, we introduce the machine learning vocabulary that we use through-out <cite>scikits.learn</cite> and give a simple learning example.</p> </div> <div class="section" id="machine-learning-the-problem-setting"> <h2>2.1. Machine learning: the problem setting<a class="headerlink" href="#machine-learning-the-problem-setting" title="Permalink to this headline">¶</a></h2> <p>In general, a learning problem considers a set of n <em>samples</em> of data and try to predict properties of unknown data. If each sample is more than a single number, and for instance a multi-dimensional entry (aka <em>multivariate</em> data), is it said to have several attributes, or <em>features</em>.</p> <p>We can separate learning problems in a few large categories:</p> <blockquote> <ul> <li><p class="first"><strong>supervised learning</strong>, in which the data comes with additional attributes that we want to predict. This problem can be either:</p> <blockquote> <ul> <li><p class="first"><strong>classification</strong>: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of un-labeled data. An example of classification problem would be the digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories.</p> </li> <li><dl class="first docutils"> <dt><strong>regression</strong>: if the desired output consists of one or more</dt> <dd><p class="first last">continuous variables, then the task is called <em>regression</em>. An example of a regression problem would be the prediction of the length of a salmon as a function of its age and weight.</p> </dd> </dl> </li> </ul> </blockquote> </li> <li><dl class="first docutils"> <dt><strong>unsupervised learning</strong>, in which the training data consists of a</dt> <dd><p class="first last">set of input vectors x without any corresponding target values. The goal in such problems may be to discover groups of similar examples within the data, where it is called <em>clustering</em>, or to determine the distribution of data within the input space, known as <em>density estimation</em>, or to project the data from a high-dimensional space down to two or thee dimensions for the purpose of <em>visualization</em>.</p> </dd> </dl> </li> </ul> </blockquote> <div class="topic"> <p class="topic-title first">Training set and testing set</p> <p>Machine learning is about learning some properties of a data set and applying them to new data. This is why a common practice in machine learning to evaluate an algorithm is to split the data at hand in two sets, one that we call a <em>training set</em> on which we learn data properties, and one that we call a <em>testing set</em>, on which we test these properties.</p> </div> </div> <div class="section" id="loading-an-example-dataset"> <h2>2.2. Loading an example dataset<a class="headerlink" href="#loading-an-example-dataset" title="Permalink to this headline">¶</a></h2> <p><cite>scikits.learn</cite> comes with a few standard datasets, for instance the <a class="reference external" href="http://en.wikipedia.org/wiki/Iris_flower_data_set">iris dataset</a>, or the <a class="reference external" href="http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits">digits dataset</a>:</p> <div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">scikits.learn</span> <span class="kn">import</span> <span class="n">datasets</span> <span class="gp">>>> </span><span class="n">iris</span> <span class="o">=</span> <span class="n">datasets</span><span class="o">.</span><span class="n">load_iris</span><span class="p">()</span> <span class="gp">>>> </span><span class="n">digits</span> <span class="o">=</span> <span class="n">datasets</span><span class="o">.</span><span class="n">load_digits</span><span class="p">()</span> </pre></div> </div> <p>A dataset is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in the <cite>.data</cite> member, which is a <cite>n_samples, n_features</cite> array. In the case of supervised problem, explanatory variables are stored in the <cite>.target</cite> member.</p> <p>For instance, in the case of the digits dataset, <cite>digits.data</cite> gives access to the features that can be used to classify the digits samples:</p> <div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="k">print</span> <span class="n">digits</span><span class="o">.</span><span class="n">data</span> <span class="go">[[ 0. 0. 5. ..., 0. 0. 0.]</span> <span class="go"> [ 0. 0. 0. ..., 10. 0. 0.]</span> <span class="go"> [ 0. 0. 0. ..., 16. 9. 0.]</span> <span class="go"> ...,</span> <span class="go"> [ 0. 0. 1. ..., 6. 0. 0.]</span> <span class="go"> [ 0. 0. 2. ..., 12. 0. 0.]</span> <span class="go"> [ 0. 0. 10. ..., 12. 1. 0.]]</span> </pre></div> </div> <p>and <cite>digits.target</cite> gives the ground truth for the digit dataset, that is the number corresponding to each digit image that we are trying to learn:</p> <div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">digits</span><span class="o">.</span><span class="n">target</span> <span class="go">array([0, 1, 2, ..., 8, 9, 8])</span> </pre></div> </div> <div class="topic"> <p class="topic-title first">Shape of the data arrays</p> <p>The data is always a 2D array, <cite>n_samples, n_features</cite>, although the original data may have had a different shape. In the case of the digits, each original sample is an image of shape <cite>8, 8</cite> and can be accessed using:</p> <div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">digits</span><span class="o">.</span><span class="n">images</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="go">array([[ 0., 0., 5., 13., 9., 1., 0., 0.],</span> <span class="go"> [ 0., 0., 13., 15., 10., 15., 5., 0.],</span> <span class="go"> [ 0., 3., 15., 2., 0., 11., 8., 0.],</span> <span class="go"> [ 0., 4., 12., 0., 0., 8., 8., 0.],</span> <span class="go"> [ 0., 5., 8., 0., 0., 9., 8., 0.],</span> <span class="go"> [ 0., 4., 11., 0., 1., 12., 7., 0.],</span> <span class="go"> [ 0., 2., 14., 5., 10., 12., 0., 0.],</span> <span class="go"> [ 0., 0., 6., 13., 10., 0., 0., 0.]])</span> </pre></div> </div> <p>The <a class="reference internal" href="auto_examples/plot_digits_classification.html#example-plot-digits-classification-py"><em>simple example on this dataset</em></a> illustrates how starting from the original problem one can shape the data for consumption in the <cite>scikit.learn</cite>.</p> </div> <p><tt class="docutils literal"><span class="pre">scikits.learn</span></tt> also offers the possibility to reuse external datasets coming from the <a class="reference external" href="http://mlcomp.org">http://mlcomp.org</a> online service that provides a repository of public datasets for various tasks (binary & multi label classification, regression, document classification, ...) along with a runtime environment to compare program performance on those datasets. Please refer to the following example for for instructions on the <tt class="docutils literal"><span class="pre">mlcomp</span></tt> dataset loader: <em class="xref std std-ref">example_mlcomp_document_classification.py</em>.</p> </div> <div class="section" id="learning-and-predicting"> <h2>2.3. Learning and Predicting<a class="headerlink" href="#learning-and-predicting" title="Permalink to this headline">¶</a></h2> <p>In the case of the digits dataset, the task is to predict the value of a hand-written digit from an image. We are given samples of each of the 10 possible classes on which we <em>fit</em> an <cite>estimator</cite> to be able to <em>predict</em> the labels corresponding to new data.</p> <p>In <cite>scikit.learn</cite>, an <em>estimator</em> is just a plain Python class that implements the methods <cite>fit(X, Y)</cite> and <cite>predict(T)</cite>.</p> <p>An example of estimator is the class <tt class="docutils literal"><span class="pre">scikits.learn.svm.SVC</span></tt> that implements <a class="reference external" href="http://en.wikipedia.org/wiki/Support_vector_machine">Support Vector Classification</a>. The constructor of an estimator takes as arguments the parameters of the model, but for the time being, we will consider the estimator as a black box and not worry about these:</p> <div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">scikits.learn</span> <span class="kn">import</span> <span class="n">svm</span> <span class="gp">>>> </span><span class="n">clf</span> <span class="o">=</span> <span class="n">svm</span><span class="o">.</span><span class="n">SVC</span><span class="p">()</span> </pre></div> </div> <p>We call our estimator instance <cite>clf</cite> as it is a classifier. It now must be fitted to the model, that is, it must <cite>learn</cite> from the model. This is done by passing our training set to the <tt class="docutils literal"><span class="pre">fit</span></tt> method. As a training set, let us use the all the images of our dataset appart from the last one:</p> <div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">clf</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">digits</span><span class="o">.</span><span class="n">data</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">digits</span><span class="o">.</span><span class="n">target</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span> <span class="go">SVC(kernel='rbf', C=1.0, probability=False, degree=3, coef0=0.0, eps=0.001,</span> <span class="go"> cache_size=100.0, shrinking=True, gamma=0.000556792873051)</span> </pre></div> </div> <p>Now you can predict new values, in particular, we can ask to the classifier what is the digit of our last image in the <cite>digits</cite> dataset, which we have not used to train the classifier:</p> <div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">clf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">digits</span><span class="o">.</span><span class="n">data</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span> <span class="go">array([ 8.])</span> </pre></div> </div> <p>The corresponding image is the following:</p> <a class="reference internal image-reference" href="_images/last_digit.png"><img alt="_images/last_digit.png" class="align-center" src="_images/last_digit.png" style="width: 60.0px; height: 60.0px;" /></a> <p>As you can see, it is a challenging task: the images are of poor resolution. Do you agree with the classifier?</p> <p>A complete example of this classification problem is available as an example that you can run and study: <a class="reference internal" href="auto_examples/plot_digits_classification.html#example-plot-digits-classification-py"><em>Recognizing hand-written digits</em></a>.</p> </div> </div> </div> </div> </div> <div class="clearer"></div> </div> </div> <div class="footer"> <p style="text-align: center">This documentation is relative to scikits.learn version 0.6.0<p> © 2010, scikits.learn developers (BSD Lincense). Created using <a href="http://sphinx.pocoo.org/">Sphinx</a> 1.0.5. Design by <a href="http://webylimonada.com">Web y Limonada</a>. </div> </body> </html>