Sophie

Sophie

distrib > Mandriva > 2010.1 > i586 > by-pkgid > e578866d55cd81fdb23827cdf3cec911 > files > 709

python-scikits-learn-0.6-1mdv2010.2.i586.rpm



<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    
    <title>3.6. Gaussian Processes &mdash; scikits.learn v0.6.0 documentation</title>
    <link rel="stylesheet" href="../_static/nature.css" type="text/css" />
    <link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
    <script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../',
        VERSION:     '0.6.0',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script>
    <script type="text/javascript" src="../_static/jquery.js"></script>
    <script type="text/javascript" src="../_static/underscore.js"></script>
    <script type="text/javascript" src="../_static/doctools.js"></script>
    <link rel="shortcut icon" href="../_static/favicon.ico"/>
    <link rel="author" title="About these documents" href="../about.html" />
    <link rel="top" title="scikits.learn v0.6.0 documentation" href="../index.html" />
    <link rel="up" title="3. Supervised learning" href="../supervised_learning.html" />
    <link rel="next" title="4. Unsupervised learning" href="../unsupervised_learning.html" />
    <link rel="prev" title="3.5. Feature selection" href="feature_selection.html" /> 
  </head>
  <body>
    <div class="header-wrapper">
      <div class="header">
          <p class="logo"><a href="../index.html">
            <img src="../_static/scikit-learn-logo-small.png" alt="Logo"/>
          </a>
          </p><div class="navbar">
          <ul>
            <li><a href="../install.html">Download</a></li>
            <li><a href="../support.html">Support</a></li>
            <li><a href="../user_guide.html">User Guide</a></li>
            <li><a href="../auto_examples/index.html">Examples</a></li>
            <li><a href="../developers/index.html">Development</a></li>
       </ul>

<div class="search_form">

<div id="cse" style="width: 100%;"></div>
<script src="http://www.google.com/jsapi" type="text/javascript"></script>
<script type="text/javascript">
  google.load('search', '1', {language : 'en'});
  google.setOnLoadCallback(function() {
    var customSearchControl = new google.search.CustomSearchControl('016639176250731907682:tjtqbvtvij0');
    customSearchControl.setResultSetSize(google.search.Search.FILTERED_CSE_RESULTSET);
    var options = new google.search.DrawOptions();
    options.setAutoComplete(true);
    customSearchControl.draw('cse', options);
  }, true);
</script>

</div>

          </div> <!-- end navbar --></div>
    </div>

    <div class="content-wrapper">

    <!-- <div id="blue_tile"></div> -->

        <div class="sphinxsidebar">
        <div class="rel">
          <a href="feature_selection.html" title="3.5. Feature selection"
             accesskey="P">previous</a> |
          <a href="../unsupervised_learning.html" title="4. Unsupervised learning"
             accesskey="N">next</a> |
          <a href="../genindex.html" title="General Index"
             accesskey="I">index</a>
        </div>
        

        <h3>Contents</h3>
         <ul>
<li><a class="reference internal" href="#">3.6. Gaussian Processes</a><ul>
<li><a class="reference internal" href="#an-introductory-regression-example">3.6.1. An introductory regression example</a></li>
<li><a class="reference internal" href="#mathematical-formulation">3.6.2. Mathematical formulation</a><ul>
<li><a class="reference internal" href="#the-initial-assumption">3.6.2.1. The initial assumption</a></li>
<li><a class="reference internal" href="#the-best-linear-unbiased-prediction-blup">3.6.2.2. The best linear unbiased prediction (BLUP)</a></li>
<li><a class="reference internal" href="#the-empirical-best-linear-unbiased-predictor-eblup">3.6.2.3. The empirical best linear unbiased predictor (EBLUP)</a></li>
</ul>
</li>
<li><a class="reference internal" href="#correlation-models">3.6.3. Correlation Models</a></li>
<li><a class="reference internal" href="#regression-models">3.6.4. Regression Models</a></li>
<li><a class="reference internal" href="#implementation-details">3.6.5. Implementation details</a></li>
</ul>
</li>
</ul>


        

        </div>

      <div class="content">
            
      <div class="documentwrapper">
        <div class="bodywrapper">
          <div class="body">
            
  <div class="section" id="gaussian-processes">
<span id="gaussian-process"></span><h1>3.6. Gaussian Processes<a class="headerlink" href="#gaussian-processes" title="Permalink to this headline">¶</a></h1>
<p><strong>Gaussian Processes for Machine Learning (GPML)</strong> is a generic supervised
learning method primarily designed to solve <em>regression</em> problems. It has also
been extended to <em>probabilistic classification</em>, but in the present
implementation, this is only a post-processing of the <em>regression</em> exercise.</p>
<p>The advantages of Gaussian Processes for Machine Learning are:</p>
<blockquote>
<ul class="simple">
<li>The prediction interpolates the observations (at least for regular
correlation models).</li>
<li>The prediction is probabilistic (Gaussian) so that one can compute
empirical confidence intervals and exceedence probabilities that might be
used to refit (online fitting, adaptive fitting) the prediction in some
region of interest.</li>
<li>Versatile: different <em class="xref std std-ref">linear regression models</em> and <em class="xref std std-ref">correlation models</em> can be specified. Common models are provided, but
it is also possible to specify custom models provided they are
stationary.</li>
</ul>
</blockquote>
<p>The disadvantages of Gaussian Processes for Machine Learning include:</p>
<blockquote>
<ul class="simple">
<li>It is not sparse. It uses the whole samples/features information to
perform the prediction.</li>
<li>It loses efficiency in high dimensional spaces &#8211; namely when the number
of features exceeds a few dozens. It might indeed give poor performance
and it loses computational efficiency.</li>
<li>Classification is only a post-processing, meaning that one first need
to solve a regression problem by providing the complete scalar float
precision output <span class="math">y</span> of the experiment one attempt to model.</li>
</ul>
</blockquote>
<p>Thanks to the Gaussian property of the prediction, it has been given varied
applications: e.g. for global optimization, probabilistic classification.</p>
<div class="section" id="an-introductory-regression-example">
<h2>3.6.1. An introductory regression example<a class="headerlink" href="#an-introductory-regression-example" title="Permalink to this headline">¶</a></h2>
<p>Say we want to surrogate the function <span class="math">g(x) = x \sin(x)</span>. To do so,
the function is evaluated onto a design of experiments. Then, we define a
GaussianProcess model whose regression and correlation models might be
specified using additional kwargs, and ask for the model to be fitted to the
data. Depending on the number of parameters provided at instanciation, the
fitting procedure may recourse to maximum likelihood estimation for the
parameters or alternatively it uses the given parameters.</p>
<div class="figure align-center">
<a class="reference external image-reference" href="../auto_examples/gaussian_process/plot_gp_regression.html"><img alt="auto_examples/gaussian_process/images/plot_gp_regression.png" src="auto_examples/gaussian_process/images/plot_gp_regression.png" /></a>
</div>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">&gt;&gt;&gt; </span><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">scikits.learn</span> <span class="kn">import</span> <span class="n">gaussian_process</span>
<span class="gp">&gt;&gt;&gt; </span><span class="k">def</span> <span class="nf">f</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="gp">... </span>    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">sin</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">atleast_2d</span><span class="p">([</span><span class="mf">1.</span><span class="p">,</span> <span class="mf">3.</span><span class="p">,</span> <span class="mf">5.</span><span class="p">,</span> <span class="mf">6.</span><span class="p">,</span> <span class="mf">7.</span><span class="p">,</span> <span class="mf">8.</span><span class="p">])</span><span class="o">.</span><span class="n">T</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">y</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="o">.</span><span class="n">ravel</span><span class="p">()</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">atleast_2d</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">1000</span><span class="p">))</span><span class="o">.</span><span class="n">T</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">gp</span> <span class="o">=</span> <span class="n">gaussian_process</span><span class="o">.</span><span class="n">GaussianProcess</span><span class="p">(</span><span class="n">theta0</span><span class="o">=</span><span class="mf">1e-2</span><span class="p">,</span> <span class="n">thetaL</span><span class="o">=</span><span class="mf">1e-4</span><span class="p">,</span> <span class="n">thetaU</span><span class="o">=</span><span class="mf">1e-1</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">gp</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> 
<span class="go">GaussianProcess(normalize=True, theta0=array([[ 0.01]]),</span>
<span class="go">        optimizer=&#39;fmin_cobyla&#39;, verbose=False, storage_mode=&#39;full&#39;,</span>
<span class="go">        nugget=2.2204460492503131e-15, thetaU=array([[ 0.1]]),</span>
<span class="go">        regr=&lt;function constant at 0x...&gt;, random_start=1,</span>
<span class="go">        corr=&lt;function squared_exponential at 0x...&gt;, beta0=None,</span>
<span class="go">        thetaL=array([[ 0.0001]]))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">y_pred</span><span class="p">,</span> <span class="n">sigma2_pred</span> <span class="o">=</span> <span class="n">gp</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">eval_MSE</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
</div>
<div class="topic">
<p class="topic-title first">Other examples</p>
<ul class="simple">
<li><a class="reference internal" href="../auto_examples/gaussian_process/plot_gp_probabilistic_classification_after_regression.html#example-gaussian-process-plot-gp-probabilistic-classification-after-regression-py"><em>Gaussian Processes classification example: exploiting the probabilistic output</em></a></li>
<li><em class="xref std std-ref">example_gaussian_process_plot_gp_diabetes_dataset.py</em></li>
</ul>
</div>
</div>
<div class="section" id="mathematical-formulation">
<h2>3.6.2. Mathematical formulation<a class="headerlink" href="#mathematical-formulation" title="Permalink to this headline">¶</a></h2>
<div class="section" id="the-initial-assumption">
<h3>3.6.2.1. The initial assumption<a class="headerlink" href="#the-initial-assumption" title="Permalink to this headline">¶</a></h3>
<p>Suppose one wants to model the output of a computer experiment, say a
mathematical function:</p>
<div class="math">
<p><span class="math">g: &amp; \mathbb{R}^{n_{\rm features}} \rightarrow \mathbb{R} \\
   &amp; X \mapsto y = g(X)</span></p>
</div><p>GPML starts with the assumption that this function is <em>a</em> conditional sample
path of <em>a</em> Gaussian process <span class="math">G</span> which is additionally assumed to read as
follows:</p>
<div class="math">
<p><span class="math">G(X) = f(X)^T \beta + Z(X)</span></p>
</div><p>where <span class="math">f(X)^T \beta</span> is a linear regression model and <span class="math">Z(X)</span> is a
zero-mean Gaussian process with a fully stationary covariance function:</p>
<div class="math">
<p><span class="math">C(X, X') = \sigma^2 R(|X - X'|)</span></p>
</div><p><span class="math">\sigma^2</span> being its variance and <span class="math">R</span> being the correlation
function which solely depends on the absolute relative distance between each
sample, possibly featurewise (this is the stationarity assumption).</p>
<p>From this basic formulation, note that GPML is nothing but an extension of a
basic least squares linear regression problem:</p>
<div class="math">
<p><span class="math">g(X) \approx f(X)^T \beta</span></p>
</div><p>Except we additionaly assume some spatial coherence (correlation) between the
samples dictated by the correlation function. Indeed, ordinary least squares
assumes the correlation model <span class="math">R(|X - X'|)</span> is one when <span class="math">X = X'</span>
and zero otherwise : a <em>dirac</em> correlation model &#8211; sometimes referred to as a
<em>nugget</em> correlation model in the kriging literature.</p>
</div>
<div class="section" id="the-best-linear-unbiased-prediction-blup">
<h3>3.6.2.2. The best linear unbiased prediction (BLUP)<a class="headerlink" href="#the-best-linear-unbiased-prediction-blup" title="Permalink to this headline">¶</a></h3>
<p>We now derive the <em>best linear unbiased prediction</em> of the sample path
<span class="math">g</span> conditioned on the observations:</p>
<div class="math">
<p><span class="math">\hat{G}(X) = G(X | y_1 = g(X_1), ...,
                            y_{n_{\rm samples}} = g(X_{n_{\rm samples}}))</span></p>
</div><p>It is derived from its <em>given properties</em>:</p>
<ul class="simple">
<li>It is linear (a linear combination of the observations)</li>
</ul>
<div class="math">
<p><span class="math">\hat{G}(X) \equiv a(X)^T y</span></p>
</div><ul class="simple">
<li>It is unbiased</li>
</ul>
<div class="math">
<p><span class="math">\mathbb{E}[G(X) - \hat{G}(X)] = 0</span></p>
</div><ul class="simple">
<li>It is the best (in the Mean Squared Error sense)</li>
</ul>
<div class="math">
<p><span class="math">\hat{G}(X)^* = \arg \min\limits_{\hat{G}(X)} \;
                                        \mathbb{E}[(G(X) - \hat{G}(X))^2]</span></p>
</div><p>So that the optimal weight vector <span class="math">a(X)</span> is solution of the following
equality constrained optimization problem:</p>
<div class="math">
<p><span class="math">a(X)^* = \arg \min\limits_{a(X)} &amp; \; \mathbb{E}[(G(X) - a(X)^T y)^2] \\
                   {\rm s. t.} &amp; \; \mathbb{E}[G(X) - a(X)^T y] = 0</span></p>
</div><p>Rewriting this constrained optimization problem in the form of a Lagrangian and
looking further for the first order optimality conditions to be satisfied, one
ends up with a closed form expression for the sought predictor &#8211; see
references for the complete proof.</p>
<p>In the end, the BLUP is shown to be a Gaussian random variate with mean:</p>
<div class="math">
<p><span class="math">\mu_{\hat{Y}}(X) = f(X)^T\,\hat{\beta} + r(X)^T\,\gamma</span></p>
</div><p>and variance:</p>
<div class="math">
<p><span class="math">\sigma_{\hat{Y}}^2(X) = \sigma_{Y}^2\,
( 1
- r(X)^T\,R^{-1}\,r(X)
+ u(X)^T\,(F^T\,R^{-1}\,F)^{-1}\,u(X)
)</span></p>
</div><p>where we have introduced:</p>
<ul class="simple">
<li>the correlation matrix whose terms are defined wrt the autocorrelation
function and its built-in parameters <span class="math">\theta</span>:</li>
</ul>
<div class="math">
<p><span class="math">R_{i\,j} = R(|X_i - X_j|, \theta), \; i,\,j = 1, ..., m</span></p>
</div><ul class="simple">
<li>the vector of cross-correlations between the point where the prediction is</li>
</ul>
<p>made and the points in the DOE:</p>
<div class="math">
<p><span class="math">r_i = R(|X - X_i|, \theta), \; i = 1, ..., m</span></p>
</div><ul class="simple">
<li>the regression matrix (eg the Vandermonde matrix if <span class="math">f</span> is a polynomial
basis):</li>
</ul>
<div class="math">
<p><span class="math">F_{i\,j} = f_i(X_j), \; i = 1, ..., p, \, j = 1, ..., m</span></p>
</div><ul class="simple">
<li>the generalized least square regression weights:</li>
</ul>
<div class="math">
<p><span class="math">\hat{\beta} =(F^T\,R^{-1}\,F)^{-1}\,F^T\,R^{-1}\,Y</span></p>
</div><ul class="simple">
<li>and the vectors:</li>
</ul>
<div class="math">
<p><span class="math">\gamma &amp; = R^{-1}(Y - F\,\hat{\beta}) \\
u(X) &amp; = F^T\,R^{-1}\,r(X) - f(X)</span></p>
</div><p>It is important to notice that the probabilistic response of a Gaussian Process
predictor is fully analytic and mostly relies on basic linear algebra
operations. More precisely the mean prediction is the sum of two simple linear
combinations (dot products), and the variance requires two matrix inversions,
but the correlation matrix can be decomposed only once using a Cholesky
decomposition algorithm.</p>
</div>
<div class="section" id="the-empirical-best-linear-unbiased-predictor-eblup">
<h3>3.6.2.3. The empirical best linear unbiased predictor (EBLUP)<a class="headerlink" href="#the-empirical-best-linear-unbiased-predictor-eblup" title="Permalink to this headline">¶</a></h3>
<p>Until now, both the autocorrelation and regression models were assumed given.
In practice however they are never known in advance so that one has to make
(motivated) empirical choices for these models <em class="xref std std-ref">correlation_models</em>.</p>
<p>Provided these choices are made, one should estimate the remaining unknown
parameters involved in the BLUP. To do so, one uses the set of provided
observations in conjunction with some inference technique. The present
implementation, which is based on the DACE&#8217;s Matlab toolbox uses the <em>maximum
likelihood estimation</em> technique &#8211; see DACE manual in references for the
complete equations. This maximum likelihood estimation problem is turned into
a global optimization problem onto the autocorrelation parameters. In the
present implementation, this global optimization is solved by means of the
fmin_cobyla optimization function from scipy.optimize. In the case of
anisotropy however, we provide an implementation of Welch&#8217;s componentwise
optimization algorithm &#8211; see references.</p>
<p>For a more comprehensive description of the theoretical aspects of Gaussian
Processes for Machine Learning, please refer to the references below:</p>
<div class="topic">
<p class="topic-title first">References:</p>
<ul class="simple">
<li><a class="reference external" href="http://www2.imm.dtu.dk/~hbn/dace/">DACE, A Matlab Kriging Toolbox</a> S Lophaven, HB Nielsen, J
Sondergaard 2002</li>
<li><a class="reference external" href="http://www.jstor.org/pss/1269548">Screening, predicting, and computer experiments</a> WJ Welch, RJ Buck, J Sacks,
HP Wynn, TJ Mitchell, and MD Morris Technometrics 34(1) 15&#8211;25,
1992</li>
<li><a class="reference external" href="http://www.gaussianprocess.org/gpml/chapters/RW.pdf">Gaussian Processes for Machine Learning</a> CE
Rasmussen, CKI Williams MIT Press, 2006 (Ed. T Diettrich)</li>
<li><a class="reference external" href="http://www.stat.osu.edu/~comp_exp/book.html">The design and analysis of computer experiments</a> TJ Santner, BJ
Williams, W Notz Springer, 2003</li>
</ul>
</div>
</div>
</div>
<div class="section" id="correlation-models">
<h2>3.6.3. Correlation Models<a class="headerlink" href="#correlation-models" title="Permalink to this headline">¶</a></h2>
<p>Common correlation models matches some famous SVM&#8217;s kernels because they are
mostly built on equivalent assumptions. They must fulfill Mercer&#8217;s conditions
and should additionaly remain stationary. Note however, that the choice of the
correlation model should be made in agreement with the known properties of the
original experiment from which the observations come. For instance:</p>
<ul class="simple">
<li>If the original experiment is known to be infinitely differentiable (smooth),
then one should use the <em>squared-exponential correlation model</em>.</li>
<li>If it&#8217;s not, then one should rather use the <em>exponential correlation model</em>.</li>
<li>Note also that there exists a correlation model that takes the degree of
derivability as input: this is the Matern correlation model, but it&#8217;s not
implemented here (TODO).</li>
</ul>
<p>For a more detailed discussion on the selection of appropriate correlation
models, see the book by Rasmussen &amp; Williams in references.</p>
</div>
<div class="section" id="regression-models">
<h2>3.6.4. Regression Models<a class="headerlink" href="#regression-models" title="Permalink to this headline">¶</a></h2>
<p>Common linear regression models involve zero- (constant), first- and
second-order polynomials. But one may specify its own in the form of a Python
function that takes the features X as input and that returns a vector
containing the values of the functional set. The only constraint is that the
number of functions must not exceed the number of available observations so
that the underlying regression problem is not <em>underdetermined</em>.</p>
</div>
<div class="section" id="implementation-details">
<h2>3.6.5. Implementation details<a class="headerlink" href="#implementation-details" title="Permalink to this headline">¶</a></h2>
<p>The present implementation is based on a translation of the DACE Matlab
toolbox.</p>
<div class="topic">
<p class="topic-title first">References:</p>
<ul class="simple">
<li><a class="reference external" href="http://www2.imm.dtu.dk/~hbn/dace/">DACE, A Matlab Kriging Toolbox</a> S Lophaven, HB Nielsen, J
Sondergaard 2002,</li>
</ul>
</div>
</div>
</div>


          </div>
        </div>
      </div>
        <div class="clearer"></div>
      </div>
    </div>

    <div class="footer">
        <p style="text-align: center">This documentation is relative
        to scikits.learn version 0.6.0<p>
        &copy; 2010, scikits.learn developers (BSD Lincense).
      Created using <a href="http://sphinx.pocoo.org/">Sphinx</a> 1.0.5. Design by <a href="http://webylimonada.com">Web y Limonada</a>.
    </div>
  </body>
</html>