<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>Sample pipeline for text feature extraction and evaluation — scikits.learn v0.6.0 documentation</title> <link rel="stylesheet" href="../_static/nature.css" type="text/css" /> <link rel="stylesheet" href="../_static/pygments.css" type="text/css" /> <script type="text/javascript"> var DOCUMENTATION_OPTIONS = { URL_ROOT: '../', VERSION: '0.6.0', COLLAPSE_INDEX: false, FILE_SUFFIX: '.html', HAS_SOURCE: true }; </script> <script type="text/javascript" src="../_static/jquery.js"></script> <script type="text/javascript" src="../_static/underscore.js"></script> <script type="text/javascript" src="../_static/doctools.js"></script> <link rel="shortcut icon" href="../_static/favicon.ico"/> <link rel="author" title="About these documents" href="../about.html" /> <link rel="top" title="scikits.learn v0.6.0 documentation" href="../index.html" /> <link rel="up" title="Examples" href="index.html" /> <link rel="next" title="Logistic Regression" href="logistic_l1_l2_coef.html" /> <link rel="prev" title="Parameter estimation using grid search with a nested cross-validation" href="grid_search_digits.html" /> </head> <body> <div class="header-wrapper"> <div class="header"> <p class="logo"><a href="../index.html"> <img src="../_static/scikit-learn-logo-small.png" alt="Logo"/> </a> </p><div class="navbar"> <ul> <li><a href="../install.html">Download</a></li> <li><a href="../support.html">Support</a></li> <li><a href="../user_guide.html">User Guide</a></li> <li><a href="index.html">Examples</a></li> <li><a href="../developers/index.html">Development</a></li> </ul> <div class="search_form"> <div id="cse" style="width: 100%;"></div> <script src="http://www.google.com/jsapi" type="text/javascript"></script> <script type="text/javascript"> google.load('search', '1', {language : 'en'}); google.setOnLoadCallback(function() { var customSearchControl = new google.search.CustomSearchControl('016639176250731907682:tjtqbvtvij0'); customSearchControl.setResultSetSize(google.search.Search.FILTERED_CSE_RESULTSET); var options = new google.search.DrawOptions(); options.setAutoComplete(true); customSearchControl.draw('cse', options); }, true); </script> </div> </div> <!-- end navbar --></div> </div> <div class="content-wrapper"> <!-- <div id="blue_tile"></div> --> <div class="sphinxsidebar"> <div class="rel"> <a href="grid_search_digits.html" title="Parameter estimation using grid search with a nested cross-validation" accesskey="P">previous</a> | <a href="logistic_l1_l2_coef.html" title="Logistic Regression" accesskey="N">next</a> | <a href="../genindex.html" title="General Index" accesskey="I">index</a> </div> <h3>Contents</h3> <ul> <li><a class="reference internal" href="#">Sample pipeline for text feature extraction and evaluation</a></li> </ul> </div> <div class="content"> <div class="documentwrapper"> <div class="bodywrapper"> <div class="body"> <div class="section" id="sample-pipeline-for-text-feature-extraction-and-evaluation"> <span id="example-grid-search-text-feature-extraction-py"></span><h1>Sample pipeline for text feature extraction and evaluation<a class="headerlink" href="#sample-pipeline-for-text-feature-extraction-and-evaluation" title="Permalink to this headline">ΒΆ</a></h1> <p>The dataset used in this example is the 20 newsgroups dataset which will be automatically downloaded and then cached and reused for the document classification example.</p> <p>You can adjust the number of categories by giving there name to the dataset loader or setting them to None to get the 20 of them.</p> <p>Here is a sample output of a run on a quad-core machine:</p> <div class="highlight-python"><pre>Loading 20 newsgroups dataset for categories: ['alt.atheism', 'talk.religion.misc'] 1427 documents 2 categories Performing grid search... pipeline: ['vect', 'tfidf', 'clf'] parameters: {'clf__alpha': (1.0000000000000001e-05, 9.9999999999999995e-07), 'clf__n_iter': (10, 50, 80), 'clf__penalty': ('l2', 'elasticnet'), 'tfidf__use_idf': (True, False), 'vect__analyzer__max_n': (1, 2), 'vect__max_df': (0.5, 0.75, 1.0), 'vect__max_features': (None, 5000, 10000, 50000)} done in 1737.030s Best score: 0.940 Best parameters set: clf__alpha: 9.9999999999999995e-07 clf__n_iter: 50 clf__penalty: 'elasticnet' tfidf__use_idf: True vect__analyzer__max_n: 2 vect__max_df: 0.75 vect__max_features: 50000</pre> </div> <p><strong>Python source code:</strong> <a class="reference download internal" href="../_downloads/grid_search_text_feature_extraction.py"><tt class="xref download docutils literal"><span class="pre">grid_search_text_feature_extraction.py</span></tt></a></p> <div class="highlight-python"><div class="highlight"><pre><span class="k">print</span> <span class="n">__doc__</span> <span class="c"># Author: Olivier Grisel <olivier.grisel@ensta.org></span> <span class="c"># Peter Prettenhofer <peter.prettenhofer@gmail.com></span> <span class="c"># Mathieu Blondel <mathieu@mblondel.org></span> <span class="c"># License: Simplified BSD</span> <span class="kn">from</span> <span class="nn">pprint</span> <span class="kn">import</span> <span class="n">pprint</span> <span class="kn">from</span> <span class="nn">time</span> <span class="kn">import</span> <span class="n">time</span> <span class="kn">import</span> <span class="nn">os</span> <span class="kn">from</span> <span class="nn">scikits.learn.datasets</span> <span class="kn">import</span> <span class="n">load_files</span> <span class="kn">from</span> <span class="nn">scikits.learn.feature_extraction.text.sparse</span> <span class="kn">import</span> <span class="n">CountVectorizer</span> <span class="kn">from</span> <span class="nn">scikits.learn.feature_extraction.text.sparse</span> <span class="kn">import</span> <span class="n">TfidfTransformer</span> <span class="kn">from</span> <span class="nn">scikits.learn.linear_model.sparse</span> <span class="kn">import</span> <span class="n">SGDClassifier</span> <span class="kn">from</span> <span class="nn">scikits.learn.grid_search</span> <span class="kn">import</span> <span class="n">GridSearchCV</span> <span class="kn">from</span> <span class="nn">scikits.learn.pipeline</span> <span class="kn">import</span> <span class="n">Pipeline</span> <span class="c">################################################################################</span> <span class="c"># Download the data, if not already on disk</span> <span class="n">url</span> <span class="o">=</span> <span class="s">"http://people.csail.mit.edu/jrennie/20Newsgroups/20news-18828.tar.gz"</span> <span class="n">archive_name</span> <span class="o">=</span> <span class="s">"20news-18828.tar.gz"</span> <span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="n">archive_name</span><span class="p">[:</span><span class="o">-</span><span class="mi">7</span><span class="p">]):</span> <span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="n">archive_name</span><span class="p">):</span> <span class="kn">import</span> <span class="nn">urllib</span> <span class="k">print</span> <span class="s">"Downloading data, please Wait (14MB)..."</span> <span class="k">print</span> <span class="n">url</span> <span class="n">opener</span> <span class="o">=</span> <span class="n">urllib</span><span class="o">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">url</span><span class="p">)</span> <span class="nb">open</span><span class="p">(</span><span class="n">archive_name</span><span class="p">,</span> <span class="s">'wb'</span><span class="p">)</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">opener</span><span class="o">.</span><span class="n">read</span><span class="p">())</span> <span class="k">print</span> <span class="kn">import</span> <span class="nn">tarfile</span> <span class="k">print</span> <span class="s">"Decompressiong the archive: "</span> <span class="o">+</span> <span class="n">archive_name</span> <span class="n">tarfile</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="n">archive_name</span><span class="p">,</span> <span class="s">"r:gz"</span><span class="p">)</span><span class="o">.</span><span class="n">extractall</span><span class="p">()</span> <span class="k">print</span> <span class="c">################################################################################</span> <span class="c"># Load some categories from the training set</span> <span class="n">categories</span> <span class="o">=</span> <span class="p">[</span> <span class="s">'alt.atheism'</span><span class="p">,</span> <span class="s">'talk.religion.misc'</span><span class="p">,</span> <span class="p">]</span> <span class="c"># Uncomment the following to do the analysis on all the categories</span> <span class="c">#categories = None</span> <span class="k">print</span> <span class="s">"Loading 20 newsgroups dataset for categories:"</span> <span class="k">print</span> <span class="n">categories</span> <span class="n">data</span> <span class="o">=</span> <span class="n">load_files</span><span class="p">(</span><span class="s">'20news-18828'</span><span class="p">,</span> <span class="n">categories</span><span class="o">=</span><span class="n">categories</span><span class="p">)</span> <span class="k">print</span> <span class="s">"</span><span class="si">%d</span><span class="s"> documents"</span> <span class="o">%</span> <span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">filenames</span><span class="p">)</span> <span class="k">print</span> <span class="s">"</span><span class="si">%d</span><span class="s"> categories"</span> <span class="o">%</span> <span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">target_names</span><span class="p">)</span> <span class="k">print</span> <span class="c">################################################################################</span> <span class="c"># define a pipeline combining a text feature extractor with a simple</span> <span class="c"># classifier</span> <span class="n">pipeline</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">([</span> <span class="p">(</span><span class="s">'vect'</span><span class="p">,</span> <span class="n">CountVectorizer</span><span class="p">()),</span> <span class="p">(</span><span class="s">'tfidf'</span><span class="p">,</span> <span class="n">TfidfTransformer</span><span class="p">()),</span> <span class="p">(</span><span class="s">'clf'</span><span class="p">,</span> <span class="n">SGDClassifier</span><span class="p">()),</span> <span class="p">])</span> <span class="n">parameters</span> <span class="o">=</span> <span class="p">{</span> <span class="c"># uncommenting more parameters will give better exploring power but will</span> <span class="c"># increase processing time in a combinatorial way</span> <span class="s">'vect__max_df'</span><span class="p">:</span> <span class="p">(</span><span class="mf">0.5</span><span class="p">,</span> <span class="mf">0.75</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">),</span> <span class="c"># 'vect__max_features': (None, 5000, 10000, 50000),</span> <span class="s">'vect__analyzer__max_n'</span><span class="p">:</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span> <span class="c"># words or bigrams</span> <span class="c"># 'tfidf__use_idf': (True, False),</span> <span class="s">'clf__alpha'</span><span class="p">:</span> <span class="p">(</span><span class="mf">0.00001</span><span class="p">,</span> <span class="mf">0.000001</span><span class="p">),</span> <span class="s">'clf__penalty'</span><span class="p">:</span> <span class="p">(</span><span class="s">'l2'</span><span class="p">,</span> <span class="s">'elasticnet'</span><span class="p">),</span> <span class="c"># 'clf__n_iter': (10, 50, 80),</span> <span class="p">}</span> <span class="c"># find the best parameters for both the feature extraction and the</span> <span class="c"># classifier</span> <span class="n">grid_search</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span><span class="n">pipeline</span><span class="p">,</span> <span class="n">parameters</span><span class="p">,</span> <span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span> <span class="c"># cross-validation doesn't work if the length of the data is not known,</span> <span class="c"># hence use lists instead of iterators</span> <span class="n">text_docs</span> <span class="o">=</span> <span class="p">[</span><span class="nb">file</span><span class="p">(</span><span class="n">f</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">data</span><span class="o">.</span><span class="n">filenames</span><span class="p">]</span> <span class="k">print</span> <span class="s">"Performing grid search..."</span> <span class="k">print</span> <span class="s">"pipeline:"</span><span class="p">,</span> <span class="p">[</span><span class="n">name</span> <span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">pipeline</span><span class="o">.</span><span class="n">steps</span><span class="p">]</span> <span class="k">print</span> <span class="s">"parameters:"</span> <span class="n">pprint</span><span class="p">(</span><span class="n">parameters</span><span class="p">)</span> <span class="n">t0</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span> <span class="n">grid_search</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">text_docs</span><span class="p">,</span> <span class="n">data</span><span class="o">.</span><span class="n">target</span><span class="p">)</span> <span class="k">print</span> <span class="s">"done in </span><span class="si">%0.3f</span><span class="s">s"</span> <span class="o">%</span> <span class="p">(</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">t0</span><span class="p">)</span> <span class="k">print</span> <span class="k">print</span> <span class="s">"Best score: </span><span class="si">%0.3f</span><span class="s">"</span> <span class="o">%</span> <span class="n">grid_search</span><span class="o">.</span><span class="n">best_score</span> <span class="k">print</span> <span class="s">"Best parameters set:"</span> <span class="n">best_parameters</span> <span class="o">=</span> <span class="n">grid_search</span><span class="o">.</span><span class="n">best_estimator</span><span class="o">.</span><span class="n">_get_params</span><span class="p">()</span> <span class="k">for</span> <span class="n">param_name</span> <span class="ow">in</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">parameters</span><span class="o">.</span><span class="n">keys</span><span class="p">()):</span> <span class="k">print</span> <span class="s">"</span><span class="se">\t</span><span class="si">%s</span><span class="s">: </span><span class="si">%r</span><span class="s">"</span> <span class="o">%</span> <span class="p">(</span><span class="n">param_name</span><span class="p">,</span> <span class="n">best_parameters</span><span class="p">[</span><span class="n">param_name</span><span class="p">])</span> </pre></div> </div> </div> </div> </div> </div> <div class="clearer"></div> </div> </div> <div class="footer"> <p style="text-align: center">This documentation is relative to scikits.learn version 0.6.0<p> © 2010, scikits.learn developers (BSD Lincense). Created using <a href="http://sphinx.pocoo.org/">Sphinx</a> 1.0.5. Design by <a href="http://webylimonada.com">Web y Limonada</a>. </div> </body> </html>