<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>Classification of text documents: using a MLComp dataset — scikits.learn v0.6.0 documentation</title> <link rel="stylesheet" href="../_static/nature.css" type="text/css" /> <link rel="stylesheet" href="../_static/pygments.css" type="text/css" /> <script type="text/javascript"> var DOCUMENTATION_OPTIONS = { URL_ROOT: '../', VERSION: '0.6.0', COLLAPSE_INDEX: false, FILE_SUFFIX: '.html', HAS_SOURCE: true }; </script> <script type="text/javascript" src="../_static/jquery.js"></script> <script type="text/javascript" src="../_static/underscore.js"></script> <script type="text/javascript" src="../_static/doctools.js"></script> <link rel="shortcut icon" href="../_static/favicon.ico"/> <link rel="author" title="About these documents" href="../about.html" /> <link rel="top" title="scikits.learn v0.6.0 documentation" href="../index.html" /> <link rel="up" title="Examples" href="index.html" /> <link rel="next" title="Gaussian Naive Bayes" href="naive_bayes.html" /> <link rel="prev" title="Logistic Regression" href="logistic_l1_l2_coef.html" /> </head> <body> <div class="header-wrapper"> <div class="header"> <p class="logo"><a href="../index.html"> <img src="../_static/scikit-learn-logo-small.png" alt="Logo"/> </a> </p><div class="navbar"> <ul> <li><a href="../install.html">Download</a></li> <li><a href="../support.html">Support</a></li> <li><a href="../user_guide.html">User Guide</a></li> <li><a href="index.html">Examples</a></li> <li><a href="../developers/index.html">Development</a></li> </ul> <div class="search_form"> <div id="cse" style="width: 100%;"></div> <script src="http://www.google.com/jsapi" type="text/javascript"></script> <script type="text/javascript"> google.load('search', '1', {language : 'en'}); google.setOnLoadCallback(function() { var customSearchControl = new google.search.CustomSearchControl('016639176250731907682:tjtqbvtvij0'); customSearchControl.setResultSetSize(google.search.Search.FILTERED_CSE_RESULTSET); var options = new google.search.DrawOptions(); options.setAutoComplete(true); customSearchControl.draw('cse', options); }, true); </script> </div> </div> <!-- end navbar --></div> </div> <div class="content-wrapper"> <!-- <div id="blue_tile"></div> --> <div class="sphinxsidebar"> <div class="rel"> <a href="logistic_l1_l2_coef.html" title="Logistic Regression" accesskey="P">previous</a> | <a href="naive_bayes.html" title="Gaussian Naive Bayes" accesskey="N">next</a> | <a href="../genindex.html" title="General Index" accesskey="I">index</a> </div> <h3>Contents</h3> <ul> <li><a class="reference internal" href="#">Classification of text documents: using a MLComp dataset</a></li> </ul> </div> <div class="content"> <div class="documentwrapper"> <div class="bodywrapper"> <div class="body"> <div class="section" id="classification-of-text-documents-using-a-mlcomp-dataset"> <span id="example-mlcomp-sparse-document-classification-py"></span><h1>Classification of text documents: using a MLComp dataset<a class="headerlink" href="#classification-of-text-documents-using-a-mlcomp-dataset" title="Permalink to this headline">ΒΆ</a></h1> <p>This is an example showing how the scikit-learn can be used to classify documents by topics using a bag-of-words approach. This example uses a scipy.sparse matrix to store the features instead of standard numpy arrays.</p> <p>The dataset used in this example is the 20 newsgroups dataset and should be downloaded from the <a class="reference external" href="http://mlcomp.org">http://mlcomp.org</a> (free registration required):</p> <blockquote> <a class="reference external" href="http://mlcomp.org/datasets/379">http://mlcomp.org/datasets/379</a></blockquote> <p>Once downloaded unzip the arhive somewhere on your filesystem. For instance in:</p> <div class="highlight-python"><pre>% mkdir -p ~/data/mlcomp % cd ~/data/mlcomp % unzip /path/to/dataset-379-20news-18828_XXXXX.zip</pre> </div> <p>You should get a folder <tt class="docutils literal"><span class="pre">~/data/mlcomp/379</span></tt> with a file named <tt class="docutils literal"><span class="pre">metadata</span></tt> and subfolders <tt class="docutils literal"><span class="pre">raw</span></tt>, <tt class="docutils literal"><span class="pre">train</span></tt> and <tt class="docutils literal"><span class="pre">test</span></tt> holding the text documents organized by newsgroups.</p> <p>Then set the <tt class="docutils literal"><span class="pre">MLCOMP_DATASETS_HOME</span></tt> environment variable pointing to the root folder holding the uncompressed archive:</p> <div class="highlight-python"><pre>% export MLCOMP_DATASETS_HOME="~/data/mlcomp"</pre> </div> <p>Then you are ready to run this example using your favorite python shell:</p> <div class="highlight-python"><pre>% ipython examples/mlcomp_sparse_document_classification.py</pre> </div> <p><strong>Python source code:</strong> <a class="reference download internal" href="../_downloads/mlcomp_sparse_document_classification.py"><tt class="xref download docutils literal"><span class="pre">mlcomp_sparse_document_classification.py</span></tt></a></p> <div class="highlight-python"><div class="highlight"><pre><span class="k">print</span> <span class="n">__doc__</span> <span class="c"># Author: Olivier Grisel <olivier.grisel@ensta.org></span> <span class="c"># License: Simplified BSD</span> <span class="kn">from</span> <span class="nn">time</span> <span class="kn">import</span> <span class="n">time</span> <span class="kn">import</span> <span class="nn">sys</span> <span class="kn">import</span> <span class="nn">os</span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span> <span class="kn">import</span> <span class="nn">scipy.sparse</span> <span class="kn">as</span> <span class="nn">sp</span> <span class="kn">import</span> <span class="nn">pylab</span> <span class="kn">as</span> <span class="nn">pl</span> <span class="kn">from</span> <span class="nn">scikits.learn.datasets</span> <span class="kn">import</span> <span class="n">load_mlcomp</span> <span class="kn">from</span> <span class="nn">scikits.learn.feature_extraction.text.sparse</span> <span class="kn">import</span> <span class="n">Vectorizer</span> <span class="kn">from</span> <span class="nn">scikits.learn.linear_model.sparse</span> <span class="kn">import</span> <span class="n">SGDClassifier</span> <span class="kn">from</span> <span class="nn">scikits.learn.metrics</span> <span class="kn">import</span> <span class="n">confusion_matrix</span> <span class="kn">from</span> <span class="nn">scikits.learn.metrics</span> <span class="kn">import</span> <span class="n">classification_report</span> <span class="k">if</span> <span class="s">'MLCOMP_DATASETS_HOME'</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">:</span> <span class="k">print</span> <span class="s">"Please follow those instructions to get started:"</span> <span class="n">sys</span><span class="o">.</span><span class="n">exit</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="c"># Load the training set</span> <span class="k">print</span> <span class="s">"Loading 20 newsgroups training set... "</span> <span class="n">news_train</span> <span class="o">=</span> <span class="n">load_mlcomp</span><span class="p">(</span><span class="s">'20news-18828'</span><span class="p">,</span> <span class="s">'train'</span><span class="p">)</span> <span class="k">print</span> <span class="n">news_train</span><span class="o">.</span><span class="n">DESCR</span> <span class="k">print</span> <span class="s">"</span><span class="si">%d</span><span class="s"> documents"</span> <span class="o">%</span> <span class="nb">len</span><span class="p">(</span><span class="n">news_train</span><span class="o">.</span><span class="n">filenames</span><span class="p">)</span> <span class="k">print</span> <span class="s">"</span><span class="si">%d</span><span class="s"> categories"</span> <span class="o">%</span> <span class="nb">len</span><span class="p">(</span><span class="n">news_train</span><span class="o">.</span><span class="n">target_names</span><span class="p">)</span> <span class="k">print</span> <span class="s">"Extracting features from the dataset using a sparse vectorizer"</span> <span class="n">t0</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span> <span class="n">vectorizer</span> <span class="o">=</span> <span class="n">Vectorizer</span><span class="p">()</span> <span class="n">X_train</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">((</span><span class="nb">open</span><span class="p">(</span><span class="n">f</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">news_train</span><span class="o">.</span><span class="n">filenames</span><span class="p">))</span> <span class="k">print</span> <span class="s">"done in </span><span class="si">%f</span><span class="s">s"</span> <span class="o">%</span> <span class="p">(</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">t0</span><span class="p">)</span> <span class="k">print</span> <span class="s">"n_samples: </span><span class="si">%d</span><span class="s">, n_features: </span><span class="si">%d</span><span class="s">"</span> <span class="o">%</span> <span class="n">X_train</span><span class="o">.</span><span class="n">shape</span> <span class="k">assert</span> <span class="n">sp</span><span class="o">.</span><span class="n">issparse</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span> <span class="n">y_train</span> <span class="o">=</span> <span class="n">news_train</span><span class="o">.</span><span class="n">target</span> <span class="k">print</span> <span class="s">"Training a linear classifier..."</span> <span class="n">parameters</span> <span class="o">=</span> <span class="p">{</span> <span class="s">'loss'</span><span class="p">:</span> <span class="s">'hinge'</span><span class="p">,</span> <span class="s">'penalty'</span><span class="p">:</span> <span class="s">'l2'</span><span class="p">,</span> <span class="s">'n_iter'</span><span class="p">:</span> <span class="mi">50</span><span class="p">,</span> <span class="s">'alpha'</span><span class="p">:</span> <span class="mf">0.00001</span><span class="p">,</span> <span class="s">'fit_intercept'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span> <span class="p">}</span> <span class="k">print</span> <span class="s">"parameters:"</span><span class="p">,</span> <span class="n">parameters</span> <span class="n">t0</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span> <span class="n">clf</span> <span class="o">=</span> <span class="n">SGDClassifier</span><span class="p">(</span><span class="o">**</span><span class="n">parameters</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span> <span class="k">print</span> <span class="s">"done in </span><span class="si">%f</span><span class="s">s"</span> <span class="o">%</span> <span class="p">(</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">t0</span><span class="p">)</span> <span class="k">print</span> <span class="s">"Percentage of non zeros coef: </span><span class="si">%f</span><span class="s">"</span> <span class="o">%</span> <span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">clf</span><span class="o">.</span><span class="n">coef_</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="o">*</span> <span class="mi">100</span><span class="p">)</span> <span class="k">print</span> <span class="s">"Loading 20 newsgroups test set... "</span> <span class="n">news_test</span> <span class="o">=</span> <span class="n">load_mlcomp</span><span class="p">(</span><span class="s">'20news-18828'</span><span class="p">,</span> <span class="s">'test'</span><span class="p">)</span> <span class="n">t0</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span> <span class="k">print</span> <span class="s">"done in </span><span class="si">%f</span><span class="s">s"</span> <span class="o">%</span> <span class="p">(</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">t0</span><span class="p">)</span> <span class="k">print</span> <span class="s">"Predicting the labels of the test set..."</span> <span class="k">print</span> <span class="s">"</span><span class="si">%d</span><span class="s"> documents"</span> <span class="o">%</span> <span class="nb">len</span><span class="p">(</span><span class="n">news_test</span><span class="o">.</span><span class="n">filenames</span><span class="p">)</span> <span class="k">print</span> <span class="s">"</span><span class="si">%d</span><span class="s"> categories"</span> <span class="o">%</span> <span class="nb">len</span><span class="p">(</span><span class="n">news_test</span><span class="o">.</span><span class="n">target_names</span><span class="p">)</span> <span class="k">print</span> <span class="s">"Extracting features from the dataset using the same vectorizer"</span> <span class="n">t0</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span> <span class="n">X_test</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="o">.</span><span class="n">transform</span><span class="p">((</span><span class="nb">open</span><span class="p">(</span><span class="n">f</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">news_test</span><span class="o">.</span><span class="n">filenames</span><span class="p">))</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">news_test</span><span class="o">.</span><span class="n">target</span> <span class="k">print</span> <span class="s">"done in </span><span class="si">%f</span><span class="s">s"</span> <span class="o">%</span> <span class="p">(</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">t0</span><span class="p">)</span> <span class="k">print</span> <span class="s">"n_samples: </span><span class="si">%d</span><span class="s">, n_features: </span><span class="si">%d</span><span class="s">"</span> <span class="o">%</span> <span class="n">X_test</span><span class="o">.</span><span class="n">shape</span> <span class="k">print</span> <span class="s">"Predicting the outcomes of the testing set"</span> <span class="n">t0</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span> <span class="n">pred</span> <span class="o">=</span> <span class="n">clf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span> <span class="k">print</span> <span class="s">"done in </span><span class="si">%f</span><span class="s">s"</span> <span class="o">%</span> <span class="p">(</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">t0</span><span class="p">)</span> <span class="k">print</span> <span class="s">"Classification report on test set for classifier:"</span> <span class="k">print</span> <span class="n">clf</span> <span class="k">print</span> <span class="k">print</span> <span class="n">classification_report</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">pred</span><span class="p">,</span> <span class="n">class_names</span><span class="o">=</span><span class="n">news_test</span><span class="o">.</span><span class="n">target_names</span><span class="p">)</span> <span class="n">cm</span> <span class="o">=</span> <span class="n">confusion_matrix</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">pred</span><span class="p">)</span> <span class="k">print</span> <span class="s">"Confusion matrix:"</span> <span class="k">print</span> <span class="n">cm</span> <span class="c"># Show confusion matrix</span> <span class="n">pl</span><span class="o">.</span><span class="n">matshow</span><span class="p">(</span><span class="n">cm</span><span class="p">)</span> <span class="n">pl</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Confusion matrix'</span><span class="p">)</span> <span class="n">pl</span><span class="o">.</span><span class="n">colorbar</span><span class="p">()</span> <span class="n">pl</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> </pre></div> </div> </div> </div> </div> </div> <div class="clearer"></div> </div> </div> <div class="footer"> <p style="text-align: center">This documentation is relative to scikits.learn version 0.6.0<p> © 2010, scikits.learn developers (BSD Lincense). Created using <a href="http://sphinx.pocoo.org/">Sphinx</a> 1.0.5. Design by <a href="http://webylimonada.com">Web y Limonada</a>. </div> </body> </html>