<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html> <head> <link rel="STYLESHEET" href="lib.css" type='text/css' /> <link rel="SHORTCUT ICON" href="../icons/pyfav.gif" /> <link rel='start' href='../index.html' title='Python Documentation Index' /> <link rel="first" href="lib.html" title='Python Library Reference' /> <link rel='contents' href='contents.html' title="Contents" /> <link rel='index' href='genindex.html' title='Index' /> <link rel='last' href='about.html' title='About this document...' /> <link rel='help' href='about.html' title='About this document...' /> <LINK rel="next" href="module-csv.html"> <LINK rel="prev" href="module-netrc.html"> <LINK rel="parent" href="netdata.html"> <LINK rel="next" href="module-csv.html"> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> <meta name='aesop' content='information' /> <META name="description" content="robotparser -- Parser for robots.txt"> <META name="keywords" content="lib"> <META name="resource-type" content="document"> <META name="distribution" content="global"> <title>12.19 robotparser -- Parser for robots.txt</title> </head> <body> <DIV CLASS="navigation"> <div id='top-navigation-panel'> <table align="center" width="100%" cellpadding="0" cellspacing="2"> <tr> <td class='online-navigation'><a rel="prev" title="12.18.1 netrc Objects" href="netrc-objects.html"><img src='../icons/previous.png' border='0' height='32' alt='Previous Page' width='32' /></A></td> <td class='online-navigation'><a rel="parent" title="12. Internet Data Handling" href="netdata.html"><img src='../icons/up.png' border='0' height='32' alt='Up One Level' width='32' /></A></td> <td class='online-navigation'><a rel="next" title="12.20 csv " href="module-csv.html"><img src='../icons/next.png' border='0' height='32' alt='Next Page' width='32' /></A></td> <td align="center" width="100%">Python Library Reference</td> <td class='online-navigation'><a rel="contents" title="Table of Contents" href="contents.html"><img src='../icons/contents.png' border='0' height='32' alt='Contents' width='32' /></A></td> <td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png' border='0' height='32' alt='Module Index' width='32' /></a></td> <td class='online-navigation'><a rel="index" title="Index" href="genindex.html"><img src='../icons/index.png' border='0' height='32' alt='Index' width='32' /></A></td> </tr></table> <div class='online-navigation'> <b class="navlabel">Previous:</b> <a class="sectref" rel="prev" href="netrc-objects.html">12.18.1 netrc Objects</A> <b class="navlabel">Up:</b> <a class="sectref" rel="parent" href="netdata.html">12. Internet Data Handling</A> <b class="navlabel">Next:</b> <a class="sectref" rel="next" href="module-csv.html">12.20 csv </A> </div> <hr /></div> </DIV> <!--End of Navigation Panel--> <H1><A NAME="SECTION00141900000000000000000"> 12.19 <tt class="module">robotparser</tt> -- Parser for robots.txt</A> </H1> <P> <A NAME="module-robotparser"><!--z--></A> <P> <a id='l2h-3909'><!--x--></a> <P> This module provides a single class, <tt class="class">RobotFileParser</tt>, which answers questions about whether or not a particular user agent can fetch a URL on the Web site that published the <span class="file">robots.txt</span> file. For more details on the structure of <span class="file">robots.txt</span> files, see <a class="url" href="http://www.robotstxt.org/wc/norobots.html">http://www.robotstxt.org/wc/norobots.html</a>. <P> <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> <td><nobr><b><span class="typelabel">class</span> <tt id='l2h-3902' class="class">RobotFileParser</tt></b>(</nobr></td> <td>)</td></tr></table></dt> <dd> <P> This class provides a set of methods to read, parse and answer questions about a single <span class="file">robots.txt</span> file. <P> <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> <td><nobr><b><tt id='l2h-3903' class="method">set_url</tt></b>(</nobr></td> <td><var>url</var>)</td></tr></table></dt> <dd> Sets the URL referring to a <span class="file">robots.txt</span> file. </dl> <P> <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> <td><nobr><b><tt id='l2h-3904' class="method">read</tt></b>(</nobr></td> <td>)</td></tr></table></dt> <dd> Reads the <span class="file">robots.txt</span> URL and feeds it to the parser. </dl> <P> <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> <td><nobr><b><tt id='l2h-3905' class="method">parse</tt></b>(</nobr></td> <td><var>lines</var>)</td></tr></table></dt> <dd> Parses the lines argument. </dl> <P> <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> <td><nobr><b><tt id='l2h-3906' class="method">can_fetch</tt></b>(</nobr></td> <td><var>useragent, url</var>)</td></tr></table></dt> <dd> Returns <code>True</code> if the <var>useragent</var> is allowed to fetch the <var>url</var> according to the rules contained in the parsed <span class="file">robots.txt</span> file. </dl> <P> <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> <td><nobr><b><tt id='l2h-3907' class="method">mtime</tt></b>(</nobr></td> <td>)</td></tr></table></dt> <dd> Returns the time the <code>robots.txt</code> file was last fetched. This is useful for long-running web spiders that need to check for new <code>robots.txt</code> files periodically. </dl> <P> <dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline"> <td><nobr><b><tt id='l2h-3908' class="method">modified</tt></b>(</nobr></td> <td>)</td></tr></table></dt> <dd> Sets the time the <code>robots.txt</code> file was last fetched to the current time. </dl> <P> </dl> <P> The following example demonstrates basic use of the RobotFileParser class. <P> <div class="verbatim"><pre> >>> import robotparser >>> rp = robotparser.RobotFileParser() >>> rp.set_url("http://www.musi-cal.com/robots.txt") >>> rp.read() >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco") False >>> rp.can_fetch("*", "http://www.musi-cal.com/") True </pre></div> <DIV CLASS="navigation"> <div class='online-navigation'><hr /> <table align="center" width="100%" cellpadding="0" cellspacing="2"> <tr> <td class='online-navigation'><a rel="prev" title="12.18.1 netrc Objects" rel="prev" title="12.18.1 netrc Objects" href="netrc-objects.html"><img src='../icons/previous.png' border='0' height='32' alt='Previous Page' width='32' /></A></td> <td class='online-navigation'><a rel="parent" title="12. Internet Data Handling" rel="parent" title="12. Internet Data Handling" href="netdata.html"><img src='../icons/up.png' border='0' height='32' alt='Up One Level' width='32' /></A></td> <td class='online-navigation'><a rel="next" title="12.20 csv " rel="next" title="12.20 csv " href="module-csv.html"><img src='../icons/next.png' border='0' height='32' alt='Next Page' width='32' /></A></td> <td align="center" width="100%">Python Library Reference</td> <td class='online-navigation'><a rel="contents" title="Table of Contents" rel="contents" title="Table of Contents" href="contents.html"><img src='../icons/contents.png' border='0' height='32' alt='Contents' width='32' /></A></td> <td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png' border='0' height='32' alt='Module Index' width='32' /></a></td> <td class='online-navigation'><a rel="index" title="Index" rel="index" title="Index" href="genindex.html"><img src='../icons/index.png' border='0' height='32' alt='Index' width='32' /></A></td> </tr></table> <div class='online-navigation'> <b class="navlabel">Previous:</b> <a class="sectref" rel="prev" href="netrc-objects.html">12.18.1 netrc Objects</A> <b class="navlabel">Up:</b> <a class="sectref" rel="parent" href="netdata.html">12. Internet Data Handling</A> <b class="navlabel">Next:</b> <a class="sectref" rel="next" href="module-csv.html">12.20 csv </A> </div> </div> <hr /> <span class="release-info">Release 2.3.4, documentation updated on May 20, 2004.</span> </DIV> <!--End of Navigation Panel--> <ADDRESS> See <i><a href="about.html">About this document...</a></i> for information on suggesting changes. </ADDRESS> </BODY> </HTML>