Sophie

Sophie

distrib > Mandriva > current > i586 > media > contrib-release > by-pkgid > cf077afebcf285bb3102fc395321c7f3 > files > 72

htmlparser-1.6-5mdv2010.0.noarch.rpm

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<!-- $Id: head.tmpl,v 1.5 2002/12/15 01:30:47 carstenklapp Exp $ -->
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<meta name="robots" content="index,follow" />
<meta name="keywords" content="Searching For Data, PhpWiki" />
<meta name="description" content="Searching for data is one of the most challenging tasks in a web page due to its seemingly unstructured (or badly structured) form. Complex searches are now possible with the parser in a simple to use API. Here's an example :" />
<meta name="language" content="" />
<meta name="document-type" content="Public" />
<meta name="document-rating" content="General" />
<meta name="generator" content="phpWiki" />
<meta name="PHPWIKI_VERSION" content="1.3.4" />

<link rel="shortcut icon" href="/wiki/themes/default/images/favicon.ico" />
<link rel="home" title="HomePage" href="HomePage" />
<link rel="help" title="HowToUseWiki" href="HowToUseWiki" />
<link rel="copyright" title="GNU General Public License" href="http://www.gnu.org/copyleft/gpl.html#SEC1" />
<link rel="author" title="The PhpWiki Programming Team" href="http://phpwiki.sourceforge.net/phpwiki/ThePhpWikiProgrammingTeam" />
<link rel="search" title="FindPage" href="FindPage" />
<link rel="alternate" title="View Source: SearchingForData" href="SearchingForData?action=viewsource&amp;version=4" />
<link rel="alternate" type="application/rss+xml" title="RSS" href="RecentChanges?format=rss" />

<link rel="bookmark" title="SandBox" href="SandBox" />
<link rel="bookmark" title="WikiWikiWeb" href="WikiWikiWeb" />



<link rel="stylesheet" title="MacOSX" type="text/css" charset="iso-8859-1" href="/wiki/themes/MacOSX/MacOSX.css" /><link rel="alternate stylesheet" title="Printer" type="text/css" charset="iso-8859-1" href="/wiki/themes/default/phpwiki-printer.css" media="print, screen" /><link rel="alternate stylesheet" title="Modern" type="text/css" charset="iso-8859-1" href="/wiki/themes/default/phpwiki-modern.css" /><style type="text/css">
<!--
body {background-image: url(/wiki/themes/MacOSX/images/bgpaper8.png);}
-->
</style>
<title>PhpWiki - Searching For Data</title>
</head>
<!-- End head -->
<!-- Begin body -->
<!-- $Id: body.tmpl,v 1.30 2002/09/02 14:36:58 rurban Exp $ -->
<body>
<!-- Begin top -->
<!-- $Id: top.tmpl,v 1.20 2002/12/15 01:30:47 carstenklapp Exp $ -->

<!-- End top -->
<!-- Begin browse -->
<!-- $Id: browse.tmpl,v 1.22 2002/02/19 23:00:26 carstenklapp Exp $ -->


<div class="wikitext"><p>Searching for data is one of the most challenging tasks in a web page due to its seemingly unstructured (or badly structured) form. Complex searches are now possible with the parser in a simple to use API. Here's an example :</p>
<p>We are looking at a page which has the following html:</p>
<pre>
&lt;html&gt;
...
&lt;body&gt;
   &lt;table&gt;
     &lt;tr&gt;
       &lt;td&gt;&lt;font size="-1"&gt;Name:&lt;b&gt;&lt;i&gt;John Doe&lt;/i&gt;&lt;/b&gt;&lt;/font&gt;&lt;/td&gt;
       ..
     &lt;/tr&gt;
     &lt;tr&gt;
       ..
     &lt;/tr&gt;
   &lt;/table&gt;
&lt;/body&gt;
&lt;/html&gt;</pre>
<p>We'd like to extract the information corresponding to the field "Name". This is possible if we make use of the fact that the name appears two tags after "Name".</p>
<p>Code to achieve this would look like:</p>
<pre>
Node nodes [] = parser.extractAllNodesThatAre(TableTag.class);
// Get the first table found
TableTag table = (TableTag)nodes[0];

// Find the position of Name.
StringNode [] stringNodes = table.digupStringNode("Name");
StringNode name = stringNodes[0];

// We assume that the first node that matched is the one we want. We
// navigate to its parent, the column tag &lt;td&gt;
CompositeTag td = name.getParent();

// From the parent, we shall find out the position of "Name"
int posOfName = td.findPositionOf(name);

// Its easy now to navigate to John Doe, as we know it is 3 positions away
Node expectedName = td.childAt(posOfName + 3);
</pre>
<hr /><p>You can move up the parent tree - e.g. when the data is in seperate columns,</p>
<pre>
&lt;html&gt;
...
&lt;body&gt;
   &lt;table&gt;
     &lt;tr&gt;
       &lt;td&gt;&lt;font size="-1"&gt;Name:&lt;/font&gt;&lt;/td&gt;
       &lt;td&gt;&lt;font size="-1"&gt;John Doe&lt;/font&gt;&lt;/td&gt;
     &lt;/tr&gt;
     &lt;tr&gt;
       ..
     &lt;/tr&gt;
   &lt;/table&gt;
&lt;/body&gt;
&lt;/html&gt;</pre>
<p>We'd like to perform the same search on "Name".</p>
<p>Code to achieve this would look like:</p>
<pre>
Node nodes [] = parser.extractAllNodesThatAre(TableTag.class);
// Get the first table found
TableTag table = (TableTag)nodes[0];

// Find the position of Name.
StringNode [] stringNodes = table.digupStringNode("Name");

// We assume that the first node that matched is the one we want. We
// navigate to its parent (column &lt;td&gt;)
CompositeTag td = stringNodes[0].getParent();

// Navigate to its parent (row &lt;tr&gt;)
CompositeTag tr = parentOfName.getParent();

// From the parent, we shall find out the position of the column
int columnNo = tr.findPositionOf(td);

// Its easy now to navigate to John Doe, as we know it is in the next column
TableColumn nextColumn = (TableColumn)tr.childAt(columnNo+1);

// The name is the second item in the column tag
Node expectedName = nextColumn.childAt(1);</pre>
</div>


<!-- End browse -->
<!-- Begin bottom -->
<!-- $Id: bottom.tmpl,v 1.3 2002/09/15 20:21:16 rurban Exp $ -->
<!-- Add your Disclaimer here -->
<!-- Begin debug -->
<!-- $Id: debug.tmpl,v 1.9 2002/09/17 02:10:33 dairiki Exp $ -->
<table width="%100" border="0" cellpadding="0" cellspacing="0">
<tr><td>

</td><td>
<span class="debug">Page Execution took 0.261 seconds</span>
</td></tr></table>
<!-- This keeps the valid XHTML! icons from "hanging off the bottom of the scree" -->
<br style="clear: both;" />
<!-- End debug -->
<!-- End bottom -->
</body>
<!-- End body -->
<!-- phpwiki source:
$Id: prepend.php,v 1.13 2002/09/18 19:23:25 dairiki Exp $
$Id: ErrorManager.php,v 1.16 2002/09/14 22:23:36 dairiki Exp $
$Id: HtmlElement.php,v 1.27 2002/10/31 03:28:30 carstenklapp Exp $
$Id: XmlElement.php,v 1.17 2002/08/17 15:52:51 rurban Exp $
$Id: WikiCallback.php,v 1.2 2001/11/21 20:01:52 dairiki Exp $
$Id: index.php,v 1.99 2002/12/31 01:13:14 wainstead Exp $
$Id: main.php,v 1.90 2002/11/19 07:07:37 carstenklapp Exp $
$Id: config.php,v 1.68 2002/11/14 22:28:03 carstenklapp Exp $
$Id: FileFinder.php,v 1.11 2002/09/18 18:34:13 dairiki Exp $
$Id: Request.php,v 1.24 2002/12/14 16:21:46 dairiki Exp $
$Id: WikiUser.php,v 1.29 2002/11/19 07:07:38 carstenklapp Exp $
$Id: WikiDB.php,v 1.17 2002/09/15 03:56:22 dairiki Exp $
$Id: SQL.php,v 1.2 2001/09/19 03:24:36 wainstead Exp $
$Id: mysql.php,v 1.3 2001/12/08 16:02:35 dairiki Exp $
$Id: PearDB.php,v 1.28 2002/09/12 11:45:33 rurban Exp $
$Id: backend.php,v 1.3 2002/01/10 23:32:04 carstenklapp Exp $
$Id: DB.php,v 1.2 2002/09/12 11:45:33 rurban Exp $
From Pear CVS: Id: DB.php,v 1.13 2002/07/02 15:19:49 cox Exp
$Id: PEAR.php,v 1.1 2002/01/28 04:01:56 dairiki Exp $
From Pear CVS: Id: PEAR.php,v 1.29 2001/12/15 15:01:35 mj Exp
$Id: mysql.php,v 1.2 2002/09/12 11:45:33 rurban Exp $
From Pear CVS: Id: mysql.php,v 1.5 2002/06/19 00:41:06 cox Exp
$Id: common.php,v 1.2 2002/09/12 11:45:33 rurban Exp $
From Pear CVS: Id: common.php,v 1.8 2002/06/12 15:03:16 fab Exp
$Id: themeinfo.php,v 1.46 2002/03/08 20:31:14 carstenklapp Exp $
$Id: Theme.php,v 1.58 2002/10/12 08:55:03 carstenklapp Exp $
$Id: display.php,v 1.38 2002/09/15 20:17:58 rurban Exp $
$Id: Template.php,v 1.46 2002/09/15 15:05:47 rurban Exp $
$Id: WikiPlugin.php,v 1.27 2002/11/04 03:15:59 carstenklapp Exp $
$Id: BlockParser.php,v 1.29 2002/11/25 22:25:49 dairiki Exp $
$Id: InlineParser.php,v 1.19 2002/11/25 22:51:37 dairiki Exp $
$Id: interwiki.php,v 1.23 2002/10/06 16:45:10 dairiki Exp $
$Id: PageType.php,v 1.13 2002/09/04 20:39:47 dairiki Exp $
-->
</html>