Sophie

Sophie

distrib > Mandriva > 9.1 > ppc > by-pkgid > 1596aa0c95b4ccf7adfa8febc56cc15c > files > 180

webmake-2.4-2mdk.noarch.rpm

<wmmeta name="Title" value="Scraped Templates" />
<wmmeta name="Section" value="03-proc_logic" />
<wmmeta name="Score" value="50" />
<wmmeta name="Abstract">
How to use scrape_xml() to generate multiple templates from a single HTML file
</wmmeta>

This is a very neat trick.  A common problem with templating systems, such as
WebMake, is that they **don't actually help at all** in certain areas.

Here's one of the problems.  When a HTML Guy edits up a page template, he's
typically going to edit __an entire page__, not just small snippets;
he has to see what the overall page looks like, align the items correctly,
make sure that font looks OK with that font, that bgcolor with that bgcolor,
etc.

However, as **Talin** mentions in <a
href="http://www.advogato.org/article/350.html#4">this thread on Advogato</a>,
there's a problem: most large web sites use the notion of ''components'' -
that is, re-usable fragments of dynamic HTML which are assembled to form a
complete page.

So once the HTML Guy has designed up a good-looking, nice page to display ''a
list of top 10 selling movies on a site that sells VHS tapes'', as the example
in the Advo article suggests, the page now contains the following templates:

	- overall page template

	- top-10 page content

	- top-10 list table template

	- one-row-of-the-table template (which could in turn be broken down
	  into 2 templates: one for odd rows, one for even, etc.)

So someone has to go and cut up the page the HTML Guy has created, into
components (template and content items, in WebMake terminology).  What a pain.

How do we deal with this problem?

Scraping
========

WebMake has some features which help here:

    - **Content ''src'' attribute**: templates can be loaded from a named
      file (or even a remote webpage).  Multiple templates or content
      items can be loaded from the same file.

    - **Pre-processing**: Using the __preproc__ attribute, you can specify
      a block of perl code to execute over each content item's text.

    - **Scraping**: The ##scrape_xml()## and ##scrape_out_xml()## perl code
      library functions allows you to easily cut out the bits of the page you
      want, based on patterns in the page text or HTML.

What you need to do is isolate -- or specify to the HTML Guy -- some patterns
in the text that delimit the areas of the page, which you will be turning
into templates.  You then set up WebMake commands which will scrape the
templates from the designer-provided page.

Let's go with the 'top-10 videos on VHS' list page example from the Advogato
thread.  That contains the following templates:

	- overall page template

	- top-10 page content (text, images maybe etc.)

	- top-10 list table template

	- one-row-of-the-table template (which could in turn be broken down
	  into 2 templates: one for odd rows, one for even, etc.)

Let's say the designer has provided you with this page, called ''top10.htm''
(hopefully he's filled in the ... bits, of course!):

<safe>

    <html>
      <head>
      <title>Top 10 Movies on VHS</title>
      </head><body>

      .... blah blah navigation, other generic-page-template stuff ...

      <!-- start of top-10 page content -->

      Lorem ipsum dolor sit amet, consectetaur adipisicing elit, sed do
      eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim
      ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut
      aliquip ex ea commodo consequat. ...

      <!-- start of top-10 table -->
      <table bgcolor=nice etc.>

	<!-- start of even row -->
	<tr>
	  <td>....</td> <td>....</td> <td>....</td>
	</tr>
	<!-- end of even row -->

	<!-- start of odd row -->
	<tr>
	  <td>....</td> <td>....</td> <td>....</td>
	</tr>
	<!-- end of odd row -->

      </table>
      <!-- end of top-10 table -->

      <!-- end of top-10 page content -->

      .... blah blah more generic-page-template stuff ....
      </body>
    </html>

</safe>

We can see that the following content or template items can be scraped
out:

	- overall page template: everything between the ##html## tags, but with
	  text from ##start of top-10 page content## to ##end of top-10 page
	  content## stripped out

	- top-10 page content: ##start of top-10 page content## to ##end of
	  top-10 page content##, strip out ##top-10 table## section

	- top-10 list template: ##top-10 table##, strip out ##even row##
	  and ##odd row## sections

	- even-table-row template: ##even row##

	- odd-table-row template: ##odd row##

That translates into this WebMake code:

<safe>
  <{perl        # define the scraping functions we will use.

  sub scrape_page_template {
    return scrape_out_xml (shift
        qr/start of top-10 page content/i, qr/end of top-10 page content/i);
  }

  sub scrape_top10_content {
    my $text = scrape_xml (shift,
        qr/start of top-10 page content/i, qr/end of top-10 page content/i);
    return scrape_out_xml ($text,
        qr/start of top-10 table/i, qr/end of top-10 table/i);
  }

  sub scrape_top10_list_template {
    my $text = scrape_xml (shift,
        qr/start of top-10 table/i, qr/end of top-10 table/i);
    $text = scrape_out_xml ($text,
        qr/start of even row/i, qr/end of even row/i);
    return scrape_out_xml ($text,
        qr/start of odd row/i, qr/end of odd row/i);
  }

  sub scrape_top10_even_row_template {
    return scrape_xml (shift, qr/start of even row/i, qr/end of even row/i);
  }

  sub scrape_top10_odd_row_template {
    return scrape_xml (shift, qr/start of odd row/i, qr/end of odd row/i);
  }

  # (Note the qr// for the search patterns use the 'i' modifier;
  # non-programmers love to mess with capitalisation ;)

  '';           # replace this perl block with an empty string

  }>

  <!-- and now define the templates, using those functions: -->
  <template name="page_template" src="top10.htm"
                          preproc=scrape_page_template></template>
  <content name="top10_content" src="top10.htm"
                          preproc=scrape_top10_content></content>
  <template name="top10_list_template" src="top10.htm"
                          preproc=scrape_top10_list_template></template>
  <template name="top10_even_row_template" src="top10.htm"
                          preproc=scrape_top10_even_row_template></template>
  <template name="top10_odd_row_template" src="top10.htm"
                          preproc=scrape_top10_odd_row_template></template>

</safe>

That's it.  Those templates can now be used safely in the site logic,
and will work as long as the page designer doesn't muck about with
the comments too much.

You don't have to use comments, by the way; if your HTML Guy's editor allows
him to mark out ''zones'' of a page in some way, then just use whatever zone
markers it provides instead, or even just use patterns in the HTML tags or
text.