Sophie

Sophie

distrib > Fedora > 15 > i386 > by-pkgid > e02e7b9526d5989357e709d1f6364807 > files > 37

htdig-3.2.0-0.11.b6.fc15.i686.rpm

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
  <head>
	<title>ht://Dig Frequently Asked Questions</title>
        <link rel="stylesheet" href="css/htdig.css">
  </head>
  <body bgcolor="#eef7ff">
	<h1>Frequently Asked Questions</h1>
	<p>
	  ht://Dig Copyright &copy; 1995-2004 <a href="THANKS.html">The ht://Dig Group</a><br>
	  Please see the file <a href="COPYING">COPYING</a> for
	  license information.
	</p>
	  <hr noshade size=4>
	  <p class="main">This FAQ is compiled by the ht://Dig developers and the
	  most recent version is available at &lt;<a
	  href="http://www.htdig.org/FAQ.html">http://www.htdig.org/FAQ.html</a>&gt;.
	  Questions (and answers!) are greatly appreciated.
	  Please send questions and/or answers to the ht://Dig user
	  mailing list at: &lt;<a href="mailto:htdig-general@lists.sourceforge.net">htdig-general@lists.sourceforge.net</a>&gt;.
	  </p>
	  <h2>Questions</h2>

	  <h3>1. General</h3>
	  1.1. <a href="#q1.1">Can I search the internet with ht://Dig?</a><br>
	  1.2. <a href="#q1.2">Can I index the internet with ht://Dig?</a><br>
	  1.3. <a href="#q1.3">What's the difference between htdig and
	  ht://Dig?</a><br>
	  1.4. <a href="#q1.4">I sent mail to Andrew or Geoff or
	  Gilles, but I never got a response!</a><br>
	  1.5. <a href="#q1.5">I sent a question to the mailing list but I
	  never got a response!</a><br>
	  1.6. <a href="#q1.6">I have a great idea/patch for ht://Dig!</a><br>
	  1.7. <a href="#q1.7">Is ht://Dig Y2K compliant?</a><br>
	  1.8. <a href="#q1.8">I think I found a bug. What should I do?</a><br>
	  1.9. <a href="#q1.9">Does ht://Dig support phrase or near
	  matching?</a><br>
	  1.10. <a href="#q1.10">What are the practical and/or theoretical
	  limits of ht://Dig?</a><br>
	  1.11. <a href="#q1.11">Do any ISPs offer ht://Dig as part of
	  their web hosting services?</a><br>
	  1.12. <a href="#q1.12">Can I use ht://Dig on a commercial website?</a><br>
	  1.13. <a href="#q1.13">Why do you use a non-free product to
	  index PDF files?</a><br>
	  1.14. <a href="#q1.14">Why do you have all those SourceForge
	  logos on your website?</a><br>
	  1.15. <a href="#q1.15">My question isn't answered here. Where should I
	  go for help?</a><br>
	  1.16. <a href="#q1.16">Why do the developers get annoyed when
	  I e-mail questions directly to them rather than the mailing list?</a><br>
	  1.17. <a href="#q1.17">Why do replies to messages on the
	  mailing list only go to the sender and not to the list?</a><br>
	  1.18. <a href="#q1.18">Can I use ht://Dig to index and search
	  an SQL database?</a><br>

	  <hr noshade size=2>

	  <h3>2. Getting ht://Dig</h3>
	  2.1. <a href="#q2.1">What's the latest version of ht://Dig?</a><br>
	  2.2. <a href="#q2.2">Are there binary distributions of ht://Dig?</a><br>
	  2.3. <a href="#q2.3">Are there mirror sites for ht://Dig?</a><br>
	  2.4. <a href="#q2.4">Is ht://Dig available by ftp?</a><br>
	  2.5. <a href="#q2.5">Are patches around to upgrade between
	  versions?</a><br>
	  2.6. <a href="#q2.6">Is there a Windows 95/98/2000/NT
	  version of ht://Dig?</a><br>
	  2.7. <a href="#q2.7">Where can I find the documentation for my
	  version of ht://Dig?</a><br>

	  <hr noshade size=2>

	  <h3>3. Compiling</h3>
	  3.1. <a href="#q3.1">When I compile ht://Dig I get an error
	  about libht.a.</a><br>
	  3.2. <a href="#q3.2">I get an error about -lg</a><br>
	  3.3. <a href="#q3.3">I'm compiling on Digital Unix and I get
	  mesages about "unresolved" and "db_open."</a><br>
	  3.4. <a href="#q3.4">I'm compiling on FreeBSD and I get lots
	  of messages about '___error' being unresolved.</a><br>
	  3.5. <a href="#q3.5">I'm compiling on HP/UX and I get a complaint about
	  "Large Files not supported."</a><br>
	  3.6. <a href="#q3.6">I'm compiling on Solaris and when I run the 
	  programs I get complaints about not finding libstdc++.</a><br>
	  3.7. <a href="#q3.7">I'm compiling on IRIX and I'm having
	  database problems when I run the program.</a><br>
	  3.8. <a href="#q3.8">I'm compiling with gcc 3.2 and getting
	  all sorts of warnings/errors about ostream and such.</a><br>

	  <hr noshade size=2>

	  <h3>4. Configuration</h3>
	  4.1. <a href="#q4.1">How come I can't index my site?</a><br>
	  4.2. <a href="#q4.2">How can I change the output format of
	  htsearch?</a><br>
	  4.3. <a href="#q4.3">How do I index pages that start with '~'?</a><br>
	  4.4. <a href="#q4.4">Can I use multiple databases?</a><br>
	  4.5. <a href="#q4.5">OK, I can use multiple databases. Can I
	  merge them into one?</a><br>
	  4.6. <a href="#q4.6">Wow, ht://Dig eats up a lot of disk
	  space. How can I cut down?</a><br>
	  4.7. <a href="#q4.7">Can I use SSI or other CGIs in my
	  htsearch results?</a><br>
	  4.8. <a href="#q4.8">How do I index Word, Excel, PowerPoint
	  or PostScript documents?</a><br>
	  4.9. <a href="#q4.9">How do I index PDF files?</a><br>
	  4.10. <a href="#q4.10">How do I index documents in other
	  languages?</a><br>
	  4.11. <a href="#q4.11">How do I get rotating banner ads in
	  search results?</a><br>
	  4.12. <a href="#q4.12">How do I index numbers in documents?</a><br>
	  4.13. <a href="#q4.13">How can I call htsearch from a hypertext
	  link, rather than from a search form?</a><br>
	  4.14. <a href="#q4.14">How do I restrict a search to only meta
	  keywords entries in documents?</a><br>
	  4.15. <a href="#q4.15">Can I use meta tags to prevent htdig from
	  indexing certain files?</a><br>
	  4.16. <a href="#q4.16">How do I get htsearch to use the star image
	  in a different directory than the default /htdig?</a><br>
	  4.17. <a href="#q4.17">How do I get htdig or htsearch to rewrite
	  URLs in the search results?</a><br>
	  4.18. <a href="#q4.18">What are all the options in
	  htdig.conf, and are there others?</a><br>
	  4.19. <a href="#q4.19">How do I get more than 10 pages of
	  10 search results from htsearch?</a><br>
	  4.20. <a href="#q4.20">How do I restrict a search to only
	  certain subdirectories or documents?</a><br>
	  4.21. <a href="#q4.21">How can I allow people to search
	  while the index is updating?</a><br>
	  4.22. <a href="#q4.22">How can I get htdig to ignore the
	  robots.txt file or meta robots tags?</a><br>
	  4.23. <a href="#q4.23">How can I get htdig not to index
	  some directories, but still follow links?</a><br>
	  4.24. <a href="#q4.24">How can I get rid of duplicates in
	  search results?</a><br>
	  4.25. <a href="#q4.25">How can I change the scores in
	  search results, and what are the defaults?</a><br>
	  4.26. <a href="#q4.26">How can I get htdig not to index
	  JavaScript code or CSS?</a><br>

	  <hr noshade size=2>

	  <h3>5. Troubleshooting</h3>
	  5.1. <a href="#q5.1">I can't seem to index more than X documents
	  in a directory.</a><br>
	  5.2. <a href="#q5.2">I can't index PDF files.</a><br>
	  5.3. <a href="#q5.3">When I run "rundig," I get a message about
	  "DATABASE_DIR" not being found.</a><br>
	  5.4. <a href="#q5.4">When I run htmerge, it stops with an "out
	  of diskspace" message.</a><br>
	  5.5. <a href="#q5.5">I have problems running rundig from cron
	  under Linux.</a><br>
	  5.6. <a href="#q5.6">When I run htmerge, it stops with an
	  "Unexpected file type" message.</a><br>
	  5.7. <a href="#q5.7">When I run htsearch, I get lots of Internal
	  Server Errors (#500).</a><br>
	  5.8. <a href="#q5.8">I'm having problems with indexing words
	  with accented characters.</a><br>
	  5.9. <a href="#q5.9">When I run htmerge, it stops with a
	  "Word sort failed" message.</a><br>
	  5.10. <a href="#q5.10">When htsearch has a lot of matches, it runs
	  extremely slowly.</a><br>
	  5.11. <a href="#q5.11">When I run htsearch, it gives me a count of
	  matches, but doesn't list the matching documents.</a><br>
	  5.12. <a href="#q5.12">I can't seem to index documents with names
	  like left_index.html with htdig.</a><br>
	  5.13. <a href="#q5.13">I get Premature End of Script Headers errors
	  when running htsearch.</a><br>
	  5.14. <a href="#q5.14">I get Segmentation faults when running
	  htdig, htsearch or htfuzzy.</a><br>
	  5.15. <a href="#q5.15">Why does htdig 3.1.3 mangle URL parameters
	  that contain bare "&amp;" characters?</a><br>
	  5.16. <a href="#q5.16">When I run htmerge, it stops with an
	  "Unable to open word list file '.../db.wordlist'" message.</a><br>
	  5.17. <a href="#q5.17">When using Netscape, htsearch always returns the
	  "No match" page.</a><br>
	  5.18. <a href="#q5.18">Why doesn't htdig follow links to other
	  pages in JavaScript code?</a><br>
	  5.19. <a href="#q5.19">When I run htsearch from the web server,
	  it returns a bunch of binary data.</a><br>
	  5.20. <a href="#q5.20">Why are the betas of 3.2 so slow at indexing?</a><br>
	  5.21. <a href="#q5.21">Why does htsearch use ";" instead of
	  "&amp;" to separate URL parameters for the page buttons?</a><br>
	  5.22. <a href="#q5.22">Why does htsearch show the
	  "&amp;" character as "&amp;amp;" in search results?</a><br>
	  5.23. <a href="#q5.23">I get Internal Server or Unrecognized
	  character errors when running htsearch.</a><br>
	  5.24. <a href="#q5.24">I took some settings out of
	  my htdig.conf but they're still set.</a><br>
	  5.25. <a href="#q5.25">When I run htdig on my site,
	  it misses entire directories.</a><br>
	  5.26. <a href="#q5.26">What do all the numbers and symbols
	  in the htdig -v output mean?</a><br>
	  5.27. <a href="#q5.27">Why is htdig rejecting some of the
	  links in my documents?</a><br>
	  5.28. <a href="#q5.28">When I run htdig or htmerge, I get a
	  "DB2 problem...: missing or empty key value specified" message.</a><br>
	  5.29. <a href="#q5.29">When I run htdig on my site,
	  it seems to go on and on without ending.</a><br>
	  5.30. <a href="#q5.30">Why does htsearch no longer recognize
	  the -c option when run from the web server?</a><br>
	  5.31. <a href="#q5.31">I've set a config attribute exactly
	  as documented but it seems to have no effect.</a><br>
	  5.32. <a href="#q5.32">When I run htsearch, it gives a page
	  with an "Unable to read configuration file" message.</a><br>
	  5.33. <a href="#q5.33">How can I find out which version
	  of ht://Dig I have installed?</a><br>
	  5.34. <a href="#q5.34">When running htdig, I get "Error (0):
	  PDF file is damaged - attempting to reconstruct xref table..."</a><br>
	  5.35. <a href="#q5.35">When running htdig on Mandrake Linux,
	  I get "host not found" and "no server running" errors.</a><br>
	  5.36. <a href="#q5.36">When I run htsearch, it gives me the
	  list of matching documents, but no header or footer.</a><br>
	  5.37. <a href="#q5.37">When I index files with doc2html.pl,
	  it fails with the "UNABLE to convert" error.</a><br>
	  5.38. <a href="#q5.38">Why do my searches find search terms
	  in pathnames, or how do I prevent matching filenames?</a><br>
	  5.39. <a href="#q5.39">I set up an external parser but I still
	  can't index Word/Excel/PowerPoint/PDF documents.</a><br>

	  <hr noshade size=4>
	  <h2>Answers</h2>

	  <h3>1. General</h3>
	  <strong>1.1. <a name="q1.1">Can I search the internet with
	  ht://Dig?</a></strong><br>
	  <p>No, ht://Dig is a system for indexing and searching a
	  finite (not necessarily small) set of sites or intranet. It
	  is not meant to replace any of the many internet-wide search
	  engines.</p>

	  <strong>1.2. <a name="q1.2">Can I index the internet with
	  ht://Dig?</a></strong><br>
	  <p>No, as above, ht://Dig is not meant as an
	  internet-wide search engine. While there is
	  <em>theoretically</em> nothing to stop you from indexing as
	  much as you wish, practical considerations (e.g. time, disk
	  space, memory, etc.) will limit this.</p>

	  <strong>1.3. <a name="q1.3">What's the difference between htdig and
	  ht://Dig?</a></strong><br>
	  <p>The complete ht://Dig package consists of several programs, one of
	  which is called "htdig." This program performs the "digging" or
	  indexing of the web pages. Of course an index doesn't do you much good
	  without a program to sort it, search through it, etc.</p>

	  <strong>1.4. <a name="q1.4">I sent mail to Andrew or Geoff
	  or Gilles, but I never got a response!</a></strong><br>
	  <p>Andrew no longer does much work on ht://Dig. He has started a
	  company, called <a href="http://www.contigo.com/">Contigo
	  Software</a> and is quite busy with that. To contact any of the
	  current developers, send mail to &lt;<a
	  href="mailto:htdig-dev@lists.sourceforge.net">htdig-dev</a>&gt;.
	  This list is intended primarily for the discussion of current
	  and future development of the software.</p>

	  <p>Geoff and Gilles are currently the maintainers of
	  ht://Dig, but they are both volunteers. So while they do
	  read all the e-mail they receive, they may not respond
	  immediately. Questions about ht://Dig in general, and especially
	  questions or requests for help in configuring the software,
	  should be posted to the &lt;<a
	  href="mailto:htdig-general@lists.sourceforge.net">htdig-general</a>&gt;
	  mailing list. When posting a followup to a message on the
	  list, you should use the "reply to all" or "group reply"
	  feature of your mail program, to make sure the mailing list
	  address is included in the reply, rather than replying only
	  to the author of the message.
	  See also question <a href="#q1.16">1.16</a> and the
	  <a href="http://www.htdig.org/mailarchive.html">mailing list</a>
	  page.</p>

	  <strong>1.5. <a name="q1.5">I sent a question to the mailing list but I
	  never got a response!</a></strong><br>
	  <p>Development of ht://Dig is done by volunteers. Since we all
	  have other jobs, it make take a while before someone gets back
	  to you. Please be patient and don't hound the volunteers with
	  direct or repeated requests. If you don't get a response after
	  3 or 4 days, then a reminder may help.
	  See also question <a href="#q1.16">1.16</a>.</p>

	  <strong>1.6. <a name="q1.6">I have a great idea/patch for
	  ht://Dig!</a></strong><br>
	  <p>Great! Development of ht://Dig continues through suggestions
	  and improvements from users. If you have an idea (or even better,
	  a patch), please send it to the ht://Dig mailing list so others
	  can use it. For suggestions on how to submit patches, please check
	  the <a href="dev/patches.html">Guidelines for
	  Patch Submissions</a>. If you'd like to make a feature request,
	  you can do so through the <a href="bugs.html">ht://Dig bug
	  database</a></p>

	  <strong>1.7. <a name="q1.7">Is ht://Dig Y2K compliant?</a></strong><br>
	  <p>
	  ht://Dig should be y2k compliant since it never <em>stores</em> dates as
	  two-digit years. Under ht://Dig's copyright (GPL), there is no warranty
	  whatsoever as permitted by law. If you would like an iron-clad,
	  legally-binding guarantee, feel free to check the source code
	  itself. Versions prior to 3.1.2 did have a problem with the parsing
	  of the Last-Modified header returned by the HTTP server, which will
	  cause incorrect dates to be stored for documents modified after
	  February 28, 2000 (yes, it didn't recognize 2000 as a leap year).
	  Versions prior to 3.1.5 didn't correctly handle servers that return
	  two digit years in the Last-Modified header, for years after 99.
	  These problems are fixed in the current release.
	  If you discover something else, please let us know!
	  </p>

	  <strong>1.8. <a name="q1.8">I think I found a bug. What should I
	  do?</a></strong><br>
	  <p>Well, there are probably bugs out there. You have two options
	  for bug-reporting. You can either mail the ht://Dig mailing list
	  at &lt;<a href="mailto:htdig-general@lists.sourceforge.net">htdig-general@lists.sourceforge.net</a>&gt; or
	  better yet, report it to the <a href="bugs.html">bug
	  database</a>, which ensures it won't
	  become lost amongst all of the other mail on the list.
	  Please try to include as much information as possible, including
	  the version of ht://Dig (see question <a href="#q5.33">5.33</a>),
	  the OS, and anything else that might be helpful.
	  Often, running the programs with one "-v" or more
	  (e.g. "-vvv") gives useful debugging information.
	  If you are unsure whether the problem is a bug or a configuration
	  problem, you should discuss the problem on
	  &lt;<a href="mailto:htdig-general@lists.sourceforge.net">htdig-general</a>&gt;
	  (after carefully reading the FAQ and searching the
	  <a href="http://www.htdig.org/mailarchive.html">mail archive</a>
	  and <a href="#q2.5">patch archive</a>,
	  of course)
	  to sort out what it is. The mailing list has a wider audience, so
	  you're more likely to get help with configuration problems there
	  than by reporting them to the bug database.
	  </p>

	  <p>Whether reporting problems to the bug database or mailing
	  list, we cannot stress enough the importance of
	  <strong>always</strong> indicating <strong>which version of
	  ht://Dig you are running</strong>.
	  See question <a href="#q5.33">5.33</a>. There
	  are still a lot of users, ISPs and software distributors using
	  older versions, and there have been a lot of bug fixes and
	  new features added in recent versions.  Knowing which version
	  you're running is absolutely essential in helping to find a
	  solution. If you're unsure if your version is current, or what
	  fixes and features have been added in more recent versions,
	  please see the <a href="RELEASE.html">
	  release notes</a>. See also question <a href="#q2.1">2.1</a>.</p>

	  <strong>1.9. <a name="q1.9">Does ht://Dig support phrase or near
	  matching?</a></strong><br>
	  <p>Phrase searching has been added for the 3.2 release,
	  which is currently in the beta phase
	 (<a href="http://www.htdig.org/files/htdig-3.2.0b6.tar.gz">3.2.0b6</a>
	  as of this writing). Near or proximity matching will probably be added
	  in a future beta.
	  </p>

	  <strong>1.10. <a name="q1.10">What are the practical and/or theoretical
	  limits of ht://Dig?</a></strong><br>
	  <p>The code itself doesn't put any real limit on the number of
	  pages. There are several sites in the hundreds of thousands
	  of pages. As for practical limits, it depends a lot on how
	  many pages you plan on indexing. Some operating systems limit
	  files to 2 GB in size, which can become a problem with a large
	  database. There are also slightly different limits to each of
	  the programs. Right now htmerge performs a sort on the words
	  indexed. Most sort programs use a fair amount of RAM and
	  temporary disk space as they assemble the sorted list. The
	  htdig program stores a fair amount of information about the
	  URLs it visits, in part to only index a page once. This takes
	  a fair amount of RAM. With cheap RAM, it never hurts to throw
	  more memory at indexing larger sites. In a pinch, swap will
	  work, but it obviously really slows things down.</p>

	  <p>The 3.2 development code helps with many of these
	  limitations. In paticular, it generates the databases on the
	  fly, which means you don't have to sort them before
	  searching. Additionally, the new databases are compressed
	  significantly, making them usually around 50% the size of
	  those in previous versions.</p>

	  <strong>1.11. <a name="q1.10">Do any ISPs offer ht://Dig as part of
	  their web hosting services?</a></strong><br>
	  <p>Yes. A list of such ISPs is <a href="isp.html">available
	  here</a>
	  </p>

	  <strong>1.12. <a name="q1.12">Can I use ht://Dig on a
	  commercial website?</a></strong><br>
	  <p>Sure! The <a href="COPYING">GNU Library General Public License (LGPL)</a> has no
	  restrictions on use. So you are free to use ht://Dig however you
	  want on your website, personal files, etc. The license only
	  restricts distribution. So if you're planning on a
	  commercial software product that includes ht://Dig, you will
	  have to provide source code including any modifications upon
	  request.
	  </p>

	  <strong>1.13. <a name="q1.13">Why do you use a non-free
	  product to index PDF files?</a></strong><br>
	  <p>
	  We don't. You <em>can</em> use the &quot;acroread&quot;
	  program to index PDF files, but this is no longer
	  recommended. Initially this program was the only reliable
	  way to extract data from PDF files. However, the <a
	  href="http://www.foolabs.com/xpdf/">xpdf package</a> is a
	  reliable, free software package for indexing and viewing PDF
	  files. See question <a href="#q4.9">4.9</a> for details on
	  using xpdf to index PDF files. We do not advocate using
	  acroread any longer because it is a proprietary product.
	  Additionally it is no longer reliable at extracting data.
	  </p>

	  <strong>1.14. <a name="q1.14">Why do you have all those SourceForge
	  logos on your website?</a></strong><br>
	  <p><a href="http://sourceforge.net/">SourceForge</a> is a
	  new service for open source software. You can host your
	  project on SourceForge servers and use many of their
	  services like bug-tracking and the like. The ht://Dig
	  project currently uses SourceForge for a mirror of the main
	  website at <a
	  href="http://htdig.sourceforge.net/">htdig.sourceforge.net</a>
	  as well as a mirror of ht://Dig releases and contributed
	  work.
	  </p>
	  
	  <strong>1.15. <a name="q1.15">My question isn't answered here. 
	  Where should I go for help?</a></strong><br>
	  <p>
	  Before you go anywhere else, think of other ways of phrasing your 
	  question. Many times people have questions that are very similar to 
	  other FAQ and while we try to phrase the queries in the FAQ closely to 
	  the most common questions, we obviously can't get them all! The next 
	  place to check is the documentation itself. In particular, take a 
	  look at the list of configuration attributes, particularly the list <a 
	  href="cf_byname.html">by name</a> and <a 
	  href="cf_byprog.html">by program</a>. There are a 
	  lot of them, but chances are there's something that might fit your needs.
	  You should also take a close look at all of
	  <a href="htsearch.html">htsearch</a>'s
	  documentation, especially the section "HTML form" which describes
	  all the CGI input parameters available for controlling the search,
	  including limiting the search to certain subdirectories.
	  You can find the answer yourself to almost all "how can I..."
	  questions by exploring what the various configuration attributes
	  and search form input parameters can do.
	  Also have a look at our collection of
	  <a href="http://www.htdig.org/contrib/guides.html">Contributed Guides</a>
	  for help on things like
	  <a href="http://www.htdig.org/files/contrib/guides/htmlhelp.html">HTML
	  forms</a> and CGI, tutorials on installing, configuring, using, and
	  internationalizing ht://Dig, as well as using PHP with htsearch.
	  </p>
	  <p>
	  Finally, if you've exhausted all the online documentation, there's the 
	  <a href="mailto:htdig-general@lists.sourceforge.net">htdig-general</a> mailing list. 
	  There are hundreds of users subscribed and chances are good that someone 
	  has had a similar problem before or can suggest a solution.
	  </p>

	  <strong>1.16. <a name="q1.16">Why do the developers get annoyed when
	  I e-mail questions directly to them rather than the mailing list?</a></strong><br>
	  <p>The <a href="mailto:htdig-general@lists.sourceforge.net">htdig-general</a>
	  mailing list exists for dealing with questions about the
	  software, its installation, configuration, and problems with
	  it. E-mailing the developers directly circumvents this forum
	  and its benefits. Most annoyingly, it puts the onus on an
	  individual to answer, even if that individual is not the best or
	  most qualified person to answer. This is not a one-man show. It
	  also circumvents the <a href="http://www.htdig.org/mailarchive.html">archiving
	  mechanism</a> of the mailing list,
	  so not only do subscribers not see these private messages
	  and replies, but future users who may run into the exact same
	  problems won't see them. Remember that the developers are all
	  volunteers, and they don't work for free for your benefit alone.
	  They volunteer for the benefit of the whole ht://Dig user
	  community, so don't expect extra support from them outside of
	  that community. See also questions <a href="#q1.4">1.4</a>
	  and <a href="#q1.5">1.5</a>.</p>

	  <p>Note also that when you reply to a message on the list, you
	  should make sure the reply gets on the list as well, provided your
	  reply is still on-topic.  See question <a href="#q1.17">1.17</a>
	  below.</p>

	  <strong>1.17. <a name="q1.17">Why do replies to messages on the
	  mailing list only go to the sender and not to the list?</a></strong><br>
	  <p>The simple answer is that, unlike some mailing lists, the
	  lists on SourceForge don't force replies back on the list. This
	  is actually a good thing, because you can reply to the sender
	  directly if you want to, or you can use your mail program's
	  "reply to all" capability (sometimes called "group reply")
	  to reply to the mailing list as well. It does mean you have to
	  think before you post a reply, but some would argue that this
	  is a good thing too. There are some compelling reasons to try to
	  keep on-topic discussions on the list, though (see questions
	  <a href="#q1.16">1.16</a> and <a href="#q1.4">1.4</a> above).</p>

	  <p>The technical answer is
	  <a href="http://sourceforge.net/docman/display_doc.php?docid=6693&group_id=1">
	  SourceForge's policy on Reply-To: munging</a>, where you'll
	  find all the gory details about the pros and cons of the two
	  common ways of setting up a mailing list, and why SourceForge
	  turns off Reply-To munging. It so happens that the ht://Dig
	  maintainers agree with SourceForge's policy on this, even if
	  we did have a say in the matter. So, counterarguments to this
	  policy are rather moot, and it would be better not to waste
	  any more mailing list bandwidth debating them. (We've heard
	  all the arguments anyway.)</p>

	  <strong>1.18. <a name="q1.18">Can I use ht://Dig to index and search
	  an SQL database?</a></strong><br>
	  <p>You can if your database has a web-based front end that can
	  be "spidered" by ht://Dig. The requirement is that every search
	  result must resolve to a unique URL which can be accessed via
	  HTTP. The htdig program uses these URLs, which you feed it via
	  the <a href="attrs.html#start_url">start_url</a> attribute, to
	  fetch and index each page of information. The search results
	  will then give a list of URLs for all pages that match the
	  search terms. If you don't have such a front end to your
	  database, or the search results must be given as something
	  other than URLs, then ht://Dig is probably not the best way of
	  dealing with this problem: you may be better off using an SQL
	  query engine that works directly on your own database, rather
	  than building a separate ht://Dig database for searching.</p>

	  <p>Ted Stresen-Reuter had the following tips: "In my case,
	  because I like htdig's ability to rank results (and that
	  ranking can be modified), I created an index page that simply
	  walks through each record and indexes each record (with
	  <em>next</em> and <em>previous</em> links so the spider can
	  read all the records). And then I do one other thing: I make
	  the <code>&lt;title&gt;</code> tag start with the unique ID
	  of each record. Then, when I'm parsing the search results, I
	  do a lookup on the database using the title tag as the key."</p>

	  <hr noshade size=2>

	  <h3>2. Getting ht://Dig</h3>
	  <strong>2.1. <a name="q2.1">What's the latest version of ht://Dig?</a></strong><br>
	  <p>The latest version is 3.1.6 as of this writing. A beta
	  version of the 3.2 code,
	 <a href="http://www.htdig.org/files/htdig-3.2.0b6.tar.gz">3.2.0b6</a>,
	  is also available, for those who wish to test it.
	  You can find out about the latest version by reading the
	  <a href="RELEASE.html">release
	  notes</a>.</p>

	  <p><strong>Note</strong> that if you're running any version
	  older than 3.1.5 (including 3.2.0b1) on a public web site,
	  you should upgrade immediately, as older versions have a
	  rather serious security hole which is explained in detail in
	  this <a
	  href="http://www.htdig.org/htdig-dev/2000/02/0272.html">advisory</a>
	  which was sent to the Bugtraq mailing list.
	  Another slightly less serious, but still troubling security hole
	  exists in 3.1.5 and older (including 3.2.0b3 and older), so you
	  should upgrade if you're running one of these. You can view details
	  on this vulnerability from the
	  <a href="http://www.securityfocus.com/bid/3410">bugtraq mailing list.</a>
	  If you're unsure of which version you're running, see question
	  <a href="#q5.33">5.33</a>.</p>

	  <strong>2.2. <a name="q2.2">Are there binary distributions of
	  ht://Dig?</a></strong><br>
	  <p>We're trying to get consistent binary distributions for
	  popular platforms. Contributed binary releases will go in <a
	  href="http://www.htdig.org/files/contrib/binaries/">
	  the contributed binaries section</a>
	  and contributions should be mentioned to the <a
	  href="mailto:htdig-general@lists.sourceforge.net">htdig-general</a>
	  mailing list.

	  <p>Anyone who would like to make consistent binary
	  distributions of ht://Dig at least should signup to the <a
	  href="mailing.html">htdig-announce mailing list</a>.</p>

	  <strong>2.3. <a name="q2.3">Are there mirror sites for ht://Dig?</a></strong><br>
	  <p>Yes, see our <a href="mirrors.html">mirrors
	  listing</a>. If you'd like to mirror the site, please see
	  the <a href="howto-mirror.html">mirroring guide</a>.</p>

	  <strong>2.4. <a name="q2.4">Is ht://Dig available by ftp?</a></strong><br>
	  <p>Yes. You can find the current versions and several older
	  versions at various &lt;<a
	  href="mirrors.html">mirror sites</a>&gt;
	  as well as the other locations mentioned in the <a
	  href="where.html">download page</a>.</p>

	  <strong>2.5. <a name="q2.5">Are patches around to upgrade between
	  versions?</a></strong><br>
	  <p>Most versions are also distributed as a patch to the previous
	  version's source code. The most recent exception to this was
	  version 3.1.0b1. Since this version switched from the GDBM
	  database to DB2, the new database package needed to be shipped
	  with the distribution. This made the potential patch almost as large
	  as the regular distribution. Update patches resumed with version
	  3.1.0b2. You can also find archives of patches submitted to
	  the htdig mailing lists, to fix specific bugs or add features,
	  at Joe Jah's <a href="ftp://ftp.ccsf.org/htdig-patches/">
	  htdig-patches ftp site</a>.</p>

	  <strong>2.6. <a name="q2.6">Is there a Windows 95/98/2000/NT
	  version of ht://Dig?</a></strong><br>
	  <p>The ht://Dig package can be built on the Win32 platform when
	  using the Cygwin package. For details, see the contributed guide,
	  <a href="http://www.htdig.org/files/contrib/guides/Installing_on_Win32.html">
	  <em>Idiot's Guide to Installing ht://Dig on Win32</em></a>.
	  </p>
	  <p>
	  As of the <a href="http://www.htdig.org/files/htdig-3.2.0b5.tar.gz">3.2.0b5</a>
	  beta release, there is also native Win32 support, thanks to
	  Neal Richter.  (Installation docs will be written soon...)
	  </p>

	  <strong>2.7. <a name="q2.7">Where can I find the documentation for my
	  version of ht://Dig?</a></strong><br>
	  <p>The documentation for the most recent stable release is always
	  posted at <a href="http://www.htdig.org/">www.htdig.org</a>.
	  The documentation for the latest beta release can be found at
	  <a href="http://www.htdig.org/dev/htdig-3.2/">http://www.htdig.org/dev/htdig-3.2/</a>.
	  In all releases, the documentation is included in the
	  <strong>htdoc</strong> subdirectory of the source distribution, so
	  you always have access to the documentation for your current version.
	  </p>

	  <hr noshade size=2>

	  <h3>3. Compiling</h3>
	  <strong>3.1. <a name="q3.1">When I compile ht://Dig I get an error about
	  libht.a</a></strong><br>
	  <p>This usually indicates that either libstdc++ is not installed or
	  is installed incorrectly. To get libstdc++ or any other GNU too,
	  check
	  <a
	  href="ftp://ftp.gnu.org/gnu/">ftp://ftp.gnu.org/gnu/</a>.
	  Note that the most recent versions of gcc come with
	  libstdc++ included and are available from <a
	  href="http://gcc.gnu.org/">http://gcc.gnu.org/</a></p>

	  <strong>3.2. <a name="q3.2">I get an error about -lg</a></strong><br>
	  <p>This is due to a bug in the Makefile.config.in of version
	  3.1.0b1. Remove all flags "-ggdb" in Makefile.config.in. Then
	  type "./config.status" to rebuild the Makefiles and
	  recompile. This bug is fixed in version 3.1.0b2.</p>

	  <strong>3.3. <a name="q3.3">I'm compiling on Digital Unix and I get
	  mesages about "unresolved" and "db_open."</a></strong><br>
	  <p>Answer contributed by George Adams
	  &lt;learningapache@my-dejanews.com&gt;</p>

	  <p>What you're seeing are problems related to the Berkeley DB
	  library.  htdig needs a fairly modern version of db, which is
	  why it ships with one that works. (see that -L../db-2.4.14/dist
	  line?  That's where htdig's db library is).<br>

	  The solution is to modify the c++ command so it explicity
	  references the correct libdb.a .  You can do this by replacing
	  the "-ldb" directive in the c++ command with
	  "../db-2.4.14/dist/libdb.a" This problem has been resolved as of
	  version 3.1.0.</p>

	  <strong>3.4. <a name="q3.4">I'm compiling on FreeBSD and I get lots
          of messages about '___error' being unresolved.</a></strong><br>
	  <p>Answer contributed by Laura Wingerd &lt;laura@perforce.com&gt;<br>
	  I got a clean build of htdig-3.1.2 on FreeBSD 2.2.8 by taking
	  -D_THREAD_SAFE out of CPPFLAGS, and setting LIBS to null, in
	  db/dist/configure.</p>

	  <strong>3.5. <a name="q3.5">I'm compiling on HP/UX and I get a complaint about
	  "Large Files not supported."</a></strong><br>
	  <p>The db/ pacakge, included with ht://Dig seems to be unable to complete
	  on HP/UX 10.20 in particular. After running the top-level configure 
	  script, cd into db/dist and type:</p>
	  <code>./configure --disable-bigfile</code>
	  <p>Then continue with the normal compilation.</p>
	  
	  <strong>3.6. <a name="q3.6">I'm compiling on Solaris and when I run the 
	  programs I get complaints about not finding libstdc++.</a></strong><br>
	  <p>Answer contributed by Adam Rice &lt;adam@newsquest.co.uk&gt;</p>
	  <p>The problem is that the Solaris loader can't find the library. The 
	  best thing to do is set the LD_RUN_PATH environment variable <em>during compile</em>
	  to the directory where libstdc++.so.2.8.1.1 lives. This tells the linker 
	  to search that directory at runtime.
	  </p>

	  <p>Note that LD_RUN_PATH is not to be confused with LD_LIBRARY_PATH.
	  The latter is parsed at run-time, while LD_RUN_PATH essentially
	  compiles in a library path into the executable, so that it doesn't
	  need a LD_LIBRARY_PATH setting to find its libraries. This allows
	  you to avoid all the complexities of setting an environment
	  variable for a CGI program run from the server. If all else fails,
	  you can always run your programs from wrapper shell scripts that
	  set the LD_LIBRARY_PATH environment variable appropriately.</p>

	  <p>Note also that while this answer is specific to Solaris, it may
	  work for other OSes too, so you may want to give it a try. However,
	  not all versions of the <code>ld</code> program on all OSes support
	  the LD_RUN_PATH environment variable, even if these systems support
	  shared libraries. Try "<code>man&nbsp;ld</code>" on your system to
	  find out the best way of setting the runtime search path for shared
	  libraries. If <code>ld</code> doesn't support LD_RUN_PATH, but does
	  support the <code>-R</code> option, you can add one or more of these
	  options to LIBDIRS in Makefile.config before running make on a 3.1.x
	  release. (For a 3.2 beta release, you can add these options to the
	  LDFLAGS environment variable before you run ./configure.)</p>
	  
	  <strong>3.7. <a name="q3.7">I'm compiling on IRIX and I'm having 
	  database problems when I run the program.</a></strong><br>
	  <p>
	  It is not entirely clear why these problems occur, though
	  they seem to only happen when older compilers are
	  used. Several people have reported that the problems go away
	  when using the latest version of <a href="http://gcc.gnu.org/">gcc</a>.
	  </p>
	  
	  <strong>3.8. <a name="q3.8">I'm compiling with gcc 3.2 and getting
	  all sorts of warnings/errors about ostream and such.</a></strong><br>
	  <p>
	  With versions before 3.2.0b5,
	  you should use the following command to configure the ht://Dig
	  package so it can be built with gcc 3.2:
<pre>
CXXFLAGS=-Wno-deprecated CPPFLAGS=-Wno-deprecated ./configure
</pre>
	  </p>

	  <hr noshade size=2>

	  <h3>4. Configuration</h3>
	  <strong>4.1. <a name="q4.1">How come I can't index my site?</a></strong><br>
	  <p>There are a variety of reasons ht://Dig won't index a
	  site. To get to the bottom of things, it's advisable to turn on
	  some debugging output from the htdig program. When running from
	  the command-line, try "-vvv"  in addition to any other
	  flags. This will add debugging output, including the responses
	  from the server.</p>
	  <p>See also questions <a href="#q5.25">5.25</a>,
	  <a href="#q5.27">5.27</a>, <a href="#q5.16">5.16</a> and
	  <a href="#q5.18">5.18</a>.</p>

	  <strong>4.2. <a name="q4.2">How can I change the output format of htsearch?</a></strong><br>
<p>Answer contributed by: Malki Cymbalista &lt;Malki.Cymbalista@weizmann.ac.il&gt;</p>

<p>You can change the output format of htsearch by creating different
header, footer and result files that specify how you want the output
to look. You then create a configuration file that specifies which
files to use. In the html document that links to the search, you
specify which configuration file to use.</p>

<p>So the configuration file would have the lines:</p>
<pre>
search_results_header: ${common_dir}/ccheader.html
search_results_footer: ${common_dir}/ccfooter.html
template_map:  Long long builtin-long \
               Short short builtin-short \
               Default default ${common_dir}/ccresult.html
template_name: Default
</pre>
<p>You would also put into the configuration file any other lines from the
default configuration file that apply to htsearch.</p>

<p>The files ${common_dir}/ccheader.html and
${common_dir}/ccfooter.html and ${common_dir}/ccresult.html would be
tailored to give the output in the desired format.</p>

<p>Assuming your configuration file is called cc.conf, the html file that
links to the search has to set the config parameter equal to cc. The
following line would do it:<br>
<code>&lt;input type="hidden" name="config" value="cc"&gt;</code></p>

	  <p><strong>Note:</strong> Don't just add the line above to your
	  <a href="hts_form.html">search form</a>
	  without checking if there isn't already a similar
	  line giving the config attribute a different value. The sample
	  search.html form that comes with the package includes a line
	  like this already, giving "config" the default value of "htdig".
	  If it's there, modify it instead of adding another definition.
	  The config input parameter doesn't need to be hidden either, and
	  you may want to define it as a pull-down list to select different
	  databases (see question <a href="#q4.4">4.4</a>).</p>

	  <strong>4.3. <a name="q4.3">How do I index pages that start with '~'?</a></strong><br>
	  <p>
	  ht://Dig should index pages starting with '~' as if it was another
	  web browser. If you are having problems with this, check your server
	  log files to see what file the server is attempting to return.
	  </p>

	  <strong>4.4. <a name="q4.4">Can I use multiple databases?</a></strong><br>
	  <p>Yes, though you may find it easier to have one larger
	  database and use restrict or exclude fields on searches. To use
	  multiple databases, you will need a config file for each
	  database. Then each file will set the
	  <a href="attrs.html#database_dir">database_dir</a> or
	  <a href="attrs.html#database_base">database_base</a> attribute to
	  change the name of the databases. The config file is selected
	  by the <strong>config</strong> input field in the search form.
	  <br>See also questions <a href="#q4.2">4.2</a> and
	  <a href="#q4.20">4.20</a>.</p>

	  <strong>4.5. <a name="q4.5">OK, I can use multiple databases. Can I
	  merge them into one?</a></strong><br>
	  <p>As of version 3.1.0, you can do this with the -m option to
	  <a href="htmerge.html">htmerge</a>.</p>

	  <strong>4.6. <a name="q4.6">Wow, ht://Dig eats up a lot of disk
	  space. How can I cut down?</a></strong><br>
	  <p>There are several ways to cut down on disk space. One is
	  not to use the "-a" option, which creates work copies of the
	  databases. Naturally this essentially doubles the disk
	  usage. If you don't need to index and search at the same time, you can
	  ignore this flag.</p>

	  <p>If you are running 3.2.0b5 or higher and don't have
	  <a href="dev/htdig-3.2/attrs.html#wordlist_compress_zlib">compression</a>
	  turned on, then turning that on will also save considerable space.</p>

	  <p>Changing configuration variables can also help cut
	  down on disk usage. Decreasing
	  <a href="attrs.html#max_head_length">max_head_length</a> and
	  <a href="attrs.html#max_meta_description_length">max_meta_description_length</a>
	  will cut down on the size of the excerpts stored (in fact, if you
	  don't have
	  <a href="attrs.html#use_meta_description">use_meta_description</a>
	  set, you can set
	  max_meta_description_length to 0!).</p>
	  
	  <p>If you are running 3.2.0b6 or higher, you can turn off
	  <a href="dev/htdig-3.2/attrs.html#store_phrases">store_phrases</a>.  This cuts the
	  database size by about 60%, at the expense of severely limiting
	  the effectiveness of phrase searches.  It also reduces digging time
	  slightly.</p>

	  <p>Other techniques include removing the db.wordlist file and adding
	  more words to the <a href="attrs.html#bad_words">bad_words</a>
	  file.</p>

	  <p>The University of Leipzig has published
	  <a href="http://wortschatz.uni-leipzig.de/html/wliste.html">
	  word lists</a> containing the 100, 1000 and 10000 most often used
	  words in English, German, French and Dutch. No copyrights or
	  restrictions seem to be applied to the downloadable files. These
	  can be very handy when putting together a bad_words file. Thanks
	  to Peter Asemann for this tip.</p>

	  <strong>4.7. <a name="q4.7">Can I use SSI or other CGIs in my
	  htsearch results?</a></strong><br>
	  <p>Not really. Apache will not parse CGI output for SSI
	  statements (See the <a
	  href="http://www.apache.org/docs/misc/FAQ.html#ssi-part-iii">Apache
	  FAQ</a>). Thus,the htsearch CGI does not understand SSI
	  markup and thus cannot include other
	  CGIs. However, it is possible doing it the other way round:
	  you can have the htsearch results included in your dynamic
	  page.
	  </p>
	  <p>
	  The Apache project has mentioned that this will be a
	  feature added to the Apache 2.0 version, currently in development.
	  </p>

	  <p>The easiest approach in the meantime is using SSI with
	  the help of the <a
	  href="attrs.html#script_name">script_name</a> configuration
	  file attribute. See the <code>contrib/scriptname</code>
	  directory for a small example using SSI.</p>

	  <p>For CGI and PHP, you need a &quot;wrapper&quot; script to
	  do that. For perl script examples, see the files in
	  <code>contrib/ewswrap</code>. The PHP guide (see <a
	  href="http://www.htdig.org/contrib/guides.html">contributed
	  guides</a>) not only describes a wrapper script for PHP, but
	  also offers a step by step tutorial to the basics of
	  ht://dig and is well worth reading.
	  For other alternatives, see question <a href="#q4.11">4.11</a>.
	  </p>

	  <strong>4.8. <a name="q4.8">How do I index Word, Excel, PowerPoint
	  or PostScript documents?</a></strong><br>
	  <p>This must be done with an
	  <a href="attrs.html#external_parsers">external parser or converter</a>.
	  A sample of such an external converter is the
	  contrib/doc2html/doc2html.pl Perl script.
	  It will parse Word, PostScript, PDF and other documents, when used
	  with the appropriate document to text converters. It uses catdoc to
	  parse Word documents, and ps2ascii to parse PostScript files. The
	  comments in the Perl script and accompanying documentation
	  indicate where you can obtain these converters.</p>

	  <p>Versions of htdig before 3.1.4 don't support external converters,
	  so you have to use an external parser script such as
	  contrib/parse_doc.pl (or better yet, upgrade htdig if you can).
	  External converter scripts are simpler to write and maintain than a
	  full external parser, as they just convert input documents to
	  text/plain or text/html, and pass that back to htdig to be parsed.
	  Parsing is more consistent across document types with external
	  converters, because the final work is done by htdig's internal
	  parsers.  External parser scripts tend to be hacks that don't
	  recognize a lot of the parsing attributes in your htdig.conf, so
	  they have to be hacked some more when you change your attributes.</p>

	  <p>The most recent versions of parse_doc.pl, conv_doc.pl and
	  the doc2html package are available on our <a
	  href="http://www.htdig.org/files/contrib/parsers/">web site</a>.<br>
	   See below for an example of doc2html.pl, or see the comments in
	  conv_doc.pl and parse_doc.pl, or the documentation for doc2html
	  for examples of their usage.
	  For help with troubleshooting, see questions
	  <a href="#q5.37">5.37</a> and <a href="#q5.39">5.39</a>.</p>

	  <strong>4.9. <a name="q4.9">How do I index PDF files?</a></strong><br>
	  <p>This too can be done with an
	  <a href="attrs.html#external_parsers">external parser or converter</a>,
	  in combination with the pdftotext program that is part of the
	  <a href="http://www.foolabs.com/xpdf/">xpdf</a> 0.90 package. A
	  sample of such a converter is the doc2html.pl Perl
	  script. It uses pdftotext to parse PDF documents, then processes
	  the text into external parser records.
	  The most recent version of doc2html.pl is available on our <a
	  href="http://www.htdig.org/files/contrib/parsers/">web
	  site</a>.</p>

	  <p>For example, you could put this in your configuration file:</p>
<pre>
<a href="attrs.html#external_parsers">external_parsers</a>: application/msword-&gt;text/html /usr/local/bin/doc2html.pl \
                  application/postscript-&gt;text/html /usr/local/bin/doc2html.pl \
                  application/pdf-&gt;text/html /usr/local/bin/doc2html.pl
</pre>
	  <p>You would also need to configure the script to indicate where all
	  of the document to text converters are installed. See the DETAILS
	  file that comes with doc2html for more information.</p>

	  <p>Versions of htdig before 3.1.4 don't support external converters,
	  so you have to use an external parser script such as
	  contrib/parse_doc.pl (or better yet, upgrade htdig if you can).
	  See question <a href="#q4.8">4.8</a> above.</p>

	  <p>Whether you use this external parser or converter, or acroread
	  with the <a href="attrs.html#pdf_parser">pdf_parser</a> attribute,
	  to successfully index PDF files be sure to set the <a
	  href="attrs.html#max_doc_size">max_doc_size</a> attribute to
	  a value larger than the size of your largest PDF file. PDF
	  documents can not be parsed if they are truncated.</p>

	  <p>This also raises the questions of why two different
	  methods of indexing PDFs are supported, and which method
	  is preferred.  The built-in PDF support, which uses acroread
	  to convert the PDF to PostScript, was the first method which
	  was provided. It had a few problems with it: acroread is not
	  open source, it is not supported on all systems on which
	  ht://Dig can run, and for some PDFs, the PostScript that
	  acroread generated was very difficult to parse into indexable
	  text. Also, the built-in PDF support expected PDF documents to
	  use the same character encoding as is defined in your current
	  <a href="attrs.html#locale">locale</a>, which isn't always the
	  case. The external converters, which use pdftotext, were developed
	  to overcome these problems. xpdf 0.90 is free software, and its
	  pdftotext utility works very well as an indexing tool.
	  It also converts various PDF encodings to the Latin 1 set.
	  It is the opinion of the developers that this is the
	  preferred method. However, some users still prefer to stick
	  with acroread, as it works well for them, and is a little
	  easier to set up if you've already installed Acrobat.</p>

	  <p>Also, pdftotext still has some difficulty handling text in
	  landscape orientation, even with its new -raw option in 0.90,
	  so if you need to index such text in PDFs, you may still get
	  better results with acroread. The pdf_parser attribute has been
	  removed from the 3.2 beta releases of htdig, so to use acroread
	  with htdig 3.2.0b5 or other 3.2 betas, use the acroconv.pl
	  external converter script from our <a
	  href="http://www.htdig.org/files/contrib/parsers/">web site</a>.</p>

	  <p>See also question <a href="#q5.2">5.2</a> below and
	  question <a href="#q1.13">1.13</a> above.
	  See questions <a href="#q5.37">5.37</a> and <a href="#q5.39">5.39</a>
	  for troubleshooting tips.</p>

	  <strong>4.10. <a name="q4.10">How do I index documents in other
	  languages?</a></strong><br>
	  <p>The first and most important thing you must do,
	  to allow ht://Dig to properly support international
	  characters, is to define the correct locale for the
	  language and country you wish to support.  This is done
	  by setting the <a href="attrs.html#locale">locale</a>
	  attribute (see question <a href="#q5.8">5.8</a>). The
	  next step is to configure ht://Dig to use dictionary and
	  affix files for the language of your choice. These can
	  be the same dictionary and affix files as are used by the
	  ispell software.  A collection of these is available from
	  Geoff Kuenning's
	  <a href="http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html">
	  International Ispell Dictionaries page</a>, and we're slowly
	  building a collection of word lists on our <a
	  href="http://www.htdig.org/files/contrib/wordlists/">web site</a>.</p>
	  <p>For example, if you install German dictionaries in common/german,
	  you could use these lines in your configuration file:</p>
<pre>
<a href="attrs.html#locale">locale</a>:               de_DE
lang_dir:             ${<a href="attrs.html#common_dir">common_dir</a>}/german
<a href="attrs.html#bad_word_list">bad_word_list</a>:        ${lang_dir}/bad_words
<a href="attrs.html#endings_affix_file">endings_affix_file</a>:   ${lang_dir}/german.aff
<a href="attrs.html#endings_dictionary">endings_dictionary</a>:   ${lang_dir}/german.0
<a href="attrs.html#endings_root2word_db">endings_root2word_db</a>: ${lang_dir}/root2word.db
<a href="attrs.html#endings_word2root_db">endings_word2root_db</a>: ${lang_dir}/word2root.db
</pre>
	  <p>
	  You can build the endings database with <code>htfuzzy endings</code>.
	  (This command may actually take days to complete, for
	  releases older than 3.1.2. Current releases use faster regular
	  expression matching, which will speed this up by a few orders
	  of magnitude.) Note that the "*.0" files are not part of
	  the ispell dictionary distributions, but are easily made by
	  concatenating the partial dictionaries and sorting to remove
	  duplicates (e.g.: "<code>cat * | sort | uniq &gt; lang.0</code>"
	  in most cases). You will also need to redefine the synonyms
	  file if you wish to use the synonyms search algorithm. This
	  file is not included with most of the dictionaries, nor is the
	  <a href="attrs.html#bad_words">bad_words</a> file.</p>

	  <p>If you put all the language-specific
	  dictionaries and configuration files in separate directories,
	  and set all the attribute definitions accordingly in each
	  search config file to access the appropriate files, you can
	  have a multilingual setup where the user selects the language
	  by selecting the "config" input parameter value. In addition
	  to the attributes given in the example above, you may also
	  want custom settings for these language-specific attributes:
	  <a href="attrs.html#date_format">date_format</a>,
	  <a href="attrs.html#iso_8601">iso_8601</a>,
	  <a href="attrs.html#method_names">method_names</a>,
	  <a href="attrs.html#no_excerpt_text">no_excerpt_text</a>,
	  <a href="attrs.html#no_next_page_text">no_next_page_text</a>,
	  <a href="attrs.html#no_prev_page_text">no_prev_page_text</a>,
	  <a href="attrs.html#nothing_found_file">nothing_found_file</a>,
	  <a href="attrs.html#page_list_header">page_list_header</a>,
	  <a href="attrs.html#prev_page_text">prev_page_text</a>,
	  <a href="attrs.html#search_results_wrapper">search_results_wrapper</a>
	  (or <a href="attrs.html#search_results_header">search_results_header</a>
	  and <a href="attrs.html#search_results_footer">search_results_footer</a>),
	  <a href="attrs.html#sort_names">sort_names</a>,
	  <a href="attrs.html#synonym_db">synonym_db</a>,
	  <a href="attrs.html#synonym_dictionary">synonym_dictionary</a>,
	  <a href="attrs.html#syntax_error_file">syntax_error_file</a>,
	  <a href="attrs.html#template_map">template_map</a>, and of course
	  <a href="attrs.html#database_dir">database_dir</a> or
	  <a href="attrs.html#database_base">database_base</a> if you
	  maintain multiple databases for sites of different languages.
	  You could also change the definition of
	  <a href="attrs.html#common_dir">common_dir</a>, rather than
	  making up a lang_dir attribute as above, as many language-specific
	  files are defined relative to the common_dir setting.</p>

	  <p>If you're running version 3.1.6 of ht://Dig, you may also
	  be interested in the <strong>accents</strong> fuzzy match
	  algorithm in the
	  <a href="attrs.html#search_algorithm">search_algorithm</a>
	  attribute, which lets you treat accented and unaccented letters
	  as equivalent in words. Note that if you use the accents algorithm,
	  you need to rebuild the accents database each time you update your
	  word database, using <code>"htfuzzy accents"</code>. This command
	  isn't in the default rundig script, so you may want to add it there.
	  The accents fuzzy match algorithm is also in the 3.2 beta releases.
	  There are also the
	  <a href="attrs.html#boolean_keywords">boolean_keywords</a> and
	  <a href="attrs.html#boolean_syntax_errors">boolean_syntax_errors</a>
	  attributes in 3.1.6 for changing other language-specific messages
	  in htsearch.</p>

	  <p>Current versions of ht://Dig only support 8-bit
	  characters, so languages such as Chinese and Japanese, which
	  require 16-bit characters, are not currently supported.</p>

	  <p>Didier Lebrun has written a guide for configuring htdig to
	  support French, entitled
	  <a href="http://www.quartier-rural.org/dl/elucu/htdig-vf/lisezmoi.html">
	  Comment installer et configurer HtDig pour la langue fran&ccedil;aise</a>.
	  His "kit de francisation" is also available on
	  <a
	  href="http://www.htdig.org/files/contrib/wordlists/">our
	  web site</a>.</p>

	  <p>See also question <a href="#q4.2">4.2</a> for tips on customizing
	  htsearch, and question <a href="#q4.6">4.6</a> for tips where to find
	  bad_words files.</a></p>

	  <strong>4.11. <a name="q4.11">How do I get rotating banner ads in
	  search results?</a></strong><br>
	  <p>While htsearch doesn't currently provide a means of doing
	  SSI on its output, or calling other CGI scripts, it does have
	  the capability of using environment variables in templates.</p>

	  <p>The easiest way to get rotating banners in htsearch is
	  to replace htsearch with a wrapper script that sets an
	  environment variable to the banner content, or whatever
	  dynamically generated content you want. Your script can then
	  call the real htsearch to do the work. The wrapper script can be
	  written as a shell script, or in Perl, C, C++, or whatever you
	  like. You'd then need to reference that environment variable
	  in header.html (or wrapper.html if that's what you're using),
	  to indicate where the dynamic content should be placed.</p>

	  <p>If the dynamic content is generated by a CGI script, your new
	  wrapper script which calls this CGI would then have to strip out
	  the parts that you don't want embedded in the output (headers,
	  some tags) so that only the relevant content gets put into the
	  environment variable you want.  You'd also have to make sure
	  this CGI script doesn't grab the POST data or get confused by
	  the QUERY_STRING contents intended for htsearch. Your script
	  should not take anything out of, or add anything to, the
	  QUERY_STRING environment variable.</p>

	  <p>An alternative approach is to have a cron job that periodically
	  regenerates a different header.html or wrapper.html with the
	  new banner ad, or changes a link to a different pre-generated
	  header.html or wrapper.html file. For other alternatives, see
	  question <a href="#q4.7">4.7</a>.</p>

	  <strong>4.12. <a name="q4.12">How do I index numbers in documents?</a></strong><br>
	  <p>By default, htdig doesn't treat numbers without letters
	  as words, so it doesn't index them.
	  To change this behavior, you must set the
	  <a href="attrs.html#allow_numbers">allow_numbers</a>
	  attribute to true, and rebuild your index from scratch using
	  rundig or htdig with the -i option, so that bare numbers get
	  added to the index.</p>

	  <strong>4.13. <a name="q4.13">How can I call htsearch from a hypertext
	  link, rather than from a search form?</a></strong><br>
	  <p>If you change the search.html form to use the GET method
	  rather than POST, you can see the URLs complete with all the
	  arguments that htsearch needs for a query. Here is an example:<br>
<code>
http://www.grommetsRus.com/cgi-bin/htsearch?config=htdig&amp;restrict=&amp;exclude=&amp;method=and&amp;format=builtin-long&amp;words=grapple+grommets
</code>
	  which can actually be simplified to:<br>
<code>
http://www.grommetsRus.com/cgi-bin/htsearch?method=and&amp;words=grapple+grommets
</code>
	  with the current defaults. The "&amp;" character acts as a
	  separator for the input parameters, while the "+" character
	  acts as a space character within an input parameter.
	  In versions 3.1.5 or 3.2.0b2, or later, you can use a semicolon
	  character ";" as a parameter separator, rather than "&amp;", for
	  HTML 4.0 compliance.
	  Most non-alphanumeric characters should be hex-encoded following
	  the convention for URL encoding (e.g. "%" becomes "%25", "+"
	  becomes "%2B", etc). Any htsearch input parameter that you'd
	  use in a search form can be added to the URL in this way.
	  This can be embedded into an &lt;a href="..."&gt; tag.
	   <br>See also question <a href="#q5.21">5.21</a>.</p>

	  <strong>4.14. <a name="q4.14">How do I restrict a search to only meta
	  keywords entries in documents?</a></strong><br>
	  <p>First of all, you do <strong>not</strong> do this by using the
	  "keywords" field in the search form. This seems to be a
	  frequent cause of confusion.	The "keywords" input parameter
	  to htsearch has absolutely nothing to do with searching meta
	  keywords fields.  It actually predates the addition of meta
	  keyword support in 3.1.x.  A better choice of name for the
	  parameter would have been "requiredwords", because that's what
	  it really means - a list of words that are all required to be
	  found somewhere in the document, in addition to the words the
	  user specifies in the search form.</p>

 	  <p>As of 3.2.0b5, the most direct way to search for a particular
 	  meta keyword is to specify the word as "keyword:&lt;word&gt;".
 	  Similarly, "title:", "heading:", and "author:" restrict searches
 	  to the respective fields.  To search for words in the body of the
 	  text, use "text:".</p>
 
 	  <p>To restrict all search terms to meta keywords only, you can set all
 	  <a href="attrs.html#heading_factor">factors</a> other than
 	  keywords_factor to 0, and for 3.1.x, you
 	  must then reindex your documents.  In the 3.2 betas, you can
	  change factors at search time without needing to reindex.
	  As of 3.2.0b5, it is possible to restrict
	  the search in the query itself.  Note that changing the scoring
	  factors in this way will only alter the scoring of search results,
	  and shift the low or zero scores to the end of the results when
	  sorting by score (as is done by default). For versions before
	  3.2.0b5, the results with scores
	  of zero aren't actually removed from the search results.</p>

	  <strong>4.15. <a name="q4.15">Can I use meta tags to prevent htdig from
	  indexing certain files?</a></strong><br>
	  <p>Yes, in each HTML file you want to exclude, add the following
	  between the &lt;HEAD&gt; and &lt;/HEAD&gt; tags:</p>
		<blockquote>
		   &lt;META NAME="robots" CONTENT="noindex, follow"&gt;
		</blockquote>
	  <p>Doing so will allow htdig to still follow links to other documents,
	  but will prevent this document from being put into the index itself.
	  You can also use "nofollow" to prevent following of links. See
	  the section on <a href="meta.html">Recognized META information</a>
	  for more details. For documents produced automatically by MhonArc,
	  you can have that line inserted automatically by putting it in the
	  MhonArc resource file, in the sections IDXPGBEGIN and TIDXPGBEGIN.</p>

	  <p>You can also use the
	  <a href="attrs.html#noindex_start">noindex_start</a> and
	  <a href="attrs.html#noindex_end">noindex_end</a> attributes to
	  define one set of tags which will mark sections to be stripped out
	  of documents, so they don't get indexed, or you can mark sections
	  with the non-DTD &lt;noindex&gt; and &lt;/noindex&gt; tags.
	  The noindex_start and noindex_end attributes can also be used to
	  suppress in-line JavaScript code that wasn't properly enclosed in
	  HTML comment tags (see question <a href="#q4.26">4.26</a>).
	  In 3.1.6, you can also put a section between &lt;noindex follow&gt;
	  and &lt;/noindex&gt; tags to turn off indexing of text but still
	  allow htdig to follow links.</p>

	  <p>If you require much more elaborate schemes for avoiding indexing
	  certain parts of your HTML files, especially if you don't have
	  control over these files and can't add tags to them, you can
	  set up htdig's
	  <a href="attrs.html#external_parsers">external_parsers</a> attribute
	  with an external converter that will preprocess the HTML before
	  it's parsed and indexed by htdig. Examples of this are the
	  unhypermail.sh script in our
	  <a href="http://www.htdig.org/files/contrib/parsers/">contributed parsers</a>
	  and the ungeoify.sh script in our
	  <a href="http://www.htdig.org/files/contrib/scripts/">contributed scripts</a>.
	  By preprocessing the HTML, you can strip out parts you don't want, or
	  you can add or change tags wherever they're needed, if you're willing
	  to put in the effort to learn awk/sed/perl enough to do the job.</p>

	  <strong>4.16. <a name="q4.16">How do I get htsearch to use the star image
	  in a different directory than the default /htdig?</a></strong><br>
	  <p>You must set either the
	  <a href="attrs.html#image_url_prefix">image_url_prefix</a> attribute,
	  or both <a href="attrs.html#star_blank">star_blank</a> and
	  <a href="attrs.html#star_image">star_image</a> in your
	  htdig.conf, to refer to the URL path for these files. You should
	  also set this URL path similarly in in common/header.html and
	  common/wrapper.html, as they also refer to the star.gif file.
	  If you want to relocate other graphics, such as the buttons or
	  the ht://Dig logo, you should change all references to these
	  in htdig.conf and common/*.html.</p>

	  <strong>4.17. <a name="q4.17">How do I get htdig or htsearch to rewrite
	  URLs in the search results?</a></strong><br>
	  <p>This can be done by using the <a
	  href="attrs.html#url_part_aliases">url_part_aliases</a>
	  configuration file attribute. You have to set up different
	  configuration files for htdig and htsearch, to define a
	  different setting of this attribute for each one.</p>

	  <p>A large number of users insist on ignoring that last point
	  and try to make do with just one definition, either for htdig
	  or htsearch, or sometimes for both. This seems to stem from
	  a fundamental misunderstanding of how this attribute works,
	  so perhaps a clarification is needed. The url_part_aliases
	  attribute uses a two stage process. In the first stage, htdig
	  encodes the URLs as they go into the database, by using the
	  pairs in url_part_aliases going from left to right. In the
	  second stage, htsearch decodes the encoded URLs taken from the
	  database, by using the pairs in url_part_aliases going from
	  right to left. If you have the same value for url_part_aliases
	  in htdig and htsearch, you end up with the same URLs in the
	  end. If you modify the first string (the from string) in
	  the pairs listed in url_part_aliases for htsearch, then when
	  htsearch decodes the URLs it ends up rewriting part of them.</p>

	  <p>While you might think that if you don't use url_part_aliases
	  in htdig, then you can use it in htsearch to alter unencoded
	  URLs, the reality is that if you don't encode parts of URLs
	  using url_part_aliases, they still get encoded automatically
	  by the <a href="attrs.html#common_url_parts">common_url_parts</a>
	  attribute. This helps to reduce the size of your databases. So,
	  trying to use url_part_aliases only in htsearch doesn't work
	  because there are no unencoded URLs in the database, so the
	  right hand strings in the pairs you define won't match anything.</p>

	  <p>You also can't put two different definitions of the
	  url_part_aliases attribute in a single configuration file, as
	  some users have attempted. When you define an attribute twice,
	  the second definition merely overrides the first. Pay close
	  attention to the description and examples for
	  <a href="attrs.html#url_part_aliases">url_part_aliases</a>.
	  You must put one definition of this attribute in your
	  configuration file for htdig, htmerge (or htpurge) and htnotify,
	  and a different definition of it in your configuration file
	  for htsearch.</p>

	  <strong>4.18. <a name="q4.18">What are all the options in
	  htdig.conf, and are there others?</a></strong><br>
	  <p>In ht://Dig's terminology, the settings in its configuration
	  files are called <a href="attrs.html">configuration attributes</a>,
	  to distinguish them from <a href="htdig.html">command line
	  options</a>, <a href="hts_form.html">CGI input parameters</a>
	  and <a href="hts_templates.html">template variables</a>. There are
	  many, many attributes that can be set to control almost all
	  aspects of indexing, searching, customization of output and
	  internationalization. All attributes have a built-in default
	  setting, and only a subset of these appear in the sample htdig.conf
	  file. See the documentation for all default values for attributes
	  not overridden in the configuration file, and for help on using
	  any of them.
	  See also question <a href="#q1.15">1.15</a>.</p>

	  <strong>4.19. <a name="q4.19">How do I get more than 10 pages of
	  10 search results from htsearch?</a></strong><br>
	  <p>There are two attributes that control the number of matches per
	  page and the total number of pages. The number of matches per page
	  can be set in your configuration file, using the
	  <a href="attrs.html#matches_per_page">matches_per_page</a> attribute,
	  or in your <a href="hts_form.html">search form</a>, using the
	  <strong>matchesperpage</strong> input parameter.</p>

	  <p>The number of pages is controlled by the
	  <a href="attrs.html#maximum_pages">maximum_pages</a> attribute in
	  your search configuration file.
	  The current default for maximum_pages is 10 because the ht://Dig
	  package comes with 10 images, with numbers 1 through 10, for
	  use as page list buttons. If we increased the limit, we'd have
	  to field a whole lot more questions from users irate because
	  only the first 10 buttons are graphics, and the rest are text.
	  If you want more than 10 pages of results, change maximum_pages,
	  but you may also want to set the
	  <a href="attrs.html#page_number_text">page_number_text</a> and
	  <a href="attrs.html#no_page_number_text">no_page_number_text</a>
	  attributes in your search configuration file to nothing, or
	  remove them, to use text rather than images for the links to
	  other pages.</p>

	  <p>In version of htsearch before 3.1.4, maximum_pages
	  limited only the number of page list buttons, and not the
	  actual number of pages. This was changed because there was no
	  means of limiting the total number of pages, but this ended up
	  frustrating users who wanted the ability to have more pages than
	  buttons. In 3.2.0b3 and 3.1.6 we introduced a
	  <a href="attrs.html#maximum_page_buttons">maximum_page_buttons</a>
	  attribute for this purpose.</p>

	  <strong>4.20. <a name="q4.20">How do I restrict a search to only
	  certain subdirectories or documents?</a></strong><br>
	  <p>That depends on whether you want to protect certain parts of
	  your site from prying eyes, or just limit the scope of search
	  results to certain relevant areas. For the latter, you just need
	  to set the <strong>restrict</strong> or <strong>exclude</strong>
	  input parameter in the <a href="hts_form.html">search form</a>.
	  This can be done using hidden input fields containing preset
	  values, text input fields, select lists, radio buttons or
	  checkboxes, as you see fit. If you use select lists, you can
	  propagate the choices to select lists in the follow-up search
	  forms using the
	  <a href="attrs.html#build_select_lists">build_select_lists</a>
	  configuration attribute.
	  The University at Albany has a good description of how to use
	  the <strong>restrict</strong> or <strong>exclude</strong> input
	  parameters: <a href="http://www.albany.edu/its/web/search/">
	  Constructing a local search using ht://Dig Search forms</a>.
	  <br>To include a hex encoded character (such as a %20 for a space)
	  in a restrict or exclude string, the '%' must again be encoded.
	  For example, to match a filename containing a space, the URL must
	  contain %20, and so the CGI parameter passed to htsearch must
	  contain %2520. The %25 encodes the '%'. (Note that this is only
	  necessary for CGI input parameters, not for the corresponding
	  configuration attributes in your htdig.conf file, as attributes
	  aren't subjected to the same hex decoding step as parameters are.)
	  <br>See also question <a href="#q4.4">4.4</a>.</p>

	  <p>If you wish to keep secure and non-secure areas on
	  your site separate, and avoid having unauthorized users
	  seeing documents from secure areas in their search results,
	  that takes a bit more effort. You certainly can't rely on
	  the <strong>restrict</strong> and <strong>exclude</strong>
	  parameters, or even the <strong>config</strong> parameter,
	  as any parameter in a search form can also be overridden
	  by the user in a URL with CGI parameters. The safest
	  option would be to host the secure and non-secure areas on
	  separate servers with independent installations of htsearch,
	  each with its own ht://Dig database, but that is often too
	  costly or impractical an option. The next best thing is to
	  host them on the same site, but make sure that everything
	  is very clearly separated to prevent any leakage of secure
	  data. You should maintain separate databases for the secure
	  and public areas of your site, by setting up different htdig
	  configuration files for each area. Use different settings
	  of the <a href="attrs.html#start_url">start_url</a>,
	  <a href="attrs.html#limit_urls_to">limit_urls_to</a>
	  and <a href="attrs.html#database_dir">database_dir</a>
	  configuration attributes, and possibly even different
	  <a href="attrs.html#common_dir">common_dir</a> settings as well.
	  Make sure your database_dir, and even your common_dir, are not
	  in any directories accessible from the web server. Run htdig
	  and htmerge (or rundig) with each separate configuration file,
	  to build your two databases.</p>

	  <p>The tricky part is to make sure your htsearch program is
	  secure. You don't want to use the same htsearch for the secure
	  and public sites, because otherwise the public site could
	  access the configuration for the secure database, making its
	  data publicly accessible. You must either compile two separate
	  versions of htsearch, with different settings of the CONFIG_DIR
	  <em>make</em> variable, or you must make a simple wrapper
	  script for htsearch that overrides the compiled-in CONFIG_DIR
	  setting by a different setting of the CONFIG_DIR environment
	  variable. Make sure the CONFIG_DIR for the secure area is
	  not a subdirectory of the CONFIG_DIR for the public area.
	  In this way, you can maintain separate directories of config
	  files for the public and secure sites, so that the secure
	  config files are not accessible from the public htsearch.</p>

	  <p>Put the htsearch binary or wrapper script for the secure site
	  in a different ScriptAlias'ed cgi-bin directory than the public
	  one, and protect the secure cgi-bin with a .htaccess file or
	  in your server configuration. Alternatively, you can put the
	  secure program, let's call it htssearch, in the same cgi-bin,
	  but protect that one CGI program in your server configuration,
	  e.g.:</p>
<pre>
&lt;Location /cgi-bin/htssearch&gt;
AuthType Basic
AuthName ....
AuthUserFile ...
AuthGroupFile ...
&lt;Limit GET POST&gt;
require group foo
&lt;/Limit&gt;
&lt;/Location&gt;
</pre>
	  <p>This describes the setup for an Apache server. You'd need to
	  work out an equivalent configuration for your server if you're
	  not running Apache.</p>

	  <strong>4.21. <a name="q4.21">How can I allow people to search
	  while the index is updating?</a></strong><br>
	  <p>Answer contributed by Avi Rappoport &lt;avirr@searchtools.com&gt;</p>
	  <p>If you have enough disk space for two copies of the index
	  database, use -a with the htdig and htmerge processes. This will
	  make use of a copy of the index database with the extension
	  ".work", and update the copy instead of the originals.
	  This way, htsearch can use those originals while the update is
	  going on. When it's done, you can move the .work versions to
	  replace the originals, and htsearch will use them. The current
	  rundig script will do this for you if you supply the -a flag
	  to it. However, rundig builds the database from scratch each
	  time you run it. If you want to update an alternate copy of
	  the database, see the
	  <a href="http://www.htdig.org/files/contrib/scripts/rundig.sh">contributed
	  rundig.sh script</a>.</p>

	  <strong>4.22. <a name="q4.22">How can I get htdig to ignore the
	  robots.txt file or meta robots tags?</a></strong><br>
	  <p>You can't, and you shouldn't. The
	  <a href="http://www.robotstxt.org/wc/norobots.html">
	  Standard for Robot Exclusion</a> exists for a very good reason,
	  and any well behaved indexing engine or spider should conform to it.
	  If you have a problem with a robots.txt file, you really should
	  take it up with the site's webmaster. If they don't have a problem
	  with you indexing their site, they shouldn't mind setting up a
	  User-agent entry in their robots.txt file with a name you both
	  agree on. The user agent setting that htdig uses for matching
	  entries in robots.txt can be changed via the
	  <a href="attrs.html#robotstxt_name">robotstxt_name</a> attribute in
	  your config file.</p>

	  <p>If you have a problem with a robots meta tag in a document
	  (see question <a href="#q4.15">4.15</a>) you should take it up
	  with the author or maintainer of that page. These tags are an
	  all or nothing deal, as they can't be set up to allow some engines
	  and disallow others. If htdig encounters them, it has to give the
	  page's creator the benefit of the doubt and honour them. If
	  exceptions to the rule are wanted, this should be done with a
	  robots.txt file rather than a meta tag.</p>

	  <strong>4.23. <a name="q4.23">How can I get htdig not to index
	  some directories, but still follow links?</a></strong><br>
	  <p>You can simply add the directory name to your robots.txt file
	  or to the <a href="attrs.html#exclude_urls">exclude_urls</a>
	  attribute in your configuration, but that will exclude all files
	  under that directory. If you want the files in that directory to
	  be indexed, you have a couple options. You can add an index.html
	  file to the directory, that will include a robots meta tag
	  (see question <a href="#q4.15">4.15</a>) to prevent indexing,
	  and will contain links to all your files in this directory.
	  The drawback of this is that you must maintain the index.html
	  file yourself, as it won't be automatically updated as new
	  files are added to the directory.</p>

	  <p>The other technique you can use, if you want the directory
	  index to be made by the web server, is to get the server to
	  insert the robots meta tag into the index page it generates.
	  In Apache, this is done using the
	  <a href="http://httpd.apache.org/docs/mod/mod_autoindex.html#headername">HeaderName</a>
	  and <a href="http://httpd.apache.org/docs/mod/mod_autoindex.html#indexoptions">IndexOptions</a>
	  directives in the directory's <strong>.htaccess</strong> file.
	  For example:</p>
<pre>   HeaderName .htrobots 
   IndexOptions FancyIndexing SuppressHTMLPreamble
</pre>
	  <p>and in the .htrobots file:</p>
<pre>&lt;HTML&gt;&lt;head&gt;
&lt;META NAME="robots" CONTENT="noindex, follow"&gt;
&lt;title&gt;Index of /this/dir&lt;/title&gt;
&lt;/head&gt;
</pre>

	  <p>If you don't mind getting just one copy of each directory,
	  but want to suppress the multiple copies generated by Apache's
	  FancyIndexing option, you can either turn off FancyIndexing or
	  you can add "?D=A ?D=D ?M=A ?M=D ?N=A ?N=D ?S=A ?S=D" to
          the <a href="attrs.html#bad_querystr">bad_querystr</a> attribute
	  (without the quotes) to suppress the alternately sorted views of
	  the directory. For Apache 2.x, you'd use "C=D C=M C=N C=S O=A O=D"
	  instead in your bad_querystr setting.</p>

	  <strong>4.24. <a name="q4.24">How can I get rid of duplicates in
	  search results?</a></strong><br>
	  <p>This depends on the cause of the duplicate documents. htdig
	  does keep track of the URLs it visits, so it never puts the
	  same URL more than once in the database. So, if you have
	  duplicate documents in your search results, it's because the
	  same document appears under different URLs. Sometimes the
	  URLs vary only slightly, and in subtle ways, so you may have
	  to look hard to find out what the variation is. Here are some
	  common reasons, each requiring a different solution.</p>

	  <ul>
	  <li>You're indexing a case insensitive web
	  server (e.g. an NT based server), but the
          <a href="attrs.html#case_sensitive">case_sensitive</a> attribute is
	  still set to true. In this case, if htdig encounters two URLs
	  pointing to the same document, but the case of the letters in
	  one is different than the other (even if it's only 1 letter),
	  it will not treat them as the same URL.<br><br>
	  <li>You have symbolic links (or hard links) to some of
	  these documents, so they can be reached by several URLs.
	  The solution here is to build an exclude list of URLs that
	  are actually symbolic links, and putting these in
	  <a href="attrs.html#exclude_urls">exclude_urls</a>
	  (or in your robots.txt file). You can automate this using a
	  technique similar to the find command in question
	  <a href="#q5.25">5.25</a> which builds the start_url list, but
	  adding a -type l to find symbolic links.<br><br>
	  <li>You have copies of the same documents in different
	  locations. This is similar to the symbolic link problem above,
	  but harder to fix automatically.<br><br>
	  <li>The duplicate URLs result from CGI, SSI or other dynamic pages
	  that give the same content even though there may be variations in
	  the query string or other parts of the URL. The approach to
	  fix this is similar to the fix above, but may be less easy
	  to automate, depending on what the variations are. You can
	  add patterns to exclude_urls or bad_querystr to get rid of
	  unwanted variations. These are especially important to bring
	  under control, because in some cases, if left unchecked, they
	  can result in an <em>infinite virtual hierarchy</em> which htdig
	  will never be able to finish indexing. For example, in a CGI-based
	  calendar, htdig could go on following next month or next
	  year links to infinity, but this can be stopped by adding a
	  stop year to <a href="attrs.html#bad_querystr">bad_querystr</a>.
	  <br><br>Another common example happens when htdig hits a link
	  to an SSI page and the URL has an extra trailing slash. This
	  can happen with either .shtml pages or .html pages that use
	  the XBitHack. The trailing slash causes the URL to be misinterpreted
	  as a directory URL, and any relative URLs in the document are added
	  to the URL, creating longer and longer URLs that still lead to the
	  same SSI document. There are two things you can do:<ol>
		<li>hunt down the pages with the incorrect links, i.e.
		search for ".shtml/" or ".html/" in URLs in your documents,
		and fix these links; or
		<li>add .shtml/ and .html/ to your
		<a href="attrs.html#exclude_urls">exclude_urls</a>
		setting to get htdig to ignore these defective links.
	  </ol>The second option is easier, but you run the risk that htdig
	  will miss some SSI pages if the only links to them have the trailing
	  slash, so you may want to try hunting down the links anyway.
	  <br><br>See also question <a href="#q5.29">5.29</a>.<br><br>
	  <li>The duplicates result from session IDs in PHP or other dynamic
	  pages that give the same content even though the ID changes during
	  the indexing process. This can lead not only to duplicates, but
	  also to URLs that become unusable because of expired session IDs.
	  Session IDs are the bane of search engines, and you should avoid
	  using them if at all possible. If getting rid of them altogether
	  isn't an option, then you can at least remove them while indexing,
	  using the <a href="attrs.html#url_rewrite_rules">url_rewrite_rules</a>
	  attribute. This will only work if htdig can access the documents
	  without a session ID, as htdig rewrites the URL before fetching the
	  document, and htsearch presents the rewritten URL (without session
	  ID) in search results.
          </ul>

	  <strong>4.25. <a name="q4.25">How can I change the scores in
	  search results, and what are the defaults?</a></strong><br>
	  <p>The scores are calculated mostly by htdig at indexing time,
	  with some tweaking done by htsearch at search time. There are
	  a number of <a href="attrs.html">configuration attributes</a>,
	  all called <em>&lt;something&gt;</em><strong>_factor</strong>,
	  which can control the scoring calculations. In addition, the
	  location of words within the document has an effect on score,
	  as word scores are also multiplied by a varying location
	  factor somewhere in between 1000 for words near the start
	  and 1 for words near the end of the document. As of yet,
	  there is no way to change this factor. For any of the scoring
	  factors you can configure, and which are used by htdig, you
	  will have to reindex your documents so the new factors take
	  effect. The default values for these scoring factors, as well as
	  information about whether they're used by htdig or htsearch,
	  are all listed in the <a href="attrs.html">configuration
	  attributes documentation</a>. Malcolm Austen has written some
	  <a href="http://wwwsearch.ox.ac.uk/scores.html">notes on page
	  scores</a> for 3.1.x which you may find helpful.</p>

	  <p>Note that the above applies to the 3.1.x releases, while
	  in the 3.2 beta releases, all scores are calculated at search
	  time with no weight being put on the location of words within
	  the document.</p>

	  <strong>4.26. <a name="q4.26">How can I get htdig not to index
	  JavaScript code or CSS?</a></strong><br>
	  <p>The HTML parser in htdig recognizes and parses only HTML,
	  which is all there should be within an HTML file. If your HTML
	  files contain in-line JavaScript code or Cascading Style Sheets
	  (CSS), these in-line codes, which are clearly not HTML, should
	  be enclosed within an HTML comment tag so they are hidden
	  from view from the HTML parser, or for that matter from any
	  web client that is not JavaScript-aware or CSS-aware. See
	  <a href="http://www.mcli.dist.maricopa.edu/show/interact/js_b.html">
	  Behind the Scenes with JavaScript</a> for a description of the
	  technique, which applies equally well to in-line style sheets.
	  If fixing up all non-HTML compliant JavaScript or CSS code in
	  your HTML files is not an option, then see question
	  <a href="#q4.15">4.15</a> for an alternate technique.</p>

	  <p>The HTML parser in htdig 3.1.6 tries skipping over bare
	  in-line JavaScript code in HTML, unlike previous versions,
	  but a small bug in the parser causes it to be thrown off by a
	  "&lt;" sign in the JavaScript, and it may then miss the closing
	  &lt;/script&gt; tag. This can be fixed by applying this
	  <a href="ftp://ftp.ccsf.org/htdig-patches/3.1.6/JavaScript.0">
	  patch</a>.</p>

	  <hr noshade size=2>

	  <h3>5. Troubleshooting</h3>
	  <strong>5.1. <a name="q5.1">I can't seem to index more than X documents
	  in a directory.</a></strong><br>
	  <p>This usually has to do with the default document size
	  limit. If you set <a href="attrs.html#max_doc_size">
	  max_doc_size</a> in your config file to
	  something enough to read in the directory index (try 100000 for
	  100K) this should fix this problem. Of course this will require
	  more memory to read the larger file. Don't set it to a value
	  larger than the amount of memory you have, and never more than
	  about 2 billion, the maximum value of a 32-bit integer.
	  If htdig is missing entire directories, see question
	  <a href="#q5.25">5.25</a>.</p>

	  <strong>5.2. <a name="q5.2">I can't index PDF files.</a></strong><br>
	  <p>As above, this usually has to do with the default document
	  size. What happens is ht://Dig will read in part of a PDF file
	  and try to index it. This usually fails. Try setting
	  <a href="attrs.html#max_doc_size">max_doc_size</a>
	  in your config file to a larger value than the
	  size of your largest PDF file. Don't go overboard, though, as
	  you don't want to overflow a 32-bit integer (about 2 billion),
	  and you don't want to allocate much more memory than you need
	  to store the largest document.</p>

	  <p>There is a bug in Adobe Acrobat Reader version 4, in its
	  handling of the -pairs option, which causes a segmentation
	  violation when using it with htdig 3.1.2 or earlier. There is
	  a workaround for this as of version 3.1.3 - you must remove
	  the -pairs option from your pdf_parser definition, if it's
	  there.  However, acroread version 4 is still very unstable (on
	  Linux, anyway) so it is not recommended as a PDF parser. An
	  alternative is to use an external converter with the xpdf 0.90
	  package installed on your system, as described in question <a
	  href="#q4.9">4.9</a> above.</p>

	  <strong>5.3. <a name="q5.3">When I run "rundig," I get a message about
	  "DATABASE_DIR" not being found.</a></strong><br>
	  <p>This is due to a bug in the Makefile.in file in version
	  3.1.0b1. The easiest fix is to edit the rundig file and change
	  the line "TMPDIR=@DATABASE_DIR@" to set TMPDIR to a directory
	  with a large amount of temporary disk space for htmerge. This
	  bug is fixed in version 3.1.0b2.</p>

	  <strong>5.4. <a name="q5.4">When I run htmerge, it stops with an "out
	  of diskspace" message.</a></strong><br>
	  <p>This means that htmerge has run out of temporary disk space
	  for sorting. Either in your "rundig" script (if you run htmerge
	  through that) or before you run htmerge, set the variable TMPDIR
	  to a temp directory with lots of space.</p>

	  <strong>5.5. <a name="q5.5">I have problems running rundig from cron
	  under Linux.</a></strong><br>
	  <p>This problem commonly occurs on Red Hat Linux 5.0 and 5.1,
	  because of a bug in vixie-cron. It causes htmerge to fail with a
	  "Word sort failed" error. It's fixed in Red Hat 5.2.
	  You can install vixie-cron-3.0.1-26.{arch}.rpm from a 5.2
	  distribution to fix the problem on 5.0 or 5.1. A quick fix for
	  the problem is to change the first line of rundig to "#!/bin/ash"
	  which will run the script through the ash shell, but this doesn't
	  solve the underlying problem.</p>

	  <strong>5.6. <a name="q5.6">When I run htmerge, it stops with an
	  "Unexpected file type" message.</a></strong><br>
	  <p>Often this is because the databases are corrupt. Try removing
	  them and rebuilding. If this doesn't work, some have found that
	  the solution for question <a href="#q3.2">3.2</a> works for this
	  as well. This should be fixed in versions from 3.1.x</p>

	  <strong>5.7. <a name="q5.7">When I run htsearch, I get lots of Internal
	  Server Errors (#500).</a></strong><br>
	  <p>If you are running under Solaris, see <a href="#q3.6">3.6</a>.
	  The solution for Solaris may also work for other OSes that use shared
	  libraries in non-standard locations, so refer to question 3.6 if
	  you suspect a shared library problem. In any case, check your web
	  server error logs to see the cause of the internal server errors.
	  If it's not a problem with shared libraries, there's a good chance
	  that the error logs will still contain useful error messages that
	  will help you figure out what the problem is.
	  <br>See also questions <a href="#q5.13">5.13</a> and
	  <a href="#q5.23">5.23</a>.</p>

	  <strong>5.8. <a name="q5.8">I'm having problems with indexing words
	  with accented characters.</a></strong><br>
	  <p>
	  Most of the time, this is caused by either not setting or
	  incorrectly setting the <a
	  href="attrs.html#locale">locale</a> attribute. The default locale
	  for most systems is the "portable" locale, which strips
	  everything down to standard ASCII. Most systems expect
	  something like <code>locale: en_US</code> or
	  <code>locale: fr_FR</code>. Locale files are often found in
	  <code>/usr/share/locale</code> or the <tt>$LANGUAGE</tt>
	  environment variable. See also question <a href="#q4.10">4.10</a>.
	  </p>

	  <p>Setting the locale correctly seems to be a frequent source of
	  frustration for ht://Dig users, so here are a few pointers which
	  some have found useful. First of all, if you don't have any luck
	  with the settings of the <a href="attrs.html#locale">locale</a>
	  attribute that you try, make sure you use a locale that is
	  defined on your system. As mentioned above, these are usually
	  installed in <code>/usr/share/locale</code>, so look there
	  for a directory named for the locale you want to use. If
	  you don't find it, but find something close, try that locale
	  name. Note that the locale may not have to be specific to the
	  language you're indexing, as long as it uses the same character
	  set. E.g. most western European languages use the ISO-8859-1
	  Latin 1 character set, so on most systems the locales for
	  all these languages define the same character types table
	  and can be used interchangeably. Some systems, however,
	  define only the accented letters used for a given language,
	  so "your mileage may vary." The important thing is that the
	  directory for your locale definition <strong>must</strong>
	  have a file named <code>LC_CTYPE</code> in it. For example,
	  on many Linux distributions, a language-specific locale like
	  <code>fr</code> won't contain this file, but country-specific
	  locales like <code>fr_FR</code> or <code>fr_CA</code> will. If
	  you don't find any appropriate locales installed on your system,
	  try obtaining and installing the locale definition files from
	  your OS distribution. Also, once you've set your locale, you need
	  to reindex all your documents in order for the locale to take
	  effect in the word database. This means rerunning the "rundig"
	  script, or running "htdig -i" and htmerge (or htpurge in the 3.2
	  betas).</p>

	  <p>Note also that some UNIX systems and libc5-based Linux
	  systems just don't have a working implementation of locales,
	  so you may not be able to get locales working at all on certain
	  systems. The
	  <a href="http://www.htdig.org/files/contrib/other/testlocale.c">testlocale.c</a>
	  program on our web site can let you see the LC_CTYPE tables
	  for any locale, to aid in finding one that works. Carefully
	  follow the directions in the program's comments to know how to
	  use it and what to look for in its output.</p>

	  <strong>5.9. <a name="q5.9">When I run htmerge, it stops with a
	  "Word sort failed" message.</a></strong><br>
	  <p>There are three common causes of this. First of all, the sort
	  program may be running out of temporary file space. Fix this
	  by freeing up some space where sort puts its temporary files,
	  or change the setting of the TMPDIR environment variable to a
	  directory on a volume with more space. A second common problem
	  is on systems with a BSD version of the sort program (such as
	  FreeBSD or NetBSD). This program uses the -T option as a record
	  separator rather than an alternate temporary directory. On these
	  systems, you must remove the TMPDIR environment variable from
	  rundig, or change the code in htmerge/words.cc not to use the
	  -T option. A third cause is the cron program on Red Hat Linux
	  5.0 or 5.1. (See question <a href="#q5.5">5.5</a> above.)</p>

	  <strong>5.10. <a name="q5.10">When htsearch has a lot of matches, it runs
	  extremely slowly.</a></strong><br>
	  <p>When you run htsearch with no customization, on a
	  large database, and it gets a lot of hits, it tends to
	  take a long time to process those hits. Some users with
	  large databases have reported much higher performance,
	  for searches that yield lots of hits, by setting the <a
	  href="attrs.html#backlink_factor">backlink_factor</a> attribute
	  in htdig.conf to 0, and sorting by score. The scores calculated
	  this way aren't quite as good, but htsearch can process hits
	  much faster when it doesn't need to look up the db.docdb record
	  for each hit, just to get the backlink count, date or title,
	  either for scoring or for sorting. This affects versions
	  3.1.0b3 and up. In version 3.2, currently under development,
	  the databases will be structured differently, so it should
	  perform searches more quickly.</p>

	  <p>In version 3.1.6, the date range selection code also slows
	  down htsearch for the same reason. Unfortunately, a small bug
	  crept into the code so that even if you don't set any of the
	  date range input parameters (startyear, endyear, etc.), and
	  you set backlink_factor and date_factor to 0, htsearch still
	  looks at the date in the db.docdb record for each hit. You can
	  avoid this either by setting startyear to 1969 and endyear to
	  2038 in your config file, or by applying this
	  <a href="ftp://ftp.ccsf.org/htdig-patches/3.1.6/timet_enddate.1">
	  patch</a>.</p>

	  <strong>5.11. <a name="q5.11">When I run htsearch, it gives me a count of
	  matches, but doesn't list the matching documents.</a></strong><br>
	  <p>This most commonly happens when you run htsearch while the
	  database is currently being rebuilt or updated by htdig.
	  If htdig and htmerge have run to completion, and the problem still
	  occurs, this is usually an indication of a corrupted database. If
	  it's finding matches, it's because it found the matching
	  words in db.words.db.  However, it isn't finding the document
	  records themselves in db.docdb, which would suggest that either
	  db.docdb, or db.docs.index (which maps document IDs used in
	  db.words.db to URLs used to look up records in db.docdb), is
	  incomplete or messed up.  You'll likely need to rebuild your
	  database from scratch if it's corrupted. Older versions of
	  ht://Dig were susceptible to database corruption of this
	  sort. Versions 3.1.2 and later are much more stable.</p>

	  <p>Another possible cause of this problem is unreadable result
	  template files. If you define external template files via the
	  <a href="attrs.html#template_map">template_map</a> attribute,
	  rather than using the builtin-short or builtin-long templates,
	  and the file names are incorrect or the files do not have
	  read permission for the user ID under which htsearch runs,
	  then htsearch won't be able to display the results. Also,
	  all directories leading up to these template files must be
	  searchable (i.e. executable) by htsearch, or it won't be able
	  to open the files. This is the opposite problem of that described
	  in question <a href="#q5.36">5.36</a>. If htsearch displays
	  nothing at all, you may have both problems.</p>

	  <strong>5.12. <a name="q5.12">I can't seem to index documents with names
	  like left_index.html with htdig.</a></strong><br>
	  <p>There is a bug in the implementation of the <a
	  href="attrs.html#remove_default_doc">remove_default_doc</a>
	  attribute in htdig versions 3.1.0, 3.1.1 and 3.1.2, which causes
	  it to match more than it should. The default value for this
	  attribute is "index.html", so any URL in which the filename ends
	  with this string (rather than matches it entirely) will have
	  the filename stripped off. This is fixed in version 3.1.3.</p>

	  <strong>5.13. <a name="q5.13">I get Premature End of Script Headers errors
	  when running htsearch.</a></strong><br>
	  <p>This happens when htsearch dies before putting out a
	  "Content-Type" header. If you are running Apache under Solaris,
	  or another system that may be using shared libraries in non-standard
	  locations,
	  first try the solution described in question <a href="#q3.6">3.6</a>.
	  If that doesn't work, or you're running on another system, try
	  running "htsearch -vvv" directly from the command line to see where
	  and why it's failing. It should prompt you for the search words,
	  as well as the format.
	  <br>If it works from the command line, but not from the web
	  server, it's almost certainly a web server configuration problem.
	  Check your web server's error log for any information related to
	  htsearch's failure. One increasingly common problem is Apache
	  configurations which expect all CGI scripts to be Perl,
	  rather than binary executables or other scripts, so they use
	  "perl-handler" rather than "cgi-handler".
	  <br>See also questions <a href="#q5.7">5.7</a>,
	  <a href="#q5.14">5.14</a> and <a href="#q5.23">5.23</a>.</p>

	  <strong>5.14. <a name="q5.14">I get Segmentation faults when running
	  htdig, htsearch or htfuzzy.</a></strong><br>
	  <p>Despite a great deal of debugging of these programs, we haven't
	  been able to completely eliminate all such problems on all platforms.
	  If you're running htsearch or htfuzzy on a BSDI system, a common
	  cause of core dumps is due to a conflict between the GNU regex
	  code bundled in htdig 3.1.2 and later, and the BSD C or C++ library.
	  The solution is to use the BSD library's own rx code instead,
	  using version 3.1.6 or newer as summarized by Joe Jah:</p>
		<ul>
		<li> ./configure --with-rx
		<li> make
		</ul>
	  <p>This solution may work on some other platforms as well (we haven't
	  heard one way or the other), but will definitely not work on some
	  platforms. For instance, on libc5-based Linux systems, the bundled
	  regex code works fine by default, but using libc5's regex code
	  causes core dumps.</p>

	  <p>Users of Cobalt Raq or Qube servers have complained of
	  segmentation faults in htdig. Apparently this is due to problems
	  in their C++ libraries, which are fixed in their experimental
	  compiler and libraries. The following commands should install
	  the packages you need:</p>
		<blockquote>
		 rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/binutils-2.8.1-3C1.mips.rpm<br>
		 rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-1.0.2-9.mips.rpm<br>
		 rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-c++-1.0.2-9.mips.rpm<br>
		 rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-g77-1.0.2-9.mips.rpm<br>
		 rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-objc-1.0.2-9.mips.rpm<br>
		 rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/libstdc++-2.8.0-9.mips.rpm<br>
		 rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/libstdc++-devel-2.8.0-9.mips.rpm<br>
		 rpm -Uvh --force ftp://ftp.cobaltnet.com/pub/products/current/RPMS/gcc-2.7.2-C2.mips.rpm
		</blockquote>
	  <p>You may have to remove the libg++ package, if you have it installed
	  before installing libstdc++, because of conflicts in these packages.
	  Be sure to do a "make clean" before a "make", to remove any object
	  files compiled with the old compiler and headers.</p>

	  <p>For other causes of segmentation faults, or in other programs,
	  getting a stack backtrace after the fault can be useful in narrowing
	  down the problem. E.g.: try "gdb /path/to/htsearch /path/to/core",
	  then enter the command "bt". You can also try running the program
	  directly under the debugger, rather than attempting a post-mortem
	  analysis of the core dump. Options to the program can be given on
	  gdb's "run" command, and after the program is suspended on fault,
	  you can use the "bt" command. This may give you enough information
	  to find and fix the problem yourself, or at least it may help others
	  on the htdig mailing list to point out what to do next.</p>

	  <strong>5.15. <a name="q5.15">Why does htdig 3.1.3 mangle URL parameters
	  that contain bare "&amp;" characters?</a></strong><br>
	  <p>This is a known bug in 3.1.3, and is fixed with this
	  <a href="ftp://ftp.ccsf.org/htdig-patches/3.1.3/HTML.cc.0">
	  patch</a>. You can apply the patch by entering into the main
	  source directory for htdig-3.1.3, and using the command
	  "patch -p0 &lt; /path/to/HTML.cc.0". This is
	  also fixed as of version 3.1.4.</p>

	  <strong>5.16. <a name="q5.16">When I run htmerge, it stops with an
	  "Unable to open word list file '.../db.wordlist'" message.</a></strong><br>
	  <p>The most common cause of this error is that htdig did not
	  manage to index any documents, and so it did not create a word
	  list. You should repeat the htdig or rundig command with the
	  -vvv option to see where and why it is failing.
	  See question <a href="#q4.1">4.1</a>.</p>

	  <strong>5.17. <a name="q5.17">When using Netscape, htsearch always returns the
	  "No match" page.</a></strong><br>
	  <p>Check your search form. Chances are there is a hidden input 
	  field with no value defined. For example, one user had<br>
	  <code>&lt;input type=hidden name=restrict&gt;</code>

	  in his search form, instead of<br>

         <code>&lt;input type=hidden name=restrict value=""&gt;</code>

	 The problem is that Netscape sets the missing value to a default of "  "
	 (two spaces), rather than an empty string. For the restrict parameter,
	 this is a problem, because htsearch won't likely find any URLs with two
	 spaces in them. Other input parameters may similarly pose a problem.
	  </p>

	  <p>Another possibility, if you're running 3.2.0b1 or 3.2.0b2, is
	  that you need to make the db.words.db_weakcmpr file writeable by
	  the user ID under which the web server runs. This is a bug, and
	  is fixed in the 3.2.0b5 beta.</p>


	  <strong>5.18. <a name="q5.18">Why doesn't htdig follow links to other
	  pages in JavaScript code?</a></strong><br>
	  <p>There probably isn't any indexing tool in existance
	  that follows JavaScript links, because they don't know how
	  to initiate JavaScript events. Realistically, it would take a
	  full JavaScript parser in order to be able to figure out all the
	  possible URLs that the code could generate, something that's way
	  beyond the means of any search engine. You have a few options:</p>
	  <ul>
	  <li>Add "backup" links using plain HTML &lt;a href=...&gt; tags to
	  all the pages that could be accessed through JavaScript,
	  <li>Add &lt;link&gt; tags to point to all these pages (see
	  <a href="http://www.w3.org/TR/html4/struct/links.html#h-12.3.3">Links
	  and search engines</a> in W3C's HTML 4.0 Specification - requires
	  htdig 3.1.3 or greater, but then <em>everyone</em> should be running
	  3.1.6 or greater anyway),
	  <li>Compose a list of all the unreachable documents, or write
	  a program to do so, and feed that list as part of htdig's
	  <a href="attrs.html#start_url">start_url</a> attribute.
	  See also question <a href="#q5.25">5.25</a>.
	  </ul>

	  <strong>5.19. <a name="q5.19">When I run htsearch from the web server,
	  it returns a bunch of binary data.</a></strong><br>
	  <p>Your server is returning the contents of the htsearch binary.
	  Common causes of this are:</p>
	  <ul>
	  <li>no execute permission on the htsearch binary,
	  <li>the binary won't run on this system (it may be compiled
	  for the wrong system type), or
	  <li>the web server doesn't recognize the file as a CGI
	  (for Apache, you must have a ScriptAlias directive for the
	  program or the directory in which it's installed, or define
	  a cgi-script handler for some suffix, e.g. .cgi, and add that
	  suffix to the program file name).
	  </ul>
	  <p>By default, Apache is usually configured with one cgi-bin
	  directory as ScriptAlias, so all your CGI programs must go in
	  there, or have a .cgi suffix on them. Your configuration may
	  differ, however.</p>

	  <strong>5.20. <a name="q5.20">Why are the betas of 3.2 so
	  slow at indexing?</a></strong><br>
	  <p>
	  As the release notes for these versions suggest, they are
	  somewhat unoptimized and are made available for testing
	  Since the 3.2 code indexes all locations of words to support
	  phrase searching and other advanced methods, this additional
	  data slows down the indexer. To compensate, the code has a
	  cache configured by the
	  <a href="dev/htdig-3.2/attrs.html#wordlist_cache_size">wordlist_cache_size</a>
	  attribute.
	  As of this writing, the word database code will slow down
	  considerably when the cache fills up. Setting the cache as
	  large as possible provides considerable performance
	  improvement. Development is in progress to improve cache
	  performance.
	  For 3.2.0b6 and higher, see also the
	  <a href="dev/htdig-3.2/attrs.html#store_phrases">store_phrases</a> attribute,
	  which can turn off support for phrase searches, improving the speed.
	  </p>

	  <strong>5.21. <a name="q5.21">Why does htsearch use ";" instead of
	  "&amp;" to separate URL parameters for the page buttons?</a></strong><br>
	  <p>In versions 3.1.5 and 3.2.0b2, and later, htsearch was
	  changed to use a semicolon character ";" as a parameter
	  separator for page button URLs, rather than "&amp;", for HTML
	  4.0 compliance. It now allows both the "&amp;" and the ";" as
	  separators for input parameters, because the CGI specification
	  still uses the "&amp;". This change may cause some PHP or CGI
	  wrapper scripts to stop working, but these scripts should be
	  similarly changed to recognize both separator characters.
	  For the definitive reference on this issue, please refer to
	  section B.2.2 of W3C's HTML 4.0 Specification,
	  <a href="http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.2">
	  Ampersands in URI attribute values</a>. We're all a little
	  tired of arguing about it. If you don't like the standard, you
	  can change the Display::createURL() code yourself to ignore it.
	   <br>See also question <a href="#q4.13">4.13</a>.</p>

	  <p>If you want to try working within the new standard, you may
	  find it helpful to know that recent versions of CGI.pm will
	  allow either the ampersand or semicolon as a parameter separator,
	  which should fix any Perl scripts that use this library.
	  In PHP, you can simply set the following in your php.ini file
	  to allow either separator:</p>
<pre>arg_separator.input = ";&amp;"
</pre>

	  <strong>5.22. <a name="q5.22">Why does htsearch show the
	  "&amp;" character as "&amp;amp;" in search results?</a></strong><br>
	  <p>In version 3.1.5, htsearch was fixed to properly
	  re-encode the characters &amp;, &lt;, &gt;, and &quot;
	  into SGML entities.  However, the default value for the
	  <a href="attrs.html#translate_amp">translate_amp</a>,
	  <a href="attrs.html#translate_lt_gt">translate_lt_gt</a>
	  and <a href="attrs.html#translate_quot">translate_quot</a>
	  attributes is still false, so these entities don't get converted
	  by htdig. If you set these three attributes to true in your
	  htdig.conf and reindex, the problem will go away.</p>

	  <p>In the 3.2 betas there was a bug in the HTML parser that
	  caused it to fail when attempting to translate the "&amp;amp;"
	  entity. This has been fixed in 3.2.0b3. The translate_* attributes
	  are gone as of 3.2.0b2.</p>

	  <strong>5.23. <a name="q5.23">I get Internal Server or Unrecognized
	  character errors when running htsearch.</a></strong><br>
	  <p>An increasingly common problem is Apache configurations
	  which expect all CGI scripts to be Perl, rather than binary
	  executables or other scripts, so they use "perl-handler"
	  rather than "cgi-handler". The fix is to create a separate
	  directory for non-Perl CGI scripts, and define it as such in
	  your httpd.conf file. You should define it the same way as your
	  existing cgi-bin directory, but use "cgi-handler" instead of
	  "perl-handler". In any case, you should check your web server's
	  error log for any information related to htsearch's failure.
	  <br>See also questions <a href="#q5.7">5.7</a>,
	  <a href="#q5.14">5.14</a> and <a href="#q5.13">5.13</a>.</p>

	  <strong>5.24. <a name="q5.24">I took some settings out of
	  my htdig.conf but they're still set.</a></strong><br>
	  <p>All configuration file attributes have compiled-in, default
	  values. Taking an attribute out of the file is not the same
	  thing as setting it to an empty string, a 0, or a value of
	  false. See question <a href="#q4.18">4.18</a>.</p>

	  <strong>5.25. <a name="q5.25">When I run htdig on my site,
	  it misses entire directories.</a></strong><br>
	  <p>First of all, htdig doesn't look at directories itself. It
	  is a spider, and it follows hypertext links in HTML documents.
	  If htdig seems to be missing some documents or entire directory
	  sub-trees of your site, it is most likely because there are
	  no HTML links to these documents or directories. (See also
	  question <a href="#q5.18">5.18</a>.) If htdig does
	  not come across at least one hypertext link to a document
	  or directory, and it's not explicitly listed in the
	  <a href="attrs.html#start_url">start_url</a> attribute, then
	  this document or directory is essentially hidden from view
	  to htdig, or to any web browser or spider for that matter.
	  You can only get htdig to index directories, without providing
	  your own files with links to the contents of these directories,
	  by using your web server's automatic index generation feature.
	  In Apache, this is done with the mod_autoindex module, which
	  is usually compiled-in by default, and is enabled with the
	  "Indexes" option for a given directory hierarchy. For example,
	  you can put these directives in your Apache configuration:</p>
<pre>
&lt;Directory "/path/to/your/document/root"&gt;
    Options Indexes FollowSymLinks Includes ExecCGI
&lt;/Directory&gt;
</pre>
	  <p>This will cause Apache to automatically generate an index
	  for any directory that does not have an index.html or other
	  "DirectoryIndex" file in it. Other web servers will have
	  similar features, which you should look for in your server
	  documentation.</p>

	  <p>As an alternative to relying on the web server's autoindex
	  feature, you can compose a list of all the unreachable
	  documents, or write a program to do so, and feed that list as
	  part of htdig's <a href="attrs.html#start_url">start_url</a>
	  attribute. Here is an example of simple shell script to make
	  a file of URLs you can use with a configuration entry like
	  <code>start_url: `/path/to/your/file`</code>:</p>
<pre>
find /path/to/your/document/root -type f -name \*.html -print | \
    sed -e 's|/path/to/your/document/root/|http://www.yourdomain.com/|' > \
        /path/to/your/file
</pre>
	  <p>Other reasons why htdig might be missing portions of your
	  site might be that they fall out of the bounds specified
	  by the <a href="attrs.html#limit_urls_to">limit_urls_to</a>
	  attribute (which takes on the value of start_url by default),
	  they are explicitly excluded using the
	  <a href="attrs.html#exclude_urls">exclude_urls</a> attribute,
	  or they are disallowed by a robots.txt file (see the
	  <a href="htdig.html">htdig</a> documentation for notes about
	  robot exclusion) or by a robots meta tag (see question
	  <a href="#q4.15">4.15</a>). If htdig seems to be missing the
	  last part of a large directory or document, see question
	  <a href="#q5.1">5.1</a>. For reasons why htdig may be rejecting
	  some links to parts of your site, see question
	  <a href="#q5.27">5.27</a>.</p>

	  <strong>5.26. <a name="q5.26">What do all the numbers and symbols
	  in the htdig -v output mean?</a></strong><br>
	  <p>Output from htdig -v typically looks like this:</p>
<pre>
23000:35506:2:http://xxx.yyy.zz/index.html: ***-+****--++***+ size = 4056
</pre>
	  <p>The first number is the number of documents parsed so far,
	  the second is the DocID for this document, and the third is
	  the hop count of the document (number of hops from one of the
	  start_url documents). After the URL, it shows a "*" for a link
	  in the document that it already visited (or at least queued
	  for retrieval), a "+" for a new link it just queued, and a
	  "-" for a link it rejected for any of a number of reasons.
	  To find out what those reasons are, you need to run htdig
	  with at least 3 "v" options, i.e. -vvv. If there are no "*",
	  "+" or "-" symbols after the URL, it doesn't mean the document
	  was not parsed or was empty, but only that no links to other
	  documents were found within it.</p>

	  <strong>5.27. <a name="q5.27">Why is htdig rejecting some of the
	  links in my documents?</a></strong><br>
	  <p>When htdig parses documents and finds hypertext links to
	  other documents (hrefs), it may reject them for any of several
	  reasons. To find out what those reasons are, you need to run
	  htdig with at least 3 "v" options, i.e. -vvv. Here are the
	  meanings of some of the messages you might see at this verbosity
	  level.</p>
	  <dl>
	   <dt>Not an http or relative link!</dt>
	   <dd>In versions 3.1.5 and earlier, only "http://" URLs, or
		URLs relative to those, are allowed.</dd>
	   <dt>Item in the exclude list: item # <em>n</em></dt>
	   <dd>A substring of the URL matches one of the items in the
		<a href="attrs.html#exclude_urls">exclude_urls</a>
		attribute. The given item number will indicate which
		pattern matched, starting at 1. The 3.2.0 betas do not
		give the item number.</dd>
	   <dt>Extension is invalid!</dt>
	   <dd>The file name extension or suffix matches one of those
		listed in the
		<a href="attrs.html#bad_extensions">bad_extensions</a>
		attribute.</dd>
	   <dt>Extension is not valid!</dt>
	   <dd>The file name extension or suffix does not match one of those
		listed in the
		<a href="attrs.html#valid_extensions">valid_extensions</a>
		attribute, if any are specified.</dd>
	   <dt>Invalid Querystring! <em>or</em><br>item in bad query list</dt>
	   <dd>The URL contains a query string which matches one of those
		listed in the
		<a href="attrs.html#bad_querystr">bad_querystr</a>
		attribute.</dd>
	   <dt>URL not in the limits!</dt>
	   <dd>No substring of the URL entirely matches one of the items in the
		<a href="attrs.html#limit_urls_to">limit_urls_to</a>
		attribute. The purpose of this attribute is to keep htdig
		from attempting to index the entire World Wide Web.</dd>
	   <dt>forbidden by server robots.txt!</dt>
	   <dd>A substring of the URL matches one of the items disallowed
		in the servers robots.txt file. See
		<a href="http://www.robotstxt.org/wc/norobots.html">
		A Standard for Robot Exclusion</a>. This message exists
		only in the 3.2.0 betas. In 3.1.5 and earlier, this condition
		is only caught later, resulting in the message
		"robots.txt: discarding '<em>URL</em>'" from htdig, and a
		later "Deleted: no excerpt" message from htmerge.</dd>
	   <dt>url rejected: (level 2)</dt>
	   <dd>No substring of the URL entirely matches one of the items in the
		<a href="attrs.html#limit_normalized">limit_normalized</a>
		attribute. All the other rejections above will be indicated
		as level 1. The 3.2.0 betas give the much more meaningful
		message 'not in "limit_normalized" list!'</dd>
	  </dl>

	  <p>Another possibility, if none of the error messages above appear
	  for some of the links you think htdig should be accepting, is that
	  htdig isn't even finding the links at all. First, make sure you're
	  not making false assumptions about how htdig finds these. It only
	  reads links in HTML code, and not JavaScript, and it doesn't read
	  directories unless the HTTP server is feeding it directory listings.
	  You will need to take a close look at the htdig -vvv (or -vvvv)
	  output to see what htdig is finding, in and around the areas where
	  the desired links are supposed to be found in your HTML code, to see
	  if it's actually finding them.
	  See also question <a href="#q5.25">5.25</a>.</p>

	  <strong>5.28. <a name="q5.28">When I run htdig or htmerge, I get a
	  "DB2 problem...: missing or empty key value specified" message.</a></strong><br>
	  <p>The most common cause of this error is that htdig or
	  htmerge rejected any documents that had been put in the
	  database, leaving an empty database. You need to find out the
	  reasons for the rejection of these documents. See questions
	  <a href="#q4.1">4.1</a>, <a href="#q5.25">5.25</a> and
	  <a href="#q5.27">5.27</a>.</p>

	  <strong>5.29. <a name="q5.29">When I run htdig on my site,
	  it seems to go on and on without ending.</a></strong><br>
	  <p>There are some things that can cause htdig to run on without
	  ending, especially when indexing dynamic content (ASP, PHP,
	  SSI or CGI pages). This usually involves htdig getting caught
	  in an <em>infinite virtual hierarchy</em>. A sure sign of
	  this is if the current size of your database is much larger
	  than the total size of the site you are indexing, or if in the
	  verbose output of htdig (see question <a href="#q4.1">4.1</a>)
	  you see the same URLs come up again and again with only subtle
	  variations. In any case, you must figure out the reason htdig
	  keeps revisiting the same documents using different URLs, as
	  explained in question <a href="#q4.24">4.24</a>, and set your
	  <a href="attrs.html#exclude_urls">exclude_urls</a> and
	  <a href="attrs.html#bad_querystr">bad_querystr</a> attributes
	  appropriately to stop htdig from going down those paths.
	  </p>

	  <strong>5.30. <a name="q5.30">Why does htsearch no longer recognize
	  the -c option when run from the web server?</a></strong><br>
	  <p>This was a security hole in 3.1.5 and older, and 3.2.0b3 and
	  older releases of ht://Dig. (See question <a href="#q2.1">2.1</a>.)
	  There's a compile-time macro you can set in htsearch.cc to disable
	  this security fix, but that's a bad idea because it reopens the hole.
	  This should only be done as a last recourse, when all other avenues
	  fail. The -c option was only intended for testing htsearch from the
	  command line, and not for use when calling htsearch on the web server.
	  Unfortunately, far too many users have needlessly latched onto this
	  option for CGI scripts. The preferred ways of specifying the config
	  file are as follows, in order of preference:</p>
	  <ol>
	  <li>use the "config" input parameter in your
	  <a href="hts_form.html">search form</a>
	  (see question <a href="#q4.2">4.2</a>).
	  <li>if you need to get at files outside the default CONFIG_DIR, use a
	  wrapper script that redefines the CONFIG_DIR environment variable,
	  then use the config input parameter as above
	  (see question <a href="#q4.20">4.20</a>).
	  <li>use a wrapper script to force htsearch to use a specific config
	  file using the -c option. This is especially for cases where you
	  want to prevent the user from selecting other config files in your
	  CONFIG_DIR using the config input parameter. This should
	  be done by using the GET method to call the wrapper script, and in
	  this script you must unset the REQUEST_METHOD enviroment variable
	  and pass "$QUERY_STRING" as a single argument to htsearch.
	  (This safely gets around htsearch's test which disables -c.)
	  <li>configure and compile different htsearch binaries with different
	  compile-time definitions of CONFIG_DIR, so you can avoid wrapper
	  scripts altogether.
	  <li>define ALLOW_INSECURE_CGI_CONFIG in htsearch.cc and recompile
	  htsearch if all other approaches above fail for you.
	  </ol>

	  <strong>5.31. <a name="q5.31">I've set a config attribute exactly
	  as documented but it seems to have no effect.</a></strong><br>
	  <p>There are a few fairly common reasons why this might happen:</p>
	  <ol>
	  <li>You may have a typo. Spelling matters, so make sure the attribute
	  name is spelled exactly as it is in the
	  <a href="attrs.html">documentation</a>. Misspelled attribute
	  definitions are silently ignored. This is because you're allowed
	  to make up your own attribute definitions for use by other attribute
	  definitions, as <strong>${myownattribute}</strong>. Also remember
	  to put the colon ("<strong>:</strong>") separator between the
	  attribute name and value in your definition.
	  <li>The attribute isn't supported in your version of the software.
	  The <a href="attrs.html">documented configuration attributes</a>
	  on the www.htdig.org web site are for the most recent
	  <strong>stable</strong> release. See questions
	  <a href="#q2.1">2.1</a> and <a href="#q2.7">2.7</a> for details.
	  If you're running an older version, or even a more recent beta
	  release, you may not have the same set of attributes to work with.
	  Consult the appropriate documentation, or upgrade to the current
	  release.
	  <li>You're not modifying the right configuration file. The default
	  configuration file is specified when you first configure ht://Dig
	  before compiling, but other configuration files can be specified
	  at run time, using the -c command-line option for most programs,
	  or the <strong>config</strong> input parameter for htsearch
	  (see question <a href="#q4.2">4.2</a>).
	  <li>You've got more than one definition of the attribute. Only the
	  last occurrence of an attribute in the configuration file is the
	  definition that's used for that attribute, overriding earlier
	  definitions. This also applies for nested configuration files that
	  are loaded in via the <a href="attrs.html#include">include</a>
	  directive, so check for other definitions in all included files.
	  Similarly for htsearch, look out for multiple definitions of input
	  parameters in your search forms, as mentioned in question
	  <a href="#q4.2">4.2</a> - these don't override each other but they
	  get combined with a Ctrl-A as separator, which may not be what you
	  want either.
	  <li>Your attribute definition is being "swallowed up" by an
	  incomplete multi-line definition above it. Remember that when a line
	  of an attribute definition ends with a single backslash
	  ("<strong>\</strong>") before the end of the line (without any
	  space after the backslash), then the following line is appended to
	  it as a continuation of the same attribute definition. For an
	  attribute definition that spans several lines, all lines but the
	  last must end with a backslash. If you want a backslash to go into
	  the attribute definition literally, it must be doubled-up, as
	  <strong>\\</strong>.
	  <li>On a similar note, make sure your attribute definitions are all
	  terminated by a newline character. Beware of text editors that do
	  word wrapping. It may look like two separate lines on the screen,
	  when it fact you've got two attribute definitions on the same long
	  line, so the second is swallowed up as part of the first.
	  <li>Your attribute definition is being overridden by an htsearch
	  <a href="hts_form.html">CGI input parameter</a>. For example,
	  <a href="attrs.html#template_name">template_name</a> is ignored
	  if the <strong>format</strong> input parameter is defined. The
	  <a href="attrs.html#allow_in_form">allow_in_form</a> attribute
	  can define any number of new CGI input parameters that override
	  the attributes of the same name in your config file.
	  <li>Your attribute definition is being ignored or overridden
	  by a related attribute.  Watch out for unexpected interactions
	  between different attributes.  For instance, characters in
	  <a href="attrs.html#valid_punctuation">valid_punctuation</a>
	  are stripped out of words, so those characters may
	  not have the effect you want if you've added them to
	  <a href="attrs.html#extra_word_characters">extra_word_characters</a>
	  or
	  <a href="attrs.html#prefix_match_character">prefix_match_character</a>.
	  Also,
	  <a href="attrs.html#search_results_wrapper">search_results_wrapper</a>
	  will override
	  <a href="attrs.html#search_results_header">search_results_header</a>
	  and
	  <a href="attrs.html#search_results_footer">search_results_footer</a>,
	  but only if you've set up the wrapper file correctly.
	  <li>Watch out for possible "latent effects" of some attributes. For
	  example, when you change attributes used by htdig, they won't have
	  an immediate effect on entries already in the database, so you would
	  have to reindex your site before they take effect. Similarly,
	  attributes that affect how htfuzzy builds some of its databases
	  don't take effect until those databases are rebuilt. Another, more
	  subtle latent effect occurs with releases 3.1.6 and 3.2 betas:
	  when you interrupt htdig (i.e. with Control-C or a kill command),
	  it stores the list of currently queued URLs in db.log, in your
	  database directory, so that the next time you invoke htdig it can
	  resume the interrupted dig. A side-effect of this file is that if
	  you change some attributes like limit_urls_to or exclude_urls before
	  restarting, the URLs in the file are still taken as-is, having been
	  checked against the old settings of limit_urls_to or exclude_urls
	  before being queued. This might explain one reason htdig seems to
	  ignore your new settings of these.
	  </ol>

	  <strong>5.32. <a name="q5.32">When I run htsearch, it gives a page
	  with an "Unable to read configuration file" message.</a></strong><br>
	  <p>The most common causes of this error are:</p>
	  <ul>
	  <li>Your configuration file name is misspelled in the "config"
	  input parameter of your search form, or you have two definitions
	  of this parameter (see question <a href="#q4.2">4.2</a>).
	  <li>You didn't install your configuration file in the directory
	  defined by the CONFIG_DIR compile-time Makefile variable
	  (see also question <a href="#q4.20">4.20</a>). This is where
	  htsearch will look for the configuration file specified by the
	  "config" input parameter.
	  <li>The configuration file is not readable by the user ID under
	  which your web server, and thus htsearch, runs. Similarly,
	  if the directories from CONFIG_DIR up to the root directory
	  are not executable by this same user ID, htsearch won't be
	  able to access the configuration files.
	  </ul>

	  <strong>5.33. <a name="q5.33">How can I find out which version
	  of ht://Dig I have installed?</a></strong><br>
	  <p>You should always check which version of ht://Dig you're
	  running, before you report any problems, or even if you
	  suspect a problem. You can find out the version number of an
	  installed ht://Dig package by running the command:</p>
	  <blockquote>
		<code>htdig -\? | head</code>
	  </blockquote>
	  <p>(or use "more" if you don't have a "head" command). The
	  full version number appears on the third line of output,
	  after "This program is part of ht://Dig", and it should also
	  include the snapshot date if you're running a pre-release
	  snapshot. Always include this full version number with any
	  bug report or problem report on a mailing list. You can save
	  yourself and others a lot of grief by being certain of which
	  version you're running, especially if you've installed more than
	  one. If you're running ht://Dig from an RPM package, you should
	  also report the package version and release number, which you
	  can determine with the command "<code>rpm -q htdig</code>",
	  and mention where you obtained the package. This will alert
	  us to the ideosyncracies and/or patches in a particular RPM
	  package. Also, if you've applied any patches yourself (see
	  question <a href="#q2.5">2.5</a>) please mention which ones.
	  See also question <a href="#q1.8">1.8</a>, on reporting bugs
	  or configuration problems.</p>

	  <strong>5.34. <a name="q5.34">When running htdig, I get "Error (0):
	  PDF file is damaged - attempting to reconstruct xref table..."</a></strong><br>
	  <p>This message comes from the pdftotext utility, when a PDF file
	  has been truncated. Find the largest PDF file on the site you're
	  indexing, and set max_doc_size to at least that size (see question
	  <a href="#q5.2">5.2</a>). If you need to track down which PDF is
	  causing the error, try running "htdig -i -v &gt; log.txt 2&gt;&amp;1" so you
	  can see which URL is being indexed when the error occurs. The output
	  redirects in that command combine stdout (where htdig's output goes)
	  and stderr (where pdftotext's error messages go) into one output
	  stream. If you're using acroread to index PDF files, the error
	  message for a truncated PDF file is simply "Could not repair file."
	  It's also possible to get errors like this from PDF files that are
	  smaller than max_doc_size, if they're already truncated or corrupted
	  on the server.</p>

	  <strong>5.35. <a name="q5.35">When running htdig on Mandrake Linux,
	  I get "host not found" and "no server running" errors.</a></strong><br>
	  <p>The default htdig.conf configuration in Mandrake's RPM package
	  of htdig very stupidly enables the
	  <a href="attrs.html#local_urls_only">local_urls_only</a> attribute
	  by default, which means you can only index a limited set of files
	  on the local server. Anything else, where htdig would normally fall
	  back to using HTTP, will fail. To make matters worse, they put a very
	  misleading comment above that attribute setting, which throws users
	  off track. This attribute is useful in certain circumstances where
	  you never want htdig to fall back to HTTP, but enabling it by default
	  was a very bad judgement call on Mandrake's part.</p>

	  <strong>5.36. <a name="q5.36">When I run htsearch, it gives me the
	  list of matching documents, but no header or footer.</a></strong><br>
	  <p>The header and footer typically contain the followup search
	  form, an indication of the total number of matches, and buttons
	  to other pages of matches if the results don't fit on one
	  page. If these don't show up, it could be that in attempting
	  to customize these (see question <a href="#q4.2">4.2</a>),
	  you removed them or rendered them unusable. Even if you didn't
	  customize them, make sure you installed the
	  <a href="attrs.html#search_results_header">search_results_header</a>
	  and
	  <a href="attrs.html#search_results_footer">search_results_footer</a>
	  files (or the
	  <a href="attrs.html#search_results_wrapper">search_results_wrapper</a>
	  file) in the correct location (where you told ht://Dig they'd be
	  when you configured prior to compiling). Also make sure they
	  have read permission for the user ID under which htsearch runs,
	  and all directories leading up to these template files are
	  searchable (i.e. executable) by htsearch, or it won't be able
	  to open the files.</p>

	  <p>This is the opposite problem of that described in question
	  <a href="#q5.11">5.11</a>. If htsearch displays nothing at
	  all, you may have both problems or you may have no matches or
	  a boolean query syntax error and the
	  <a href="attrs.html#nothing_found_file">nothing_found_file</a>
	  or <a href="attrs.html#syntax_error_file">syntax_error_file</a>
	  is missing or unreadable.</p>

	  <strong>5.37. <a name="q5.37">When I index files with doc2html.pl,
	  it fails with the "UNABLE to convert" error.</a></strong><br>
	  <p>This is an indication that doc2html.pl wasn't configured
	  properly. Carefully follow all the directions for installation
	  in the DETAILS file that comes with the script. In addition to
	  installing doc2html.pl, you must:</p>
	  <ul>
	  <li>Install xpdf and check that pdftotext and pdfinfo work from
	   the command line,
	  <li>Configure pdf2html.pl to use pdftotext and pdfinfo and check
	   that it works from the command line,
	  <li>Configure doc2html.pl to use pdf2html.pl and check that it
	   works from the command line:
<pre>doc2html.pl /full/path/to/sample/filename.pdf "application/pdf" url</pre>
	  </ul>
	  <p>You should repeat a similar set of steps to configure and test
	  doc2html.pl for other document types, such as Word, RTF, Excel and
	  other document types. See also questions <a href="#q4.8">4.8</a>,
	  <a href="#q4.9">4.9</a> and <a href="#q5.39">5.39</a>.</p>
	  
	  <strong>5.38. <a name="q5.38">Why do my searches find search terms
	  in pathnames, or how do I prevent matching filenames?</a></strong><br>
	  <p>htdig doesn't normally add the URL components to the index
	  itself, but when you index a directory where the filenames are
	  used as link description text (such as an automatic DirectoryIndex
	  created by Apache's mod_autoindex) then these link descriptions
	  get indexed, carrying the weight assigned to them by the
	  <a href="attrs.html#description_factor">description_factor</a>
	  attribute. Thus, a search for a filename will match this link
	  description, and the file will show up in search results.
	  To avoid that, make sure your DirctoryIndexes don't get indexed
	  as detailed in question <a href="#q4.23">4.23</a>.</p>

	  <p>Conversely, there is no way to force htdig to index URL
	  components so that a search for a file name will yield a match
	  on that file, unless you index an HTML file (or several) containing
	  links to all the files you want, where the link description text
	  does contain the full URL or the pathname components you want.</p>
	  
	  <strong>5.39. <a name="q5.39">I set up an external parser but I still
	  can't index Word/Excel/PowerPoint/PDF documents.</a></strong><br>
	  <p>You probably need to carefully re-read and follow questions
	  <a href="#q4.8">4.8</a>, <a href="#q4.9">4.9</a>,
	  <a href="#q5.25">5.25</a> and <a href="#q5.27">5.27</a>.
	  When you can't index documents with an external parser or converter,
	  there are three main issues, or points of failure, that you need
	  to resolve. You need to figure out on which of the three stages the
	  process is failing, and focus on that stage to get to the bottom of
	  why it's not working at that stage. You need to run htdig with
	  anywhere from 1 to 4 -v options, to get the debugging output you
	  need to see where it's failing and why. This may be an iterative
	  process, if htdig is failing at more than one stage: you might fix
	  one problem only to run into another.</p>

	  <ol>
	  <li>Is htdig actually finding links to the PDF, Word, etc. documents
	    you want to index? Make sure you're not making false assumptions
	    about how htdig finds these (questions <a href="#q5.25">5.25</a>
	    and <a href="#q5.18">5.18</a>), and then find out how htdig is
	    looking at the links in your HTML files to see if it's ignoring
	    or rejecting links to your externally parsed documents (questions
	    <a href="#q4.1">4.1</a> and <a href="#q5.27">5.27</a>).<br><br>
	  <li>If it is finding and accepting the links to these documents, is
	    it correctly fetching them and passing them on to the appropriate
	    external converter to be able to index them? Look at htdig -vvv
	    output, around the time it tries to fetch one of these, and see
	    what it does next. Does the file size look right? Are there any
	    error messages around there? If the external converter isn't even
	    being called, take a close look at your
	    <a href="attrs.html#external_parsers">external_parsers</a>
	    attribute setting to make sure it's correct (see question
	    <a href="#q5.31">5.31</a>).<br><br>
	  <li>If it is attempting to convert them, is the external converter
	    doing what it should, to feed some indexable text back into htdig's
	    parser? You can also try htdig -vvvv (4 -v options) to see if it's
	    actually parsing individual words from any of these. If this is
	    too much output to wade through, try setting
	    <a href="attrs.html#start_url">start_url</a> to the URL
	    of a single document that you want to test, so you can look in
	    detail at what htdig does with it. You can also try running the
	    external converter manually on one of these documents to see
	    what it spits out. See question <a href="#q5.37">5.37</a>.
	    Make sure your documents actually contain indexable text. Some
	    PDFs are nothing but scanned images of pages, so it looks like
	    text but it's just images with no computer-readable text.
	  </ol>

	  <br>

	  <hr noshade size=4>
	  	Last modified: $Date: 2004/05/28 13:15:16 $
<br> 
    <a href="http://sourceforge.net/"> 
          <img src="http://sourceforge.net/sflogo.php?group_id=4593&amp;type=1" width="88" height="31" border="0" alt="SourceForge Logo"></a>
  </body>
</html>