Sophie: swish-e-2.0.5-2mdk i586

swish-e-2.0.5-2mdk.i586.rpm

Here is an html page from Rainer Scherg to show you how to write filters

More info on filters from thr original page of Rainer:

http://www.bnmsp.de/home/rainer.scherg/swish/filters.html




How does the filter option work?
================================

 There are two configuration directives:
    - FilterDir
    - FileFilter

 "FilterDir" tells swhish, where the filter scripts resides.
 "FileFilter" tells swish which filetype corresponds with which
 filter.

 A filter may be any executable program (C, shell, perl, tkl, etc.).
 The filter gets the filepath to the file which has to be converted to
 TXT as argument 1 on the commandline - this may be a tmp-file.

 The converted text- or html-output has to be send to stdout.
 
 swish-e itsself doesn't provide any filtering capabilities - you have
 to use speciallized programs to convert a document type to simple text
 or html.

 E.g. use "pdftotext" from the xpdf package to enable swish to index PDF
 files. Ghostscript may also be a good program for filtering purpose...



Syntax
======

  FilterDir  <path-to-filterprog/>

  FileFilter  <file-ext> <filter-program>

  e.g.:
      FilterDir   /usr/local/apache/swish-e/filters-bin/
      FileFilter  .pdf   pdf-filter.sh
      Filefilter  .gz    gzip-filter.sh
      Filefilter  .doc   wword-filter.sh
      Filefilter  .dot   wword-filter.sh
      Filefilter  .ps    ghostscript-filter.sh


  Filter program syntax:

    <filter-program> <filepath>  [<url-or-filepath>] 

      filepath:    	Arg1 contains the file to be opened and converted.
      url-or-filepath:	url-path or filepath, if filepath is a tmpfile.
                        (normally not needed, but in case you want to check
                         fileextensions, etc. in the filterprog)

      Output:		on Standard Output as (ISO-/ASCII)Text or HTML.


     This means:

      swish-e is passing these arguments to your filter script.
      Your filter has to read <filepath>, process the file and
      send the converted output simply to STDOUT.
     



Simple Sample Filter Files
==========================

 pdf-filter.sh:
 #!/bin/sh
 /usr/local/bin/pdftotext "$1" - 2>/dev/null


 test-filter.sh:
 #!/bin/sh
 echo "Filter-Call: "
 echo "  File to process: $1"
 echo "  File real location: $2"



Complex Filters
===============


 You can also write filter scripts, which are doing a more complex
 file handling...

   E.g.:
      FileFilter    .gz    gz-filter

   Script (pseudo code):
      - unzip   .gz file  in tmpfile
      - check new extension
      - call filter script for this extension

 In this way you can handle tar, tar.gz, doc.gz, html.Z or
 any file you need.



Optimization of the filtering process
=====================================

 Because sometimes it takes a lot of (cpu-)time to convert files 
 (and produces also a heavy impact on the system - especially
 with masses of files to index), you might check in your filterscript,
 if there's already a converted file in a hidden directory tree,
 which can simply be send with "cat".

 If the file is not present or uptodate, because the e.g. pdf file has
 been updated, you can create this file on the fly using the filter script.

 This makes indexing much faster (depends on the speed of the
 converter programs), but will burn down a lot of disk space...
 ... but IMO not necessary, because todays computer are fast
 enough...




 
Bug-Reports
===========

 Please report bug reports to the swish-e discussion group.
 Feel also free to improve or enhance this feature.
 

6. New Filter Modules
=====================

 If you have developed a filter module, please make this known to the swish-e
 discussion group.


7. Some Urls
============

 Apache:     http://www.apache.org
             http://modules.apache.org
              -- Apache Webserver resource

 SWISH-E:    http://sunsite.berkeley.edu/SWISH-E/
              -- swish-e search engine

 XPDF:       http://www.foolabs.com/xpdf/
	      -- The xpdf software and documentation by
              -- Derek B. Noonburg. 


Aug 1998 & May 1999
Rainer.Scherg@t-online.de
[Rainer.Scherg@rexroth.de]