Sophie: modlogan-0.8.5-2mdk ppc

modlogan-0.8.5-2mdk.ppc.rpm


$Date: 2000/08/13 18:19:40 $

Some features of modlogan are special and need some more into-the-deep
documentation. 

if this documentation is not enough, unclear or/and complete non-sense
please tell me. send a mail to <jan@kneschke.de>

Table Of Contents
=================
0. modular design
1. input stream divider (logfile splitter)
2. running modlogan on non-closing pipe
3. input stream sorter
4. searchengines
5. default config file
6. userdefined headers/footers
7. userdefined logfile defenition

0. modular design
=================
to understand what is described below you have to understand some of the
internals of modlogan. 

modlogan consists of 3 + 1 parts. 
- input plugin
- processor plugin
- output plugin
and
- gluecode + mainloop

the input plugin has only one important function which provides a record.
if this record belongs to the next month a report is generated by output
plugin. afterwards the record will be processed by the processor plugin.
this is repeated until the last record is parsed.

some features are change this work flow in some ways. they are described
below.

1. logfile splitter 
===================
plugin: processor/web
option: splitby

if you want to split one input stream into different output stream, you need
a logfile splitter. you can use an external script as a preprocessor or use
the splitter-support of the web processor plugin.

                         input stream (input plugin)
                                      |
                           +-- ... ---+--- ... ---+ (processor plugin)
                           |          |           |   
                           |          |           |
                       output streams (output plugins)

to enable this feature add a splitby definition to the processor_web 
section of your config-file. a splitby definition is the following string:

splitby=<field>,"<regex>",<name>

where <field> is:
srvhost - for the host which served the request
srvport - for the port where the host listened at
requser - for the authenticated user
requrl  - for the requested url
reqhost - for the requesting host
refurl  - for the referring url
default - 'joker' which matches everything.

<regex> is a regular expression which has to successfully been matched and
<name> a name to group the splitted records again. <name> is also used as
the name of the subdirectory where the reports are placed.

if you specify multiple splitby definitions they a checked from the first to
last. if one check is successfull the generated name is used as <name>. 
NOTE: the splitter has to return a name and it's your job to make sure that
a name is available
- by using a always matching definition
  e.g. splitby=srvhost,"(.*)",$1
- by specifing a 'default' definition.

Examples 1:
-----------

let's assume that we have the following directory structure:
/users/~j.kneschke/index.html
/users/~project.modlogan/index.html

the definition

--
splitby=requrl,"^/users/~(.*?)/.*$",host_$1
--

will divide the records according to the string between the '~' and the '/'.
the string will be taken and added the name.
/users/~j.kneschke/index.html        -> host_j.kneschke
/users/~project.modlogan/index.html  -> host_project.modlogan

directories will be created (if it doesn't exist) in the directory which you
specifiy with global:outputdir with the name host_j.kneschke and
host_project.modlogan.

each directory contains the reports for the respective splitted logs.

Example 2:
----------
/users/~<username>/index.html is an alias for <username>.domain.com and
modlogan shall treat that as such.

as the following two definition generate the same name they result in the
report. as the first match wins you can switch between your installations
like you want to.

--
splitby=requrl,"^/users/~(.*?)/.*$",host_$1
splitby=srvhost,"^(.*?)\..*$",host_$1
--

Possible Errors
---------------

the splitter will notice that nothing matched and will spit
out a warning and ignore the record if no 'default' stream is avaliable.

--
# split by all non-anymous users
splitby=user,"^([^-]+)$",user_$1
--

if a page is requested from a non-autheticated user the logfile will contain
a dash ('-'). as no splitby definition matches the record will be igroned.

Support
-------
only a few plugins are ported now:

input: everything plugin works (they don't have to be changed)
processor: web
output: modlogan


2. non-closing pipes
====================
plugin: global
option: gen_report_threshold

by default modlogan generates a report at the end of each parsed month 
and at the end of each run. 
  This isn't enough if you run modlogan on a non-closing pipe like:

$ tail access.log | modlogan -c <config-file>

In this scenario modlogan generates a report only once a month which isn't
really perfect. To circumvent this problem you can specify a number of
record after which a report is generated. 

if you set a threshold of 1000 modlogan will generate a report 
- after each 1000 records
- at the end of each month and 
- at the end of the run. 

the last option can't become active in this scenario because modlogan 
won't stop by itself.

3. input stream sorter
======================
plugin: input/clf
option: readaheadlimit

if multiple servers are writing into one logfile the records are normally
not in order, but modlogan expects them to be in order. they have to be
sorted either by an external run of sort on the whole logfile or by the
internal logfile sorter.

By enabling the internal input sorter the input plugin will read the 
specified amount of records, parse them and will put them into a sorted
list. the oldest record is taken from the list and returned to the mainloop
for further processing. as the internal list of parsed records isn't filled
anymore the next call of the input parser will add another line the sorted
list of parsed records. the oldest record will be taken from the list and
the game repeats.

you can control this feature by specifing the maximum number of lines which
has to to be reach before a record is returned.

NOTE: as the specfied number of lines is kept in memory you shouldn't
increase the number too much.

Example
-------

[input_clf]
readaheadlimit = 100

4. searchengines
================
plugin: processor/web
option: searchengines
related options: debug_searchengines

The file modlogan.searchengines contains a list of known searchengines. This
list is used by the 'web'-processor-module for searchstring and searchengine
detection.

The list is far from complete and has to be maintained by you the user. If
you've found a new searchengine in your logfile (enabling 
'debug_searchengines=1' in the processor section of your config-file will 
help you a lot) you have to add the modlogan.searchengines to use this
string in the next run.

Adding SearchEngines
--------------------

A entry in the searchengines description file consists of the hoststring 
string and the getvars part. the getvars part is used as a group string and
is covered by bracket '[]'. every url that is using the a getvars which is
already found in one of those bracket has to be below this group string.

the url part is a regex string.  
The number after the url should be set to zero and will be used for a
searchstring rating in the future.

Lines startin with '#' are comments.

Example:
--------
You've got the following output from modlogan:

o SK: ?? http://www.searchalot.com/texis/open/meta/main.htm ->
q=free+catalogue

This means that a known key was detected ('q') but the url is listed in the
modlogan.searchengines file. 

Now we have to choose the right section in the searchengines-definition
file. As the section name as the detected key we have to go to the following
section:

[q]

after we've written the regex for the URL the new searchengine definition is
written:

[q]
# ...
# in the first attempt we try to be most accurate as possible 
# if other pages from the same site are hitting this section 
# we'll shorten the match to 
# searchalot.com/texis/,0
searchalot.com/texis/open/meta/main.htm$,0
# ...

5. Default configfile
=====================

 if you are generating reports for multiple sites you often don't want to
rewrite the whole configfile for each server. using a default-configfile
which contains all the default values eases the update process a lot. the
server specific config-file is small and only contains the options that vary
between the servers: servername, grouped hosts, hidden referrer...

you understand how the option from the config file interfere with the
options of the default-configfile.

there three possibilities:
- the value will be overwritten by the next occurence of the same key
- the value is appended to the list
- only the first occurence is written

how each option reacts is specified in ./doc/plugin-option.txt or
./doc/plugin-options.html.

the third possibility is only used by the 4 keys of the global section:
- inputplugin
- outputplugin
- processorplugin
- defaultconfigfile

as these option can only be written once you have to specify the plugin
BEFORE you define the default-configfile as the default-configfile is parsed
right at the accurance of the default-configfile option.

Example
-------

[global]
inputplugin=null
outputplugin=modlogan
processorplugin=web
default_configfile=modlogan.def.conf

6. userdefined headers/footers
==============================
plugin: output/modlogan
option: htmlheader, htmlfooter

by default modlogan will generate complete HTML-pages with a header and
footer. 
  the header contains the full HTML-header (DOCTYPE, TITLE, ...),
the starting BODY-tab and the standard header ("Statitics for ..."). 
  the default footer consists of a horizontal line (<HR>), a link which
points the home of modlogan and the two pics which clarify that this is 
true HTML 4.0.

in some circumstences it isn't wanted that these parts are generated. in
these cases you can supply two filenames which are used instead of the
default header/footer.

Example:
--------

[output_modlogan]
# the page is embedded into a surrounding HTML-page via Server-Side-Includes
# blank.ihtml is an empty file.
htmlfooter=blank.ihtml
htmlheader=blank.ihtml

7. userdefined logfile definition
=================================
plugin: input/clf
option: format

by default the clf input plugin tries to find out the type of the logfile.
it checks for:
- common logfile (Apache: CustomLog common)
- combined logfile (Apache: CustomLog combined)
- squid logfile

if your logfile definition is different from these three definitions you can
provide the logfile definition by the config-file. 

you can copy the CustomLog string directly from your httpd.conf of your
apache.  

for definition of the different options you can specify read the
documentation of 'log_mod_config' [-> Apache Manual] and for the availability
of these option in combination with modlogan ./src/input/clf/plugin_config.h.

Example:
--------
[input_clf]
format=%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %v %p %T