$Date: 2000/08/13 18:19:40 $ Some features of modlogan are special and need some more into-the-deep documentation. if this documentation is not enough, unclear or/and complete non-sense please tell me. send a mail to <jan@kneschke.de> Table Of Contents ================= 0. modular design 1. input stream divider (logfile splitter) 2. running modlogan on non-closing pipe 3. input stream sorter 4. searchengines 5. default config file 6. userdefined headers/footers 7. userdefined logfile defenition 0. modular design ================= to understand what is described below you have to understand some of the internals of modlogan. modlogan consists of 3 + 1 parts. - input plugin - processor plugin - output plugin and - gluecode + mainloop the input plugin has only one important function which provides a record. if this record belongs to the next month a report is generated by output plugin. afterwards the record will be processed by the processor plugin. this is repeated until the last record is parsed. some features are change this work flow in some ways. they are described below. 1. logfile splitter =================== plugin: processor/web option: splitby if you want to split one input stream into different output stream, you need a logfile splitter. you can use an external script as a preprocessor or use the splitter-support of the web processor plugin. input stream (input plugin) | +-- ... ---+--- ... ---+ (processor plugin) | | | | | | output streams (output plugins) to enable this feature add a splitby definition to the processor_web section of your config-file. a splitby definition is the following string: splitby=<field>,"<regex>",<name> where <field> is: srvhost - for the host which served the request srvport - for the port where the host listened at requser - for the authenticated user requrl - for the requested url reqhost - for the requesting host refurl - for the referring url default - 'joker' which matches everything. <regex> is a regular expression which has to successfully been matched and <name> a name to group the splitted records again. <name> is also used as the name of the subdirectory where the reports are placed. if you specify multiple splitby definitions they a checked from the first to last. if one check is successfull the generated name is used as <name>. NOTE: the splitter has to return a name and it's your job to make sure that a name is available - by using a always matching definition e.g. splitby=srvhost,"(.*)",$1 - by specifing a 'default' definition. Examples 1: ----------- let's assume that we have the following directory structure: /users/~j.kneschke/index.html /users/~project.modlogan/index.html the definition -- splitby=requrl,"^/users/~(.*?)/.*$",host_$1 -- will divide the records according to the string between the '~' and the '/'. the string will be taken and added the name. /users/~j.kneschke/index.html -> host_j.kneschke /users/~project.modlogan/index.html -> host_project.modlogan directories will be created (if it doesn't exist) in the directory which you specifiy with global:outputdir with the name host_j.kneschke and host_project.modlogan. each directory contains the reports for the respective splitted logs. Example 2: ---------- /users/~<username>/index.html is an alias for <username>.domain.com and modlogan shall treat that as such. as the following two definition generate the same name they result in the report. as the first match wins you can switch between your installations like you want to. -- splitby=requrl,"^/users/~(.*?)/.*$",host_$1 splitby=srvhost,"^(.*?)\..*$",host_$1 -- Possible Errors --------------- the splitter will notice that nothing matched and will spit out a warning and ignore the record if no 'default' stream is avaliable. -- # split by all non-anymous users splitby=user,"^([^-]+)$",user_$1 -- if a page is requested from a non-autheticated user the logfile will contain a dash ('-'). as no splitby definition matches the record will be igroned. Support ------- only a few plugins are ported now: input: everything plugin works (they don't have to be changed) processor: web output: modlogan 2. non-closing pipes ==================== plugin: global option: gen_report_threshold by default modlogan generates a report at the end of each parsed month and at the end of each run. This isn't enough if you run modlogan on a non-closing pipe like: $ tail access.log | modlogan -c <config-file> In this scenario modlogan generates a report only once a month which isn't really perfect. To circumvent this problem you can specify a number of record after which a report is generated. if you set a threshold of 1000 modlogan will generate a report - after each 1000 records - at the end of each month and - at the end of the run. the last option can't become active in this scenario because modlogan won't stop by itself. 3. input stream sorter ====================== plugin: input/clf option: readaheadlimit if multiple servers are writing into one logfile the records are normally not in order, but modlogan expects them to be in order. they have to be sorted either by an external run of sort on the whole logfile or by the internal logfile sorter. By enabling the internal input sorter the input plugin will read the specified amount of records, parse them and will put them into a sorted list. the oldest record is taken from the list and returned to the mainloop for further processing. as the internal list of parsed records isn't filled anymore the next call of the input parser will add another line the sorted list of parsed records. the oldest record will be taken from the list and the game repeats. you can control this feature by specifing the maximum number of lines which has to to be reach before a record is returned. NOTE: as the specfied number of lines is kept in memory you shouldn't increase the number too much. Example ------- [input_clf] readaheadlimit = 100 4. searchengines ================ plugin: processor/web option: searchengines related options: debug_searchengines The file modlogan.searchengines contains a list of known searchengines. This list is used by the 'web'-processor-module for searchstring and searchengine detection. The list is far from complete and has to be maintained by you the user. If you've found a new searchengine in your logfile (enabling 'debug_searchengines=1' in the processor section of your config-file will help you a lot) you have to add the modlogan.searchengines to use this string in the next run. Adding SearchEngines -------------------- A entry in the searchengines description file consists of the hoststring string and the getvars part. the getvars part is used as a group string and is covered by bracket '[]'. every url that is using the a getvars which is already found in one of those bracket has to be below this group string. the url part is a regex string. The number after the url should be set to zero and will be used for a searchstring rating in the future. Lines startin with '#' are comments. Example: -------- You've got the following output from modlogan: o SK: ?? http://www.searchalot.com/texis/open/meta/main.htm -> q=free+catalogue This means that a known key was detected ('q') but the url is listed in the modlogan.searchengines file. Now we have to choose the right section in the searchengines-definition file. As the section name as the detected key we have to go to the following section: [q] after we've written the regex for the URL the new searchengine definition is written: [q] # ... # in the first attempt we try to be most accurate as possible # if other pages from the same site are hitting this section # we'll shorten the match to # searchalot.com/texis/,0 searchalot.com/texis/open/meta/main.htm$,0 # ... 5. Default configfile ===================== if you are generating reports for multiple sites you often don't want to rewrite the whole configfile for each server. using a default-configfile which contains all the default values eases the update process a lot. the server specific config-file is small and only contains the options that vary between the servers: servername, grouped hosts, hidden referrer... you understand how the option from the config file interfere with the options of the default-configfile. there three possibilities: - the value will be overwritten by the next occurence of the same key - the value is appended to the list - only the first occurence is written how each option reacts is specified in ./doc/plugin-option.txt or ./doc/plugin-options.html. the third possibility is only used by the 4 keys of the global section: - inputplugin - outputplugin - processorplugin - defaultconfigfile as these option can only be written once you have to specify the plugin BEFORE you define the default-configfile as the default-configfile is parsed right at the accurance of the default-configfile option. Example ------- [global] inputplugin=null outputplugin=modlogan processorplugin=web default_configfile=modlogan.def.conf 6. userdefined headers/footers ============================== plugin: output/modlogan option: htmlheader, htmlfooter by default modlogan will generate complete HTML-pages with a header and footer. the header contains the full HTML-header (DOCTYPE, TITLE, ...), the starting BODY-tab and the standard header ("Statitics for ..."). the default footer consists of a horizontal line (<HR>), a link which points the home of modlogan and the two pics which clarify that this is true HTML 4.0. in some circumstences it isn't wanted that these parts are generated. in these cases you can supply two filenames which are used instead of the default header/footer. Example: -------- [output_modlogan] # the page is embedded into a surrounding HTML-page via Server-Side-Includes # blank.ihtml is an empty file. htmlfooter=blank.ihtml htmlheader=blank.ihtml 7. userdefined logfile definition ================================= plugin: input/clf option: format by default the clf input plugin tries to find out the type of the logfile. it checks for: - common logfile (Apache: CustomLog common) - combined logfile (Apache: CustomLog combined) - squid logfile if your logfile definition is different from these three definitions you can provide the logfile definition by the config-file. you can copy the CustomLog string directly from your httpd.conf of your apache. for definition of the different options you can specify read the documentation of 'log_mod_config' [-> Apache Manual] and for the availability of these option in combination with modlogan ./src/input/clf/plugin_config.h. Example: -------- [input_clf] format=%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %v %p %T