Sophie

Sophie

distrib > Mandriva > 10.0 > i586 > by-pkgid > 7b9697446798a9d5ffad285d24b70654 > files > 15

fastresolve-2.10-2mdk.i586.rpm

TODO:

Get someone to make spiffy graphics and a logo for the web page. :-)

Do whatever else is needed to make it usable as an apache logging coprocess?

Profiling
=========

From gprof, significant time is being spent in:
5.7%		adns__timeouts()
3.8%		find() (mostly less(), thus strcmp())
2.3%		sendto()
1.0%		read_ipaddr()
1.0%		domptr(0
1.0%		memchr() in fgetln()

Cache efficiency
================

On an arbitrarily chosen day, a reports system reporting on 81
different web servers got a 64% DB file cache hit rate.  That number
is higher than might be expected.  The common use of web proxies like
AOL's may be responsible for the high success rate, though that's just
a guess.

It would be helpful to do a histogram of the age of cache entries that
were used, to help determine the optimum cache entry lifetime for
expire-ip-db.  dns-terror would need more instrumentation (probably #ifdef'd)
to track that.  Is it a normal distribution, or fairly flat with no dropoff?

Parallelism
===========

dns-terror -oz and analog each fully utilize a single CPU; rcp does
not use much CPU.  How to take advantage of N CPUs?

The process of generating a single report is not easy to parallelize,
without incurring enough coordination and system call overhead to
possibly counteract the speed gain.  If multiple reports are being
generated, there could be several processes reading from a work queue.

This could be approached in two main ways.  One way would be to have
each reader process go through the whole cycle of programs to produce
a report: rcp, dns-terror, optionally getdominfo, analog.  dns-terror
and getdominfo would use file locking (flock, man DB_File) on the DB files
to prevent corruption from multiple simultaneous writers.  There could
be more reader processes than CPUs, to try to get more CPU utilization
when several rcp's happen to be occurring simultaneously, by
increasing the chance that there will be another process that is
running dns-terror or analog.  There would be some redundant fetching
of external data if multiple dns-terror or getdominfo processes ran
simultaneously.  Probably the increase in parallelism would more than
offset that time waste.

The other way would be to have each process do only one stage of
report generation.  There might be a first-stage control process that
generates the initial queue of log files to rcp, from a database.  One
process loops rcp'ing log files.  When it finishes each one, it adds
it to a queue that is read by another process that runs dns-terror on
each log file.  When it finishes one, it adds that log file name to a
queue that is read by a process that repeatedly runs analog.  With
this approach, the queues could even be text files that are written
with line buffering, since each queue will have only one reader, so
read buffering won't be a problem.

But this approach doesn't parallelize beyond 2-4 CPUs.  A refinement
of this second approach is to have multiple processes at each stage,
with DB file locking, and making sure that the queue items are only
written and read at record boundaries.  There would be the same
inefficiency as with the first approach regarding some redundant
fetching of data by multiple dns-terror and getdominfo processes.

Having multiple processes reading and/or writing log files
simultaneously shouldn't be a problem, as the I/O bandwidth of
an ultra-wide SCSI RAID can handle that.

Another issue is how and when gzip is run.  A log file can be zipped
before being fetched from the remote machine, or at the start, middle,
or end of the report generation process, or the log file can be
discarded without ever zipping it (though we wouldn't do that).
Zipping it last, for archiving, minimizes the amount of CPU time spent
zipping and unzipping, at the cost of more disk I/O.  If a log file is
not zipped on the remote machine prior to the rcp, it could be
transferred with rsync -z or scp -C to reduce transfer times.
However, that is effectively gzipping the file on the remote machine,
then unzipping it and rezipping it locally.  If rsync or scp had an
option to gzip the file on the fly and keep it in gzipped form on the
destination machine, that might be desirable.  It would be nice
because the time when the log files are ready to be fetched would be
the same, without having to account for gzipping time that depends on
the log file sizes.

On multiprocessor machines, running gzip in a pipe as a separate
process might be a win over using zlib, unless the system call
overhead outweighs the gain of multiprocessing.  If we are using the
pipelining approaches outlined above, we might want to use zlib
anyway.

Possible ways to parallelize dns-terror:

Parallelize and partition the work with fork() to increase CPU
utilization from 25-50% to closer to 100%, so we can be doing
processing while waiting for I/O (e.g., a cache lookup).  Or, more
simply, we could start several copies at once, processing different
log files.  But they would have to either use different DB files (and
hence duplicate effort) or else use DB syncing and locking.

Here's a parallel design to consider:

For each N-line chunk of logs (could be the whole file):
parent stores an in-core map (key=ipaddr,value=exists) to make a list
of the distinct IP addresses it has read.  When it's done N (or all)
lines, it hands off 1/C of them to each child that it's forked, via
perhaps shared memory or Unix domain sockets, and signals.  The
children resolve them by looking up whether they're resolved in the
on-disk DB file.  They write to either that file, by locking it, or
their own DB file, which are all combined by the parent at the end.
Or they could just append the results to a stack or socket in memory,
and the parent writes them out to the DB file.  But remember, most of
them we've already seen, and never go to DNS for.  So those lookups are
probably what we need to parallelize the most.

Here's another parallelizing idea:

Split the work into N buckets, each handled by one process, according
to the last octet of the IP address.  Either the modulus or the
remainder should work, I think.  For -o, getting the lines output in
order would require some sort of coordination--shared memory or
semaphores, perhaps.

Are TCP DNS connections faster than UDP?
========================================

It's hard to know:

Try adns_qf_usevc (TCP) in the query flags.  Unfortunately, after a
few dozen queries, I get this:
adns warning: TCP connection lost: read: Connection reset by peer (NS=127.0.0.1)
And then nothing happens....

Sample IIS 4 log
================

#Software: Microsoft Internet Information Server 4.0
#Version: 1.0
#Date: 1999-08-16 00:02:07
#Fields: date time c-ip cs-username s-sitename s-computername s-ip cs-method cs-uri-stem cs-uri-query sc-status sc-win32-status sc-bytes cs-bytes time-taken s-port cs-version cs(User-Agent) cs(Cookie) cs(Referer)
1999-08-16 00:02:07 208.206.40.191 - W3SVC3 FLEXNET17 208.192.104.93 HEAD /default.htm - 200 0 280 19 0 80 HTTP/1.0 - - -
1999-08-16 00:07:06 208.206.40.191 - W3SVC3 FLEXNET17 208.192.104.93 HEAD /default.htm - 200 0 280 19 0 80 HTTP/1.0 - - -