Sophie: mon-1.2.0-8mdv2010.1 x86

mon-1.2.0-8mdv2010.1.x86_64.rpm

-implement trap delivery for "redistribute" in the mon server itself as an
 option. retain the "call script" behavior, but maybe specify internal
 trap delivery via "redistribute -h hostname [hostname...]". also allow
 multiple redistribute lines to build a list of scripts to call

-deliver traps with acknowledgement via tcp

-add protocol commands to dump entire status + configuration in one operation
 to reduce latency (not so many serialized get/response operations just to
 get status)

-no alerts for n mins

-better cookbook of examples, including some pre-fab m4 defines for templates
 with focus on the ability to quickly configure mon out-of-the-box for
 the most common setups

-period "templates"
    > like I have to repeat my period definitions all 260 times, one for
    > each watch.  we should have templates in the Mon config file for any
    > kind of object so it can be reused.

    so do you mean a way to define a "template" for a period so that
    you don't need to keep rewriting "wd {Sun-Sat}", or so that it'll use
    some default period if you don't specify one, or what? i can see this
    working a bunch of different ways.


    like this?

    define period-template xyz
	period wd {Sun-Sat}
		 alert mail.alert mis@domain.com
		 alert page.alert mis-pagers@domain.com
		 alertevery 1h


    watch something
	 service something
	    period template(xyz)

    watch somethingelse
	 service something
	    period template(xyz)
		# override the 1h
		alertevery 2h


-my recent thoughts on config management are that the parsing should be
 all modularized, (a keeping the config parsing code in a separate
 perl module to be reused by other apps),
 and there should be a way to turn the resulting data
 structure into xml and importing the same back, not so you can write
 your config by hand in xml, but so you can use some generic xml editing
 tool to mess around with the config, to get one type of gui.

-the most common things should be easiest to do, regardless of
 a gui or text file config. that is what makes stuff "easy". however,
 i don't think more complicated setups lend themselves to guis as much,
 and in complicated setups you have to invest a lot of time to learn how
 the tool works, and a fancy gui in that case is less of a payoff.
 this is for configuration, i mean. fancy guis for reporting and stuff
 are good, no doubt.

-global alert definitions with their own squelches (alertevery, etc.)
 > also, alarms need to be collated so pagers and cell phones don't get
 > buried with large numbers of alerts.  I have a custom solution that I
 > wrote for this, but it's a lousy solution since it essentially implements
 > its own paging system.

 i could see how it would be good to be able to define some alert
 destinations *outside* of the period definitions, then refer to them
 in the period definitions, then you can do "collation" that way. like
 this:

    define global-alert xyz mail.alert xyz@lmnop.com
	 alertevery 1h

    watch
       service
	 period
	   globalalert xyz     <---collated globally

    watch
       service
	 period
	   globalalert xyz     <---collated globally
	   alert mail.alert pdq@lmnop.com   <---not collated


that would be quite easy to do and i think very useful. you could
apply all the same squelch knobs (alertevery, etc.) to the global ones.

-----
(from mon-1.2.0)
$Id: TODO,v 1.2.2.1 2007/06/27 11:51:17 trockij Exp $

-add short a "radius howto" to the doc/ directory.

-make traps authenticate via the same scheme used to obscure
 the password in RADIUS packets

-descriptions defined in mon.cf should be 'quoted'

-document command section and trap section in authfile

-finish support for receiving snmp traps

-output to client should be buffered and incorporated into the I/O loop.
 There is the danger that a sock_write to a client will block the server.

-finish muxpect

-make "chainable" alerts
 ?? i don't recall who asked for this or how it would work

-make alerts nonblocking, and handle them in a similar fashion to
 monitors. i.e., serialize per-service (or per-period) alerts.

-document "clear" client command

-Document trap authentication.

-Document traps.

-Make monitors parallelize their tasks, similar to fping.monitor. This
 is an important scalability problem.

-re-vamp the host disabling. 1) store them in a table with a timeout
 on each so that they can automatically re-enable themselves so
 people don't forget to re-enable them manually. 2) don't do
 the disabling by "commenting" them out of the host groups.
 We still want them to be tested for failure, but just disable
 alerts that have to do with the disabled hosts.
 When a host is commented out, accept a "reason" field that
 is later accessible so that you can tell why someone disabled
 the host.

-allow checking a service at a particular time of day, maybe using
 inPeriod.

-maybe make a command that will disable an alert for a certain amount
 of time

-make it possible to disable just one of multiple alarms in a service

-make a logging facility which forks and execs external logging
 daemons and writes to them via some ipc such as unix domain socket.
 mon should be sure that one of each type of these loggers is running
 at all times. configure the logging either globally or for each
 service. write both the success and failure status to the log in
 some "list opstatus" type format. each logger can do as it wishes
 with the data (e.g. stuff it into rrdtool, mysql, cat it to a file, etc.)


    # global setting
    logger = file

    watch stuff
	service http
	    logger file -p _LOGDIR_
	    ...
	service fping
	    # this will use the global logger setting
	    ...
	service
	    # this will override the global logger setting
	    logger none
	    ...


 common options to logger:
    -d dir	path to logging dir
    -f file	name of log file
    -g, -s	group, service

-----------
notes on a v2 protocol redesign from trockij

- Configuring on a hostgroup scheme works very well. In the beginning, mon was
  never intended to get this complex(tm), it was intended to be a tool
  where it was easy to whip up custom monitoring scripts and alert scripts
  and plug them into a framework which allowed them all to connect to each
  other, and to have a way to easily build custom clients and report
  generators as well.

- However, per host status is needed now.

- This requires changes to both mon itself and also the monitors / alerts.
  
  Backward compatibility is important, and KISS is very important to
  retain the ease at which one can whip up a new monitor or alert or reporting
  client.

- There will be a new protocol for communicating with the monitors / alerts,
  which will be masked by a Mon::Monitor / Mon::Alert module in Perl.
  Appropriate shell functions will be provided by the first one who asks.
  See below for the protocol.

- We still want to retain the benefits of the old behaviour, but extend
  some alert management features, such as the ability to liberate
  alert definitions from the service periods so they can be used globally.

- The server code might be broken up into multiple files (I/O routines, config
  parser, related parts, etc)

- monitors can communicate better with the alerts (see below). For example,
  the monitor might hint (using "a_mail_list") the mail.alert about where else
  to send a warning that a user dir goes over quota.
  (Attention should be paid to privacy that we don't accidentially inform
  all users that /home/foo/how-i-will-destroy-western-civilization/
  is consuming 1GB too much space ;)

- Associations: these allow monitors to communicate details
  about failures back to the server which can be used to specify who
  to alert.

  The associations are based on key/value pairs specified in the
  association config file, and are expanded on the alert command line
  (or possibly within the alert protocol) if "@assoc-*" is in the
  configuration. If a host assoc. is needed, an alert spec will look like:

    alert mail.alert admin@xyz.com @assoc-host

  There are two association types (possibly more in the future): host
  associations, and user-defined associations.  Host associations use the
  "assoc-host" specifier, and map one or more username to an individual
  host. User-defined associations are just that, and begin with the
  "assoc-u-" specifier.

  Monitors return associations via the "assoc-*" key in the monitor
  protocol.

  Alerts obtain association information either via command-line arguments
  which were expanded by the server from "@assoc-*" in the config file,
  or via the "assoc-*" key in the alert protocol.

- Metrics are only passed to the mon server for "monitoring" purposes, but can
  be marked up in such a way that they could be easily piped to a logging
  utility, one which is not part of the mon process itself.
  monitors are _encouraged_ to collect and report performance data.

  "Failures" are basically just a conclusion based upon performance data and
  it makes no sense to collect the data twice, e.g. if you have mon polling
  ifInOctets.0 on a system, why should mrtg have to poll on its own.

  It may be desireable to propose a "unified logging system" which all
  monitors can easily use, something which is pluggable and extensible

- The hostgroup syntax is going to be extended to add per host options. (which
  will be passed to the monitors / alerts using the new protocol)
  ns1.teuto.net( fs="/(80%,90%)",mail_list="lmb@teuto.net" )
  would be passed as "h_fs=/(80%,90%)" and "h_mail_list="lmb@teuto.net"
  
FLOATING MONITORS

A floating monitor is started by mon and remains running for the entire time.
If it dies, it is automatically restarted.

The server forks off a separate process for fping and communicates with
it via some IPC, like a named pipe or a socket or something. The floating
monitor sits there waiting for a message from the server that says "start
checking now". The server then adds this descriptor to %fhandles and %running
and treats it similar to other forked monitors. When the floting monitor is
done, it spits its output back to the server and then goes dormant again,
awaiting another message from the server. Floating monitors are started
when mon starts, and are restarted if mon notices that they go away. This
is a way to save on fork() overhead, but to also

PROTOCOL

The protocol will be simple and ASCII based, in the form of "key=value". Line
continuation will be provided by prefixing following lines with a ">". A "\n"
on a line by itself indicates the start of a new block.

The order of the keys should not be important.

The first block will always contain metadata further defining the following
blocks. The "version" key is always present.

The current protocol version is "1".

(In the examples, everything after a "#" is a comment and should be cut out)

KEY CONVENTIONS

Keys only private to monitors will be prefixed with an "m_". In the same
vain, keys private to alerts will be prefixed with a "a_", and additional
host option keys specified in the mon.cf file will be prefixed with a "h_"
before being passed to monitors/alerts.

By convention, flags only pertaining to a specific alert will embed that name
in the key name too - ie keys only pertaining to "mail.alert" will start with
"a_mail_".

The key/values pairs will be passed to all processes for a specific service.
"h_" are static between invocations as they come from the mon.cf file. "m_"
keys will be preserved between multiple monitor executions. "a_" keys will be
passed from the monitor to the alert script.


MONITOR PROTOCOL (monitor -> mon)

The metadata block is followed by a block describing the overall hostgroup
status, followed by a detailled status for each host.

The following keys are defined for the blocks:
"summary" = contains a one line short summary of the status.
"status"  = up, fail, ignore
"metric_1"  = an opaque floating point number which can be referenced for
            triggering alerts. May try to give an "operational percentage".
	    More than one metric may be returned.
	    (Ping rtt, packet loss, disk space etc)
"description" = longer elaborate description of the current status.
"host"        = hostgroup member to which this status applies. The overall
                hostgroup status does not include this field.
"assoc-host"  = host association
"assoc-u-*"   = user-defined association

Here is an example for a hypothetical hostgroup with 2 hosts and the ping
service.

###
version=1

summary=Still alive.
metric_1=50 # Packetloss
metric_2=20.23 # rtt times
description=1 out of 2 hosts still responding.
> Whatever else one might want to say about the status. It is difficult to
> come up with a good text here so I will just babble.
status=up

host=foo.bar.com
metric_1=100
metric_2=0 # 100% packet loss make rtt measurements difficult ;)
summary=ICMP unreachable from 2.2.2.2
status=fail
description=PING 2.2.2.2 (2.2.2.2): 56 data bytes
>
>--- 2.2.2.2 ping statistics ---
>23 packets transmitted, 0 packets received, 100% packet loss

metric_1=0
metric_2=52.1
summary=ICMP echo reply received ok
status=up
description=64 bytes from 212.8.197.2: icmp_seq=0 ttl=60 time=110.0 ms
>64 bytes from 212.8.197.2: icmp_seq=1 ttl=60 time=32.3 ms
>64 bytes from 212.8.197.2: icmp_seq=2 ttl=60 time=32.8 ms
>64 bytes from 212.8.197.2: icmp_seq=3 ttl=60 time=33.4 ms
>
>--- ns1.teuto.net ping statistics ---
>4 packets transmitted, 4 packets received, 0% packet loss
>round-trip min/avg/max = 32.3/52.1/110.0 ms
host=baz.bar.com
######


Points still open:
- mon -> monitor communication

- mon <-> alert communication

- the new trap protocol

- muxpect

- a unified logging proposal