Sophie: heartbeat-2.1.3-7.3mdv2010.0 x86

heartbeat-2.1.3-7.3mdv2010.0.x86_64.rpm

Heartbeat reporting
===================
Dejan Muhamedagic <dmuhamedagic@suse.de>
v1.0

`hb_report` is a utility to collect all information relevant to
Heartbeat over the given period of time.

Quick start
-----------

Run `hb_report` on one of the nodes or on the host which serves as
a central log server. Run `hb_report` without parameters to see usage.

A few examples:

1. Last night during the backup there were several warnings
encountered (logserver is the log host):
+
	logserver# hb_report -f 3:00 -t 4:00 /tmp/report
+
collects everything from all nodes from 3am to 4am last night.
The files are stored in /tmp/report and compressed to a tarball
/tmp/report.tar.gz.

2. Just found a problem during testing:

	node1# date : note the current time
	node1# /etc/init.d/heartbeat start
	node1# nasty_command_that_breaks_things
	node1# sleep 120 : wait for the cluster to settle
	node1# hb_report -f time /tmp/hb1

Introduction
------------

Managing clusters is cumbersome. Heartbeat v2 with its numerous
configuration files and multi-node clusters just adds to the
complexity. No wonder then that most problem reports were less
than optimal. This is an attempt to rectify that situation and
make life easier for both the users and the developers.

On security
-----------

`hb_report` is a fairly complex program. As some of you are
probably going to run it as `root` let us state a few important
things you should keep in mind:

1. Don't run `hb_report` as `root`! It is fairly simple to setup
things in such a way that root access is not needed. I won't go
into details, just to stress that all information collected
should be readable by accounts belonging the haclient group.

2. If you still have to run this as root. Well, don't use the
`-C` option.

3. Of course, every possible precaution has been taken not to
disturb processes, or touch or remove files out of the given
destination directory. If you (by mistake) specify an existing
directory, `hb_report` will bail out soon. If you specify a
relative path, it won't work either.

The final product of `hb_report` is a tarball. However, the
destination directory is not removed on any node, unless the user
specifies `-C`. If you're too lazy to cleanup the previous run,
do yourself a favour and just supply a new destination directory.
You've been warned. If you worry about the space used, just put
all your directories under `/tmp` and setup a cronjob to remove
those directories once a week:
..........
	for d in /tmp/*; do
		test -d $d ||
			continue
		test -f $d/description.txt || test -f $d/.env ||
			continue
		grep -qs 'By: hb_report' $d/description.txt ||
			grep -qs '^UNIQUE_MSG=Mark' $d/.env ||
			continue
		rm -r $d
	done
..........

Mode of operation
-----------------

Cluster data collection is straightforward: just run the same
procedure on all nodes and collect the reports. There is,
apart from many small ones, one large complication: central
syslog destination. So, in order to allow this to be fully
automated, we should sometimes run the procedure on the log host
too. Actually, if there is a log host, then the best way is to
run `hb_report` there.

We use `ssh` for the remote program invocation. Even though it is
possible to run `hb_report` without ssh by doing a more menial job,
the overall user experience is much better if ssh works. Anyway,
how else do you manage your cluster?

Another ssh related point: In case your security policy
proscribes loghost-to-cluster-over-ssh communications, then
you'll have to copy the log file to one of the nodes and point
`hb_report` to it.

Prerequisites
-------------

1. ssh
+
This is not strictly required, but you won't regret having a
password-less ssh. It is not too difficult to setup and will save
you a lot of time. If you can't have it, for example because your
security policy does not allow such a thing, or you just prefer
menial work, then you will have to resort to the semi-manual
semi-automated report generation. See below for instructions.
+
If you need to supply a password for your passphrase/login, then
please use the `-u` option.

2. Times
+
In order to find files and messages in the given period and to
parse the `-f` and `-t` options, `hb_report` uses perl and one of the
`Date::Parse` or `Date::Manip` perl modules. Note that you need
only one of these. Furthermore, on nodes which have no logs and
where you don't run `hb_report` directly, no date parsing is
necessary. In other words, if you run this on a loghost then you
don't need these perl modules on the cluster nodes.
+
On rpm based distributions, you can find `Date::Parse` in
`perl-TimeDate` and on Debian and its derivatives in
`libtimedate-perl`.

3. Core dumps
+
To backtrace core dumps `gdb` is needed and the Heartbeat packages
with the debugging info. The debug info packages may be installed
at the time the report is created. Let's hope that you will need
this really seldom.

What is in the report
---------------------

1. Heartbeat related
- heartbeat version/release information
- heartbeat configuration (CIB, ha.cf, logd.cf)
- heartbeat status (output from crm_mon, crm_verify, ccm_tool)
- pengine transition graphs (if any)
- backtraces of core dumps (if any)
- heartbeat logs (if any)
2. System related
- general platform information (`uname`, `arch`, `distribution`)
- system statistics (`uptime`, `top`, `ps`, `netstat -i`, `arp`)
3. User created :)
- problem description (template to be edited)
4. Generated
- problem analysis (generated)

It is preferred that the Heartbeat is running at the time of the
report, but not absolutely required. `hb_report` will also do a
quick analysis of the collected information.

Times
-----

Specifying times can at times be a nuisance. That is why we have
chosen to use one of the perl modules--they do allow certain
freedom when talking dates. You can either read the instructions
at the
http://search.cpan.org/dist/TimeDate/lib/Date/Parse.pm#EXAMPLE_DATES[Date::Parse
examples page].

or just rely on common sense and try stuff like:

	3:00          (today at 3am)
	15:00         (today at 3pm)
	2007/9/1 2pm  (September 1st at 2pm)

`hb_report` will (probably) complain if it can't figure out what do
you mean.

Try to delimit the event as close as possible in order to reduce
the size of the report, but still leaving a minute or two around
for good measure.

Note that `-f` is not an optional option. And don't forget to quote
dates when they contain spaces.

It is also possible to extract a CTS test. Just prefix the test
number with `cts:` in the `-f` option.

Should I send all this to the rest of Internet?
-----------------------------------------------

We make an effort to remove sensitive data from the Heartbeat
configuration (CIB, ha.cf, and transition graphs). However, you
_have_ to tell us what is sensitive! Use the `-p` option to specify
additional regular expressions to match variable names which may
contain information you don't want to leak. For example:

	# hb_report -f 18:00 -p "user.*" -p "secret.*" /var/tmp/report

We look by default for variable names matching "pass.*" and the
stonith_host ha.cf directive.

Logs and other files are not filtered. Please filter them
yourself if necessary.

Logs
----

It may be tricky to find syslog logs. The scheme used is to log a
unique message on all nodes and then look it up in the usual
syslog locations. This procedure is not foolproof, in particular
if the syslog files are in a non-standard directory. We look in
/var/log /var/logs /var/syslog /var/adm /var/log/ha
/var/log/cluster. In case we can't find the logs, please supply
their location:

	# hb_report -f 5pm -l /var/log/cluster1/ha-log -S /tmp/report_node1

If you have different log locations on different nodes, well,
perhaps you'd like to make them the same and make life easier for
everybody.

The log files are collected from all hosts where found. In case
your syslog is configured to log to both the log server and local
files and `hb_report` is run on the log server you will end up with
multiple logs with same content.

Files starting with "ha-" are preferred. In case syslog sends
messages to more than one file, if one of them is named ha-log or
ha-debug those will be favoured to syslog or messages.

If there is no separate log for Heartbeat, possibly unrelated
messages from other programs are included. We don't filter logs,
just pick a segment for the period you specified.

NB: Don't have a central log host? Read the CTS README and setup
one.

Manual report collection
------------------------

So, your ssh doesn't work. In that case, you will have to run
this procedure on all nodes. Use `-S` so that we don't bother with
ssh:

	# hb_report -f 5:20pm -t 5:30pm -S /tmp/report_node1

If you also have a log host which is not in the cluster, then
you'll have to copy the log to one of the nodes and tell us where
it is:

	# hb_report -f 5:20pm -t 5:30pm -l /var/tmp/ha-log -S /tmp/report_node1

Furthermore, to prevent `hb_report` from asking you to edit the
report to describe the problem on every node use `-D` on all but
one:

	# hb_report -f 5:20pm -t 5:30pm -DS /tmp/report_node1

If you reconsider and want the ssh setup, take a look at the CTS
README file for instructions.

Analysis
--------

The point of analysis is to get out the most important
information from probably several thousand lines worth of text.
Perhaps this should be more properly named as report review as it
is rather simple, but let's pretend that we are doing something
utterly sophisticated.

The analysis consists of the following:

- compare files coming from different nodes; if they are equal,
  make one copy in the top level directory, remove duplicates,
  and create soft links instead
- print errors, warnings, and lines matching `-L` patterns from logs
- report if there were coredumps and by whom
- report crm_verify results

The goods
---------

1. Common
+
- ha-log (if found on the log host)
- description.txt (template and user report)
- analysis.txt

2. Per node
+
- ha.cf
- logd.cf
- ha-log (if found)
- cib.xml (`cibadmin -Ql` or `cp` if Heartbeat is not running)
- ccm_tool.txt (`ccm_tool -p`)
- crm_mon.txt (`crm_mon -1`)
- crm_verify.txt (`crm_verify -V`)
- pengine/ (only on DC, directory with pengine transitions)
- sysinfo.txt (static info)
- sysstats.txt (dynamic info)
- backtraces.txt (if coredumps found)
- DC (well...)
- RUNNING or STOPPED