Sophie

Sophie

distrib > * > 2010.0 > * > by-pkgid > 2acd34fd833ce754a3cc56ff5df10f83 > files > 188

heartbeat-2.1.3-7.3mdv2010.0.x86_64.rpm

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
    "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="generator" content="AsciiDoc 8.2.1" />
<style type="text/css">
/* Debug borders */
p, li, dt, dd, div, pre, h1, h2, h3, h4, h5, h6 {
/*
  border: 1px solid red;
*/
}

body {
  margin: 1em 5% 1em 5%;
}

a {
  color: blue;
  text-decoration: underline;
}
a:visited {
  color: fuchsia;
}

em {
  font-style: italic;
}

strong {
  font-weight: bold;
}

tt {
  color: navy;
}

h1, h2, h3, h4, h5, h6 {
  color: #527bbd;
  font-family: sans-serif;
  margin-top: 1.2em;
  margin-bottom: 0.5em;
  line-height: 1.3;
}

h1 {
  border-bottom: 2px solid silver;
}
h2 {
  border-bottom: 2px solid silver;
  padding-top: 0.5em;
}

div.sectionbody {
  font-family: serif;
  margin-left: 0;
}

hr {
  border: 1px solid silver;
}

p {
  margin-top: 0.5em;
  margin-bottom: 0.5em;
}

pre {
  padding: 0;
  margin: 0;
}

span#author {
  color: #527bbd;
  font-family: sans-serif;
  font-weight: bold;
  font-size: 1.1em;
}
span#email {
}
span#revision {
  font-family: sans-serif;
}

div#footer {
  font-family: sans-serif;
  font-size: small;
  border-top: 2px solid silver;
  padding-top: 0.5em;
  margin-top: 4.0em;
}
div#footer-text {
  float: left;
  padding-bottom: 0.5em;
}
div#footer-badges {
  float: right;
  padding-bottom: 0.5em;
}

div#preamble,
div.tableblock, div.imageblock, div.exampleblock, div.verseblock,
div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock,
div.admonitionblock {
  margin-right: 10%;
  margin-top: 1.5em;
  margin-bottom: 1.5em;
}
div.admonitionblock {
  margin-top: 2.5em;
  margin-bottom: 2.5em;
}

div.content { /* Block element content. */
  padding: 0;
}

/* Block element titles. */
div.title, caption.title {
  font-family: sans-serif;
  font-weight: bold;
  text-align: left;
  margin-top: 1.0em;
  margin-bottom: 0.5em;
}
div.title + * {
  margin-top: 0;
}

td div.title:first-child {
  margin-top: 0.0em;
}
div.content div.title:first-child {
  margin-top: 0.0em;
}
div.content + div.title {
  margin-top: 0.0em;
}

div.sidebarblock > div.content {
  background: #ffffee;
  border: 1px solid silver;
  padding: 0.5em;
}

div.listingblock {
  margin-right: 0%;
}
div.listingblock > div.content {
  border: 1px solid silver;
  background: #f4f4f4;
  padding: 0.5em;
}

div.quoteblock > div.content {
  padding-left: 2.0em;
}

div.attribution {
  text-align: right;
}
div.verseblock + div.attribution {
  text-align: left;
}

div.admonitionblock .icon {
  vertical-align: top;
  font-size: 1.1em;
  font-weight: bold;
  text-decoration: underline;
  color: #527bbd;
  padding-right: 0.5em;
}
div.admonitionblock td.content {
  padding-left: 0.5em;
  border-left: 2px solid silver;
}

div.exampleblock > div.content {
  border-left: 2px solid silver;
  padding: 0.5em;
}

div.verseblock div.content {
  white-space: pre;
}

div.imageblock div.content { padding-left: 0; }
div.imageblock img { border: 1px solid silver; }
span.image img { border-style: none; }

dl {
  margin-top: 0.8em;
  margin-bottom: 0.8em;
}
dt {
  margin-top: 0.5em;
  margin-bottom: 0;
  font-style: italic;
}
dd > *:first-child {
  margin-top: 0;
}

ul, ol {
    list-style-position: outside;
}
ol.olist2 {
  list-style-type: lower-alpha;
}

div.tableblock > table {
  border: 3px solid #527bbd;
}
thead {
  font-family: sans-serif;
  font-weight: bold;
}
tfoot {
  font-weight: bold;
}

div.hlist {
  margin-top: 0.8em;
  margin-bottom: 0.8em;
}
div.hlist td {
  padding-bottom: 5px;
}
td.hlist1 {
  vertical-align: top;
  font-style: italic;
  padding-right: 0.8em;
}
td.hlist2 {
  vertical-align: top;
}

@media print {
  div#footer-badges { display: none; }
}

div#toctitle {
  color: #527bbd;
  font-family: sans-serif;
  font-size: 1.1em;
  font-weight: bold;
  margin-top: 1.0em;
  margin-bottom: 0.1em;
}

div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 {
  margin-top: 0;
  margin-bottom: 0;
}
div.toclevel2 {
  margin-left: 2em;
  font-size: 0.9em;
}
div.toclevel3 {
  margin-left: 4em;
  font-size: 0.9em;
}
div.toclevel4 {
  margin-left: 6em;
  font-size: 0.9em;
}
/* Workarounds for IE6's broken and incomplete CSS2. */

div.sidebar-content {
  background: #ffffee;
  border: 1px solid silver;
  padding: 0.5em;
}
div.sidebar-title, div.image-title {
  font-family: sans-serif;
  font-weight: bold;
  margin-top: 0.0em;
  margin-bottom: 0.5em;
}

div.listingblock div.content {
  border: 1px solid silver;
  background: #f4f4f4;
  padding: 0.5em;
}

div.quoteblock-content {
  padding-left: 2.0em;
}

div.exampleblock-content {
  border-left: 2px solid silver;
  padding-left: 0.5em;
}

/* IE6 sets dynamically generated links as visited. */
div#toc a:visited { color: blue; }
</style>
<title>Heartbeat reporting</title>
</head>
<body>
<div id="header">
<h1>Heartbeat reporting</h1>
<span id="author">Dejan Muhamedagic</span><br />
<span id="email"><tt>&lt;<a href="mailto:dmuhamedagic@suse.de">dmuhamedagic@suse.de</a>&gt;</tt></span><br />
v1.0
</div>
<div id="preamble">
<div class="sectionbody">
<p><tt>hb_report</tt> is a utility to collect all information relevant to
Heartbeat over the given period of time.</p>
</div>
</div>
<h2>Quick start</h2>
<div class="sectionbody">
<p>Run <tt>hb_report</tt> on one of the nodes or on the host which serves as
a central log server. Run <tt>hb_report</tt> without parameters to see usage.</p>
<p>A few examples:</p>
<ol>
<li>
<p>
Last night during the backup there were several warnings
encountered (logserver is the log host):
</p>
<div class="literalblock">
<div class="content">
<pre><tt>logserver# hb_report -f 3:00 -t 4:00 /tmp/report</tt></pre>
</div></div>
<p>collects everything from all nodes from 3am to 4am last night.
The files are stored in /tmp/report and compressed to a tarball
/tmp/report.tar.gz.</p>
</li>
<li>
<p>
Just found a problem during testing:
</p>
<div class="literalblock">
<div class="content">
<pre><tt>node1# date : note the current time
node1# /etc/init.d/heartbeat start
node1# nasty_command_that_breaks_things
node1# sleep 120 : wait for the cluster to settle
node1# hb_report -f time /tmp/hb1</tt></pre>
</div></div>
</li>
</ol>
</div>
<h2>Introduction</h2>
<div class="sectionbody">
<p>Managing clusters is cumbersome. Heartbeat v2 with its numerous
configuration files and multi-node clusters just adds to the
complexity. No wonder then that most problem reports were less
than optimal. This is an attempt to rectify that situation and
make life easier for both the users and the developers.</p>
</div>
<h2>On security</h2>
<div class="sectionbody">
<p><tt>hb_report</tt> is a fairly complex program. As some of you are
probably going to run it as <tt>root</tt> let us state a few important
things you should keep in mind:</p>
<ol>
<li>
<p>
Don't run <tt>hb_report</tt> as <tt>root</tt>! It is fairly simple to setup
things in such a way that root access is not needed. I won't go
into details, just to stress that all information collected
should be readable by accounts belonging the haclient group.
</p>
</li>
<li>
<p>
If you still have to run this as root. Well, don't use the
<tt>-C</tt> option.
</p>
</li>
<li>
<p>
Of course, every possible precaution has been taken not to
disturb processes, or touch or remove files out of the given
destination directory. If you (by mistake) specify an existing
directory, <tt>hb_report</tt> will bail out soon. If you specify a
relative path, it won't work either.
</p>
</li>
</ol>
<p>The final product of <tt>hb_report</tt> is a tarball. However, the
destination directory is not removed on any node, unless the user
specifies <tt>-C</tt>. If you're too lazy to cleanup the previous run,
do yourself a favour and just supply a new destination directory.
You've been warned. If you worry about the space used, just put
all your directories under <tt>/tmp</tt> and setup a cronjob to remove
those directories once a week:</p>
<div class="literalblock">
<div class="content">
<pre><tt>        for d in /tmp/*; do
                test -d $d ||
                        continue
                test -f $d/description.txt || test -f $d/.env ||
                        continue
                grep -qs 'By: hb_report' $d/description.txt ||
                        grep -qs '^UNIQUE_MSG=Mark' $d/.env ||
                        continue
                rm -r $d
        done</tt></pre>
</div></div>
</div>
<h2>Mode of operation</h2>
<div class="sectionbody">
<p>Cluster data collection is straightforward: just run the same
procedure on all nodes and collect the reports. There is,
apart from many small ones, one large complication: central
syslog destination. So, in order to allow this to be fully
automated, we should sometimes run the procedure on the log host
too. Actually, if there is a log host, then the best way is to
run <tt>hb_report</tt> there.</p>
<p>We use <tt>ssh</tt> for the remote program invocation. Even though it is
possible to run <tt>hb_report</tt> without ssh by doing a more menial job,
the overall user experience is much better if ssh works. Anyway,
how else do you manage your cluster?</p>
<p>Another ssh related point: In case your security policy
proscribes loghost-to-cluster-over-ssh communications, then
you'll have to copy the log file to one of the nodes and point
<tt>hb_report</tt> to it.</p>
</div>
<h2>Prerequisites</h2>
<div class="sectionbody">
<ol>
<li>
<p>
ssh
</p>
<p>This is not strictly required, but you won't regret having a
password-less ssh. It is not too difficult to setup and will save
you a lot of time. If you can't have it, for example because your
security policy does not allow such a thing, or you just prefer
menial work, then you will have to resort to the semi-manual
semi-automated report generation. See below for instructions.</p>
<p>If you need to supply a password for your passphrase/login, then
please use the <tt>-u</tt> option.</p>
</li>
<li>
<p>
Times
</p>
<p>In order to find files and messages in the given period and to
parse the <tt>-f</tt> and <tt>-t</tt> options, <tt>hb_report</tt> uses perl and one of the
<tt>Date::Parse</tt> or <tt>Date::Manip</tt> perl modules. Note that you need
only one of these. Furthermore, on nodes which have no logs and
where you don't run <tt>hb_report</tt> directly, no date parsing is
necessary. In other words, if you run this on a loghost then you
don't need these perl modules on the cluster nodes.</p>
<p>On rpm based distributions, you can find <tt>Date::Parse</tt> in
<tt>perl-TimeDate</tt> and on Debian and its derivatives in
<tt>libtimedate-perl</tt>.</p>
</li>
<li>
<p>
Core dumps
</p>
<p>To backtrace core dumps <tt>gdb</tt> is needed and the Heartbeat packages
with the debugging info. The debug info packages may be installed
at the time the report is created. Let's hope that you will need
this really seldom.</p>
</li>
</ol>
</div>
<h2>What is in the report</h2>
<div class="sectionbody">
<ol>
<li>
<p>
Heartbeat related
</p>
<ul>
<li>
<p>
heartbeat version/release information
</p>
</li>
<li>
<p>
heartbeat configuration (CIB, ha.cf, logd.cf)
</p>
</li>
<li>
<p>
heartbeat status (output from crm_mon, crm_verify, ccm_tool)
</p>
</li>
<li>
<p>
pengine transition graphs (if any)
</p>
</li>
<li>
<p>
backtraces of core dumps (if any)
</p>
</li>
<li>
<p>
heartbeat logs (if any)
</p>
</li>
</ul>
</li>
<li>
<p>
System related
</p>
<ul>
<li>
<p>
general platform information (<tt>uname</tt>, <tt>arch</tt>, <tt>distribution</tt>)
</p>
</li>
<li>
<p>
system statistics (<tt>uptime</tt>, <tt>top</tt>, <tt>ps</tt>, <tt>netstat -i</tt>, <tt>arp</tt>)
</p>
</li>
</ul>
</li>
<li>
<p>
User created :)
</p>
<ul>
<li>
<p>
problem description (template to be edited)
</p>
</li>
</ul>
</li>
<li>
<p>
Generated
</p>
<ul>
<li>
<p>
problem analysis (generated)
</p>
</li>
</ul>
</li>
</ol>
<p>It is preferred that the Heartbeat is running at the time of the
report, but not absolutely required. <tt>hb_report</tt> will also do a
quick analysis of the collected information.</p>
</div>
<h2>Times</h2>
<div class="sectionbody">
<p>Specifying times can at times be a nuisance. That is why we have
chosen to use one of the perl modules&#8212;they do allow certain
freedom when talking dates. You can either read the instructions
at the
<a href="http://search.cpan.org/dist/TimeDate/lib/Date/Parse.pm#EXAMPLE_DATES">Date::Parse
examples page</a>.</p>
<p>or just rely on common sense and try stuff like:</p>
<div class="literalblock">
<div class="content">
<pre><tt>3:00          (today at 3am)
15:00         (today at 3pm)
2007/9/1 2pm  (September 1st at 2pm)</tt></pre>
</div></div>
<p><tt>hb_report</tt> will (probably) complain if it can't figure out what do
you mean.</p>
<p>Try to delimit the event as close as possible in order to reduce
the size of the report, but still leaving a minute or two around
for good measure.</p>
<p>Note that <tt>-f</tt> is not an optional option. And don't forget to quote
dates when they contain spaces.</p>
<p>It is also possible to extract a CTS test. Just prefix the test
number with <tt>cts:</tt> in the <tt>-f</tt> option.</p>
</div>
<h2>Should I send all this to the rest of Internet?</h2>
<div class="sectionbody">
<p>We make an effort to remove sensitive data from the Heartbeat
configuration (CIB, ha.cf, and transition graphs). However, you
<em>have</em> to tell us what is sensitive! Use the <tt>-p</tt> option to specify
additional regular expressions to match variable names which may
contain information you don't want to leak. For example:</p>
<div class="literalblock">
<div class="content">
<pre><tt># hb_report -f 18:00 -p "user.*" -p "secret.*" /var/tmp/report</tt></pre>
</div></div>
<p>We look by default for variable names matching "pass.*" and the
stonith_host ha.cf directive.</p>
<p>Logs and other files are not filtered. Please filter them
yourself if necessary.</p>
</div>
<h2>Logs</h2>
<div class="sectionbody">
<p>It may be tricky to find syslog logs. The scheme used is to log a
unique message on all nodes and then look it up in the usual
syslog locations. This procedure is not foolproof, in particular
if the syslog files are in a non-standard directory. We look in
/var/log /var/logs /var/syslog /var/adm /var/log/ha
/var/log/cluster. In case we can't find the logs, please supply
their location:</p>
<div class="literalblock">
<div class="content">
<pre><tt># hb_report -f 5pm -l /var/log/cluster1/ha-log -S /tmp/report_node1</tt></pre>
</div></div>
<p>If you have different log locations on different nodes, well,
perhaps you'd like to make them the same and make life easier for
everybody.</p>
<p>The log files are collected from all hosts where found. In case
your syslog is configured to log to both the log server and local
files and <tt>hb_report</tt> is run on the log server you will end up with
multiple logs with same content.</p>
<p>Files starting with "ha-" are preferred. In case syslog sends
messages to more than one file, if one of them is named ha-log or
ha-debug those will be favoured to syslog or messages.</p>
<p>If there is no separate log for Heartbeat, possibly unrelated
messages from other programs are included. We don't filter logs,
just pick a segment for the period you specified.</p>
<p>NB: Don't have a central log host? Read the CTS README and setup
one.</p>
</div>
<h2>Manual report collection</h2>
<div class="sectionbody">
<p>So, your ssh doesn't work. In that case, you will have to run
this procedure on all nodes. Use <tt>-S</tt> so that we don't bother with
ssh:</p>
<div class="literalblock">
<div class="content">
<pre><tt># hb_report -f 5:20pm -t 5:30pm -S /tmp/report_node1</tt></pre>
</div></div>
<p>If you also have a log host which is not in the cluster, then
you'll have to copy the log to one of the nodes and tell us where
it is:</p>
<div class="literalblock">
<div class="content">
<pre><tt># hb_report -f 5:20pm -t 5:30pm -l /var/tmp/ha-log -S /tmp/report_node1</tt></pre>
</div></div>
<p>Furthermore, to prevent <tt>hb_report</tt> from asking you to edit the
report to describe the problem on every node use <tt>-D</tt> on all but
one:</p>
<div class="literalblock">
<div class="content">
<pre><tt># hb_report -f 5:20pm -t 5:30pm -DS /tmp/report_node1</tt></pre>
</div></div>
<p>If you reconsider and want the ssh setup, take a look at the CTS
README file for instructions.</p>
</div>
<h2>Analysis</h2>
<div class="sectionbody">
<p>The point of analysis is to get out the most important
information from probably several thousand lines worth of text.
Perhaps this should be more properly named as report review as it
is rather simple, but let's pretend that we are doing something
utterly sophisticated.</p>
<p>The analysis consists of the following:</p>
<ul>
<li>
<p>
compare files coming from different nodes; if they are equal,
  make one copy in the top level directory, remove duplicates,
  and create soft links instead
</p>
</li>
<li>
<p>
print errors, warnings, and lines matching <tt>-L</tt> patterns from logs
</p>
</li>
<li>
<p>
report if there were coredumps and by whom
</p>
</li>
<li>
<p>
report crm_verify results
</p>
</li>
</ul>
</div>
<h2>The goods</h2>
<div class="sectionbody">
<ol>
<li>
<p>
Common
</p>
<ul>
<li>
<p>
ha-log (if found on the log host)
</p>
</li>
<li>
<p>
description.txt (template and user report)
</p>
</li>
<li>
<p>
analysis.txt
</p>
</li>
</ul>
</li>
<li>
<p>
Per node
</p>
<ul>
<li>
<p>
ha.cf
</p>
</li>
<li>
<p>
logd.cf
</p>
</li>
<li>
<p>
ha-log (if found)
</p>
</li>
<li>
<p>
cib.xml (<tt>cibadmin -Ql</tt> or <tt>cp</tt> if Heartbeat is not running)
</p>
</li>
<li>
<p>
ccm_tool.txt (<tt>ccm_tool -p</tt>)
</p>
</li>
<li>
<p>
crm_mon.txt (<tt>crm_mon -1</tt>)
</p>
</li>
<li>
<p>
crm_verify.txt (<tt>crm_verify -V</tt>)
</p>
</li>
<li>
<p>
pengine/ (only on DC, directory with pengine transitions)
</p>
</li>
<li>
<p>
sysinfo.txt (static info)
</p>
</li>
<li>
<p>
sysstats.txt (dynamic info)
</p>
</li>
<li>
<p>
backtraces.txt (if coredumps found)
</p>
</li>
<li>
<p>
DC (well&#8230;)
</p>
</li>
<li>
<p>
RUNNING or STOPPED
</p>
</li>
</ul>
</li>
</ol>
</div>
<div id="footer">
<div id="footer-text">
Last updated 29-Nov-2007 16:12:02 CEST
</div>
</div>
</body>
</html>