Sophie: heartbeat-2.1.3-7.3mdv2010.0 x86

heartbeat-2.1.3-7.3mdv2010.0.x86_64.rpm

<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   <title>Preliminary Linux HA Hardware Installation Guide</title>
</head>
<body>

<h1>
Linux-HA Hardware Installation Guideline</h1>
<i><font size=-1>This document (c) 1999 Volker Wiegand
<a href="mailto:Volker.Wiegand@suse.de">&lt;Volker.Wiegand@suse.de></a></font></i>
<p>This document serves as the starting point to plan, execute, and verify
your hardware setup for a High Availability (HA) environment.
<h2>Contents</h2>
<OL>
	<LI><a href="#Introduction">Introduction</a>
	<LI><a href="#Hardware Requirements">Hardware Requirements</a>
	<OL>
		<LI><a href="#Minimum Installation">Minimum Installation</a>
		<LI><a href="#More Advanced Installation">More Advanced Installation</a>
		<LI><a href="#Fully Redundant Installation">Fully Redundant Installation</a>
	</OL>
	<LI><a href="#Hardware Setup and Test">Hardware Setup and Test</a>
	<OL>
		<LI><a href="#Serial Ports">Serial Ports</a>
		<LI><a href="#LAN Interfaces">LAN Interfaces</a>
		<LI><a href="#Other Devices">Other Devices</a>
	</OL>
	<LI><a href="#Troubleshooting">Troubleshooting</a>
	<LI><a href="#References">References</a>
</OL>
<h2>
<a NAME="Introduction"></a>Introduction</h2>
With the high stability Linux has reached, this Operating System is well
suited to be used for HA purposes. The Linux-HA project, based upon
Harald Milz's
<a href="http://metalab.unc.edu/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html">HOWTO</a>
and Alan Robertson's Heartbeat code, provides the building
blocks for a professional solution.
<p>This document provides some advice on the initial planning, the
installation and cabling, and the test and verification of the overall
setup. We use the word takeover to mean transferring some kind of
server functionality from a broken entity to a sane one. In this context,
entities can be network adapters, computers, or something else. Our current
focus is to provide HA capabilities among PC's which we will call "nodes"
from now on.
<h2>
<a NAME="Hardware Requirements"></a>Hardware Requirements</h2>
Since we want to be able to provide failover capabilities on the machine
level, we need at least two computers. Obvious, isn't it? In our current
setup, all we require that they are running Linux.
No particular distribution is preferred (although most tests have been
carried out on RedHat and SuSE systems).
The minimum kernel version is
<B>[TODO: which one is it ???]</B>, although the software makes fairly minimal
demands on the OS.
<p>These two nodes have to be connected in some way to exchange status
information and to monitor each other. The more channels our nodes have
to talk to each other, the better it is. We will use the term "medium"
for such a communication channel.
<p>In general we work from the assumption that we use standard hardware
where ever possible. This means that we do not modify our PC's other than
to expand them with components off the shelf. And we use only cabling that
can be bought without "special orders" such as split serial cables or the
like. After all we want solutions that can be installed and used by everyone,
not just some experts.
<h3>
<a NAME="Minimum Installation"></a>Minimum Installation</h3>
In order for the takeover to work, we need at least one medium to exchange
messages. Given that we use TCP/IP as the basis for our service, some kind
of LAN is certainly available. Of course having only the LAN provides poor
monitoring capabilities, but on the other hand this is the minimum chapter
anyway :-)
<p>So how will the hardware be planned? Well, straight forward.
<pre>
    +-------------------+  +-------------------+
    |                   |  |                   |
    |      Node A       |  |      Node B       |
    |                   |  |                   |
    |       eth0        |  |       eth0        |
    +---------+---------+  +---------+---------+
              |                      |
              |                      |
    |---------+----------------------+---------| LAN (Ethernet, etc.)</pre>
As was mentioned before, this design obviously provides
insufficiently reliable monitoring capabilities.
In a LAN, there are many different points of failure.
Another issue is that the LAN is a public medium and that there are several
levels of possible failures.
So we would be well-advised to look for more robust options to
use in addition to the LAN for heartbeats.
<h3>
<a NAME="More Advanced Installation"></a>More Advanced Installation</h3>
So let's see what we can do to provide sound monitoring and good takeover
capabilities and still not having to purchase excessive hardware or software
add-ons.
<p>The main idea is to have a simple private medium like one or more serial
cables. We can use the standard serial ports, provided they are not already
occupied by modems, mice, or other vermin.
If you have a server with a PS/2 mouse, it probably has two such
ports available.
<p>So here's what this configuration looks like.
<pre>
              +----------------------+           (Nullmodem Cable)
              |                      |
    +---------+---------+  +---------+---------+
    |       ttyS0       |  |       ttyS0       |
    |                   |  |                   |
    |      Node A       |  |      Node B       |
    |                   |  |                   |
    |       eth0        |  |       eth0        |
    +---------+---------+  +---------+---------+
              |                      |
              |                      |
    |---------+----------------------+---------| LAN (Ethernet, etc.)</pre>
What do we gain? We have now two media to exchange the heartbeat.
This provides greater reliability in the case of failure.
Of course
the restriction with the LAN still holds true, but now Node A could use
the serial line to initiate a takeback of the service. And if just the
serial connection should fail, we still have the LAN.
Reliable intracluster communications is very important, and this design
is a low-cost improvement over the previous one.
<h3>
<a NAME="Fully Redundant Installation"></a>Fully Redundant Installation</h3>
The point of the addition of the serial links to the system is that a single
failure cannot cause the nodes to become confused about the overall
system configuration.  This is vitally important for many HA systems, because
the cost of this confusion can be scrambled disks, and other problems
which are often worse than the cost of an outage.
With more resources, the
following provides a general guideline to set up things.
To illustrate
the principle, a third node has been included, but we can install any number
of nodes in this way. Well, almost any.
<B>Note:</B> The takeover code which is
part of the <i>heartbeat</i> package will not yet correctly manage takeovers
for more than two nodes.
<p>The serial lines are now arranged in a ring structure.
As you will have noticed,
this occupies two serial ports on each node as per our discussion in the
previous chapter. But on the other hand we do now have a general setup
that can easily be extended and also provides a good level of redundancy.
We can now send our heartbeat now in both directions over the
ring, thus reaching every other node even in case of a (single) cable defect
(or down system) anywhere on the ring.
<p>Another facet of our high end design covers the LAN access. Having two
adapters connected to the wire allows us to provide intra-node failover
capabilities in case of an interface or LAN cable breakdown. Plus it gives
us the chance to take over the IP address of Node A, eth0 onto Node B,
eth1 and keeping Node B, eth0 as it is. In fact, this is the primary operation
mode of several professional systems, including IBM's HACMP or HP's
MC/ServiceGuard. Which doesn't imply that we are not professional, of course
:-)
<p>So, here is the block diagram for this third design.
<pre>
                                                  (Nullmodem Cables)
          +-----------------------------------------------------+
          |       +--------------+        +-------------+       |
          |       |              |        |             |       |
    +-----+-------+-----+  +-----+--------+----+  +-----+-------+-----+
    |   ttyS0   ttyS1   |  |   ttyS0   ttyS1   |  |   ttyS0   ttyS1   |
    |                   |  |                   |  |                   |
    |      Node A       |  |      Node B       |  |      Node C       |
    |                   |  |                   |  |                   |
    |   eth0     eth1   |  |   eth0     eth1   |  |   eth0     eth1   |
    +-----+--------+----+  +-----+--------+----+  +-----+--------+----+
          |        |             |        |             |        |
          |        |             |        |             |        |
    |-----+--------+-------------+--------+-------------+--------+----|
                                                  LAN (Ethernet, etc.)
</pre>
Future releases of the <i>heartbeat</i> software will support such a
configuration, but current takeover code restricts the configuration
to a single interface and two nodes in the network.
Of course we could also use other media for the heartbeat exchange. Recent
suggestions include SCSI buses in target mode and IrDA ports "connected"
with a mirror. Another candidate that comes to mind is the USB found in
many modern PC's. As I said before, the more (and more different) the better.
<h2>
<a NAME="Hardware Setup and Test"></a>Hardware Setup and Test</h2>
The following chapter deals with the installation and verification of the
various components within the nodes.
<h3>
<a NAME="Serial Ports"></a>Serial Ports</h3>
First of all, let's recap how a Nullmodem Cable is wired. The pain is that
you certainly possess the pin assignment a thousand times, but you don't
have it handy when you need it. So here it is ...
<pre>
    25-pin        9-pin                          9-pin        25-pin

      2     TxD     3  --------------------------  2     RxD     3
      3     RxD     2  --------------------------  3     TxD     2
      4     RTS     7  --------------------------  8     CTS     5
      5     CTS     8  --------------------------  7     RTS     4
      7     GND     5  --------------------------  5     GND     7
      6     DSR     6  ---+----------------------  4     DTR    20
      8     DCD     1  ---+                  +---  1     DCD     8
     20     DTR     4  ----------------------+---  6     DSR     6
</pre>

Once you have these cable(s) in place you will want to test them. This
is fairly easy since the serial ports are usually configured with decent
default values. On a freshly booted Linux system we can assume the ports
to be in a "sane" state, with the speed set to 9600 baud. If not, you can
do a "<tt><b>stty&nbsp;sane 9600&nbsp;&lt;/dev/ttyS0</b></tt>"
with <i>ttyS0</i> replaced by the actual device.  Please note the input
redirection which selects the device.

</P>
<p>Then you can set up one node as receiver
("<tt><b>cat&nbsp;&lt;/dev/ttyS0</b></tt>") and
the other one as transmitter ("<tt><b>echo&nbsp;hello&nbsp;&gt;/dev/ttyS0</b></tt>").
Voila! What you expect is that the "hello" is printed out at the receiver.
Pressing Ctrl-C on the receiver's keyboard will return you to the prompt.
Then do the same test with mutually exchanged roles.
<h3>
<a NAME="LAN Interfaces"></a>LAN Interfaces</h3>
Rumor has it that there is work in progress to provide some level of diagnostic
capabilities for Ethernet adapters and wiring. I don't know the actual
status, and can only suggest to use a shabby "ping" provided that the interfaces
are set up correctly with "ifconfig". For more information on Linux ethernet,
please check the
<a href="http://metalab.unc.edu/LDP/HOWTO/Ethernet-HOWTO.html">Ethernet
HOWTO</a>.
<p>If you are planning to use more than one adapter per node (usually called
"Standby Adapters"), please make sure to connect them to the same physical
medium as the primary adapters. Otherwise you will of course not be able
to takeover the IP address. Having them in a different subnet is perfectly
okay. More than that: it's preferred. <b>[TODO: this is what I learned with
HACMP. Can anyone please give the *correct* rationale --- or rephrase the
whole paragraph?]</b>
<p><b>Note:</b> The <i>heartbeat</i> software does not yet support this
kind of configuration.
<h3>
<a NAME="Other Devices"></a>Other Devices</h3>
<b>[TODO: well, to do]</b>
<h2>
<a NAME="Troubleshooting"></a>Troubleshooting</h2>
If things don't work in the first place -- don't panic! Usually it's just
a trifle. Things to check include:
<ul>
<li>
Check the startup messages of the kernel, e.g. using "dmesg". Is the serial
driver (either the standard one or the special one for your hardware) compiled
in or available as a module?</li>

<li>
Check the serial port(s) and cable(s). Do your modem and mouse still work?
Using a battery, a light bulb or buzzer and some wire you can easily verify
that all pins are connected and there are no short circuits.
Inexpensive <i>breakout boxes</i> are available for diagnosing such conditions
as well.  They contain the light bulbs, the connectors and the wire in one 
handy little unit.
<li>
For serial ports, the file <tt>/proc/tty/driver/serial</tt> can be very
helpful for diagnosing serial port problems in Linux.
It contains lines of this form in it:
<PRE><tt>1: uart:16550A port:2F8 irq:3 baud:19200 tx:24423 rx:4680 RTS|CTS|DTR|DSR|CD</tt></PRE>
This particular line corresponds to a working "raw" serial port,
/dev/ttyS<b>1</b>
with both sides cabled up correctly, and heartbeat active on both sides.
The first number on the line is the port number.  The built-in serial ports
on PCs are numbered 0 and 1.
With heartbeat only active on the local side (and not the far side), it
looks like this instead:
<PRE><tt>1: uart:16550A port:2F8 irq:3 baud:19200 tx:43558 rx:12277 RTS|DTR</tt></PRE>
Note the lack of the
<b>CTS</b> (Clear To Send),
<b>DSR</b> (Data Set Ready), and
<b>CD</b> (Carrier Detect)
bits on the interface.
When heartbeat is only running on the far side interface, it looks
like this:
<PRE><tt>1: uart:16550A port:2F8 irq:3 baud:19200 tx:55039 rx:12277 CTS|DSR|CD</tt></PRE>
Note that when the local port isn't active, the
<b>RTS</b> (Request To Send), and
<b>DTR</b> (Data Terminal Ready) bits aren't active.

When heartbeat isn't running on either interface, the line looks like this:
<PRE><tt>1: uart:16550A port:2F8 irq:3 baud:19200 tx:55039 rx:12277</tt></PRE>
This is essentially a software breakout box.
</li>

<li>
Check that the cables are properly plugged into their sockets.
For a production
High-Availability system, it is a very good idea to fasten the screws
in order to avoid loose contacts.</li>

<li>
For more information on diagnosing ethernet problems, consult the
<a href="http://metalab.unc.edu/LDP/HOWTO/Ethernet-HOWTO.html">Ethernet
HOWTO</a>.</li>

<li>
For more information on diagnosing serial port problems, consult the
<a href="http://metalab.unc.edu/LDP/HOWTO/Serial-HOWTO.html">Serial
Port HOWTO</a>.</li>
</ul>

<h2>
<a NAME="References"></a>References</h2>
The Linux-HA homepage on the internet is:
<a href="http://linux-ha.org/">http://linux-ha.org/</a>
<p>Harald Milz' Linux-HA HOWTO that started the whole thing can be found
at:
<a href="http://metalab.unc.edu/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html">http://metalab.unc.edu/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html</a>
<p>A comprehensive survey on professional HA solutions is here:
<a href="http://www.sun.com/clusters/dh.brown.pdf">http://www.sun.com/clusters/dh.brown.pdf</a>
<br><b>[TODO: should we include links to HACMP, Veritas, Wizard, ... ???]</b>
<hr>
</body>
</html>