Sophie: heartbeat-2.1.4-3mdv2010.2 x86

heartbeat-2.1.4-3mdv2010.2.x86_64.rpm

Faq'n Tips

1. [1]Hey! This doesn't look like a FAQ! What gives?
2. [2]Are there mailing lists for Linux-HA?
3. [3]What is a cluster?
4. [4]What is a resource script?
5. [5]How to monitor various resources?. If one of my resources stops
working heartbeat doesn't do anything unless the server crashes.
How do I monitor resources with heartbeat?
6. [6]If my one of my ethernet connections goes away (cable severance,
NIC failure, locusts), but my current primary node (the one with
the services) is otherwise fine, no one can get to my services and
I want to fail them over to my other cluster node. Is there a way
to do this?
7. [7]Every time my machine releases an IP alias, it loses the whole
interface (i.e. eth0)! How do I fix this?
8. [8]I want a lot of IP addresses as resources (more than 8). What's
the best way?
9. [9]The documentation indicates that a serial line is a good idea,
is there really a drawback to using two ethernet connections?
10. [10]How to use heartbeat with ipchains firewall?
11. [11]I got this message ERROR: No local heartbeat. Forcing shutdown
and then heartbeat shut itself down for no reason at all!
12. [12]How to tune heartbeat on heavily loaded system to avoid
split-brain?
13. [13]When heartbeat starts up I get this error message in my logs:
WARN: process_clustermsg: node [<hostname>] failed authentication
[14]What does this mean?
14. [15]When I try to start heartbeat i receive message: [16]"Starting
High-Availability services: Heartbeat failure [rc=1]. Failed.
[17]and there is nothing in any of the log files and no messages.
What is wrong ?
15. [18]How to run multiple clusters on same network segment ?
16. [19]How to get latest CVS version of heartbeat ?
17. [20]Heartbeat on other OSs.
18. [21]When I try to install the linux-ha.org heartbeat RPMs, they
complain of dependencies from packages I already have installed!
Now what?
19. [22]I don't want heartbeat to fail over the cluster automatically.
How can I require human confirmation before failing over?
20. [23]What is STONITH? And why might I need it?
21. [24]How do I figure out what STONITH devices are available, and how
to configure them?
22. [25]I want to use a shared disk, but I don't want to use STONITH.
Any recommendations?
23. [26]Can heartbeat be configured in an active/active configuration?
If so, how do I do this, since the haresources file is supposed to
be the same on each box so I do not know how this could be done.
24. [27]Why are my interface names getting truncated when they're
brought up and down?
25. [28]What is this auto_failback parameter? What happened to the old
nice_failback parameter?
26. [29]I am upgrading from a version of Linux-HA which supported
nice_failback to one that supports auto_failback. How to I avoid a
flash cut in this upgrade?
27. [30]If nothing helps, what should I do ?
28. [31]I want to submit a patch, how do I do that?
_______________________________________________________________________

1. Quit your bellyachin'! We needed a "catch-all" document to supply
useful information in a way that was easily referenced and would
grow without a lot of work. It's closer to a FAQ than anything
else.
2. Yes! There are two public mailing lists for Linux-HA. You can
find out about them by visiting [32]http://linux-ha.org/contact/.
3. HA (High availability Cluster) - A cluster that allows a host (or
hosts) to become Highly Available. This means that if one node goes
down (or a service on that node goes down) another node can pick up
the service or node and take over from the failed machine.
[33]http://linux-ha.org
Computing Cluster - This is what a Beowulf cluster is. It allows
distributed computing over off the shelf components. In this case
it is usually cheap IA32 machines. [34]http://www.beowulf.org/
Load balancing clusters - This is what the Linux Virtual Server
project does. In this scenario you have one machine with load
balances requests to a certain server (apache for example) over a
farm of servers. [35]www.linuxvirtualserver.org
All of these sites have howtos etc. on them. For a general overview
on clustering under Linux, look at the Clustering HOWTO.
4. Resource scripts are basically (extended) System V init scripts.
They must support stop, start, and status operations. In the
future we will also add support for a "monitor" operation for
monitoring services as you requested. The IPaddr script implements
this new "monitor" operation now (but heartbeat doesn't use that
function of it). For more info see Resource HOWTO.
5. Heartbeat itself was not designed for monitoring various resources.
If you need to monitor some resources (for example, availability of
WWW server) you need some third party software. Mon is a reasonable
solution.
A. Get Mon from [36]http://kernel.org/software/mon/.
B. Get all required modules listed. You can find them at nearest
mirror or at the CPAN archive (www.cpan.org). I am not very
familiar with Perl, so I downloaded them from CPAN archive as
.tar.gz packages and installed them in the usual way (perl
Makefile.pl && make && make test && make install).
C. Mon is software for monitoring different network resources. It
can ping computers, connect to various ports, monitor WWW,
MySQL etc. In case of dysfunction of some resources it
triggers some scripts.
D. Unpack mon in some directory. Best starting point is README
file. Complete documentation is in the <dir>/doc, where <dir>
is the place you unpacked mon package.
E. For a fast start do following steps:
a. copy all subdirs found in <dir> to /usr/lib/mon
b. create dir /etc/mon
c. copy auth.cf from <dir>/etc to /etc/mon
Now, mon is prepared to work. You need to create your own
mon.cf file, where you should point to resources mon should
watch and actions mon will start in case of dysfunction and
when resources are available again. All monitoring scripts
are in /usr/lib/mon/mon.d/. At the beginning of every script
you can find explanation how to use it.
All alert scripts are placed in /usr/lib/mon/alert.d/. Those
are scripts triggered in case something went wrong. In case
you are using ipvs on theirs homepage
(www.linuxvirtualserver.org) you can find scripts for adding
and removing servers from an ipvs list.
6. Yes! Use the ipfail plug-in. For each interface you wish to
monitor, specify one or more "ping" nodes or "ping groups" in your
configuration. Each node in your cluster will monitor these ping
nodes or groups. Should one node detect a failure in one of these
ping nodes, it will contact the other node in order to determine
whether it or the ping node has the problem. If the cluster node
has the problem, it will try to failover its resources (if it has
any).
To use ipfail, you will need to add the following to your
/etc/ha.d/ha.cf files:
respawn hacluster /usr/lib/heartbeat/ipfail
ping <IPaddr1> <IPaddr2> ... <IPaddrN>
See [37]Kevin's documentation for more details on the concepts.
IPaddr1..N are your ping nodes. NOTE: ipfail requires the
auto_failback option to be set to on or off (not legacy).
7. This isn't a problem with heartbeat, but rather is caused by
various versions of net-tools. Upgrade to the most recent version
of net-tools and it will go away. You can test it with ifconfig
manually.
8. Instead of failing over many IP addresses, just fail over one
router address. On your router, do the equivalent of "route add
-net x.x.x.0/24 gw x.x.x.2", where x.x.x.2 is the cluster IP
address controlled by heartbeat. Then, make every address within
x.x.x.0/24 that you wish to failover a permanent alias of lo0 on
BOTH cluster nodes. This is done via "ifconfig lo:2 x.x.x.3
netmask 255.255.255.255 -arp" etc...
9. If anything makes your ethernet / IP stack fail, you may lose both
connections. You definitely should run the cables differently,
depending on how important your data is...
10. To make heartbeat work with ipchains, you must accept incoming and
outgoing traffic on 694 UDP port. Add something like
/sbin/ipchains -A output -i ethN -p udp -s <source_IP> -d
<dest_IP> -j ACCEPT
/sbin/ipchains -A input -i ethN -p udp -s <source_IP> -d <dest_IP>
-j ACCEPT
11. This can be caused by one of two things:
+ System under heavy I/O load, or
+ Kernel bug.
For how to deal with the first occurrence (heavy load), please read
the answer to the [38]next FAQ item.
If your system was not under moderate to heavy load when it got
this message, you probably have the kernel bug. The 2.4.18 Linux
kernel had a bug in it which would cause it to not schedule
heartbeat for very long periods of time when the system was idle,
or nearly so. If this is the case, you need to get a kernel that
isn't broken.
12. "No local heartbeat" or "Cluster node returning after partition"
under heavy load is typically caused by too small a deadtime
interval. Here is suggestion for how to tune deadtime:
+ Set deadtime to 60 seconds or higher
+ Set warntime to whatever you *want* your deadtime to be.
+ Run your system under heavy load for a few weeks.
+ Look at your logs for the longest time either system went
without hearing a heartbeat.
+ Set your deadtime to 1.5-2 times that amount.
+ Set warntime to a little less than that amount.
+ Continue to monitor logs for warnings about long heartbeat
times. If you don't do this, you may get "Cluster node ...
returning after partition" which will cause heartbeat to
restart on all machines in the cluster. This will almost
certainly annoy you.
Adding memory to the machine generally helps. Limiting workload on
the machine generally helps. Newer versions of heartbeat are a
better about this than pre 1.0 versions.
13. It's common to get a single mangled packet on your serial interface
when heartbeat starts up. This message is an indication that we
received a mangled packet. It's harmless in this scenario. If it
happens continually, there is probably something else going on.
14. It's probably a permissions problem on authkeys. It wants it to be
read only mode (400, 600 or 700). Depending on where and when it
discovers the problem, the message will wind up in different
places.
But, it tends to be in
1. stdout/stderr
2. wherever you specified in your setup
3. /var/log/messages
Newer releases are better about also putting out startup messages
to stderr in addition to wherever you have configured them to go.
15. Use multicast and give each its own multicast group. If you need
to/want to use broadcast, then run each cluster on different port
numbers. An example of a configuration using multicast would be to
have the following line in your ha.cf file:
mcast eth0 224.1.2.3 694 1 0
This sets eth0 as the interface over which to send the multicast,
224.1.2.3 as the multicast group (will be same on each node in the
same cluster), udp port 694 (heartbeat default), time to live of 1
(limit multicast to local network segment and not propagate through
routers), multicast loopback disabled (typical).
16. There is a CVS repository for Linux-HA. You can find it at
cvs.linux-ha.org. Read-only access is via login guest, password
guest, module name linux-ha. More details are to be found in the
[39]announcement email. It is also available through the web using
viewcvs at
[40]http://cvs.linux-ha.org/viewcvs/viewcvs.cgi/linux-ha/
17. Heartbeat now uses use automake and is generally quite portable at
this point. Join the Linux-HA-dev mailing list if you want to help
port it to your favorite platform.
18. Due to distribution RPM package name differences, this was
unavoidable. If you're not using STONITH, use the "--nodeps"
option with rpm. Otherwise, use the heartbeat source to build your
own RPMs. You'll have the added dependencies of autoconf >= 2.53
and libnet (get it from [41]http://www.packetfactory.net/libnet).
Use the heartbeat source RPM (preferred) or unpack the heartbeat
source and from the top directory, run "./ConfigureMe rpm". This
will build RPMS and place them where it's customary for your
particular distro. It may even tell you if you are missing some
other required packages!
19. You configure a "meatware" STONITH device into the ha.cf file. The
meatware STONITH device asks the operator to go power reset the
machine which has gone down. When the operator has reset the
machine he or she then issues a command to tell the meatware
STONITH plug-in that the reset has taken place. Heartbeat will
wait indefinitely until the operator acknowledges the reset has
occurred. During this time, the resources will not be taken over,
and nothing will happen.
20. STONITH is a form of fencing, and is an acronym standing for Shoot
The Other Node In The Head. It allows one node in the cluster to
reset the other. Fencing is essential if you're using shared
disks, in order to protect the integrity of the disk data.
Heartbeat supports STONITH fencing, and resources which are
self-fencing. You need to configure some kind of fencing whenever
you have a cluster resource which might be permanently damaged if
both machines tried to make it active at the same time. When in
doubt check with the Linux-HA mailing list.
21. To get the list of supported STONITH devices, issue this command:
stonith -L
To get all the gory details on exactly what these STONITH device
names mean, and how to configure them, issue this command:
stonith -h
22. This is not something which heartbeat supports directly, however,
there are a few kinds of resources which are "self-fencing". This
means that activating the resource causes it to fence itself off
from the other node naturally. Since this fencing happens in the
resource agent, heartbeat doesn't know (and doesn't have to know)
about it. Two possible hardware candidates are IBM's ServeRAID-4
RAID controllers and ICP Vortex RAID controllers - but do your
homework!!! When in doubt check with the mailing list.
23. Yes, heartbeat has supported active/active configurations since its
first release. The key to configuring active/active clusters is to
understand that each resource group in the haresources file is
preceded by the name of the server which is normally supposed to
run that service. When in a "auto_failback yes (or legacy)" (or
old-style "nice_failback off") configuration, when a cluster node
comes up, it will take over any resources for which it is listed as
the "normal master" in the haresources file. Below is an example of
how to do this for an apache/mysql configuration.
server1 10.10.10.1 mysql
server2 10.10.10.2 apache

In this case, the IP address 10.10.10.1 should be replaced with the
IP address you want to contact the mysql server at, and 10.10.10.2
should be replaced with the IP address you want people to use to
contact the web server. Any time server1 is up, it will run the
mysql service. Any time server2 is up, it will run the apache
service. If both server1 and server2 are up, both servers will be
active. Note that this is contradictory with the old nice_failback
on parameter. With the new release which supports hb_standby
foreign, you can manually fail back into an active/active
configuration if you have auto_failback off. This allows
administrators more flexibility in failing back in a more
customized way at more safe or convenient times.
24. Heartbeat was written to use ifconfig to manage its interfaces.
That's nice for portability for other platforms, but for some
reasons ifconfig truncates interface names. If you want to have
fewer than 10 aliases, then you need to limit your interface names
to 7 characters, and 6 for fewer than 100 interfaces.
25. The auto_failback parameter is a replacement for the old
nice_failback parameter. The old value nice_failback on is replaced
by auto_failback off. The old value nice_failback off is logically
replaced by the new auto_failback on parameter. Unlike the old
nice_failback off behavior, the new auto_failback on allows the use
of the ipfail and hb_standby facilities.
During upgrades from nice_failback to auto_failback, it is
sometimes necessary to set auto_failback to legacy, as described in
the [42]upgrade procedure below.
26. To upgrade from a pre-auto_failback version of heartbeat to one
which supports auto_failback, the following procedures are
recommended to avoid a flash cut on the whole cluster.
1. Stop heartbeat on one node in the cluster.
2. upgrade this node. If the other node has nice_failback on in
ha.cf then set auto_failback off in the new ha.cf file. If the
other node in the cluster has nice_failback off then set
auto_failback legacy in the new ha.cf file.
3. Start the new version of heartbeat on this node.
4. Stop heartbeat on the other node in the cluster.
5. upgrade this second node in the cluster with the new version
of heartbeat. Set auto_failback the same as it was set in the
previous step.
6. Start heartbeat on this second node in the cluster.
7. If you set auto_failback to on or off, then you are done.
Congratulations!
8. If you set auto_failback legacy in your ha.cf file, then
continue as described below...
9. Schedule a time to shut down the entire cluster for a few
seconds.
10. At the scheduled time, stop both nodes in the cluster, and
then change the value of auto_failback to on in the ha.cf file
on both sides.
11. Restart both nodes on the cluster at about the same time.
12. Congratulations, you're done! You can now use ipfail, and can
also use the hb_standby command to cause manual resource
moves.
27. Please be sure that you read all documentation and searched mail
list archives. If you still can't find a solution you can post
questions to the mailing list. Please include following:
+ What OS are you running.
+ What version (distro/kernel).
+ How did you install heartbeat (tar.gz, rpm, src.rpm or manual
installation)
+ Include your configuration files from BOTH machines. You can
omit authkeys.
+ Include the parts of your logs which describe the errors.
Send them as text/plain attachments.
Please don't send "cleaned up" logs. The real logs have more
information in them than cleaned up versions. Always include
at least a little irrelevant data before and after the events
in question so that we know nothing was missed. Don't edit
the logs unless you really have some super-secret
high-security reason for doing so.
This means you need to attach 6 or 8 files. Include 6 if your
debug output goes into the same file as your normal output and
8 otherwise. For each machine you need to send:
o ha.cf
o haresources
o normal logs
o debug logs (perhaps)
28. We love to get good patches. Here's the preferred way:
+ If you have any questions about the patch, please check with
the linux-ha-dev mailing list for answers before starting.
+ Make your changes against the current CVS source
+ Test them, and make sure they work ;-)
+ Produce the patch this way:
cvs -q diff -u >patchname.txt
+ Send an email to the linux-ha-dev mailing list with the patch
as a [text/plain] attachment. If your mailer wants to zip it
up for you, please fix it.
_______________________________________________________________________

Rev 0.0.8
(c) 2000 Rudy Pawul [43]rpawul@iso-ne.com
(c) 2001 Dusan Djordjevic [44]dj.dule@linux.org.yu (c) 2003 IBM (Author
Alan Robertson [45]alanr@unix.sh)

References

1. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#FAQ
2. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#mailinglists
3. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#what_is_it
4. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#res_scr
5. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#mon
6. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#ipfail
7. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#nettools
8. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#manyIPs
9. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#serial
10. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#firewall
11. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#nolocalheartbeat
12. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#heavy_load
13. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#serialerr
14. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#serialerr
15. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#authkeys
16. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#authkeys
17. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#authkeys
18. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#multiple_clusters
19. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#CVS
20. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#other_os
21. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#RPM
22. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#meatware
23. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#STONITH
24. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#config_stonith
25. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#self_fence
26. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#active_active
27. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#iftrunc
28. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#why_auto_failback
29. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#auto_failback_upgrade
30. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#last_hope
31. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#patches
32. http://linux-ha.org/contact/
33. http://www.linux-ha.org/
34. http://www.beowulf.org/
35. http://www.linuxvirtualserver.org/
36. http://kernel.org/software/mon/
37. http://pheared.net/devel/c/ipfail/
38. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#heavy_load
39. http://lists.community.tummy.com/pipermail/linux-ha-dev/1999-October/000212.html
40. http://cvs.linux-ha.org/viewcvs/viewcvs.cgi/linux-ha/
41. http://www.packetfactory.net/libnet
42. file://localhost/home/mandrake/rpm/BUILD/heartbeat-2.1.4/doc/faqntips.html#auto_failback_upgrade
43. mailto:rpawul@iso-ne.com
44. mailto:dj.dule@linux.org.yu
45. mailto:alanr@unix.sh