Sophie: nettee-0.1.8-2mdv2008.1 i586

nettee-0.1.8-2mdv2008.1.i586.rpm

Instructions for using nettee with SystemImager - method 2

Method 1 described in the other document is adequate for a small
cluster.  As pointed out to me by Greg Kurtzer, Sean Dague,
and Brian Elliot Finley, for a large cluster one must consider
the possibility of node failures.  Here are some suggested
modifications to the preceding method to work around that problem.

1.  Do not load nettee values from DHCP.  Intead have each node
boot with a script like this:

   $ARGSFORNEXT=`nettee -next _EOC_`
   nettee $ARGSFORNEXT -out /dev/hda
   
2.  Have the master node scan the nodes with nettee, attempting
to download to each its parameters.  Ie:

  echo "ARGS PROPERLY ESCAPED" | nettee -next $TARGETNODE -t 1 -in -

(Note that there is no -w so nettee will fail immediately
if the target node isn't up or isn't running nettee.)
If the return status on one nettee is a failure then
its NEXTNODE value is reused for the next good node on the list. 
For instance, if A is distributing to chain B C D E something
like this would happen if node C is bad:

node   nextnode   probed              Probe order
B      B          ok, loaded D        4
C      (down)D    bad                 3
D      E          ok, loaded E        2
E      _EOC_      ok, loaded _EOC_    1

then A uses B for its -next value.  Writing the scanning script is left
as an exercise for the reader.

3.  It's possible that a node may fail DURING the transfer.
This is much harder to deal with than one that never managed
to get to the point in the script where nettee runs, which
is handled under (2) above.  However, by having the
nodes in the chain (but not the source node) employ -colwf 
(Continue On Local   Write Failure) and -conwf
(Continue on Network Write Failure) it is possible to salvage
as much of the distribution chain as possible.  If either
(or both) of those errors is encountered the top node will emit an
error message indicating that this type of problem occurred somewhere
in the chain.  It cannot say where in the chain though.  A
subsequent post mortem scan of the nodes should be able to
determine where the problem lay.  To aid in this postmortem
analysis it might be a good idea for the automatic load 
script on each node to look something like this:

   echo "Before first nettee" > /etc/postmortem.txt
   $ARGSFORNEXT=`nettee -next _EOC_`
   echo "Exit Status of first nettee  $? " >> /etc/postmortem.txt
   nettee $ARGSFORNEXT -out /dev/hda
   echo "Exit Status of second nettee $? " >> /etc/postmortem.txt

4.  My gut feeling is that nodes that don't (re)boot, case 2 above,
are more common than nodes that fail once they have booted normally
and managed to run nettee once.  In my own experience with 20 nodes
a nonrebooting node isn't that uncommon, however a node that fails
during a download was probably iffy to start with and so you probably
already knew that it might fail.  If you _must_ reload known
iffy nodes along with the good ones put the bad ones as near to
the end of the chain as possible and employ the -conwf -colwf options.