Instructions for using nettee with SystemImager - method 2 Method 1 described in the other document is adequate for a small cluster. As pointed out to me by Greg Kurtzer, Sean Dague, and Brian Elliot Finley, for a large cluster one must consider the possibility of node failures. Here are some suggested modifications to the preceding method to work around that problem. 1. Do not load nettee values from DHCP. Intead have each node boot with a script like this: $ARGSFORNEXT=`nettee -next _EOC_` nettee $ARGSFORNEXT -out /dev/hda 2. Have the master node scan the nodes with nettee, attempting to download to each its parameters. Ie: echo "ARGS PROPERLY ESCAPED" | nettee -next $TARGETNODE -t 1 -in - (Note that there is no -w so nettee will fail immediately if the target node isn't up or isn't running nettee.) If the return status on one nettee is a failure then its NEXTNODE value is reused for the next good node on the list. For instance, if A is distributing to chain B C D E something like this would happen if node C is bad: node nextnode probed Probe order B B ok, loaded D 4 C (down)D bad 3 D E ok, loaded E 2 E _EOC_ ok, loaded _EOC_ 1 then A uses B for its -next value. Writing the scanning script is left as an exercise for the reader. 3. It's possible that a node may fail DURING the transfer. This is much harder to deal with than one that never managed to get to the point in the script where nettee runs, which is handled under (2) above. However, by having the nodes in the chain (but not the source node) employ -colwf (Continue On Local Write Failure) and -conwf (Continue on Network Write Failure) it is possible to salvage as much of the distribution chain as possible. If either (or both) of those errors is encountered the top node will emit an error message indicating that this type of problem occurred somewhere in the chain. It cannot say where in the chain though. A subsequent post mortem scan of the nodes should be able to determine where the problem lay. To aid in this postmortem analysis it might be a good idea for the automatic load script on each node to look something like this: echo "Before first nettee" > /etc/postmortem.txt $ARGSFORNEXT=`nettee -next _EOC_` echo "Exit Status of first nettee $? " >> /etc/postmortem.txt nettee $ARGSFORNEXT -out /dev/hda echo "Exit Status of second nettee $? " >> /etc/postmortem.txt 4. My gut feeling is that nodes that don't (re)boot, case 2 above, are more common than nodes that fail once they have booted normally and managed to run nettee once. In my own experience with 20 nodes a nonrebooting node isn't that uncommon, however a node that fails during a download was probably iffy to start with and so you probably already knew that it might fail. If you _must_ reload known iffy nodes along with the good ones put the bad ones as near to the end of the chain as possible and employ the -conwf -colwf options.