%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% %W pargap6.tex ParGAP documentation Gene Cooperman %% %H $Id: pargap6.tex,v 1.5 2001/07/28 23:49:06 gap Exp $ %% %Y Copyright (C) 1999-2001 Gene Cooperman %Y See included file, COPYING, for conditions for copying %% \Chapter{MPI commands and UNIX system calls in ParGAP} This chapter can be safely ignored on a first reading, and maybe permanently. It is for application programmers who wish to develop their own low-level message-based parallel application. The additional UNIX system calls in {\ParGAP} may also be useful in some applications. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \Section{Tutorial introduction to the MPI C library} \atindex{MPI}{@MPI} \atindex{Message Passing Interface}{@Message Passing Interface} \atindex{MPI model}{@MPI model}\atindex{tutorial!MPI}{@tutorial!MPI} This section lists some of the more common message passing commands, followed by a short MPI example. The next section ("Other low level commands") contains more (but by no means all) of the MPI commands and some UNIX system calls. The {\ParGAP} binding provides a simplified form that makes interactive usage easier. This section describes the original MPI binding in~C, with some comments about the interactive versions provided in {\ParGAP}. (The MPI standard includes a binding both to~C and to~FORTRAN.) Even if your ultimate goal is a standalone C-based application, it is useful to prototype your application with equivalent commands executed interactively within {\ParGAP}. Note that this distribution includes a subdirectory `mpinu', which provides a subset MPI implementation in~C with a small <footprint>. It consists of a C~library, `libmpi.a', and an include file `mpi.h'. The library is approximately 150~KB. The subdirectory can be consulted for further details. We first briefly explain some MPI concepts. \beginitems *rank*:& The *rank* of an MPI process is a unique ID number associated with the process. By convention, the console (master) process has rank~0. The ranks of the process are guaranteed by MPI to form a consecutive, ascending sequence of integers, starting with~0. *tag*:& Each message has associated with it a non-negative integer *tag* specified by the application. Our interface allows you to ignore tags by letting them take on default values. Typical application uses for tags are either to choose consecutive integers in order to guarantee that all messages can be re-assembled in sequence, or to choose a fixed set of constant tags, each constant associated with another type of message. In the latter case, one might have integers for a `QUIT_TAG', an `INTEGER_ARRAY_TAG', `START_TASK2_TAG', etc. In fact, our implementation of the Slave Listener and `MasterSlave()' specifically uses certain tags of value 1000 and higher for these purposes. Hence, application routines that do use tags should restrict themselves to tags `[0..999]'. *communicator*:& A *communicator* in MPI serves the purpose of a namespace. Most MPI commands require a communicator argument to specify the namespace. MPI starts up with a default namespace, `MPI_COMM_WORLD'. The {\ParGAP} implementation always assumes that single namespace. A namespace is important in MPI to build modules and library routines, so that a thread may distinguish messages meant for itself, or to catch errors of cross-communication between two modules. *message*:& Each message in MPI is typically implemented to include fields for the source rank, destination rank (optional), tag, communicator, count, and an array of data. The <count> field specifies the length of the array. MPI guarantees that messages are non-overtaking, in the sense that if two messages are sent from a single source process to the same destination process, then it is guaranteed that the first process sent will be the first one to arrive, and will be received or probed first from the queue. *other*:& MPI also has concepts of <datatype>, <derived datatype>, <group>, <topology>, etc. This implementation defaults those values, so that <datatype> is always a character (hence the use of strings in {\ParGAP}), no <derived datatypes> are implemented, <group> is always consistent with `MPI_COMM_WORLD', and <topology> is the fully connected topology. *communication*:& This implementation implements only point-to-point communication (always blocking receives, except for `MPI_Iprobe', and sends can be blocking or not, according to the default underlying sockets). *collective communication*:& The MPI standard also provides for *collective communication*, which sets up a barrier in which all process within the named communicator must participate. One process is distinguished as the *root* process in cases of asymmetric usage. {\ParGAP} does not implement any collective communication (although you can easily emulate it using a sequence of point-to-point commands). The MPI subset distribution (in {\ParGAP}'s `mpinu' directory) does provide some commands for collective communication. Examples of MPI collective communication commands are `MPI_Bcast' (broadcast), `MPI_Gather' (place an entry from each process in an array residing on the root process), `MPI_Scatter' (inverse of gather), `MPI_Reduce' (execute a commutative, associative function with an entry from each process and store on root; example functions are `sum', `and', `xor', etc. *dynamic processes*:& The newer MPI-2 standard allows for the dynamic creation of new processes on new processors in an ongoing MPI computation. The standard is silent on whether an MPI session should be aborted if one of its member processes dies, and the MPI standard provides no mechanism to recognize such a dead process. Part of the reason for this silence is that much of the ancestry of MPI lies in dedicated parallel computers for which it would be unusual for one process or processor to die. \enditems Here is a short extract of MPI code to illustrate its flavor. It illustrates the C equivalents of the following {\ParGAP} commands. Note that the {\ParGAP} versions noted here take fewer parameters than their C-based cousins, and {\ParGAP} includes defaults for some optional parameters. \>MPI_Init()!{example} % [ called for you automatically when ParGAP is loaded ] F \>MPI_Finalize()!{example} [ called for you automatically when GAP quits ] F \>MPI_Comm_rank()!{example} F \>MPI_Get_count()!{example} F \>MPI_Get_source()!{example} F \>MPI_Get_tag()!{example} F \>MPI_Comm_size()!{example} F \>MPI_Send( <string buf>, <int dest>[, <int tag = 0> ] )!{example} F \>MPI_Recv( <string buf> [, <int source = MPI_ANY_SOURCE>[, % <int tag = MPI_ANY_TAG> ] ] )!{example} F \>MPI_Probe( [ <int source = MPI_ANY_SOURCE>[, % <int tag = MPI_ANY_TAG> ] ] )!{example} F Many of the above commands have analogues at a higher level in section~"Slave Listener Commands" as `GetLastMsgSource()', `GetLastMsgTag()', `MPI_Comm_size() = TOPCnumSlaves + 1', `SendMsg()', `RecvMsg()' and `ProbeMsg()'. \begintt #include <stdlib.h> #include <mpi.h> #define MYCOUNT 5 #define INT_TAG 1 main( int argc, char *argv[] ) { int myrank; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &myrank ); if ( myrank == 0 ) { int mysize, dest, i; int buf; printf("My rank (master): %d\n", myrank); for ( i=0; i<MYCOUNT; i++ ) buf = 5; MPI_Comm_size( MPI_COMM_WORLD, &mysize ); printf("Size: %d\n", mysize); for ( dest=1; dest< mysize; dest++ ) MPI_Send( &buf, MYCOUNT, MPI_INT, dest, INT_TAG, MPI_COMM_WORLD ); } else { int i; MPI_Status status; int source; int count; int *buf; printf("My rank (slave): %d\n", myrank); MPI_Probe( MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status ); printf( "Message pending with rank %d and tag %d.\n", status.MPI_SOURCE, status.MPI_TAG ); if ( status.MPI_TAG != INT_TAG ) printf("Error: Bad tag.\n"); exit(1); MPI_Get_count( &status, MPI_INT, &count ); printf( "The count of how many data units (MPI_INT) is: %d.\n", count ); buf = (int *)malloc( count * sizeof(int) ); source = status.MPI_SOURCE; MPI_Recv( buf, count, MPI_INT, source, INT_TAG, MPI_COMM_WORLD, &status ); for ( i=0; i<MYCOUNT; i++ ) if ( *buf != 5 ) printf("error: buf[%d] != 5\n", i); printf("slave %d done.\n", myrank); } MPI_Finalize(); exit(0); } \endtt Even in this simplistic example, it was important to specify \begintt MPI_Recv( buf, count, MPI_INT, source, INT_TAG, MPI_COMM_WORLD, &status ); \endtt and not to use `MPI_ANY_SOURCE' instead of the known source. Although this alternative would often work, there is a danger that there might be a second incoming message from a different source that arrives between the calls to `MPI_Probe()' and `MPI_Recv()'. In such an event, MPI would be free to receive the second message in `MPI_Recv()', even though the appropriate count of the second message is likely to be different, thus risking an overflow of the `buf' buffer. Other typical bugs in MPI programs are: \beginlist \item{$\bullet$} Incorrectly matching corresponding sends and receives or having more or fewer sends than receives due to the logic of multiple sends and receives within distinct loops. \item{$\bullet$} Reaching deadlock because all active processes have blocking calls to `MPI_Recv()' while no process has yet reached code that executes `MPI_Send()'. \item{$\bullet$} Incorrect use of barriers in collective communication, whereby one process might execute: \begintt MPI_Send( buf, count, datatype, dest, tag, COMM_1 ); MPI_Bcast( buffer, count, datatype, root, COMM_2 ); \endtt \item{} and a second executes \begintt MPI_Bcast( buffer, count, datatype, root, COMM_2 ); MPI_Recv( buf, count, datatype, dest, tag, COMM_1, status ); \endtt \item{} If the call to `MPI_Send()' is blocking (as is the case for long messages in the case of many implementations), then the first process will block at `MPI_Send()' while the second blocks at 'MPI_Bcast()'. This happens even though they use distinct communicators, and the send-receive communication would not normally interact with the broadcast communication. \endlist Much of the TOP-C method in {\ParGAP} (see chapters~"Basic Concepts for the TOP-C model (MasterSlave)" and~"MasterSlave Tutorial") was developed precisely to make errors like those above syntactically impossible. The slave listener layer also does some additional work to keep track of the `status' that was last received and other bookkeeping. Additionally, the TOP-C method was designed to provide a higher level, task-oriented ``language'', which would naturally lead the application programmer into designing an efficient high level algorithm. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \Section{Other low level commands} \atindex{MPI commands!All ParGAP bindings}{@MPI commands!All ParGAP bindings} \atindex{UNIX system calls!All ParGAP bindings}% {@UNIX system calls!All ParGAP bindings} Here is a complete listing of the low level commands available in {\ParGAP}. Some of these commands were documented elsewhere. The remaining ones are not recommended for most users. Nevertheless, it may be useful to others for more sophisticated applications. For most of these commands, the source code is the ultimat documentation. However, you may be able to guess at the meaning of many of them based on their names and their similarity UNIX system calls (in the case of `UNIX_...') or MPI commands (in the case of `MPI...'). Some of the commands will also show you their calling parameters if called with the wrong number of arguments. Many of the MPI commands have simplified calling parameters with certain arguments optional or set to defaults, making them easier for interactive use. \atindex{UNIX functions}{@UNIX functions} \atindex{functions!UNIX}{@functions!UNIX} \>UNIX_MakeString( <len> ) F \>UNIX_DirectoryCurrent() [ Defined in `pkg/pargap/lib/slavelist.g' ] F \>UNIX_Chdir( <string> ) F \>UNIX_FflushStdout() F \>UNIX_Catch( <function>, <return_val> ) F \>UNIX_Throw() F \>UNIX_Getpid() F \>UNIX_Hostname() F \>UNIX_Alarm( <seconds> ) F \>UNIX_Realtime() F \>UNIX_Nice( <priority> ) F \>UNIX_LimitRss( <bytes_of_ram> ) [ = setrlimit(RLIMIT_RSS, ...) ] F \atindex{MPI functions}{@MPI functions} \atindex{functions!MPI}{@functions!MPI} \>MPI_Init() F \>MPI_Initialized() F \>MPI_Finalize() F \>MPI_Comm_rank() F \>MPI_Get_count() F \>MPI_Get_source() F \>MPI_Get_tag() F \>MPI_Comm_size() F \>MPI_World_size() F \>MPI_Error_string( <errorcode> ) F \>MPI_Get_processor_name() F \>MPI_Attr_get( <keyval> ) F \>MPI_Abort( <errorcode> ) F \>MPI_Send( <string buf>, <int dest>[, <int tag = 0> ] ) F \>MPI_Recv( <string buf> [, <int source = MPI_ANY_SOURCE>[, <int tag = MPI_ANY_TAG> ] ] ) F \>MPI_Probe( [ <int source = MPI_ANY_SOURCE>[, <int tag = MPI_ANY_TAG> ] ] ) F \>MPI_Iprobe( [ <int source = MPI_ANY_SOURCE>[, <int tag = MPI_ANY_TAG> ] ] ) F \atindex{MPI global constants}{@MPI global constants} \atindex{constants!MPI, global}{@constants!MPI, global} \>`MPI_ANY_SOURCE' V \>`MPI_ANY_TAG' V \>`MPI_COMM_WORLD' V \>`MPI_TAG_UB' V \>`MPI_HOST' V \>`MPI_IO' V %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% %E