<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> <HTML> <HEAD> <META NAME="GENERATOR" CONTENT="SGML-Tools 1.0.9"> <TITLE> RPC2 User Guide and Reference Manual: SFTP Internals</TITLE> <LINK HREF="rpc2_manual-17.html" REL=next> <LINK HREF="rpc2_manual-15.html" REL=previous> <LINK HREF="rpc2_manual.html#toc16" REL=contents> </HEAD> <BODY> <A HREF="rpc2_manual-17.html">Next</A> <A HREF="rpc2_manual-15.html">Previous</A> <A HREF="rpc2_manual.html#toc16">Contents</A> <HR> <H2><A NAME="s16">16. SFTP Internals</A></H2> <P> <A NAME="SFTPInternals"></A> <P> <H2><A NAME="ss16.1">16.1 Background</A> </H2> <P> <P>An SFTP file transfer can take place from either an RPC2 server or a RPC2 client. To avoid confusion we will refer to the transmitting entity as the <EM>source</EM> and the receiver entity as the <EM>sink</EM>. RPC2 clients and servers are not regarded as peers. While an RPC2 client might be someones personal workstation, an RPC2 server could be serving a large user community. In an effort to improve scalability when more clients are added to the system, the servers will handle all SFTP flow control, irrespective if they are the source or the sink. <P>An RPC2 client can use SFTP to transfer a file simultaneously to more than one RPC2 server using IP multicasting. Multicast file transfers are only possible when the source is an RPC2 client. The sinks will send flow control information to the source, and it will adapt to the requirements of the slowest sink. <P>An SFTP file transfer is basically a cyclic exchange of data and acknowledgements. At the beginning of each cycle, the source will send a block of data packets. It will then wait for an acknowledgement to arrive. The acknowledgement will specify which packets the sink has received. The cycle then repeats. The source will now retransmit any packets that it knows that the sink did not receive, followed by a block of new packets. <P>When the source has transmitted a block of data packets, it will wait for the arrival of an acknowledgement. If the source is an RPC2 server and the acknowledgement does not arrive after a predetermined time, the source will retransmit the block of data packets. It basically acts as if it received an acknowledgement that indicated that the entire block of data packets had been lost. <P>If the source is an RPC2 client, however, it will wait passively for an acknowledgement to arrive. If the sink does not receive more data packets after a predetermined period of time, it will conclude that acknowledgement was lost in transit and retransmit it. <P> <H2><A NAME="ss16.2">16.2 SFTP Code Structure</A> </H2> <P>In this section we describe the SFTP code structure. Our description assumes that the reader is already familiar with the description of basic RPC2 internals in Chapter <A HREF="rpc2_manual-14.html#RPC2Int">XXX</A>. <P> <P> <H3>Thread creation and Initialization</H3> <P>In the base RPC2, the RPC2 client and server communicate via Internet Sockets. Both at the client and at the server, the socket is created during initialization by calling RPC2_Init. A SocketListener thread is present at both ends to monitor these sockets. <P>When using SFTP, in addition to the above, there is another set of sockets created; one at the client and another at the server. These sockets are monitored by the SFTP_Listener. Both of these are created during the iniailization of the SFTP package. <P>Note that there are two independent channels of communication between the client and the server. The first channel (which we will refer to as the <EM>RPC2 channel</EM>) that is associated with the base RPC2 is used for making simple RPCs, for making RPCs requesting for the file transfers, for retransmissions and BUSYs. All other exchanges related to the file transfer are handled by the second channel which will be referred to in future as the <EM>sftp channel</EM>. <P>As previously mentioned, a file can be transferred from a client to a server or from a server to a client. In the first case, the server is the sink. In the second case, the client is the sink. The code however is not symmetric; i.e., the code executed when a client is the sink is slightly different from the code executed when a server is the sink. We describe both cases below. <P> <H3>Data Structures used in SFTP</H3> <P>In addition to the data structures used in RPC2 (on the RPC2 channel), SFTP uses a data structure called SFTP_Entry which is given in sftp.h. It contains fields relevant to the sftp channel such as the LocalHandle. Other fields include the state of the file transfer, the packet size, the window size and a number of others. It is created by the call sftp_AllocSEntry. <P> <H3>File transfer from server to client</H3> <P>This is the case in which the server is the source and the client is the sink. The client makes a request for the file by doing a <EM>RPC2_MakeRPC</EM> on the RPC2 channel. This is received by the servers SocketListener which wakes up a suitable LWP blocked on a <EM>RPC2_GetRequest</EM>. The LWP then calls the routine that is meant to handle this request. This routine contains calls to two routines, <EM>RPC2_InitSE</EM> and <EM>RPC2_CheckSE</EM>. <EM>RPC2_InitSE</EM> initializes certain internal data structures. <EM>RPC2_CheckSE</EM> handles the actual file transfer. The main procedure in <EM>RPC2_CheckSE</EM> which deals with the file transfer from SERVERTOCLIENT is the <EM>PutFile</EM> routine. <P>The <EM>PutFile</EM> routine sets some of the fields of the data structure SEntry, sets the transfer state of SEntry to <EM>XferInProgress</EM> (transfer in progress) and calls <EM>sftp_SendStrategy</EM>. This routine sends a set of packets, using a strategy described in the next section. After sending the first set of packets, a <EM>while</EM> loop is entered and executed as long as the transfer state of SEntry is still in <EM>XferInProgress</EM>. In the <EM>while</EM> loop, <EM>AwaitPacket</EM> and <EM>sftp_SendStrategy</EM> are called alternately. The <EM>AwaitPacket</EM> routine waits for either an ACK, NAK or for a timeout. If a timeout occurs, the packets are retransmitted using <EM>sftp_SendStrategy</EM>. If an ACK is received, the <EM>sftp_AckArrived</EM> routine is called. This routine advances the transmission window and checks to see if the transfer is complete. If so, it sets the SEntrys transfer state to <EM>XferCompleted</EM>, and the <EM>while</EM> loop is exited. Otherwise, the next set of packets is transmitted, after which control is yielded. Note that all these packets are sent on the SFTP channel, not the main RPC2 channel. <P>At the client end, the sftp_Listener detects a packet in the socket, receives it and processes it by calling <EM>sftp_ProcessPacket</EM>. This routine after receiving the packet calls the <EM>ExaminePacket</EM> routine. This routine sanity checks the packet, and identifies it as a DATA packet. It then calls <EM>sftp_DataArrived</EM> which sends the requested ACKs and writes the data to disk by calling the <EM>WriteStrategy</EM> routine. The sftp_listener yields after each packet it processes. <P>The packet from the client is received by the sftp_listener at the server which then calls the <EM>ServerPacket</EM> routine which modifies the appropriate SEntry. It then does an <EM>IOMGR_Select</EM>, and yields control. Control is then transferred to the LWP waiting on this packet, and the cycle continues. <P> <H3>File transfer from client to server</H3> <P>This is the case in which the server is the sink and the client is the source. As in previous case, the client makes a request for the file by doing a <EM>RPC2_MakeRPC</EM> on the RPC2 channel. This is received by the servers SocketListener which wakes up a suitable LWP blocked on a <EM>RPC2_GetRequest</EM>. The LWP then calls the routine that is meant to handle this request. This routine contains calls to the two routines <EM>RPC2_InitSE</EM> and <EM>RPC2_CheckSE</EM>. The <EM>RPC2_InitSE</EM> initializes some of the fields of the data structure. The main routine in <EM>RPC2_CheckSE</EM> which deals with the file transfer from CLIENTTOSERVER is the <EM>GetFile</EM> routine. <P> <P>The <EM>GetFile</EM> routine sets some of the fields of the data structure SEntry, sets the transfer state of SEntry to <EM>XferInProgress</EM> and sends a <EM>START</EM> packet to the client to tell the client that the server is ready to receive the file. It then enters a <EM>while</EM> loop which is executed as long as the transfer state of SEntry is still in <EM>XferInProgress</EM>. In the <EM>while</EM> loop, <EM>AwaitPacket</EM> and <EM>sftp_DataArrived</EM> are called alternately. The <EM>AwaitPacket</EM> routine waits for either a packet to arrive or for a timeout. If a timeout occurs, the ACK is retransmitted. If a DATA packet is received, the <EM>sftp_DataArrived</EM> routine is called. This routine in turn calls the <EM>sftp_WriteStrategy</EM>. When the file transfer is eventually completed, the transfer state of the SEntry is set to <EM>XferCompleted</EM>, and the loop is exited. <P>The sftp_Listener at the client end receives the packet and decodes it, and calls the <EM>ClientPacket</EM> routine which in turn identifies the packet as an <EM>SFTP_START</EM> packet. It then calls <EM>sftp_StartArrived</EM> which sets some of the fields in the SEntry data structure and calls the <EM>sftp_SendStrategy</EM> descirbed above. The sftp_Listener then block on an IOMGR_Select. Note that it patiently waits for an ACK from the server, and does not retransmit if it does not receive an ACK within a given time. What prevents the client from waiting forever is that communication exists between the client and the server on the RPC2 channel in the form of retransmissions and BUSYs. When an ACK arrives, it transmits the next set of packets. <P>The sftp_Listener at the server end receives the packet and calls the <EM>ServerPacket</EM> routine. This routine wakes up the appropriate LWP (which is blocked in the <EM>AwaitPacket</EM> call). @foot(Note that although the client sends a number of packets, the sftp_Listener receives and processes them one at a time; yielding control after each one. The same applies at the server end.) <P>Note that the role of the sftp_listener is different at the client end and at the server end. At the client end, the whole sftp transfer is handled by the sftp_listener. At the server end the sftp_listener receives and decodes the packet. Most of the sftp transfer is handled by the LWP thread. <P> <H2><A NAME="ss16.3">16.3 Packet formats</A> </H2> <P> <P>All packets carry 32 bit sequence numbers. Data packets and control packets have independent sequence numbers. The sequence number series of the source and sink (s) are also independent of each other. <P>There are thus at least 4 sequences in a connection: <UL> <LI>Source to Sink, Data Source to Sink, Control Sink to Source, Data Sink to Source, Control</LI> </UL> The Sink to Source sequence space is currently never used. When doing a multicast file transfer each sink will have independent sequence number series. <P>The sequence number for a particular packet type is incremented by one for each new packet of its type that is sent. <P>The <EM>MOREDATA</EM> flag will be set in each data packet except for the very last one. This is to facilitate end of file detection. If the <EM>ACKME</EM> flag is set on a data packet it requests an acknowledgement, <EM>ACK</EM>, from all of the servers. <P>Each <EM>ACK</EM> packet describes which packets have been received by the particular server. There should be little or no need to transmit an acknowledgement packet for each data packet. It is of particular benefit to limit the number of <EM>ACK</EM> packets given our single channel operating environment. The acknowledgement packets will contend with data packets going in the other direction. <P>Each acknowledgement packet has a 64-bit wide bitmask and an offset counter, <EM>GotEmAll</EM>. This counter is the highest sequence number of a data packet such that it and all preceding data packets have been received. The bitmask indicates which of the data packets with sequence numbers greater than <EM>GotEmAll</EM> that have been received. Each bit in the bitmask represents a single packet. <P> <H3>Protocol details</H3> <P> <P>If the source is an RPC2 client it must first wait for permission from the sink (s) before it can transmit. This permission is granted by a special <EM>START</EM> packet. <P>The following counters are of relevance to the SFTP source protocol machine. <EM>SendLastContig</EM>, which is the sequence number of the latest packet to be moved out of the transmission window, and <EM>SendMostRecent</EM>, which is the sequence number of the data packet last sent. There are also three important transmission parameters: the transmission window size, <EM>AckPoint</EM>, and size of the <EM>SendAhead</EM> set. <P>When an SFTP source begins the transfer, <EM>SendLastContig</EM> and <EM>SendMostRecent</EM> will be equal. The packets in the <EM>SendAhead</EM> set are transmitted, and <EM>SendMostRecent</EM> is increased by the size of <EM>SendAhead</EM>. Only one of these packets will have the <EM>ACKME</EM> flag set. The relative position of this packet in the <EM>SendAhead</EM> set is given by <EM>AckPoint</EM>. <EM>AckPoint</EM> must thus be less than or equal to the size of the <EM>SendAhead</EM> set. <P>Packets which have been sent and for which an <EM>ACK</EM> has been requested but not yet received fall into two categories: the <EM>NeedAck</EM> set and the <EM>Worried</EM> set. They are distinguished by whether or not an retranmission timeout has occurred since they were sent. Packets in the <EM>NeedAck</EM> set have been sent and an <EM>ACK</EM> has been requested, but not enough time has passed to be worried about the fact that an <EM>ACK</EM> has not been received. Packets which have been sent for which an <EM>ACK</EM> has not yet been requested, if any, are called the <EM>InTransit</EM> set. The <EM>InTransit</EM> set will always be empty if <EM>AckPoint</EM> equals the <EM>SendAhead</EM> size. <P>The source then waits for an <EM>ACK</EM> packet from the sink. Our implementation uses the waiting time to prefetch more data from the disk. During ideal conditions the source will proceed only after having received the <EM>ACK</EM> it is waiting for. In practice, however, it may timeout and retransmit data packets if it is operating as an RPC2 server. <P>At this point, the source will revise the <EM>Worried</EM> set. Any packets that have been acknowledged will be taken off the <EM>Worried</EM> set. The transmit window is shifted by increasing the <EM>SendLastContig</EM> counter. It will be set to one less the smallest sequence number of a member in the <EM>Worried</EM> set. <P>All the packets that are in the <EM>Worried</EM> set are retransmitted followed by <EM>SendAhead</EM> new packets. No packets will have the <EM>ACKME</EM> flag set, except for the member of <EM>SendAhead</EM> set whose index is given by <EM>AckPoint</EM>. The packets in the new <EM>SendAhead</EM> set are then either added to the <EM>NeedAck</EM> set or to the <EM>InTransit</EM> set, depending upon the <EM>AckPoint</EM> value, as described above. Packets are placed in the <EM>Worried</EM> set only after a retransmission interval has expired. The procedure is repeated until the file has been completely transfered. At no point, however, will the protocol have more packets outstanding than what is given by the transmit window size. Whenever the sum of the number of packets in the various sets is greater than the transmit window size, only the first packet in the <EM>Worried</EM> set will be sent. <P> <H3>Sink side operation</H3> <P> <P>The sink keeps two counters, <EM>RecvLastContig</EM> and <EM>RecvMostRecent</EM>, which are similar to their counterparts at the source side. <EM>RecvLastContig</EM> is the sequence number of the last data packet where it and all previous data packets have been received. It is used as the <EM>GotEmAll</EM> counter when sending an <EM>ACK</EM> packet. <EM>RecvMostRecent</EM> is the highest sequence number of a packet received so far. <P>Before the file transfer takes place, the source will inform the sink about the parameters <EM>RetryInterval</EM>, <EM>RetryCount</EM> and <EM>DupThreshold</EM>. If the sink is an RPC2 server it will grant the source permission to transmit data packets by sending a <EM>START</EM> packet. It will then start waiting for data packets. If the sink is an RPC2 server and no data packet has arrived after the time specified by <EM>RetryInterval</EM>, it will send an <EM>ACK</EM> (or <EM>START</EM>) packet, trying to cause the source to retransmit its data. If this fails <EM>RetryCount</EM> number of times, without any valid data packet being received, the sink will consider the connection unusable. <P>The sink keeps track of the number of duplicate data packets that have arrived since the last time an <EM>ACK</EM> was sent. If that number exceed <EM>DupThreshold</EM>, the sink will send an <EM>ACK</EM> in an attempt to inform the source about the situation. <P> <H3>Client and server invariants</H3> <P> <P>The state of the counters at the source and sink can be summarized by the following invariant relations, where SendAckLimit and SendWorriedLimit are upper bounds of the NeedAck and Worried sets, respectively. <P> <P> <H3>Invariants when transfer is in progress </H3> <P> <OL> <LI> SendLastContig <= SendWorriedLimit <= SendAckLimit <= SendMostRecent (SendMostRecent - SendLastContig) <= WindowSize (SendMostRecent - SendAckLimit) <= SendAhead </LI> <LI>RecvLastContig <= RecvMostRecent (RecvMostRecent - RecvLastContig) <= WindowSize</LI> </OL> <P> <H3>Invariants when transfer is completed, aborted or not started </H3> <P> <OL> <LI>SendLastContig (at source) = SendMostRecent (at source) </LI> <LI>RecvLastContig (at sink) = RecvMostRecent (at sink) </LI> <LI>SendLastContig (at source) = RecvLastContig (at sink)</LI> </OL> <P> <H2><A NAME="ss16.4">16.4 Adjusting the Retransmission Interval</A> </H2> <P> <P>SFTP uses the retranmission interval to determine when it should be worried about packets for which it has not received an <EM>ACK</EM>. Initially, the retranmission interval is set to <EM>SFTP_RetryInterval</EM> (2 seconds), but varies depending on RTT observations collected during file transfers. When a timeout occurs, the retranmission timer is backed off. The backed off timer is independent of the RTT state in the <EM>sEntry</EM>. <P>Like RPC2, SFTP collects RTT observations by using packet timestamps. The timestamps and storage of RTT state is the same as presented in chapter <A HREF="rpc2_manual-15.html#RetryChapter">XXX</A>. In SFTP, timestamping is <EM>two-way</EM>, namely, both source and sink collect RTT observations during a transfer. Both timestamp fields in the packet header are used: one for the current timestamp, and one for the timestamp being echoed back to the other side. <P>The source collects observations as follows: it timestamps outgoing <EM>DATA</EM> packets. The sink echos a timestamp back on the <EM>ACK</EM> packet. When the <EM>ACK</EM> arrives at the source, the source computes the RTT for that send-ahead set, and updates its RTO accordingly. <P>The sink collects observations as follows: it timestamps <EM>START</EM>, <EM>ACK</EM>, and <EM>TRIGGER</EM> packets. (A trigger packet is an <EM>ACK</EM> that is being used by the server because it has timed out on the client.) The source echos the timestamp back on the first <EM>DATA</EM> packet sent in response to such a packet. When that <EM>DATA</EM> packet arrives at the sink, the sink computes the RTT and updates its RTO. If the first packet gets lost, no update is performed. If it is delayed, the update is performed when it arrives. <P>All that is needed for state in the <EM>sEntry</EM> is a single word, <EM>TimeEcho</EM>, to hold the timestamp that will next be echoed on a packet. Each packet may carry up to two timestamps - one is the time at which the sender sent it, and the other is the echoed timestamp. (Only <EM>ACK</EM> packets and certain <EM>DATA</EM> packets actually use both fields.) The spare2 and spare3 fields of the packet header are used for these fields, called <EM>TimeStamp</EM> and <EM>TimeEcho</EM>. These fields were previously reserved for bitmask fields, but were not being used. <P>Packets are timestamped as they are sent out in the following routines: <UL> <LI> sftp_SendSendAhead, sftp_ResendWorried, sftp_SendFirstUnacked (data) </LI> <LI> sftp_SendAck </LI> <LI> send_SendStart</LI> </UL> <P>The packet TimeStamp field is stashed in sEntry->TimeEcho as appropriate when a packet with a timestamp is received. This is the timestamp that will be echoed back to the other side eventually. This occurs in: <UL> <LI>sftp_DataArrived, on the sink, if the packet advances the left edge of the window (Header.SeqNumber == sEntry->RecvLastContig+1). </LI> <LI>sftp_StartArrived, on the source, whether the transfer has started or not. Data will be sent in response to the <EM>START</EM> packet either way. </LI> <LI>sftp_AckArrived, on the source. If there is more data to send, the source will send it in response to this packet.</LI> </UL> <P>The value in sEntry->TimeEcho is then placed in the Header.TimeEcho field in the following routines: <UL> <LI>sftp_SendAck, on the sink. </LI> <LI>sftp_SendSendAhead, sftp_ResendWorried, or sftp_SendFirstUnacked, on the source. The timestamp is echoed on the <EM>first</EM> packet sent out by these collectively (the one corresponding to sEntry->SendLastContig+1). All other <EM>DATA</EM> packets carry a TimeEcho of 0. A special case occurs in the first set of <EM>DATA</EM> packets on a server-to-client transfer, from PutFile. In this case there is no timestamp to echo, because the source does not hear from the sink before sending data. In this case, sEntry->TimeEcho is set to 0 at the top of PutFile. A second special case also occurs in PutFile, when the server times out. Again, there is no timestamp to echo, because the data is not being sent in response to a packet from the sink.</LI> </UL> <P>RTT measurements are computed from Header.TimeEcho in the following routines: <UL> <LI>sftp_AckArrived, on the source, if the <EM>ACK</EM> is not a trigger. Triggers are sent when the server times out during a client-to-server transfer. They do not represent real observations because there was no transmission from the source that caused them. Triggers are marked so that the source can distinguish them from real <EM>ACK</EM>s. </LI> <LI>sftp_DataArrived, on the sink.</LI> </UL> <P>Any zero TimeStamp or TimeEcho is ignored, and the RTO remains unchanged. This is chiefly for compatibility with versions of SFTP that do not use packet timestamps. RTT state in the sEntry is initialized on the client in <EM>SFTP_Bind2</EM>, using the BindTime supplied by RPC2. On the server, it is initialized in <EM>SFTP_GetRequest</EM>, using the same BindTime shipped to the server on the first request on the connection. <P> <H2><A NAME="ss16.5">16.5 Performance</A> </H2> <P> <P>RPC2 and SFTP perform well over a wide range of network speeds. Figure <@@ref>RPC2TableXXX</A> compares the performance of SFTP and TCP over three different networks: Ethernet, a WaveLan wireless network, and a modem over a phone line. In almost all cases, SFTPs performance equals or exceeds that of TCP. <P> <P> <P> <P> <HR> <A HREF="rpc2_manual-17.html">Next</A> <A HREF="rpc2_manual-15.html">Previous</A> <A HREF="rpc2_manual.html#toc16">Contents</A> </BODY> </HTML>