<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> <HTML> <HEAD> <META NAME="GENERATOR" CONTENT="SGML-Tools 1.0.9"> <TITLE>RVM: Recoverable Virtual Memory, Release 1.3: RVM Internals</TITLE> <LINK HREF="rvm_manual-8.html" REL=next> <LINK HREF="rvm_manual-6.html" REL=previous> <LINK HREF="rvm_manual.html#toc7" REL=contents> </HEAD> <BODY> <A HREF="rvm_manual-8.html">Next</A> <A HREF="rvm_manual-6.html">Previous</A> <A HREF="rvm_manual.html#toc7">Contents</A> <HR> <H2><A NAME="RVMinternals"></A> <A NAME="s7">7. RVM Internals</A></H2> <P><EM>By Yui Wah LEE</EM> <P> <H2><A NAME="ss7.1">7.1 Overview of RVM internal structure</A> </H2> <P>RVM is designed to be a <EM>run-time library</EM> to be linked with an <EM>external application</EM>. It allows the application to be benefited from <EM>transactional</EM> support. In particular, it allows operations on virtual memory to be <EM>atomic</EM> and <EM>persistent</EM>, provided that the application declares its intention through the RVM application programming interface. <P>Figure <A HREF="#fig_overview">fig_overview</A> shows the three important components of RVM: <EM>virtual memory</EM>, <EM>data segment</EM> and <EM>log</EM>. <P>Virtual memory is where the application manipulates its data. RVM allows the application to declare <EM>regions</EM> of virtual memory to be associated with regions in external data segments. The data segments are the persistent backing store of the virtual memory regions: latest committed image of the virtual memory region has a copy in the data segments. Since data segment is on stable storage (disk), it is persistent across program crashes. Should a application crashes, it can always find those memory regions from the data segment. The association of virtual memory regions with segment regions is called <EM>mapping</EM>. <P>For performance reason, RVM does not write <EM>directly</EM> the committed image of virtual memory regions back to the data segment. Doing so would generate excessive 'seek' operations on the magnetic head of the disk, since the access pattern can be random. Disk head seeks are expensive operation and should be minimized. In the case of RVM, this is achieved by writting <EM>new value</EM> of memory ranges as <CODE>records</CODE> in an <EM>append-only</EM>, <EM>sequential</EM> log. This is called <EM>flush</EM> operation. Log is also on stable storage (disk or battery-backed ramdisk). Once the new values are written on log (flushed), they become persistent across program crashes. From time to time, records on the log are processed and applied onto the data segment. After that, the log space used by the the records can be re-claimed for further use. This is called <EM>log truncation</EM>. Log truncation is necessary, without it the log will grow unboundedly. <P> <FIGURE> <EPS FILE="overview"> <CAPTION> <A NAME="fig_overview"></A> Overview</CAPTION> </FIGURE> <P> <H2><A NAME="ss7.2">7.2 RVM application programming interface</A> </H2> <P>To facilitate our discussion, let first review briefly the RVM application programming interface. Details of the intefaces can be found in chapter <A HREF="rvm_manual-4.html#rvmAPI">of RVM Library Functions</A><P> <HR> <PRE> rvm_options_t options; char *version; rvm_region_t region; rvm_tid_t *tid; char *dest, *src; unsigned long length; rvm_initialize(version, options) rvm_map(region, options); rvm_begin_transaction(tid, restore); rvm_begin_transaction(tid, no_restore); rvm_set_range(tid, dest, length); rvm_modify_bytes(tid, dest, src, length); rvm_end_transaction(tid, flush); rvm_end_transaction(tid, no_flush); rvm_abort_transaction(tid); rvm_flush(); rvm_truncate(); rvm_unmap(region); rvm_terminate(); </PRE> <HR> Application declares regions of VM that it want to have persistent image by calling <CODE>rvm_map()</CODE>. There are two possible modes for <CODE>rvm_map()</CODE>: <CODE>copy</CODE> and <CODE>no_copy</CODE>. The former is the default and will cause the corresponding region in the segment device to be read into virtual memory. The latter is used when the application is sure the image in segment device is useless (e.g., the application is going to replace it immediately). <P>Application declares the begining of a transaction by calling <CODE>rvm_being_transaction()</CODE>. The transaction can be committed by calling <CODE>rvm_end_transaction()</CODE>, or be aborted by calling <CODE>rvm_abort_transaction()</CODE>. In addition, the application must inform RVM the memory location that will be affected by the transactions. These can be done by calling <CODE>rvm_set_range()</CODE> or <CODE>rvm_modify_bytes()</CODE>. RVM guarantees that if the transaction commits, then all the modifications make on declared ranges will be made effective atomically, conversely, if the transaction abort, then none of modifications will be made effective. <P>Normally, when transactions are aborted, the <EM>old value</EM> in the affected memory ranges will be <EM>restored</EM>. So, a default option for <CODE>rvm_begin_transaction</CODE> is <CODE>restore</CODE>. However, in some case, the application can be sure that it will never abort the transaction. For example, if there are no concurrent transaction, the application can be sure it will never abort the transaction explicitly. For these cases, the application can start the transaction with the <CODE>no_restore</CODE> mode. <P>If the application want the committed transaction be also recorded in the log device (stable storage) <EM>immediately</EM>, it should commit the transaction with <CODE>flush</CODE> mode. In this mode, the <CODE>end_transaction()</CODE> calls return only after the transaction records are written to disk. However, there are cases when the application can accept slight chance of lose of permanency. In those case, it can end the transaction with a <CODE>no_flush</CODE> mode. There will be a big payoff in performance, as the function <CODE>rvm_end_transaction()</CODE> will return immediately, without having to the wait for the slow disk I/O to complete. RVM will put the committed transaction in a queue and write them out to the log device later. This is called <EM>lazy commit</EM> and it provides a <EM>bound persistence</EM>. The committed transactions may lost if the program crashed before the queued transaction get a chance to be written out to log. Note that although the transaction is lost, it is lost atomically. Upon the next restart of the application, RVM will restore an old consistent view of the virtual memory. <P>Commit transaction lazily also give more opportunity to inter- transaction optimization. <P>To force all previously committed transaction records to be written out to log device, the application can call <CODE>rvm_flush()</CODE>. Also, whenever the application commit a new transaction with <CODE>flush</CODE> mode, there will be a side effect that all previous transaction on the queue will also be flushed. <P>The log device is only a temporary stable storage for commited transactions: its primary purpose is to allow records be written on disk sequentially. Eventually, these records need to be applied to the data segment. The process of applying transaction log records to data segment is called <EM>log truncation</EM>. RVM will carry out log truncation automatically when the log is too full. Users can specify the option <CODE>truncate</CODE> to set the threshold. Alternatively, the application can initiate a trunction by calling <CODE>rvm_truncate()</CODE>. Also, note that RVM will do a log truncation upon start up and <CODE>rvm_map()</CODE> operations. This is to ensure the data segment has the lastest updates before it get mapped into virtual memory. <P> <P> <H2><A NAME="ss7.3">7.3 Structure of log</A> </H2> <H3>Log pointers and limits</H3> <P>The log device can be a <UL> <LI> regular Unix file</LI> <LI> raw partition</LI> <LI> raw partition in a dedicated disk.</LI> </UL> It is recommended to use the third option, raw partition in a dedicated disk. As we see in the discussion in section 1, the purpose of the log device is to make the disk I/Os as sequential as possible, having other simultaneous disk I/O will defeat this purpose. <P>Meta data of the log are collectively recorded on the <EM>log status block</EM> which is located at the beginning of the log. It is offset by either <CODE>FILE_STATUS_OFFSET</CODE> or <CODE>RAW_STATUS_OFFSET</CODE>, depending on the types of the log devices. The two constants are currently 0 and 8192 bytes (16 sectors) respectively. <P>Log records are appended sequentially on log area after the log status block. The area starts from <CODE>log_start</CODE> and is of size <CODE>log_size</CODE>. (Fig <A HREF="#fig_log_struct">fig_log_struct</A> ) <P> <FIGURE> <EPS FILE="log_struct"> <CAPTION> <A NAME="fig_log_struct"></A> Structure of the log</CAPTION> </FIGURE> <P>As new records are appended, the log pointer <CODE>log_tail</CODE> will be advanced. <CODE>log_head</CODE> point to the first record in the log that is not yet subjected to truncation (<EM>current epoch</EM>). If truncation is not in progress, both of the pointers <CODE>prev_log_head</CODE> and <CODE>prev_log_tail</CODE> will have a zero value. (Fig <A HREF="#fig_log_trunc">fig_log_trunc</A> b) <P>Currently, the log are truncated in a way called <EM>epoch truncation</EM>. In one atomic operation, the function <CODE>new_epoch()</CODE> will bracket all the current log records by <CODE>prev_log_head</CODE> and <CODE>prev_log_tail</CODE>, those records are said to be in <EM>truncation epoch</EM> and are subject to truncation. At the same time <CODE>log_head</CODE> will be advanced to <CODE>log_tail</CODE>, effectively resetting the <EM>current epoch</EM> to be empty. New log records may be appended by concurrent threads while truncation is in progress, and <CODE>log_tail</CODE> will advance to new position. (Fig <A HREF="#fig_log_trunc">fig_log_trunc</A> a) <P>It is important to use the log pointer to delimit the log into truncation epoch and current epoch as we don't want the log truncation process to interfere with the log flushing process. <P> <FIGURE> <EPS FILE="log_trunc"> <CAPTION>Truncation of the log <A NAME="fig_log_trunc"></A> </CAPTION> </FIGURE> <P>If the truncation goes well and all the records are successfully applied to the data segment. The log pointers <CODE>prev_log_head</CODE> and <CODE>prev_log_tail</CODE> will be reset to zero. The area that was once occupied by those truncated records are now reclaimed. <P>However, it is possible that the program crashes while truncation is in progress. RVM will know this upon next restart, by finding the two log pointers <CODE>prev_log_head</CODE> and <CODE>prev_log_tail</CODE> are non-zero. It will then redo the truncation. Truncations can always be redone as it is just writing the same value on the same location again. <P> <H3>Log records</H3> <P>There are five different types of log records. Each of them has a corresponding in-core structure defined. They are listed in the table <A HREF="#record-type">record-type</A> <CENTER><TABLE BORDER><TR><TD> <BR> Generic record </TD><TD> <CODE>rec_hdr_t</CODE> </TD><TD> 20 bytes</TD></TR><TR><TD> Transaction header </TD><TD> <CODE>trans_hdr_t</CODE> </TD><TD> 48 bytes</TD></TR><TR><TD> New value record </TD><TD> <CODE>nv_range_t</CODE> </TD><TD> 56 + length + pad bytes</TD></TR><TR><TD> Record end marker </TD><TD> <CODE>rec_end_t</CODE> </TD><TD> 28 bytes</TD></TR><TR><TD> Wrap marker </TD><TD> <CODE>log_wrap_t</CODE> </TD><TD> 24 bytes</TD></TR><TR><TD> Special log record </TD><TD> <CODE>log_special_t</CODE> </TD><TD> 24 + special + pad bytes</TD></TR><TR><TD> <CAPTION> <A NAME="record-type"></A> Six record types and their in-core structure</CAPTION> </TD></TR></TABLE></CENTER> <P>Generic record (<CODE>rec_hdr_t</CODE>) does not actually exist in the log. It contains the common first four fields as the other five 'real' records. When RVM scans the log and find a record, it may not know the type of the record. It can always cast the record as a generic record, and then interpret the first four fields to determine the appropiate action. For example, it can know the types of the records by looking at the first field. (The common four fields are <CODE>struct_id</CODE>, <CODE>rec_length</CODE>, <CODE>timestamp</CODE> and <CODE>rec_num</CODE>) <P>In the five 'real' records, type specific fields will follow the common four fields. For example, in the following we show the declaration of <CODE>rec_hdr_t</CODE> and <CODE>trans_hdr_t</CODE>. <P> <HR> <PRE> typedef struct { struct_id_t struct_id; /* type of entry */ rvm_length_t rec_length; /* record length */ struct timeval timestamp; /* timestamp of record entry */ long rec_num; /* record number of entry */ } rec_hdr_t; typedef struct { struct_id_t struct_id; /* self-identifier */ rvm_length_t rec_length; /* log record length, displacement to end mark */ struct timeval timestamp; /* timestamp of log entry */ rvm_length_t rec_num; /* record number of log entry */ rvm_length_t num_ranges; /* number of ranges in record */ struct timeval uname; /* uname of transaction */ struct timeval commit_stamp; /* timestamp of commit */ rvm_length_t n_coalesced; /* count of coalesced transactions */ rvm_length_t flags; /* mode and optimization flags */ } trans_hdr_t; </PRE> <HR> <P><CODE>trans_hdr_t</CODE>, <CODE>nv_range_t</CODE> and <CODE>rec_end_t</CODE> represent the <EM>transaction header</EM>, <EM>new value range header</EM> and <EM>record end marker</EM>. They are the majority of records on the log. Each transaction will be written in a sequence of records: it begins with a transaction header, follows by one or more sub-record (the new value range record) and ends with a record end marker (Fig <A HREF="#fig_log_rec">fig_log_rec</A> ). This is for the simple case when the transaction are not split by a wrap marker in between. If the transaction is split by a wrap marker, it will have two sequences of records, each begins and ends with a transaction header and end marker. Details of log wrapping will be discussed in section <A HREF="#log-wrap">log-wrap</A> . <P> <FIGURE> <EPS FILE="log_records"> <CAPTION>Transaction records <A NAME="fig_log_rec"></A> </CAPTION> </FIGURE> <P>New value records have variable lengths, each of them begins with a new value range header and follows by the actual data. The actual data is the image of the virtual memory range that the transaction has committed. The header is of type <CODE>nv_range_t</CODE> and is 56 bytes long, the actual data length varies. Note that RVM may pad some bytes after the actual data so that the next sub-record will be word-aligned. <P>To allow RVM to scan the log in both forward and backward direction, different forward and backward links are provided. Generally speaking, <CODE>rec_length</CODE> is a forward link. For transaction header, it is the forward link to the end marker; for new value record, it is the forward link to the next new value record (or record end marker). However, for end record marker, this value means the <CODE>backward</CODE> link to the transaction header. (RVM is somewhat inconsistent in naming here). On the other hand, <CODE>sub_rec_len</CODE> is a <CODE>backward</CODE> link to the previous sub-record. In new value range header, there is the third lenght field: <CODE>length</CODE>, it is the length of the actual data. <CODE>length</CODE> may not be equal to <CODE>rec_length</CODE> because of the padding. See Fig <A HREF="#fig_rec_link">fig_rec_link</A> for the details of those links. <P> <FIGURE> <EPS FILE="record_links"> <CAPTION>Forward and backward links of records <A NAME="fig_rec_link"></A> </CAPTION> </FIGURE> <P>The fourth possible record type that could be find on log is the wrap marker. Note that wrapping may not happen exactly at the end of the log, it may happen a little bit earlier. In the latter case, the bytes follow the wrap marker will all be written with value <CODE>0xff</CODE>. <P>The fifth possible record type is special log record. Currently, the only special log record is segment mapping descriptor (of type <CODE>log_seg_t</CODE>). RVM allows more than one data segment for one log. To support this, every new value range will be tagged with a <CODE>seg_code</CODE> which indicates to which segment that range should be applied to. <CODE>seg_code</CODE> is only a short hand of the segment. Details of the segment (<CODE>num_bytes</CODE> and <CODE>name</CODE>) will be recorded in the segment mapping descriptor. Segment mapping descriptors are inserted into the log whenever a new data segment is mapped the first time. At the time of log truncation, RVM will use those segment mapping descriptors to reconstruct the necessary information to open the segment device. <P> <H3><A NAME="log-wrap"></A> Log wrapping</H3> <P>When <CODE>log_tail</CODE> approaches the end of the log, RVM will choose an appropiate point to do a <EM>wrapping</EM>. This is done by writting out a wrap marker. After that, RVM will starting appending new records from <CODE>log_start</CODE>. There will be no risk of overwriting onto useful records because the old records should have been truncated long before. Even if they are not truncated yet, RVM will call a internal function <CODE>wait_for_space()</CODE> to force a truncation. <P>Log wrapping can happens <EM>between</EM> transaction. In this case, it simply leave a wrap marker (of type <EM>log_wrap_t</EM> and start a brand new transaction at the beginning of log <EM>log_start</EM> (Fig <A HREF="#fig_wrap">fig_wrap</A> a). <P> <FIGURE> <EPS FILE="wrapping"> <CAPTION>Three different possible wrapping <A NAME="fig_wrap"></A> </CAPTION> </FIGURE> <P>It can also happens <EM>within</EM> a transaction. There are two possibilities: between different new value records or <EM>splitting</EM> a new value records. Since a range can be very long, spliting new value records prevents leaving too much space not used by inserting the wrap marker too early. <P>In the first case, after the current new value records are written, a record-end marker is inserted into the log, and then followed by a log-wrap marker. Another transaction header will be inserted at <CODE>log_start</CODE> before the remaining new value records are written. Effectively, the same transaction will have two transaction headers: the first is written before log wrapping, the second is written after log wrapping. The <CODE>flag</CODE> field of the transaction headers helps to indicates this situation. The first transaction header will only have <CODE>FIRST_ENTRY_FLAG</CODE> set, while the second transaction header will only have the <CODE>LAST_ENTRY_FLAG</CODE> set. (If the whole sequence of records for the transaction is not interrupted by a log wrapping, both <CODE>FIRST_ENTRY_FLAG</CODE> and <CODE>LAST_ENTRY_FLAG</CODE> of the transaction header will be set.) (Fig <A HREF="#fig_wrap">fig_wrap</A> b) <P>For the latter case where a new value range is split, each of the two portion of the data will has its own new value range header. The first portion of the splitted range is written followed by a record-end marker and a wrap marker. The second transaction header are then written at <CODE>log_start</CODE>, followed by the second portion of the splitted range. The remaining new value records then follows. There is a flag <CODE>is_split</CODE> in the new value record header indicates that this happens. (Fig 6c) The flag <CODE>FIRST_ENTRY_FLAG</CODE> and <CODE>LAST_ENTRY_FLAG</CODE> will also be set as in the previous case. <P> <P> <H2><A NAME="ss7.4">7.4 Internal Data Structure</A> </H2> <P>Before we describe the important data structure used by RVM, we will first discuss some of the building block of these data structure. In next section we will discuss the generic list and generic tree data structure. In the following section we will discuss the synchronization primitive used by RVM that enable multi-threading for RVM. <P> <H3>Generic data structure</H3> <P>A lot of the RVM data structure are organized as lists or trees. To support that, RVM internally defined a generic list type as well as a generic tree type. <P>Generic list is declared as <CODE>list_entry_t</CODE>, while generic tree node is declared as <CODE>tree_node_t</CODE>. Supporting routine for them are <CODE>move_list_entry()</CODE> and <CODE>tree_insert()</CODE> etc. <P>Each of these generic list/tree will contain enough topological information for it to be manuipulated as a list-node or tree node. Generic list and tree are typically not allocated as themselves, rather, they are the first member of other specific data structures that need to be organized. For example, the region descriptor <CODE>region_t</CODE> is declared with <CODE>list_entry_t</CODE> as the first member of the structure. That way, the region descriptor can be manuipulated as a list element. We can do the following: <PRE> seg_t seg; region_t region; move_list_entry(NULL, &seg->map_list, &region->links); </PRE> to move a region descriptor <CODE>region</CODE> to the mapped region list (<CODE>map_list</CODE>) in the segment descriptor <CODE>seg</CODE>. <P> <H3>Support for multi-threading</H3> <P>RVM is designed to be thread-safe: application is responsible for protecting access to its data structures by concurrent threads, but can otherwise use RVM function freely. RVM will synchronizing its internal operations. This is done by using the following three primitives: <P> <H3>Mutex</H3> <P>Internal data which may be accessed concurrently are protected by mutex of type <CODE>RVM_MUTEX</CODE>. Routines accessing those data are required to do a <CODE>mutex_lock()</CODE> before, and do a <CODE>mutex_unlock()</CODE> after. Alternatively, they can use the macro <CODE>CRITICAL()</CODE> to bracket their access to critical data. <P> <HR> <PRE> RVM_MUTEX lck; mutex_init(&lck); /* initialization */ ... mutex_lock(&lck); /* begin lck crit. sec */ /* body */; /* codes in lck crit. sec */ mutex_unlock(&lck); /* end lck crit. sec */ /* alternatively use macro */ CRITICAL(lck, /* begin lck crit. sec */ /* body */; /* codes in lck crit. sec */ ); /* end lck crit. sec */ </PRE> <HR> <P> <H3>Read/Write Lock</H3> <P>Sometimes a data can be read by many threads simultaneously, provided none of them attempts to update on the data. This kind of concurrency is supported by read/write lock of type <CODE>rw_lock_t</CODE>. Readers access those data with <CODE>rw_lock(...,r)</CODE>, they will be allowed to proceed immediately if there are no other writer, otherwise, they will be blocked until the writer releases the lock. Writers access those data with <CODE>rw_lock(...,w)</CODE>, they will be allowed to proceed immediately if there are no other readers or writers, otherwise, they will be blocked until all of them release the lock. Both readers and writers have to do a <CODE>rw_un_lock()</CODE> when they are done accessing the protected data. There is also a macro <CODE>RW_CRITICAL()</CODE> for bracketing accesses to critical data protected by read/write lock. <P> <HR> <PRE> rw_lock_t rwl; init_rw_lock(&rwl); /* initialization */ ... /* reader */ rw_lock(&rwl,r); /* begin rwl crit. sec. */ /* body */; /* read immediately if no other /* body */; /* writer, otherwise, block until /* body */; /* it releases lock */ rw_unlock(&rwl,r); /* end rwl crit. sec. */ ... /* writer */ rw_lock(&rwl,w); /* begin rwl crit. sec. */ /* body */; /* write immediately if no other */ /* body */; /* writer/readers otherwise block */ /* body */; /* until they release lock */ rw_unlock(&rwl,w); /* end rwl crit. sec. */ ... RW_CRITICAL(rwl, mode, /* begin rwl crit. sec., mode=r|w */ /* body */; ); /* end rwl crit. sec. */ </PRE> <HR> <H3>Condition variable</H3> <P>Two concurrent threads can synchronize their actions by using the condition variable of type <CODE>RVM_CONDITION</CODE>. Thread A can sleep by calling <CODE>condition_wait()</CODE>, waiting to be waked up by thread B when it calls <CODE>condition_signal()</CODE>. <P>This techniques is used for the asynchronous log truncation. A seperate thread will execute the code of log_daemon() if there are multi-thread support (either mach's cthread, or unix's lwp or pthread). Normally this thread is block waiting, other thread(s) will run. From time to time, when transactions are committed, the other thread(s) will check and see if the log is too full, if it is, it will wake up the log_daemon() which will then truncate the log asynchronusly. <P> <HR> <PRE> RVM_CONDITION code; /* thread of log_daemon() */ CRITICAL(lock, { while (state == rvm_idle) condition_wait(&code, &lock); }); /* async. log truncation, or terminate * log daemon here */ /* thread of log_tid() */ CRITICAL(lock, { if (need_truncation) { if (state == rvm_idle) { state = truncating; condition_signal(&code); } } }); </PRE> <HR> <H3>Important Data Structure</H3> <P>The following structure are all declared in the file <CODE>rvm_private.h</CODE>. The important data structures are the following. <P> <H3>Segment Descriptor (<CODE>seg_t</CODE>)</H3> <P>There can be more than one segment per process. Each of the segment has a segment descriptor. These segment descriptor are linked together as a list, and the list header is the global variable <CODE>seg_root</CODE>. <P>The segment descriptor contains the header <CODE>map_list</CODE> for an important list: list of mapped region, which will be discussed in more details in section <A HREF="#sect2-region">sect2-region</A> <P>There can be more than one segment per log. In fact, each individual new value record can be applied to different segment. This is achieved by tagging a short-name of segment <CODE>seg_code</CODE> with each new value record. The short name of each segment is also stored in its own descriptor. <P> <H3>Log descriptor (<CODE>log_t</CODE>)</H3> <P>Like the segment descriptors, log descriptors are also connected as a list (header is a global variable <CODE>log_root</CODE>. However, currently RVM only support one log in a process, which is pointed to by the global pointer <CODE>default_log</CODE>. <P><CODE>log_status</CODE> is the in-core copy of the log status block, it will be explained in more detail in the next section. Similarily, <CODE>trans_hdr</CODE>, <CODE>rec_end</CODE>, <CODE>log_wrap</CODE> are the area for preparing the corresponing records on log (c.f. section <A HREF="#sect2-log-record">sect2-log-record</A> ). <P><CODE>log_buf</CODE> is the log recovery buffer descriptor and it will explained in the section <A HREF="#log-buf">log-buf</A> . <P>Log descriptor contains headers for three important list: <CODE>tid_list</CODE>, <CODE>flush_list</CODE> and <CODE>special_list</CODE>. The first link up transaction that have beginned but not yet ended. The second link up transaction that have been committed but not yet flushed. <P> <H3>Log status area (<CODE>log_status</CODE>)</H3> <P>Log status has a in-core copy as well as in-disk copy. The in-core copy is declared as type <CODE>log_status_t</CODE>. The in-core copy is stored as a sub-structure of the log descriptor, and the in-disk copy is written on the log status block. Ideally, we would want the in-core copy to written to disk everytime the log status change, however, doing so will require a lot of disk head seeks between the log status block and the log tail, which defeat the purpose of the log, so RVM updates the on-disk log status block only every UPDATE_STATUS (currently the value is 100) of status change. This is keep tracked by the count <CODE>update_cnt</CODE>. Because of that, RVM will not complete trust on the value in the log status block. For example, when doing log truncation, RVM trusted the value <CODE>prev_log_head</CODE> but not the value <CODE>log_tail</CODE>, so it will do a <CODE>locate_tail()</CODE> to make sure it can find the last records written on the log. <P>Log status area records important log pointers: <CODE>log_head</CODE>, <CODE>log_tail</CODE>, <CODE>prev_log_head</CODE>, <CODE>prev_log_tail</CODE> and log limit: <CODE>log_start</CODE> and <CODE>log_size</CODE>. Besides, it also records a huge numbers of consistency check fields, transaction statistics and log statistics. <P> <H3><A NAME="sect2-log-record"></A> Area for preparing log records</H3> <P>Each of the log record types has a corresponding in-core data structure. They are used to prepare the records before they are written out to log. There are only one copy for <CODE>trans_hdr_t</CODE>, <CODE>rec_ent_t</CODE> and <CODE>log_wrap_t</CODE> per-log and they are part of the log descriptor. For new value records, areas are required for the headers and the actual data and they are per-range. Headers are of type <CODE>nv_range_t</CODE> and are store in the range descriptors (described in the following section). Actual data are stored in a sperately allocated buffer pointed to by the field <CODE>data</CODE> in range descriptor. Log special records are associated with the log, and they dynamically allocated and are organized as a list (header: <CODE>log_special_list</CODE>) in the log descriptor. <P> <H3><A NAME="log-buf"></A> Log recovery buffer (<CODE>log_buf->buf</CODE>) and its descriptor(<CODE>log_buf</CODE>) </H3> <P>Log recovery buffer (also called log buffer) is used in log truncation. It is used to keep an image of log records for processing. The descriptor of the buffer is called log recovery buffer descriptor (type <CODE>log_buf_t</CODE>). It is a sub-structure of log descriptor. The actually buffer is pointed to by the field <CODE>buf</CODE>, the length of the buffer is stored by the field <CODE>buf_len</CODE>. <P>Two routines can be used to fill and refill the log recovery buffer: <UL> <LI><CODE>init_buffer(log, offset, direction, ...)</CODE></LI> <LI><CODE>refill_buffer(log, direction, ...)</CODE></LI> </UL> The log can be scanned in two direction. This is indicated by the argument <CODE>direction</CODE> of the two routines. They will try to read as much as possible of data from the log device, subject to the length of the log buffer and other constraints: <P>Upon returns of these two routines, <CODE>log_buf->ptr</CODE> will be the index into log buffer that corresponds the current log position, and <CODE>log_buf->offset</CODE> will correspond to the log offset of byte 0 in log buffer. <CODE>log_buf->offset</CODE> will be sector aligned. (Figure <A HREF="#fig_log_buf">fig_log_buf</A> ) <P> <FIGURE> <EPS FILE="log_buf"> <CAPTION>log buffer and init_buffer() <A NAME="fig_log_buf"></A> </CAPTION> </FIGURE> <P>The following is a typical sequence code in log truncation: <P> <HR> <PRE> ... r_length = init_buffer(log, off, REVERSE, ...); log_buf->ptr -= sizeof(rec_end_t); rec_end = (rec_end_t *) &log_buf->buf[log_buf->ptr]; if (rec_end->struct_id == rec_end_id) { ... </PRE> <HR> <P>In the code, <CODE>off</CODE> is the current log position. Routine <CODE>init_buffer()</CODE> is instructed to initialize the buffer by scanning the log in the reverse direction. When it returns, <CODE>r_length</CODE> number of bytes are read into the log recovery buffer (pointed to by <CODE>log_buf->buf</CODE>), <CODE>log_buf->ptr</CODE> is index of the byte in log buffer that corresponds to <CODE>off</CODE>. Assume the routine here is expecting a end record in log, it decrements <CODE>log_buf->ptr</CODE> by the known size of <CODE>rec_end_t</CODE> so as to point to the end record. After the decrement, the address of <CODE>log_buf->buf[log_buf->ptr]</CODE> should be the beginning of the in core image of the end record. It type-cast it as <CODE>(rec_end_t *)</CODE> and stores it in <CODE>rec_end</CODE>. After that it can manipulate the record as it likes. In the example, it examines and see if <CODE>struct_id</CODE> of the end record is really a <CODE>rec_end_id</CODE> ... <P> <H3><A NAME="sect2-region"></A> Region descriptors (<CODE>region_t</CODE>)</H3> <P>Each segment will have a list of regions that are mapped. So in each segment descriptor (type <CODE>seg_t</CODE>) there is a member <CODE>map_list</CODE> which is the head of this list. Each node in the list is a region descriptr of type <CODE>region_t</CODE>. <P>It records the <CODE>offset</CODE> and <CODE>end_offset</CODE> of the region in the data segment, as well as the <CODE>vmaddr</CODE> and <CODE>length</CODE> of region that is mapped to. <P>The member <CODE>n_uncommit</CODE> records the number of uncommitted range in this region, it is incremented everytime application does a <CODE>rvm_set_range()</CODE>, it will be decrement when a transaction is committed or aborted. When application tries to do <CODE>rvm_unmap()</CODE>, this value must be zero otherwise it will return RVM_EUNCOMMIT. <P><CODE>region_t</CODE> may be confused with <CODE>rvm_region_t</CODE> and <CODE>mem_region_t</CODE>. We will explain the former in the following and the latter in the next section. <CODE>rvm_region_t</CODE> is part of the API, it is the structure used by the application to indicate the mapping parameters: <CODE>data_dev</CODE>, <CODE>offset</CODE>, <CODE>length</CODE>, <CODE>vmaddr</CODE> and <CODE>no_copy</CODE> etc. RVM does not use this structure internally, once it has got all the parameters from the application and has transferred them to <CODE>region_t</CODE>. <P> <H3>Memory region descriptors (<CODE>mem_region_t</CODE>)</H3> <P>In addition to the per segment region list described above, there is also a per process memory region tree. It is used to record the regions in the virtual memory space that are mapped, and it is arranged in the order of vm address. <P>When application does a <CODE>rvm_map()</CODE>, this tree is consulted to make sure the requested virtual memory region is not already mapped. If it is already mapped, RVM_EVM_OVERLAP will be returned. Also, when application does a <CODE>rvm_set_range()</CODE>, this tree is consulted to make sure the requested range is totally contained within a single virtual memory region. If it isn't, RVM_ENOT_MAPPED will be returned. <P>The root of the tree is <CODE>region_tree</CODE>. Each node of the tree is of type <CODE>mem_region_t</CODE>. <P> <H3>Transaction descriptor (<CODE>int_tid_t</CODE>)</H3> <P>Whenever the application does a <CODE>rvm_begin_transaction()</CODE>, rvm will allocate a transaction descriptor of type <CODE>int_tid_t</CODE>. These descriptors are linked up into list. Before the transaction commits, the descriptor is in the list headed by <CODE>log->tid_list</CODE>. If the transaction is committed, it will be moved to another list headed by <CODE>log->flush_list</CODE>. If the transaction record is flushed, it will be removed from the flush list. <P>Typically, the application may inform RVM one or more memory range which will be affected by the transaction. This is done by issuing one or more <CODE>rvm_set_range()</CODE> or <CODE>rvm_modify_bytes()</CODE> after <CODE>rvm_begin_transaction()</CODE>. Internally, these memory ranges are organized as a tree rooted at <CODE>range_tree</CODE> (of type <CODE>tree_root_t</CODE>). The tree root is a field of the transaction descriptor. <P><CODE>int_tid_t</CODE> may be confused with <CODE>rvm_tid_t</CODE>. The latter is the transaction identifier used by applications while the former is used internally by RVM. Althought <CODE>rvm_tid_t</CODE> contains a pointer to <CODE>int_tid_t</CODE>, applications are not expected to use or touch the pointer. <P> <H3>modification range descriptor (<CODE>range_t</CODE>)</H3> <P>When application specifies modification range by <CODE>rvm_set_range()</CODE> or <CODE>rvm_modify_bytes()</CODE>, the vm address (<CODE>range->nv.vmaddr</CODE>) and the length (<CODE>data_len</CODE>) of the range is stored internally in the range descriptor (<CODE>range_t</CODE>). When the transaction commits, the content of virtual memory pointed to by (<CODE>range->nv.vmaddr</CODE>) will be copied to the buffer pointed to by <CODE>range->data</CODE>. The same buffer has another purpose: if the transaction is in <CODE>restore</CODE> mode, the buffer is used also for saving the old image of VM. In case the transaction is aborted, the old image can be restored. <P>Range descriptors are connected together as a tree, with the tree root in the corresponding transaction descriptor. If the optimization flag <CODE>RVM_COALESCE_RANGES</CODE> is specified, the tree is sorted according to the vm address, and ranges are coalesced before they are inserted into the tree. If <CODE>RVM_COALESCE_RANGES</CODE> is not specified, the tree is simply sorted by the range number. <P>There is also an area of type <CODE>nv_range_t</CODE> called <CODE>nv</CODE> which is used as the in-memory image of the nv range record on log. <P> <H3>storage device region node (<CODE>dev_region_t</CODE>)</H3> <P>This is not to be confused with <CODE>region_t</CODE> and <CODE>mem_region_t</CODE>. <CODE>dev_region_t</CODE> is used only in log truncation. During log trunctation, all sub-records of all transactions in the truncation epoch are processed. A tree called <CODE>mod_tree</CODE> is built. Each node of the tree is of type <CODE>dev_region_t</CODE>. The tree is sorted according to the segment offset. The purpose of the tree is to make the disk write operation on the segment device sequential. <P> <H3>Device descriptor (<CODE>device_t</CODE>)</H3> <P>Device descriptor (of type <CODE>device_t</CODE>) contains the details of the phyical devices: name, handle, size. Also, it contains the pointers to the gather-write I/O vectors. <P> <H2><A NAME="ss7.5">7.5 RVM in action</A> </H2> <P> <H3><CODE>rvm_initialize()</CODE></H3> <P>Global lists and trees are initialized (<CODE>seg_root</CODE>, <CODE>log_root</CODE>, <CODE>region_tree</CODE>). A log truncation (<CODE>log_recover()</CODE>) is performed. <P> <H3><CODE>rvm_map()</CODE></H3> <P>There is always a log truncation at the begin of mapping. This is necessary to ensure the mapping will get the latest committed image from the data segment. It also checks if application has specified a new segment device, if so, it will create a special record of type <CODE>log_seg_t</CODE> and queue it on <CODE>log->special_list</CODE>. This record will be flushed to the log before any next transaction. <P>Application may or may not specified the mapped-to address for the region. If it is not specified, <CODE>rvm_map()</CODE> will do a <CODE>page_alloc()</CODE> to find a VM address for the request region. Either case, the mapped-to address will be stored in <CODE>region->vmaddr</CODE> as well as <CODE>mem_region->vmaddr</CODE>. <P>Before carry out the mapping, <CODE>rvm_map()</CODE> also need to check that the requested mapping is not overlapped with existing mapping in both the VM address space as well as the segment offset space. The former is detected with the aid of <CODE>region_tree</CODE> while the latter is detected with the aid of <CODE>seg->map_list</CODE>. If there are overlapping, RVM_EVM_OVERLAP will be returned in the former case and RVM_EOVERLAP will be returned in the latter case. <P>Finally, the content of the data segment at the specified offset is copied into the VM space (unless <CODE>no_copy</CODE> mode is specified). <P> <H3><CODE>rvm_begin_transaction()</CODE></H3> <P>A transaction descriptor (of type <CODE>int_tid_t</CODE>) will be allocated and inserted in <CODE>log->tid_list</CODE>. The address of that descriptor is also noted in <CODE>rvm_tid->tid</CODE>. <P> <H3><CODE>rvm_set_range()</CODE> and <CODE>rvm_modify_bytes()</CODE></H3> <P>The range declared must be totally included inside a mapped region, other RVM_ENOT_MAPPED will be returned. This is done by consulting <CODE>region_tree</CODE>. <P>A new range descriptor (of type <CODE>range_t</CODE> will be allocated and inserted into <CODE>tid->range_tree</CODE>. Depending on the optimization flag of the transaction, the range tree may be sorted by vm address or sorted by range numbers. <P>If transaction is in <CODE>restore</CODE> mode, the current image of VM in the range will be first saved to a buffer pointed to by <CODE>range->data</CODE>. <P>VM address of the range declared is stored in <CODE>range->nv.vmaddr</CODE>. <P> <H3><CODE>rvm_end_transaction()</CODE></H3> <P>When application calls <CODE>rvm_end_transaction()</CODE>, for each of the ranges in <CODE>tid_range_tree</CODE>, the vm area pointed to by <CODE>range->nv.vmaddr</CODE> is copied to <CODE>range->data</CODE>. This is to ensure the atomicity of transaction: when <CODE>rvm_end_transaction()</CODE> returns, all declared ranges have been copied to a seperate location. If the program crashes before <CODE>rvm_end_transaction()</CODE> returns, then the data segment will contain the old value of the of the declared ranged. Applications are free to write new data on the ranges after <CODE>rvm_end_transaction()</CODE> returns. <P>The transaction descriptor <CODE>tid</CODE> is dequeued from <CODE>log->tid_list</CODE> and is queued to <CODE>log->flush_list</CODE>. <P>If RVM_COALESCE_TRANS flag is set for the transaction, an inter-transaction coalescing will be carried out. <P>If the transaction is committed in <CODE>flush</CODE> mode, <CODE>flush_log()</CODE> is called. <P> <H3><CODE>flush_log()</CODE></H3> <P>Every transaction descriptors in <CODE>log->flush_list</CODE> are written out to the log, this is done by the routine <CODE>log_tid()</CODE>. First, the transaction header of type <CODE>trans_hdr</CODE> is built and added to the gather write vector <CODE>dev->iov</CODE>. Then, for every ranges in <CODE>tid->range_tree</CODE>, the new value range header of type <CODE>nv_range_t</CODE> is built and added to the gather write vector, followed by the actual new value data. After all the new value records are written, a record end marker (of type <CODE>rec_end_t</CODE> is built and added to the gather write vector. <P>If log wrapping is need, there will be intervening record end marker, log wrap marker and transaction header record inserted, as described in section <A HREF="#log-wrap">log-wrap</A> . <P> <H3><CODE>rvm_truncate()</CODE></H3> <P>Log truncation is carried out by <CODE>log_recover()</CODE>. For each segment, a modification tree <CODE>mod_tree</CODE> is built. Each node is of type <CODE>dev_region_t</CODE>, it is arranged in order of segment offset. For each new value records of each transactions on the log, a node of type <CODE>dev_region_t</CODE> is built and inserted into the modification tree. If there are overlaps in segment offset , the overlapped portion is not repeatedly inserted into the modification tree. In this case, it is not an error to have overlapped segment, for example, the same memory range can be updated by different transaction. Also note that since the log is truncated in last-in-first-out order, i.e. new value records that was flushed last is inserted into tree first, we always have the most up-to-date range inserted into the tree. <P>When all the log records bracket by the log pointers <CODE>prev_log_head</CODE> and <CODE>prev_log_tail</CODE> are scanned and inserted into the modification tree. The whole tree is written to data device (<CODE>apply_mods()</CODE>). <P> <H2><A NAME="ss7.6">7.6 RVMUTL</A> </H2> <P>The program <CODE>rvmutl</CODE> is a utility program that allows: <UL> <LI> initialization of a log</LI> <LI> debug via the log</LI> </UL> The following is an incomplete list of commands available in <CODE>rvmutl</CODE>: <UL> <LI> <CODE>init_log</CODE>: initialize a log</LI> <LI> <CODE>open_log</CODE>: open an exising log</LI> <LI> <CODE>status</CODE> : display information on log status block</LI> <LI> <CODE>head</CODE> : locate and display the first record of current epoch</LI> <LI> <CODE>tail</CODE> : locate the display the last record of current epoch</LI> <LI> <CODE>prev (n)</CODE>: locate and display previous (n) record(s)</LI> <LI> <CODE>next (n)</CODE>: locate and display next (n) reocrds(s)</LI> </UL> <P>In the rvm source codes, you can often see the global variable <CODE>rvm_utlsw</CODE>. RVM is designed in mind that there are two possible types of users for rvm's routines: ordinary applications or the program <CODE>rvmutl</CODE>. This variable allows a differential treatment on exception condition. For example, in the routine <CODE>scan_forward()</CODE>, if the record type is found not to be one that the routine is expecting (possibly means a corrupted log) the program will: <UL> <LI> abort, if it is called by an general application</LI> <LI> continue with error return, if it is called by <CODE>rvmutl</CODE></LI> </UL> The latter treatment allow manual handling of log corruption via <CODE>rvmutl</CODE>. <P> <HR> <A HREF="rvm_manual-8.html">Next</A> <A HREF="rvm_manual-6.html">Previous</A> <A HREF="rvm_manual.html#toc7">Contents</A> </BODY> </HTML>