Sophie: coda-doc-5.2.0-4mdk noarch

coda-doc-5.2.0-4mdk.noarch.rpm

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
 <META NAME="GENERATOR" CONTENT="SGML-Tools 1.0.9">
 <TITLE>RVM: Recoverable Virtual Memory, Release 1.3: RVM Internals</TITLE>
 <LINK HREF="rvm_manual-8.html" REL=next>
 <LINK HREF="rvm_manual-6.html" REL=previous>
 <LINK HREF="rvm_manual.html#toc7" REL=contents>
</HEAD>
<BODY>
<A HREF="rvm_manual-8.html">Next</A>
<A HREF="rvm_manual-6.html">Previous</A>
<A HREF="rvm_manual.html#toc7">Contents</A>
<HR>
<H2><A NAME="RVMinternals"></A> <A NAME="s7">7. RVM Internals</A></H2>

<P><EM>By Yui Wah LEE</EM>
<P>
<H2><A NAME="ss7.1">7.1 Overview of RVM internal structure</A>
</H2>

<P>RVM is designed to be a <EM>run-time library</EM> to be linked with an
<EM>external application</EM>.  It allows the application to be benefited
from <EM>transactional</EM> support.  In particular, it allows operations
on virtual memory to be <EM>atomic</EM> and
<EM>persistent</EM>, provided that the application declares its intention
through the RVM application programming interface. 
<P>Figure 
<A HREF="#fig_overview">fig_overview</A>
 shows the three important components of
RVM: <EM>virtual memory</EM>, <EM>data segment</EM> and <EM>log</EM>.
<P>Virtual memory is where the application manipulates its data.  RVM
allows the application to declare <EM>regions</EM> of virtual memory to be
associated with regions in external data segments.  The data segments
are the persistent backing store of the virtual memory regions: latest
committed image of the virtual memory region has a copy in the data
segments.  Since data segment is on stable storage (disk), it is
persistent across program crashes.  Should a application crashes, it
can always find those memory regions from the data segment.  The
association of virtual memory regions with segment regions is called
<EM>mapping</EM>.
<P>For performance reason, RVM does not write <EM>directly</EM> the committed
image of virtual memory regions back to the data segment.  Doing so
would generate excessive 'seek' operations on the magnetic head of the
disk, since the access pattern can be
random.  Disk head seeks are expensive operation and should be
minimized.  In the case of RVM, this is achieved by writting <EM>new
value</EM> of memory ranges as <CODE>records</CODE> in an <EM>append-only</EM>,
<EM>sequential</EM> log.  This is called <EM>flush</EM> operation.  Log is
also on stable storage (disk or battery-backed ramdisk).  Once the new
values are written on log (flushed), they become persistent across
program crashes.  From time to time, records on the log are processed
and applied onto the data segment.  After that, the log space used by the
the records can be re-claimed for further use. This is called <EM>log
truncation</EM>.  Log truncation is necessary, without it the log will
grow unboundedly.
<P>
<FIGURE>
<EPS FILE="overview">
<CAPTION>
<A NAME="fig_overview"></A> Overview</CAPTION>
</FIGURE>
<P>
<H2><A NAME="ss7.2">7.2 RVM application programming interface</A>
</H2>

<P>To facilitate our discussion, let first review briefly the RVM
application programming interface.  Details of the intefaces can be
found in chapter 
<A HREF="rvm_manual-4.html#rvmAPI">of RVM Library Functions</A><P>
<HR>
<PRE>
rvm_options_t   options;
char            *version;
rvm_region_t    region;
rvm_tid_t       *tid;
char            *dest, *src;
unsigned long   length;

rvm_initialize(version, options) 

rvm_map(region, options);

rvm_begin_transaction(tid, restore);
rvm_begin_transaction(tid, no_restore);

rvm_set_range(tid, dest, length);
rvm_modify_bytes(tid, dest, src, length);

rvm_end_transaction(tid, flush);
rvm_end_transaction(tid, no_flush);
rvm_abort_transaction(tid); 

rvm_flush();
rvm_truncate();

rvm_unmap(region);

rvm_terminate();
</PRE>
<HR>

Application declares regions of VM that it want to have persistent
image by calling <CODE>rvm_map()</CODE>.  There are two possible modes for
<CODE>rvm_map()</CODE>: <CODE>copy</CODE> and <CODE>no_copy</CODE>.  The former is the default
and will cause the corresponding region in the segment device to be
read into virtual memory.  The latter is used when the application is
sure the image in segment device is useless (e.g., the application is
going to replace it immediately).
<P>Application declares the begining of a transaction by calling
<CODE>rvm_being_transaction()</CODE>.  The transaction can be committed by
calling <CODE>rvm_end_transaction()</CODE>, or be aborted by calling
<CODE>rvm_abort_transaction()</CODE>.  In addition, the application must
inform RVM the memory location that will be affected by the
transactions.  These can be done by calling <CODE>rvm_set_range()</CODE> or
<CODE>rvm_modify_bytes()</CODE>.  RVM guarantees that if the transaction
commits, then all the modifications make on declared ranges will be
made effective atomically, conversely, if the transaction abort,
then none of modifications will be made effective.
<P>Normally, when transactions are aborted, the <EM>old value</EM> in the affected
memory ranges will be <EM>restored</EM>.  So, a default option for
<CODE>rvm_begin_transaction</CODE> is <CODE>restore</CODE>.  However, in some case, the
application can be sure that it will never abort the transaction.  For
example, if there are no concurrent transaction, the application can
be sure it will never abort the transaction explicitly.  For these
cases, the application can start the transaction with the
<CODE>no_restore</CODE> mode.
<P>If the application want the committed transaction be also recorded in
the log device (stable storage) <EM>immediately</EM>, it should commit the
transaction with <CODE>flush</CODE> mode.  In this mode, the
<CODE>end_transaction()</CODE> calls return only after the transaction records
are written to disk.  However, there are
cases when the application can accept slight chance of lose of
permanency.  In those case, it can end the transaction with a
<CODE>no_flush</CODE> mode.  There will be a big payoff in performance, as the
function <CODE>rvm_end_transaction()</CODE> will return immediately, without
having to the wait for the slow disk I/O to complete.  RVM will put
the committed transaction in a queue and write them out to the log
device later.  This is called <EM>lazy commit</EM> and it provides a
<EM>bound persistence</EM>.  The committed transactions may lost if the
program crashed before the queued transaction get a chance to be
written out to log.  Note that although the transaction is lost, it is
lost atomically.  Upon the next restart of the application, RVM
will restore an old consistent view of the virtual memory.
<P>Commit transaction lazily also give more opportunity to inter-
transaction optimization.  
<P>To force all previously committed transaction records to be written
out to log device, the application can call <CODE>rvm_flush()</CODE>.
Also, whenever the application commit a new transaction with <CODE>flush</CODE>
mode, there will be a side effect that all previous transaction on the
queue will also be flushed.
<P>The log device is only a temporary stable storage for commited
transactions: its primary purpose is to allow records be written on
disk sequentially.  Eventually, these records need to be applied to
the data segment.  The process of applying transaction log records to
data segment is called <EM>log truncation</EM>.  RVM will carry out log
truncation automatically when the log is too full.  Users can specify
the option <CODE>truncate</CODE> to set the threshold.  Alternatively, the
application can initiate a trunction by calling <CODE>rvm_truncate()</CODE>.
Also, note that RVM will do a log truncation upon start up and
<CODE>rvm_map()</CODE> operations.  This is to ensure the data segment has the
lastest updates before it get mapped into virtual memory.
<P>
<P>
<H2><A NAME="ss7.3">7.3 Structure of log</A>
</H2>

<H3>Log pointers and limits</H3>

<P>The log device can be a 
<UL>
<LI> regular Unix file</LI>
<LI> raw partition</LI>
<LI> raw partition in a dedicated disk.</LI>
</UL>

It is recommended to use the third option, raw partition in a
dedicated disk.  As we see in the discussion in section 1, the purpose
of the log device is to make the disk I/Os as sequential as possible,
having other simultaneous disk I/O will defeat this purpose.
<P>Meta data of the log are collectively recorded on the <EM>log status
block</EM> which is located at the beginning of the log.  It is offset by
either <CODE>FILE_STATUS_OFFSET</CODE> or <CODE>RAW_STATUS_OFFSET</CODE>, depending on
the types of the log devices.  The two constants are currently 0 and
8192 bytes (16 sectors) respectively.
<P>Log records are appended sequentially on log area after the log status
block.  The area starts from <CODE>log_start</CODE> and is of size
<CODE>log_size</CODE>.  (Fig 
<A HREF="#fig_log_struct">fig_log_struct</A>
)
<P>
<FIGURE>
<EPS FILE="log_struct">
<CAPTION>
<A NAME="fig_log_struct"></A> Structure of the log</CAPTION>
</FIGURE>
<P>As new records are appended, the log pointer <CODE>log_tail</CODE> will be
advanced.  <CODE>log_head</CODE> point to the first record in the log that is
not yet subjected to truncation (<EM>current epoch</EM>).  If truncation
is not in progress, both of the pointers <CODE>prev_log_head</CODE> and
<CODE>prev_log_tail</CODE> will have a zero value. (Fig 
<A HREF="#fig_log_trunc">fig_log_trunc</A>
b)
<P>Currently, the log are truncated in a way called <EM>epoch
truncation</EM>.  In one atomic operation, the function <CODE>new_epoch()</CODE>
will bracket all the current log records by <CODE>prev_log_head</CODE> and
<CODE>prev_log_tail</CODE>, those records are said to be in <EM>truncation
epoch</EM> and are subject to truncation.  At the same time <CODE>log_head</CODE>
will be advanced to <CODE>log_tail</CODE>, effectively resetting the
<EM>current epoch</EM> to be empty.  New log records may
be appended by concurrent threads while truncation is in progress, and
<CODE>log_tail</CODE> will advance to new position. (Fig 
<A HREF="#fig_log_trunc">fig_log_trunc</A>
a)
<P>It is important to use the log pointer to delimit the log into
truncation epoch and current epoch as we don't want the log truncation
process to interfere with the log flushing process.  
<P>
<FIGURE>
<EPS FILE="log_trunc">
<CAPTION>Truncation of the log
  
<A NAME="fig_log_trunc"></A> </CAPTION>
</FIGURE>
<P>If the truncation goes well and all the records are successfully
applied to the data segment.  The log pointers <CODE>prev_log_head</CODE>
and <CODE>prev_log_tail</CODE> will be reset to zero.  The area that was
once occupied by those truncated records are now reclaimed.  
<P>However, it is possible that the program crashes while truncation is
in progress.  RVM will know this upon next restart, by finding the two
log pointers <CODE>prev_log_head</CODE> and <CODE>prev_log_tail</CODE> are non-zero.
It will then redo the truncation.  Truncations can always be redone as
it is just writing the same value on the same location again.  
<P>
<H3>Log records</H3>

<P>There are five different types of log records.  Each of them
has a corresponding in-core structure defined.  They are listed in the
table 
<A HREF="#record-type">record-type</A>

<CENTER><TABLE BORDER><TR><TD>
<BR>
Generic record </TD><TD> <CODE>rec_hdr_t</CODE> </TD><TD> 20 bytes</TD></TR><TR><TD>
Transaction header </TD><TD> <CODE>trans_hdr_t</CODE> </TD><TD> 48 bytes</TD></TR><TR><TD>
New value record </TD><TD> <CODE>nv_range_t</CODE> </TD><TD> 56 + length + pad bytes</TD></TR><TR><TD>
Record end marker </TD><TD> <CODE>rec_end_t</CODE> </TD><TD> 28 bytes</TD></TR><TR><TD>
Wrap marker </TD><TD> <CODE>log_wrap_t</CODE> </TD><TD> 24 bytes</TD></TR><TR><TD>
Special log record </TD><TD> <CODE>log_special_t</CODE> </TD><TD> 24 + special + pad bytes</TD></TR><TR><TD>

<CAPTION>
<A NAME="record-type"></A> 
    Six record types and their in-core structure</CAPTION>
</TD></TR></TABLE></CENTER>
<P>Generic record (<CODE>rec_hdr_t</CODE>) does not actually exist in the log.
It contains the common first four fields as the other five 'real'
records.  When RVM scans the log and find a record, it may not know
the type of the record.  It can always cast the record as a generic
record, and then interpret the first four fields to determine the
appropiate action.  For example, it can know the types of the records
by looking at the first field.  (The common four fields are
<CODE>struct_id</CODE>, <CODE>rec_length</CODE>, <CODE>timestamp</CODE> and <CODE>rec_num</CODE>)
<P>In the five 'real' records, type specific fields will follow the
common four fields.  For example, in the following we show the
declaration of <CODE>rec_hdr_t</CODE> and <CODE>trans_hdr_t</CODE>.
<P>
<HR>
<PRE>
typedef struct
    {
    struct_id_t     struct_id;          /* type of entry */
    rvm_length_t    rec_length;         /* record length */
    struct timeval  timestamp;          /* timestamp of record entry */
    long            rec_num;            /* record number of entry */
    }
rec_hdr_t;

typedef struct
    {
    struct_id_t     struct_id;          /* self-identifier */
    rvm_length_t    rec_length;         /* log record length, displacement to
                                           end mark */
    struct timeval  timestamp;          /* timestamp of log entry */
    rvm_length_t    rec_num;            /* record number of log entry */
    rvm_length_t    num_ranges;         /* number of ranges in record */
    struct timeval  uname;              /* uname of transaction */
    struct timeval  commit_stamp;       /* timestamp of commit */
    rvm_length_t    n_coalesced;        /* count of coalesced transactions */
    rvm_length_t    flags;              /* mode and optimization flags */
    }
trans_hdr_t;
</PRE>
<HR>
<P><CODE>trans_hdr_t</CODE>, <CODE>nv_range_t</CODE> and <CODE>rec_end_t</CODE> represent the
<EM>transaction header</EM>, <EM>new value range header</EM> and <EM>record
end marker</EM>.  They are the majority of records on the log.  Each
transaction will be written in a sequence of records: it begins with a
transaction header, follows by one or more sub-record (the new value
range record) and ends with a record end marker (Fig 
<A HREF="#fig_log_rec">fig_log_rec</A>
). This is for the simple case when the transaction
are not split by a wrap marker in between.  If the transaction is
split by a wrap marker, it will have two sequences of records, each
begins and ends with a transaction header and end marker.  Details of
log wrapping will be discussed in section 
<A HREF="#log-wrap">log-wrap</A>
.
<P>
<FIGURE>
<EPS FILE="log_records">
<CAPTION>Transaction records
  
<A NAME="fig_log_rec"></A> </CAPTION>
</FIGURE>
<P>New value records have variable lengths, each of them begins with a
new value range header and follows by the actual data.  The actual
data is the image of the virtual memory range that the transaction has
committed.  The header is of type <CODE>nv_range_t</CODE> and is 56 bytes
long, the actual data length varies.  Note that RVM may pad some bytes
after the actual data so that the next sub-record will be word-aligned.
<P>To allow RVM to scan the log in both forward and backward direction,
different forward and backward links are provided.  Generally
speaking, <CODE>rec_length</CODE> is a forward link.  For transaction header,
it is the forward link to the end marker; for new value record, it is
the forward link to the next new value record (or record end marker).
However, for end record marker, this value means the <CODE>backward</CODE>
link to the transaction header.  (RVM is somewhat inconsistent in
naming here). On the other hand, <CODE>sub_rec_len</CODE> is a <CODE>backward</CODE>
link to the previous sub-record.  In new value range header, there is
the third lenght field: <CODE>length</CODE>, it is the length of the actual
data.  <CODE>length</CODE> may not be equal to <CODE>rec_length</CODE> because of the
padding.  See Fig 
<A HREF="#fig_rec_link">fig_rec_link</A>
 for the details of those links.
<P>
<FIGURE>
<EPS FILE="record_links">
<CAPTION>Forward and backward links of records
  
<A NAME="fig_rec_link"></A> </CAPTION>
</FIGURE>
<P>The fourth possible record type that could be find on log is the wrap
marker.  Note that wrapping may not happen exactly at the end of the
log, it may happen a little bit earlier.  In the latter case, the
bytes follow the wrap marker will all be written with value <CODE>0xff</CODE>.
<P>The fifth possible record type is special log record.  Currently, the
only special log record is segment mapping descriptor (of type
<CODE>log_seg_t</CODE>).  RVM allows more than one data segment for one log.
To support this, every new value range will be tagged with a
<CODE>seg_code</CODE> which indicates to which segment that range should be
applied to.  <CODE>seg_code</CODE> is only a short hand of the
segment. Details of the segment (<CODE>num_bytes</CODE> and <CODE>name</CODE>) will be
recorded in the segment mapping descriptor.  Segment mapping
descriptors are inserted into the log whenever a new data segment is
mapped the first time.  At the time of log truncation, RVM will use
those segment mapping descriptors to reconstruct the necessary
information to open the segment device.
<P>
<H3><A NAME="log-wrap"></A> Log wrapping</H3>

<P>When <CODE>log_tail</CODE> approaches the end of the log, RVM will choose an
appropiate point to do a <EM>wrapping</EM>.  This is done by writting out
a wrap marker.  After that, RVM will starting appending new records
from <CODE>log_start</CODE>.  There will be no risk of overwriting onto useful
records because the old records should have been truncated long
before.  Even if they are not truncated yet, RVM will call a internal
function <CODE>wait_for_space()</CODE> to force a truncation.
<P>Log wrapping can happens <EM>between</EM> transaction.  In this case, it
simply leave a wrap marker (of type <EM>log_wrap_t</EM> and start a brand
new transaction at the beginning of log <EM>log_start</EM> (Fig 
<A HREF="#fig_wrap">fig_wrap</A>
a).
<P>
<FIGURE>
<EPS FILE="wrapping">
<CAPTION>Three different possible wrapping
  
<A NAME="fig_wrap"></A> </CAPTION>
</FIGURE>
<P>It can also happens <EM>within</EM> a transaction.  There are two
possibilities: between different new value records or <EM>splitting</EM> a
new value records.  Since a range can be very long, spliting new
value records prevents leaving too much space not used by inserting
the wrap marker too early.
<P>In the first case, after the current new value records are written, a
record-end marker is inserted into the log, and then followed by a
log-wrap marker.  Another transaction header will be inserted at
<CODE>log_start</CODE> before the remaining new value records are written.
Effectively, the same transaction will have two transaction headers:
the first is written before log wrapping, the second is written after
log wrapping.  The <CODE>flag</CODE> field of the transaction headers helps to
indicates this situation.  The first transaction header will only have
<CODE>FIRST_ENTRY_FLAG</CODE> set, while the second transaction header will
only have the <CODE>LAST_ENTRY_FLAG</CODE> set.  (If the whole sequence of
records for the transaction is not interrupted by a log wrapping, both
<CODE>FIRST_ENTRY_FLAG</CODE> and <CODE>LAST_ENTRY_FLAG</CODE> of the transaction
header will be set.) (Fig 
<A HREF="#fig_wrap">fig_wrap</A>
b)
<P>For the latter case where a new value range is split, each of the two
portion of the data will has its own new value range header.  The first
portion of the splitted range is written followed by a record-end
marker and a wrap marker.  The second transaction header are then
written at <CODE>log_start</CODE>, followed by the second portion of the
splitted range.  The remaining new value records then follows.  There
is a flag <CODE>is_split</CODE> in the new value record header indicates that
this happens. (Fig 6c)  The flag <CODE>FIRST_ENTRY_FLAG</CODE> and
<CODE>LAST_ENTRY_FLAG</CODE> will also be set as in the previous case.
<P>
<P>
<H2><A NAME="ss7.4">7.4 Internal Data Structure</A>
</H2>

<P>Before we describe the important data structure used by RVM, we will
first discuss some of the building block of these data structure.  In
next section we will discuss the generic list and generic tree data
structure.  In the following section we will discuss the
synchronization primitive used by RVM that enable multi-threading for
RVM.
<P>
<H3>Generic data structure</H3>

<P>A lot of the RVM data structure are organized as lists or trees.  To
support that, RVM internally defined a generic list type as well as
a generic tree type.
<P>Generic list is declared as <CODE>list_entry_t</CODE>, while generic tree node
is declared as <CODE>tree_node_t</CODE>.  Supporting routine for them are
<CODE>move_list_entry()</CODE> and <CODE>tree_insert()</CODE> etc.
<P>Each of these generic list/tree will contain enough topological
information for it to be manuipulated as a list-node or tree node.
Generic list and tree are typically not allocated as themselves,
rather, they are the first member of other specific data structures
that need to be organized.  For example, the region descriptor
<CODE>region_t</CODE> is declared with <CODE>list_entry_t</CODE> as the first member
of the structure.  That way, the region descriptor can be manuipulated
as a list element.  We can do the following:
<PRE>
seg_t     seg;
region_t  region;

move_list_entry(NULL, &amp;seg->map_list, &amp;region->links);
</PRE>

to move a region descriptor <CODE>region</CODE> to the mapped region list
(<CODE>map_list</CODE>) in the segment descriptor <CODE>seg</CODE>.
<P>
<H3>Support for multi-threading</H3>

<P>RVM is designed to be thread-safe: application is responsible for
protecting access to its data structures by concurrent threads, but
can otherwise use RVM function freely.  RVM will synchronizing its
internal operations.  This is done by using the following three
primitives:
<P>
<H3>Mutex</H3>

<P>Internal data which may be accessed concurrently are protected by
mutex of type <CODE>RVM_MUTEX</CODE>.  Routines accessing those data are
required to do a <CODE>mutex_lock()</CODE> before, and do a
<CODE>mutex_unlock()</CODE> after.  Alternatively, they can use the macro
<CODE>CRITICAL()</CODE> to bracket their access to critical data.
<P>
<HR>
<PRE>
RVM_MUTEX    lck;

mutex_init(&amp;lck);   /* initialization */
...

mutex_lock(&amp;lck);   /* begin    lck crit. sec */
  /* body */;           /* codes in lck crit. sec */
mutex_unlock(&amp;lck); /* end      lck crit. sec */

                    /* alternatively use macro */
CRITICAL(lck,       /* begin    lck crit. sec */
  /* body */;       /* codes in lck crit. sec */
);                  /* end      lck crit. sec */
</PRE>
<HR>
<P>
<H3>Read/Write Lock</H3>

<P>Sometimes a data can be read by many threads simultaneously, provided
none of them attempts to update on the data.  This kind of concurrency
is supported by read/write lock of type <CODE>rw_lock_t</CODE>.   Readers
access those data with <CODE>rw_lock(...,r)</CODE>, they will be allowed to
proceed immediately if there are no other writer, otherwise, they will
be blocked until the writer releases the lock.  Writers access those
data with <CODE>rw_lock(...,w)</CODE>, they will be allowed to proceed
immediately if there are no other readers or writers, otherwise, they
will be blocked until all of them release the lock.  Both readers and
writers have to do a <CODE>rw_un_lock()</CODE> when they are done accessing the
protected data.   There is also a macro <CODE>RW_CRITICAL()</CODE> for
bracketing accesses to critical data protected by read/write lock.
<P>
<HR>
<PRE>
rw_lock_t     rwl;

init_rw_lock(&amp;rwl); /* initialization */
...

/* reader */
rw_lock(&amp;rwl,r);    /* begin rwl crit. sec. */
  /* body */;           /* read immediately if no other
  /* body */;           /*  writer, otherwise, block until
  /* body */;           /*  it releases lock */
rw_unlock(&amp;rwl,r);  /* end   rwl crit. sec. */

...
/* writer */
rw_lock(&amp;rwl,w);    /* begin rwl crit. sec. */
  /* body */;           /* write immediately if no other */
  /* body */;           /*  writer/readers otherwise block */
  /* body */;           /*  until they release lock */
rw_unlock(&amp;rwl,w);  /* end   rwl crit. sec. */

...
RW_CRITICAL(rwl, mode, /* begin rwl crit. sec., mode=r|w */
  /* body */;
);                     /* end   rwl crit. sec. */
</PRE>
<HR>
<H3>Condition variable</H3>

<P>Two concurrent threads can synchronize their actions by using the
condition variable of type <CODE>RVM_CONDITION</CODE>.  Thread A can sleep by
calling <CODE>condition_wait()</CODE>, waiting to be waked up by thread B when
it calls <CODE>condition_signal()</CODE>.
<P>This techniques is used for the asynchronous log truncation.  A
seperate thread will execute the code of log_daemon() if there are
multi-thread support (either mach's cthread, or unix's lwp or
pthread).  Normally this thread is block waiting, other thread(s) will
run.  From time to time, when transactions are committed, the other
thread(s) will check and see if the log is too full, if it is, it will
wake up the log_daemon() which will then truncate the log
asynchronusly.
<P>
<HR>
<PRE>
RVM_CONDITION     code;

/* thread of log_daemon() */
  CRITICAL(lock,
    {
    while (state == rvm_idle)
      condition_wait(&amp;code, &amp;lock);
    });
  /* async. log truncation, or terminate
   * log daemon here */

/* thread of log_tid() */
  CRITICAL(lock,
    {
      if (need_truncation) {
        if (state == rvm_idle) {
          state = truncating;
          condition_signal(&amp;code);
        }
      }
    });
</PRE>
<HR>
<H3>Important Data Structure</H3>

<P>The following structure are all declared in the file
<CODE>rvm_private.h</CODE>.  The important data structures are the
following.
<P>
<H3>Segment Descriptor (<CODE>seg_t</CODE>)</H3>

<P>There can be more than one segment per process.  Each of the segment
has a segment descriptor.  These segment descriptor are linked
together as a list, and the list header is the global variable
<CODE>seg_root</CODE>.
<P>The segment descriptor contains the header <CODE>map_list</CODE> for an
important list: list of mapped region, which will be discussed in more
details in section 
<A HREF="#sect2-region">sect2-region</A>
<P>There can be more than one segment per log.  In fact, each individual
new value record can be applied to different segment.  This is
achieved by tagging a short-name of segment <CODE>seg_code</CODE> with each new
value record.  The short name of each segment is also stored in its
own descriptor.
<P>
<H3>Log descriptor (<CODE>log_t</CODE>)</H3>

<P>Like the segment descriptors, log descriptors are also connected as a
list (header is a global variable <CODE>log_root</CODE>.  However, currently
RVM only support one log in a process, which is pointed to by the
global pointer <CODE>default_log</CODE>.
<P><CODE>log_status</CODE> is the in-core copy of the log status block, it will
be explained in more detail in the next section.  Similarily,
<CODE>trans_hdr</CODE>, <CODE>rec_end</CODE>, <CODE>log_wrap</CODE> are the area for preparing
the corresponing records on log (c.f. section 
<A HREF="#sect2-log-record">sect2-log-record</A>
).
<P><CODE>log_buf</CODE> is the log recovery buffer descriptor and it will
explained in the section 
<A HREF="#log-buf">log-buf</A>
.
<P>Log descriptor contains headers for three important list:
<CODE>tid_list</CODE>, <CODE>flush_list</CODE> and <CODE>special_list</CODE>.  The first link
up transaction that have beginned but not yet ended.  The second link
up transaction that have been committed but not yet flushed.
<P>
<H3>Log status area (<CODE>log_status</CODE>)</H3>

<P>Log status has a in-core copy as well as in-disk copy.  The in-core
copy is declared as type <CODE>log_status_t</CODE>.  The in-core copy is
stored as a sub-structure of the log descriptor, and the in-disk copy
is written on the log status block.  Ideally, we would want the
in-core copy to written to disk everytime the log status change,
however, doing so will require a lot of disk head seeks between the
log status block and the log tail, which defeat the purpose of the
log, so RVM updates the on-disk log status block only every
UPDATE_STATUS (currently the value is 100) of status change.  This is
keep tracked by the count <CODE>update_cnt</CODE>.  Because of that, RVM will
not complete trust on the value in the log status block.  For example,
when doing log truncation, RVM trusted the value <CODE>prev_log_head</CODE>
but not the value <CODE>log_tail</CODE>, so it will do a <CODE>locate_tail()</CODE> to
make sure it can find the last records written on the log.
<P>Log status area records important log pointers: <CODE>log_head</CODE>,
<CODE>log_tail</CODE>, <CODE>prev_log_head</CODE>, <CODE>prev_log_tail</CODE> and log limit:
<CODE>log_start</CODE> and <CODE>log_size</CODE>.  Besides, it also records a huge
numbers of consistency check fields, transaction statistics and log
statistics. 
<P>
<H3><A NAME="sect2-log-record"></A> Area for preparing log records</H3>

<P>Each of the log record types has a corresponding in-core data
structure.  They are used to prepare the records before they are
written out to log.  There are only one copy for <CODE>trans_hdr_t</CODE>,
<CODE>rec_ent_t</CODE> and <CODE>log_wrap_t</CODE> per-log and they are part of the
log descriptor.  For new value records, areas are required for the
headers and the actual data and they are per-range.  Headers are of
type <CODE>nv_range_t</CODE> and are store in the range descriptors (described
in the following section).  Actual data are stored in a sperately
allocated buffer pointed to by the field <CODE>data</CODE> in range
descriptor.  Log special records are associated with the log, and they
dynamically allocated and are organized as a list (header: 
<CODE>log_special_list</CODE>) in the log descriptor.
<P>
<H3><A NAME="log-buf"></A> Log recovery buffer (<CODE>log_buf->buf</CODE>) and its descriptor(<CODE>log_buf</CODE>) </H3>

<P>Log recovery buffer (also called log buffer) is used in log
truncation.  It is used to keep an image of log records for
processing.  The descriptor of the buffer is called log recovery
buffer descriptor (type <CODE>log_buf_t</CODE>).  It is a sub-structure of log
descriptor.  The actually buffer is pointed to by the field <CODE>buf</CODE>,
the length of the buffer is stored by the field <CODE>buf_len</CODE>.
<P>Two routines can be used to fill and refill the log recovery buffer:
<UL>
<LI><CODE>init_buffer(log, offset, direction, ...)</CODE></LI>
<LI><CODE>refill_buffer(log, direction, ...)</CODE></LI>
</UL>

The log can be scanned in two direction.  This is indicated by the
argument <CODE>direction</CODE> of the two routines.  They will try to read as
much as possible of data from the log device, subject to the length
of the log buffer and other
constraints:
<P>Upon returns of these two routines, 
<CODE>log_buf->ptr</CODE> will be the index into log buffer that
corresponds the current log position, and <CODE>log_buf->offset</CODE> will 
correspond to the log offset of byte 0 in log buffer.
<CODE>log_buf->offset</CODE> will be sector aligned. (Figure 
<A HREF="#fig_log_buf">fig_log_buf</A>
)
<P>
<FIGURE>
<EPS FILE="log_buf">
<CAPTION>log buffer and init_buffer()
  
<A NAME="fig_log_buf"></A> </CAPTION>
</FIGURE>
<P>The following is a typical sequence code in log truncation:
<P>
<HR>
<PRE>
...
r_length      = init_buffer(log, off, REVERSE, ...);
log_buf->ptr -= sizeof(rec_end_t);
rec_end       = (rec_end_t *) &amp;log_buf->buf[log_buf->ptr];
if (rec_end->struct_id == rec_end_id) {
  ...
</PRE>
<HR>
<P>In the code, <CODE>off</CODE> is the current log position.  Routine
<CODE>init_buffer()</CODE> is instructed to initialize the buffer by scanning
the log in the reverse direction.  When it returns, <CODE>r_length</CODE>
number of bytes are read into the log recovery buffer (pointed to by
<CODE>log_buf->buf</CODE>), <CODE>log_buf->ptr</CODE> is index of the byte in log
buffer that corresponds to <CODE>off</CODE>.  Assume the routine here is
expecting a end record in log, it decrements <CODE>log_buf->ptr</CODE> by the
known size of <CODE>rec_end_t</CODE> so as to point to the end record.  After
the decrement, the address of <CODE>log_buf->buf[log_buf->ptr]</CODE> should
be the beginning of the in core image of the end record.  It type-cast
it as <CODE>(rec_end_t *)</CODE> and stores it in <CODE>rec_end</CODE>.  After that it
can manipulate the record as it likes.  In the example, it examines
and see if <CODE>struct_id</CODE> of the end record is really a
<CODE>rec_end_id</CODE> ...
<P>
<H3><A NAME="sect2-region"></A> Region descriptors (<CODE>region_t</CODE>)</H3>

<P>Each segment will have a list of regions that are mapped.  So in each
segment descriptor (type  <CODE>seg_t</CODE>) there is a member
<CODE>map_list</CODE> which is the head of this list.  Each node in
the list is a region descriptr of type <CODE>region_t</CODE>. 
<P>It records the <CODE>offset</CODE> and <CODE>end_offset</CODE> of the region in the
data segment, as well as the <CODE>vmaddr</CODE> and <CODE>length</CODE> of region
that is mapped to.
<P>The member <CODE>n_uncommit</CODE> records the number of uncommitted range in
this region, it is incremented everytime application does a
<CODE>rvm_set_range()</CODE>, it will be decrement when a transaction is
committed or aborted.  When application tries to do <CODE>rvm_unmap()</CODE>,
this value must be zero otherwise it will return RVM_EUNCOMMIT.
<P><CODE>region_t</CODE> may be confused with <CODE>rvm_region_t</CODE> and
<CODE>mem_region_t</CODE>.  We will explain the former in the following and
the latter in the next section.  <CODE>rvm_region_t</CODE> is part of the API,
it is the structure used by the application to indicate the mapping
parameters: <CODE>data_dev</CODE>, <CODE>offset</CODE>, <CODE>length</CODE>, <CODE>vmaddr</CODE> and
<CODE>no_copy</CODE> etc.  RVM does not use this structure internally, once it
has got all the parameters from the application and has transferred
them to <CODE>region_t</CODE>.
<P>
<H3>Memory region descriptors (<CODE>mem_region_t</CODE>)</H3>

<P>In addition to the per segment region list described above, there is
also a per process memory region tree.  It is used to record the
regions in the virtual memory space that are mapped, and it is
arranged in the order of vm address.
<P>When application does a <CODE>rvm_map()</CODE>, this tree is consulted to make
sure the requested virtual memory region is not already mapped.  If it
is already mapped, RVM_EVM_OVERLAP will be returned.  Also, when
application does a <CODE>rvm_set_range()</CODE>, this tree is consulted to
make sure the requested range is totally contained within a single
virtual memory region.  If it isn't, RVM_ENOT_MAPPED will be returned.
<P>The root of the tree is <CODE>region_tree</CODE>.  Each node of the
tree is of type <CODE>mem_region_t</CODE>.  
<P>
<H3>Transaction descriptor (<CODE>int_tid_t</CODE>)</H3>

<P>Whenever the application does a <CODE>rvm_begin_transaction()</CODE>, rvm will
allocate a transaction descriptor of type <CODE>int_tid_t</CODE>.  These
descriptors are linked up into list.  Before the transaction commits,
the descriptor is in the list headed by <CODE>log->tid_list</CODE>.  If the
transaction is committed, it will be moved to another list headed by
<CODE>log->flush_list</CODE>.  If the transaction record is flushed, it will
be removed from the flush list.
<P>Typically, the application may inform RVM one or more memory range
which will be affected by the transaction.  This is done by issuing
one or more <CODE>rvm_set_range()</CODE> or <CODE>rvm_modify_bytes()</CODE> after
<CODE>rvm_begin_transaction()</CODE>.  Internally, these memory ranges are
organized as a tree rooted at <CODE>range_tree</CODE> (of type
<CODE>tree_root_t</CODE>).  The tree root is a field of the transaction
descriptor.
<P><CODE>int_tid_t</CODE> may be confused with <CODE>rvm_tid_t</CODE>.  The latter is the
transaction identifier used by applications while the former is used
internally by RVM.  Althought <CODE>rvm_tid_t</CODE> contains a pointer to
<CODE>int_tid_t</CODE>, applications are not expected to use or touch the
pointer.
<P>
<H3>modification range descriptor (<CODE>range_t</CODE>)</H3>

<P>When application specifies modification range by <CODE>rvm_set_range()</CODE>
or <CODE>rvm_modify_bytes()</CODE>, the vm address (<CODE>range->nv.vmaddr</CODE>) and
the length (<CODE>data_len</CODE>) of the range is stored internally in the
range descriptor (<CODE>range_t</CODE>).  When the transaction commits, the
content of virtual memory pointed to by (<CODE>range->nv.vmaddr</CODE>) will
be copied to the buffer pointed to by <CODE>range->data</CODE>.  The same
buffer has another purpose: if the transaction is in <CODE>restore</CODE> mode,
the buffer is used also for saving the old image of VM.  In case the
transaction is aborted, the old image can be restored.
<P>Range descriptors are connected together as a tree, with the tree root
in the corresponding transaction descriptor.  If the optimization flag
<CODE>RVM_COALESCE_RANGES</CODE> is specified, the tree is sorted according to
the vm address, and ranges are coalesced before they are inserted into
the tree.  If <CODE>RVM_COALESCE_RANGES</CODE> is not specified, the tree is
simply sorted by the range number.
<P>There is also an area of type <CODE>nv_range_t</CODE> called <CODE>nv</CODE> which is
used as the in-memory image of the nv range record on log.
<P>
<H3>storage device region node (<CODE>dev_region_t</CODE>)</H3>

<P>This is not to be confused with <CODE>region_t</CODE> and <CODE>mem_region_t</CODE>.
<CODE>dev_region_t</CODE> is used only in log truncation.  During log
trunctation, all sub-records of all transactions in the truncation epoch
are processed.  A tree called <CODE>mod_tree</CODE> is built.  Each node
of the tree is of type <CODE>dev_region_t</CODE>.  The tree is sorted according
to the segment offset.  The purpose of the tree is to make the disk
write operation on the segment device sequential.
<P>
<H3>Device descriptor (<CODE>device_t</CODE>)</H3>

<P>Device descriptor (of type <CODE>device_t</CODE>) contains the details of the
phyical devices: name, handle, size.  Also, it contains the pointers
to the gather-write I/O vectors.
<P>
<H2><A NAME="ss7.5">7.5 RVM in action</A>
</H2>

<P>
<H3><CODE>rvm_initialize()</CODE></H3>

<P>Global lists and trees are initialized (<CODE>seg_root</CODE>, <CODE>log_root</CODE>,
<CODE>region_tree</CODE>).  A log truncation (<CODE>log_recover()</CODE>) is
performed.
<P>
<H3><CODE>rvm_map()</CODE></H3>

<P>There is always a log truncation at the begin of mapping.  This is
necessary to ensure the mapping will get the latest committed image
from the data segment.  It also checks if application has 
specified a new segment device, if so, it will create a special record
of type <CODE>log_seg_t</CODE> and queue it on <CODE>log->special_list</CODE>.  This
record will be flushed to the log before any next transaction.  
<P>Application may or may not specified the mapped-to address for the
region.  If it is not specified, <CODE>rvm_map()</CODE> will do a
<CODE>page_alloc()</CODE> to find a VM address for the request region.  Either
case, the mapped-to address will be stored in <CODE>region->vmaddr</CODE> as
well as <CODE>mem_region->vmaddr</CODE>.  
<P>Before carry out the mapping, <CODE>rvm_map()</CODE> also need to check that
the requested mapping is not overlapped with existing mapping in both
the VM address space as well as the segment offset space.  The former
is detected with the aid of <CODE>region_tree</CODE> while the latter is
detected with the aid of <CODE>seg->map_list</CODE>.  If there are
overlapping, RVM_EVM_OVERLAP will be returned in the former case
and RVM_EOVERLAP will be returned in the latter case.
<P>Finally, the content of the data segment at the specified offset is
copied into the VM space (unless <CODE>no_copy</CODE> mode is specified).
<P>
<H3><CODE>rvm_begin_transaction()</CODE></H3>

<P>A transaction descriptor (of type <CODE>int_tid_t</CODE>) will be allocated
and inserted in <CODE>log->tid_list</CODE>.  The address of that descriptor
is also noted in <CODE>rvm_tid->tid</CODE>.
<P>
<H3><CODE>rvm_set_range()</CODE> and <CODE>rvm_modify_bytes()</CODE></H3>

<P>The range declared must be totally included inside a mapped region,
other RVM_ENOT_MAPPED will be returned.  This is done by consulting
<CODE>region_tree</CODE>.
<P>A new range descriptor (of type <CODE>range_t</CODE> will be allocated and
inserted into <CODE>tid->range_tree</CODE>.  Depending on the optimization
flag of the transaction, the range tree may be sorted by vm address or
sorted by range numbers.
<P>If transaction is in <CODE>restore</CODE> mode, the current image of VM in the
range will be first saved to a buffer pointed to by <CODE>range->data</CODE>.
<P>VM address of the range declared is stored in <CODE>range->nv.vmaddr</CODE>.
<P>
<H3><CODE>rvm_end_transaction()</CODE></H3>

<P>When application calls <CODE>rvm_end_transaction()</CODE>, for each of the
ranges in <CODE>tid_range_tree</CODE>, the vm area pointed to by
<CODE>range->nv.vmaddr</CODE> is copied to <CODE>range->data</CODE>.  This is to
ensure the atomicity of transaction: when <CODE>rvm_end_transaction()</CODE>
returns, all declared ranges have been copied to a seperate location.
If the program crashes before <CODE>rvm_end_transaction()</CODE> returns, then
the data segment will contain the old value of the of the declared
ranged.  Applications are free to write new data on the ranges after
<CODE>rvm_end_transaction()</CODE> returns.
<P>The transaction descriptor <CODE>tid</CODE> is dequeued from
<CODE>log->tid_list</CODE> and is queued to <CODE>log->flush_list</CODE>.
<P>If RVM_COALESCE_TRANS flag is set for the transaction, an
inter-transaction coalescing will be carried out.
<P>If the transaction is committed in <CODE>flush</CODE> mode, <CODE>flush_log()</CODE>
is called.
<P>
<H3><CODE>flush_log()</CODE></H3>

<P>Every transaction descriptors in <CODE>log->flush_list</CODE> are written out
to the log, this is done by the routine <CODE>log_tid()</CODE>.  First, the
transaction header of type <CODE>trans_hdr</CODE> is built and added to the
gather write vector <CODE>dev->iov</CODE>.  Then, for every ranges in
<CODE>tid->range_tree</CODE>, the new value range header of type
<CODE>nv_range_t</CODE> is built and added to the gather write vector,
followed by the actual new value data.  After all the new value
records are written, a record end marker (of type <CODE>rec_end_t</CODE> is
built and added to the gather write vector.
<P>If log wrapping is need, there will be intervening record end
marker, log wrap marker and transaction header record inserted, as
described in section 
<A HREF="#log-wrap">log-wrap</A>
.
<P>
<H3><CODE>rvm_truncate()</CODE></H3>

<P>Log truncation is carried out by <CODE>log_recover()</CODE>.  For each
segment, a modification tree <CODE>mod_tree</CODE> is built.  Each node is of
type <CODE>dev_region_t</CODE>, it is arranged in order of segment offset.
For each new value records of each transactions on the log, a node of
type <CODE>dev_region_t</CODE> is built and inserted into the modification
tree.  If there are overlaps in segment offset , the overlapped
portion is not repeatedly inserted into the modification tree.  In
this case, it is not an error to have overlapped segment, for example,
the same memory range can be updated by different transaction.  Also
note that since the log is truncated in last-in-first-out order,
i.e. new value records that was flushed last is inserted into tree
first, we always have the most up-to-date range inserted into the
tree.
<P>When all the log records bracket by the log pointers <CODE>prev_log_head</CODE>
and <CODE>prev_log_tail</CODE> are scanned and inserted into the modification
tree. The whole tree is written to data device (<CODE>apply_mods()</CODE>).
<P>
<H2><A NAME="ss7.6">7.6 RVMUTL</A>
</H2>

<P>The program <CODE>rvmutl</CODE> is a utility program that allows:
<UL>
<LI> initialization of a log</LI>
<LI> debug via the log</LI>
</UL>

The following is an incomplete list of commands available in
<CODE>rvmutl</CODE>:
<UL>
<LI> <CODE>init_log</CODE>: initialize a log</LI>
<LI> <CODE>open_log</CODE>: open an exising log</LI>
<LI> <CODE>status</CODE>  : display information on log status block</LI>
<LI> <CODE>head</CODE>    : locate and display the first record of current epoch</LI>
<LI> <CODE>tail</CODE>    : locate the display the last record  of current epoch</LI>
<LI> <CODE>prev (n)</CODE>: locate and display previous (n) record(s)</LI>
<LI> <CODE>next (n)</CODE>: locate and display next (n) reocrds(s)</LI>
</UL>
<P>In the rvm source codes, you can often see the global variable
<CODE>rvm_utlsw</CODE>.  RVM is designed in mind that there are two possible
types of users for rvm's routines: ordinary applications or the
program <CODE>rvmutl</CODE>.  This variable allows a differential treatment on
exception condition.  For example, in the routine <CODE>scan_forward()</CODE>,
if the record type is found not to be one that the routine is
expecting (possibly means a corrupted log) the program will:
<UL>
<LI> abort, if it is called by an general application</LI>
<LI> continue with error return, if it is called by <CODE>rvmutl</CODE></LI>
</UL>

The latter treatment allow manual handling of log corruption via <CODE>rvmutl</CODE>.
<P>
<HR>
<A HREF="rvm_manual-8.html">Next</A>
<A HREF="rvm_manual-6.html">Previous</A>
<A HREF="rvm_manual.html#toc7">Contents</A>
</BODY>
</HTML>