Sophie

Sophie

distrib > Fedora > 14 > x86_64 > by-pkgid > 9902e69a44dd756e5d778e8dd9bd5eef > files > 9

mdbtools-devel-0.6-0.7.cvs20051109.fc12.1.i686.rpm

This file documents the Microsoft MDB file format for Jet3 and Jet4 databases.

General Notes
-------------

Access (Jet) does not in general initialize pages to zero before writing them,
so the file will contains a lot of unititialized data.  This makes the task of
figuring out the format a bit more difficult than it otherwise would be.

This document will, generally speaking, provide all offsets and constants in 
hex format.  

Most multibyte pointer and integers are stored in little endian (LSB-MSB) order.
There is an exception in the case of indexes, see the section on index pages for
details.

Terminology
-----------

This section contains a mix of information about data structures used in the MDB
file format along with general database terminology needed to explain these 
structures.

Page          - A fixed size region within the file on a 2 or 4K boundry. All 
                data in the file exists inside pages.
System Table  - Tables in Access generally starting with "MSys".  The 'Flags'
                field in the table's Catalog Entry will contain a flag in one
                of two positions (0x80000000 or 0x00000002).  See also the TDEF
		(table definition) pages for "System Table" field.
Catalog Entry - A row from the MSysObjects table describing another database
                object.  The MSysObjects table definition page is always at 
                page 2 of the database, and a phony tdef structure is
                bootstrapped to initially read the database.
Page Split    - A process in which a row is added to a page with no space left.
                A second page is allocated and rows on the original page are 
		split between the two pages and then indexes are updated. Pages
                can use a variety of algorithms for splitting the rows, the 
                most popular being a 50/50 split in which rows are divided 
                evenly between pages.
Overflow Page - Instead of doing a full page split with associated index writes,
                a pointer to an "overflow" page can be stored at the original
                row's location. Compacting a database would normally rewrite
                overflow pages back into regular pages.
Leaf Page     - The lowest page on an index tree.  In Access, leaf pages are of
                a different type than other index pages.
UCS-2         - a two byte unicode encoding used in Jet4 files.
Covered Query - a query that can be satisfied by reading only index pages.  For
                instance if the query 
		"SELECT count(*) from Table1 where Column3 = 4" were run and 
                Column3 was indexed, the query could be satisfied by reading
                only indexes.  Because of the way Access hashes text columns
                in indexes, covered queries on text columns are not possible.

Pages
-----

At its topmost level, a MDB file is organized into a series of fixed-size
pages.  These are 2K in size for Jet3 (Access 97) and 4K for Jet4 (Access
2000/2002).  All data in MDB files exists within pages, of which there are 
a number of types.

The first byte of each page identifies the page type as follows.

0x00 Database definition page.  (Always page 0)
0x01 Data page
0x02 Table definition
0x03 Intermediate Index pages
0x04 Leaf Index pages 
0x05 Page Usage Bitmaps (extended page usage)
0x08 ??


Database Definition Page
------------------------

Each MDB database has a single definition page located at beginning of the file.
Not a lot is known about this page, and it is one of the least documented page
types.  However, it contains things like Jet version, encryption keys, and name
of the creating program.

Offset 0x14 contains the Jet version of this database: 0x00 for 3, 0x01 for 4.
This is used by the mdb-ver utility to determine the Jet version.


Data Pages
----------

Data rows are all stored in data pages.

The header of a Jet3 data page looks like this:

+--------------------------------------------------------------------------+
| Jet3 Data Page Definition                                                |
+------+---------+---------------------------------------------------------+
| data | length  | name       | description                                |
+------+---------+---------------------------------------------------------+
| 0x01 | 1 byte  | page_type  | 0x01 indicates a data page.                |
| 0x01 | 1 byte  | unknown    |                                            |
| ???? | 2 bytes | free_space | Free space in this page                    |
| ???? | 4 bytes | tdef_pg    | Page pointer to table definition           |
| ???? | 2 bytes | num_rows   | number of records on this page             |
+--------------------------------------------------------------------------+
| Iterate for the number of records                                        |
+--------------------------------------------------------------------------+
| ???? | 2 bytes | offset_row | The record's location on this page         |
+--------------------------------------------------------------------------+

Notes:
. In Jet4, an additional four-byte field was added after tdef_pg.  Its purpose
  is currently unknown.
. Offsets that have 0x40 in the high order byte point to a location within the
  page where a Data Pointer (4 bytes) to another data page (also known as an
  overflow page) is stored.  Called 'lookupflag' in source code.
. Offsets that have 0x80 in the high order byte are deleted rows.  Called
  'delflag' in source code.


Rows are stored from the end of the page to the top of the page.  So, the first
row stored runs from the row's offset to page_size - 1.  The next row runs from
its offset to the previous row's offset - 1, and so on.

Decoding a row requires knowing the number and types of columns from its TDEF
page. Decoding is handled by the routine mdb_crack_row().

+--------------------------------------------------------------------------+
| Jet3 Row Definition                                                      |
+------+---------+---------------------------------------------------------+
| data | length  | name       | description                                |
+------+---------+---------------------------------------------------------+
| ???? | 1 byte  | num_cols   | Number of columns stored on this row.      |
| ???? | n bytes | fixed_cols | Fixed length columns                       |
| ???? | n bytes | var_cols   | Variable length columns                    |
| ???? | 1 byte  | eod        | length of data from begining of record     |
| ???? | n bytes | var_table[]| offset from start of row for each var_col  |
| ???? | n bytes | jump_table | Jump table (see description below)         |
| ???? | 1 byte  | var_len    | number of variable length columns          |
| ???? | n bytes | null_mask  | Null indicator.  See notes.                |
+--------------------------------------------------------------------------+

+--------------------------------------------------------------------------+
| Jet4 Row Definition                                                      |
+------+---------+---------------------------------------------------------+
| data | length  | name       | description                                |
+------+---------+---------------------------------------------------------+
| ???? | 2 bytes | num_cols   | Number of columns stored on this row.      |
| ???? | n bytes | fixed_cols | Fixed length columns                       |
| ???? | n bytes | var_cols   | Variable length columns                    |
| ???? | 2 bytes | eod        | length of data from begining of record     |
| ???? | n bytes | var_table[]| offset from start of row for each var_col  |
| ???? | 2 bytes | var_len    | number of variable length columns          |
| ???? | n bytes | null_mask  | Null indicator.  See notes.                |
+--------------------------------------------------------------------------+

Notes:

. A row will always have the number of fixed columns as specified in the table
  definition, but may have fewer variable columns, as rows are not updated when
  columns are added.
. All fixed-length columns are stored first to last, followed by non-null
  variable-length columns stored first to last.
. If the number of variable columns, as given in the TDEF, is 0, then the
  only items in the row are num_cols, fixed_cols, and null_mask.
. The var_len field indicates the number of entries in the var_table[].
. The var_table[] and jump_table[] are stored in reverse order.
. The eod field points at the first byte after the var_cols field.  It is used
  to determine where the last var_col ends.
. The size of the null mask is computed by (num_cols + 7)/8.
. Fixed columns can be null (unlike some other databases).
. The null mask stores one bit for each column, starting with the
  least-significant bit of the first byte.
. In the null mask, 0 represents null, and 1 represents not null.
. Values for boolean fixed columns are in the null mask: 0 - false, 1 - true.

In Jet3, offsets are stored as 1-byte fields yielding a maximum of 256 bytes.
To get around this, offsets are computed using a jump table.  The jump table
stores the number of the first column in each jump segment.  If the size of the
row is less than 256 then the jump table will not be present.  Also, eod is
treated as an additional entry of the var_table[].

For example, if the row contains 45 columns and the 15th column is the first
with an offset of 256 or greater, then the first entry in the jump table will be
0xe (14).  If the 24th column is the first one at offset >= 512, the second
entry of the jump table would be 0x17 (23).  If eod is the first entry >= 768,
the last entry in this case will be 0x2d (45).

The number of jump table entries is calculated based on the size of the row,
rather than the location of eod.  As a result, there may be a dummy entry that
contains 0xff.  In this case, and using the example above, the values in the
jump table would be 0x2d 0x17 0x0e 0xff.

In Jet4 all offsets are stored as 2 byte fields, including the var_table
entries.  Thus, the jump table was (thankfully) ditched in Jet4.


Each memo column (or other long binary data) in a row

+-------------------------------------------------------------------------+
| Memo Field Definition (12 bytes)                                        |
+------+---------+-------------+------------------------------------------+
| data | length  | name        | description                              |
+------+---------+-------------+------------------------------------------+
| ???? | 3 bytes | memo_len    | Total length of the memo                 |
| ???? | 1 bytes | bitmask     | See values                               |
| ???? | 4 bytes | lval_dp     | Data pointer to LVAL page (if needed)    |
| 0x00 | 4 bytes | unknown     |                                          |
+------+---------+-------------+------------------------------------------+
Values for the bitmask:

0x80 = the memo is in a string at the end of this header (memo_len bytes)
0x40 = the memo is in a unique LVAL page in a record type 1
0x00 = the memo is in n LVAL pages in a record type 2

If the memo is in a LVAL page, we use row_id of lval_dp to find the row.
offset_start of memo = (int16*) LVAL_page[offset_num_rows + (row_id * 2) + 2]
if (row_id = 0)
     offset_stop of memo = 2048(jet3) or 4096(jet4)
else
     offset_stop of memo = (int16*) LVAL_page[offset_num_row + (row_id * 2)]

The length (partial if type 2) for the memo is:
memo_page_len = offset_stop - offset_start


LVAL (Long Value) Pages
-----------------------

The header of a LVAL page is just like that of a regular data page,
except that in place of the tdef_pg is the word 'LVAL'.

Each memo record type 1 looks like this:
+------+---------+-------------+------------------------------------------+
| data | length  | name        | description                              |
+------+---------+-------------+------------------------------------------+
| ???? | n bytes | memo_value  | A string which is the memo               |
+-------------------------------------------------------------------------+

Each memo record type 2 looks like this:
+------+---------+-------------+------------------------------------------+
| data | length  | name        | description                              |
+------+---------+-------------+------------------------------------------+
| ???? | 4 bytes | lval_dp     | Next page LVAL type 2 if memo is too long|
| ???? | n bytes | memo_value  | A string which is the memo (partial)     |
+-------------------------------------------------------------------------+

In a LVAL type 2 data page, you have 
   10 or 14 bytes for the header of the data page, 
    2 bytes for an offset,
    4 bytes for the next lval_pg

So there is a block of 2048 - (10+2+4) = 2032(jet3) 
or 4096 - (14+2+4) = 4076(jet4) bytes max in a page.


TDEF (Table Definition) Pages
-----------------------------

Every table in the database has a TDEF page.  It contains a definition of 
the columns, types, sizes, indexes, and similar information.

+-------------------------------------------------------------------------+
| Jet3/Jet4 TDEF Header
+------+---------+-------------+------------------------------------------+
| data | length  | name        | description                              |
+------+---------+-------------+------------------------------------------+
| 0x02 | 1 bytes | page_type   | 0x02 indicate a tabledef page            |
| 0x01 | 1 bytes | unknown     |                                          |
| ???? | 2 bytes | tdef_id     | (jet3) The word 'VC'                     |
|      |         |             | (jet4) Free space in this page minus 8   |
| 0x00 | 4 bytes | next_pg     | Next tdef page pointer (0 if none)       |
+------+---------+-------------+------------------------------------------+

TDEFs can span multiple pages for large tables, this is accomplished using the
next_pg field.

+-------------------------------------------------------------------------+
| Jet3 Table Definition Block (35 bytes)                                  |
+------+---------+-------------+------------------------------------------+
| data | length  | name        | description                              |
+------+---------+-------------+------------------------------------------+
| ???? | 4 bytes | tdef_len    | Length of the data for this page         |
| ???? | 4 bytes | num_rows    | Number of records in this table          |
| 0x00 | 4 bytes | autonumber  | value for the next value of the          |
|      |         |             | autonumber column, if any. 0 otherwise   |
| 0x4e | 1 byte  | table_type  | 0x4e: user table, 0x53: system table     |
| ???? | 2 bytes | max_cols    | Max columns a row will have (deletions)  |
| ???? | 2 bytes | num_var_cols| Number of variable columns in table      |
| ???? | 2 bytes | num_cols    | Number of columns in table (repeat)      |
| ???? | 4 bytes | num_idx     | Number of indexes in table               |
| ???? | 4 bytes | num_real_idx| Number of indexes in table (repeat)      |
| ???? | 4 bytes | used_pages  | Points to a record containing the        |
|      |         |             | usage bitmask for this table.            |
| ???? | 4 bytes | free_pages  | Points to a similar record as above,     |
|      |         |             | listing pages which contain free space.  |
+-------------------------------------------------------------------------+
| Iterate for the number of num_real_idx (8 bytes per idxs)               |
+-------------------------------------------------------------------------+
| 0x00 | 4 bytes | ???         |                                          |
| ???? | 4 bytes | num_idx_rows| (not sure)                               |
+-------------------------------------------------------------------------+
| Iterate for the number of num_cols (18 bytes per column)                |
+-------------------------------------------------------------------------+
| ???? | 1 byte  | col_type    | Column Type (see table below)            |
| ???? | 2 bytes | col_num     | Column Number (includes deleted columns) |
| ???? | 2 bytes | offset_V    | Offset for variable length columns       |
| ???? | 2 bytes | col_num     | Column Number                            |
| ???? | 2 bytes | ???         |                                          |
| ???? | 1 byte  | precision   | precision if numeric column              |
| ???? | 1 byte  | scale       | scale if numeric column                  |
| ???? | 2 bytes | ???         |                                          |
| ???? | 1 byte  | bitmask     | low order bit indicates variable columns |
| ???? | 2 bytes | offset_F    | Offset for fixed length columns          |
| ???? | 2 bytes | col_len     | Length of the column (0 if memo)         |
+-------------------------------------------------------------------------+
| Iterate for the number of num_cols (n bytes per column)                 |
+-------------------------------------------------------------------------+
| ???? | 1 byte  | col_name_len| len of the name of the column            |
| ???? | n bytes | col_name    | Name of the column                       |
+-------------------------------------------------------------------------+
| Iterate for the number of num_real_idx (30+9 = 39 bytes)                |
+-------------------------------------------------------------------------+
|     Iterate 10 times for 10 possible columns (10*3 = 30 bytes)          |
+-------------------------------------------------------------------------+
| ???? | 2 bytes | col_num     | number of a column (0xFFFF= none)        |
| ???? | 1 byte  | col_order   | 0x01 =  ascendency order                 |
+-------------------------------------------------------------------------+
| ???? | 4 bytes | unknown     |                                          |
| ???? | 4 bytes | first_dp    | Data pointer of the index page           |
| ???? | 1 byte  | flags       | See flags table for indexes              |
+-------------------------------------------------------------------------+
| Iterate for the number of num_real_idx (20 bytes)                       |
+-------------------------------------------------------------------------+
| ???? | 4 bytes | index_num   | Number of the index                      |
|      |         |             |(warn: not always in the sequential order)|
| ???? | 4 bytes | index_num2  | Number of the index (repeat)             |
| 0x00 | 1 byte  | ???         |                                          |
| 0xFF | 4 bytes | ???         |                                          |
| 0x00 | 4 bytes | ???         |                                          |
| 0x04 | 2 bytes | ???         |                                          |
| ???? | 1 byte  | primary_key | 0x01 if this index is primary            |
+-------------------------------------------------------------------------+
| Iterate for the number of num_real_idx                                  |
+-------------------------------------------------------------------------+
| ???? | 1 byte  | idx_name_len| len of the name of the index             |
| ???? | n bytes | idx_name    | Name of the index                        |
+-------------------------------------------------------------------------+
| ???? | n bytes | ???         |                                          |
| 0xFF | 2 bytes | ???         | End of the tableDef  ?                   |
+-------------------------------------------------------------------------+

+-------------------------------------------------------------------------+
| Jet4 Table Definition Block (55 bytes)                                  |
+------+---------+-------------+------------------------------------------+
| data | length  | name        | description                              |
+------+---------+-------------+------------------------------------------+
| ???? | 4 bytes | tdef_len    | Length of the data for this page         |
| ???? | 4 bytes | unknown     | unknown                                  |
| ???? | 4 bytes | num_rows    | Number of records in this table          |
| 0x00 | 4 bytes | autonumber  | value for the next value of the          |
|      |         |             | autonumber column, if any. 0 otherwise   |
| ???? |16 bytes | unknown     | unknown                                  |
| 0x4e | 1 byte  | table_type  | 0x4e: user table, 0x53: system table     |
| ???? | 2 bytes | max_cols    | Max columns a row will have (deletions)  |
| ???? | 2 bytes | num_var_cols| Number of variable columns in table      |
| ???? | 2 bytes | num_cols    | Number of columns in table (repeat)      |
| ???? | 4 bytes | num_idx     | Number of indexes in table               |
| ???? | 4 bytes | num_real_idx| Number of indexes in table (repeat)      |
| ???? | 4 bytes | used_pages  | Points to a record containing the        |
|      |         |             | usage bitmask for this table.            |
| ???? | 4 bytes | free_pages  | Points to a similar record as above,     |
|      |         |             | listing pages which contain free space.  |
+-------------------------------------------------------------------------+
| Iterate for the number of num_real_idx (12 bytes per idxs)              |
+-------------------------------------------------------------------------+
| 0x00 | 4 bytes | ???         |                                          |
| ???? | 4 bytes | num_idx_rows| (not sure)                               |
| 0x00 | 4 bytes | ???         |                                          |
+-------------------------------------------------------------------------+
| Iterate for the number of num_cols (25 bytes per column)                |
+-------------------------------------------------------------------------+
| ???? | 1 byte  | col_type    | Column Type (see table below)            |
| ???? | 4 bytes | unknown     | matches first unknown definition block   |
| ???? | 2 bytes | col_num     | Column Number (includes deleted columns) |
| ???? | 2 bytes | offset_V    | Offset for variable length columns       |
| ???? | 2 bytes | col_num     | Column Number                            |
| ???? | 4 bytes | ???         | prec/scale? verify                       |
| ???? | 1 byte  | bitmask     | low order bit indicates variable columns |
| ???? | 1 byte  | ???         | seems to be 1 when variable len          |
| 0000 | 4 bytes | ???         |                                          |
| ???? | 2 bytes | offset_F    | Offset for fixed length columns          |
| ???? | 2 bytes | col_len     | Length of the column (0 if memo)         |
+-------------------------------------------------------------------------+
| Iterate for the number of num_cols (n*2 bytes per column)               |
+-------------------------------------------------------------------------+
| ???? | 2 bytes | col_name_len| len of the name of the column            |
| ???? | n bytes | col_name    | Name of the column (UCS-2 format)        |
+-------------------------------------------------------------------------+
| Iterate for the number of num_real_idx (30+22 = 52 bytes)               |
+-------------------------------------------------------------------------+
| ???? | 4 bytes | ???         |                                          |
+-------------------------------------------------------------------------+
| Iterate 10 times for 10 possible columns (10*3 = 30 bytes)              |
+-------------------------------------------------------------------------+
| ???? | 2 bytes | col_num     | number of a column (0xFFFF= none)        |
| ???? | 1 byte  | col_order   | 0x01 =  ascendency order                 |
+-------------------------------------------------------------------------+
| ???? | 4 bytes | unknown     |                                          |
| ???? | 4 bytes | first_dp    | Data pointer of the index page           |
| ???? | 1 byte  | flags       | See flags table for indexes              |
| ???? | 9 bytes | unknown     |                                          |
+-------------------------------------------------------------------------+
| Iterate for the number of num_real_idx (28 bytes)                       |
+-------------------------------------------------------------------------+
| ???? | 4 bytes | unknown     | matches first unknown definition block   |
| ???? | 4 bytes | index_num   | Number of the index                      |
|      |         |             |(warn: not always in the sequential order)|
| ???? | 4 bytes | index_num2  | Number of the index (repeat)             |
| 0x00 | 1 byte  | ???         |                                          |
| 0xFF | 4 bytes | ???         |                                          |
| 0x00 | 4 bytes | ???         |                                          |
| 0x04 | 2 bytes | ???         |                                          |
| ???? | 1 byte  | primary_key | 0x01 if this index is primary            |
| ???? | 4 bytes | unknown     |                                          |
+-------------------------------------------------------------------------+
| Iterate for the number of num_real_idx                                  |
+-------------------------------------------------------------------------+
| ???? | 2 bytes | idx_name_len| len of the name of the index             |
| ???? | n bytes | idx_name    | Name of the index (UCS-2)                |
+-------------------------------------------------------------------------+
| ???? | n bytes | ???         |                                          |
| 0xFF | 2 bytes | ???         | End of the tableDef  ?                   |
+-------------------------------------------------------------------------+

Index flags (not complete):
0x01 Unique     
0x02 IgnoreNuls
0x08 Required

Column Type may be one of the following (not complete):

    BOOL            = 0x01 /* Boolean         ( 1 bit ) */
    BYTE            = 0x02 /* Byte            ( 8 bits) */
    INT             = 0x03 /* Integer         (16 bits) */
    LONGINT         = 0x04 /* Long Integer    (32 bits) */
    MONEY           = 0x05 /* Currency        (64 bits) */
    FLOAT           = 0x06 /* Single          (32 bits) */
    DOUBLE          = 0x07 /* Double          (64 bits) */
    SDATETIME       = 0x08 /* Short Date/Time (64 bits) */
    BINARY          = 0x09 /* Binary        (255 bytes) */
    TEXT            = 0x0A /* Text          (255 bytes) */
    OLE             = 0x0B /* OLE */
    MEMO            = 0x0C /* Memo, Hyperlink */
    UNKNOWN_0D      = 0x0D
    UNKNOWN_0E      = 0x0E
    REPID           = 0x0F /* GUID */
    NUMERIC         = 0x10 /* Scaled decimal  (17 bytes) */

Notes on deleted and added columns: (sort of Jet4 specific)

If a fixed length column is deleted the offset_F field will contain the offsets 
of the original row definition.  Thus if the number of columns on the row does 
not match the number in the tdef, the offset_F field could be used to return 
the proper data. Columns are never really deleted in the row data.  The deleted
column will forever exist and be set to null for new rows. 

A row may have less than max_cols columns but will never have more, as max_cols
is never decremented.  If you have a table with 6 columns, delete one, and add 
one, then max_cols will be 7.

For variable length columns, offset_V will hold the position in the offset table
of that column.  Missing columns are set to null for new rows.


Page Usage Maps
---------------

There are three uses for the page usage bitmaps.  There is a global page usage 
stored on page 1 which tracks allocated pages throughout the database.  

Tables store two page usage bitmaps.  One is a straight map of which pages are 
owned by the table.  The second is a map of the pages owned by the table which 
have free space on them (used for inserting data).

The table bitmaps appear to be of a fixed size for both Jet 3 and 4 (128 and 64
bytes respectively).  The first byte of the map is a type field.

+--------------------------------------------------------------------------+
| Type 0 Page Usage Map                                                    |
+------+---------+---------------------------------------------------------+
| data | length  | name       | description                                |
+------+---------+---------------------------------------------------------+
| 0x00 | 1 byte  | map_type   | 0x00 indicates map stored within.          |
| ???? | 4 byte  | page_start | first page for which this map applies      |
+------+---------+---------------------------------------------------------+
| Iterate for the length of map                                            |
+--------------------------------------------------------------------------+
| ???? | 1 byte  | bitmap     | each bit encodes the allocation status of a|
|      |         |            | page. 1 indicates allocated to this table. |
|      |         |            | Pages are stored starting with the low     |
|      |         |            | order bit of the first byte.               |
+--------------------------------------------------------------------------+

If you're paying attention then you'll realize that the relatively small size of
the map (128*8*2048 or 64*8*4096 = 2 Meg) means that this scheme won't work with
larger database files although the initial start page helps a bit.  To overcome
this there is a second page usage map scheme with the map_type of 0x01.

+--------------------------------------------------------------------------+
| Type 1 Page Usage Map                                                    |
+------+---------+---------------------------------------------------------+
| data | length  | name       | description                                |
+------+---------+---------------------------------------------------------+
| 0x01 | 1 byte  | map_type   | 0x01 indicates this is a indirection list. |
+------+---------+---------------------------------------------------------+
| Iterate for the length of map                                            |
+--------------------------------------------------------------------------+
| ???? | 4 bytes | map_page   | pointer to page type 0x05 containing map   |
+--------------------------------------------------------------------------+

Note that the initial start page is gone and is reused for the first page 
indirection.  The 0x05 type page header looks like:

+--------------------------------------------------------------------------+
| Usage Map Page (type 0x05)                                               |
+------+---------+---------------------------------------------------------+
| data | length  | name       | description                                |
+------+---------+---------------------------------------------------------+
| 0x05 | 1 byte  | page_type  | allocation map page                        |
| 0x01 | 1 byte  | unknown    | always 1 as with other page types          |
| 0x00 | 2 bytes | unknown    |                                            |
+------+---------+---------------------------------------------------------+

The rest of the page is the allocation bitmap following the same scheme (lsb
to msb order, 1 bit per page) as a type 0 map.  This yields a maximum of
2044*8=16352 (jet3) or 4092*8 = 32736 (jet4) pages mapped per type 0x05 page.
Given 128/4+1 = 33 or 64/4+1 = 17 page pointers per indirection row (remember
the start page field is reused, thus the +1), this yields 33*16352*2048 = 1053
Meg (jet3) or 17*32736*4096 = 2173 Meg (jet4) or enough to cover the maximum
size of each of the database formats comfortably, so there is no reason to
believe any other page map schemes exist.


Indices
-------

Indices are not completely understood but here is what we know.

+-------------------------------------------------------------------------+
| Index Page (type 0x03)                                                  |
+------+---------+-------------+------------------------------------------+
| data | length  | name        | description                              |
+------+---------+-------------+------------------------------------------+
| 0x01 | 1 bytes | page_type   | 0x03 indicate an index page              |
| 0x01 | 1 bytes | unknown     |                                          |
| ???? | 2 bytes | free_space  | The free space at the end this page      |
| ???? | 4 bytes | parent_page | The page number of the TDEF for this idx |
| ???? | 4 bytes | prev_page   | Previous page at this index level        |
| ???? | 4 bytes | next_page   | Next page at this index level            |
| ???? | 4 bytes | leaf_page   | Pointer to leaf page, purpose unknown    |
+-------------------------------------------------------------------------+

Index pages come in two flavors.

0x04 pages are leaf pages which contain one entry for each row in the table.  
Each entry is composed of a flag, the indexed column values and a page/row 
pointer to the data.

0x03 index pages make up the rest of the index tree and contain a flag, the 
indexed columns, the page/row that contains this entry, and the leaf page or 
intermediate (another 0x03 page) page pointer for which this is the first 
entry on.

Both index types have a bitmask starting at 0x16(jet3) or 0x1b(jet4) which
identifies the starting location of each index entry on this page.  The first
entry begins at offset 0xf8(jet3) or 0x1e0(jet4), and is not explicitly
indicated in the bitmask.  Note that the count in each byte begins with the
low order bit.  For example take the data:

00 20 00 04 80 00 ...

Convert the bytes to binary starting with the low order bit in each byte.  v's
mark where each entry begins:

v                v                 v                v
0000 0000  0000 0100  0000 0000  0010 0000  0000 0001  0000 0000
-- 00 ---  -- 20 ---  -- 00 ---  -- 04 ---  -- 80 ---  -- 00 ---

As noted earlier, the first entry is implicit.  The second entry begins at an
offset of 13 (0xd) bytes from the first.  The third entry 26 (0x1a) bytes from
the first.  The final entry starts at an offset of 39 (0x27) bytes from the
first.  In this example the rest of the mask (up to offset 0xf8/0x1e0) would be
zero-filled and thus this last entry isn't an actual entry, but the stopping
point of the data.

For Jet3, (0xf8 - 0x16) * 8 = 0x710 and 0x800 - 0xf8 = 0x708.
For Jet4, (0x1e0 - 0x1b) * 8 = 0xe28 and 0x1000 - 0x1e0 = 0xe20.
So the mask just covers the page, including space to indicate if the last entry
goes to the end of the page.  One wonders why MS didn't use a row offset table
like they did on data pages.  It seems like it would have been easier and more
flexible.

So now we come to the index entries for type 0x03 pages which look like this:

+-------------------------------------------------------------------------+
| Index Record                                                            |
+------+---------+-------------+------------------------------------------+
| data | length  | name        | description                              |
+------+---------+-------------+------------------------------------------+
| 0x7f | 1 byte  | flags       | 0x80 LSB, 0x7f MSB, 0x00 null?           |
| ???? | variable| indexed cols| indexed column data                      |
| ???? | 3 bytes | data page   | page containing row referred to by this  |
|      |         |             | index entry                              |
| ???? | 1 byte  | data row    | row number on that page of this entry    |
| ???? | 4 bytes | child page  | next level index page containing this    |
|      |         |             | entry as first entry.  Could be a leaf   |
|      |         |             | node.                                    |
+-------------------------------------------------------------------------+

The flag field is generally either 0x00, 0x7f, 0x80.  0x80 is the one's 
complement of 0x7f and all text data in the index would then need to be negated.
The reason for this negation is unknown, although I suspect it has to do with 
descending order.  The 0x00 flag indicates that the key column is null, and no 
data will follow, only the page pointer.  In multicolumn indexes the flag field 
plus data is repeated for the number of columns participating in the key.

Note, there is a compression scheme utilized on leaf pages.  Normally an index
entry with an integer primary key would be 9 bytes (1 for the flags field, 4 for
the integer, 4 for page/row).  The entry can be shorter than 9, containing only
5 bytes, where the first byte is the last octet of the encoded primary key field
(integer) and the last four are the page/row pointer.  Thus if the first key
value on the page is 1 and it points to page 261 (00 01 05) row 3, it becomes:

7f 00 00 00 01 00 01 05 03

and the next index entry can be:

02 00 01 05 04

That is, the key value is 2 (the last octet changes to 02) page 261 row 4.

Access stores an 'alphabetic sort order' version of the text key columns in the
index.  Here is the encoding as we know it:

0-9: 0x56-0x5f
A-Z: 0x60-0x79
a-z: 0x60-0x79

Once converted into this (non-ascii) character set, the text value can be
sorted in 'alphabetic' order.  A text column will end with a NULL (0x00 or 0xff
if negated).  

The leaf page entries store the key column and the 3 byte page and 1 byte row
number.

The value of the index root page in the index definition may be an index page
(type 0x03), an index leaf page (type 0x04) if there is only one index page, 
or (in the case of tables small enough to fit on one page) a data page 
(type 0x01).

So to search the index, you need to convert your value into the alphabetic 
character set, compare against each index entry, and on successful comparison
follow the page and row number to the data.  Because text data is managled 
during this conversion there is no 'covered querys' possible on text columns.

To conserve on frequent index updates, Jet also does something special when 
creating new leaf pages at the end of a primary key (maybe others as well) index.  
The next leaf page pointer of the last leaf node points to the new leaf page but 
the index tree is not otherwise updated.  In src/libmdb/index.c, the last leaf 
read is stored, once the index search has been exhausted by the normal search 
routine, it enters a "clean up mode" and reads the next leaf page pointer until 
it's null.
 
KKD Records
-----------

Design View table definitions appear to be stored in 'KKD' records (my name for 
them...they always start with 'KKD\0'). Again these reside on pages, packed to 
the end of the page. 

Update: The KKD records are stored in LvProp column of MSysObjects so they are 
stored as other OLE/Memo fields are.

They look a little like this: (this needs work...see the kkd.c)

'K' 'K' 'D' 0x00
16 bit length value    (this includes the length)
0x00 0x00
0x80 0x00              (0x80 seems to indicate a header)
Then one or more of: 16 bit length field and a value of that size.
For instance: 
0x0d 0x00 and 'AccessVersion' (AccessVersion is 13 bytes, 0x0d 0x00 intel order)

Next comes one of more rows of data. (column names, descriptions, etc...)
16 bit length value    (this includes the length)
0x00 0x00
0x00 0x00
   16bit length field (this include the length itself)
   4 bytes of unknown purpose
      16 bit length field (non-inclusive)
      value (07.53 for the AccessVersion example above)

See kkd.c for an example, although it needs cleanup.


Text Data Type
--------------

In Jet3, the encoding of text depends on the machine on which it was created.
So for databases created on U.S. English systems, it can be expected that text
is encoded in CP1252.  This is the default used by mdbtools.  If you know that
another encoding has been used, you can over-ride the default by setting the
environment variable MDB_JET3_CHARSET.  To find out what encodings will work on
your system, run 'iconv -l'.

In Jet4, the encoding can be either little-endian UCS-2, or a special
compressed form of it.  This compressed format begins with 0xff 0xfe.
The string then starts in compressed mode, where characters with 0x00 for the
most-significant byte do not encode it.  In the compressed format, a 0x00 byte
signals a change from compressed mode to uncompressed mode, or from
uncompressed mode back to compressed mode.  The string may end in either mode.
Note that a string containing any character 0x##00 (UCS-2) will not be
compressed.  Also, the string will only be compressed if it really does make
the string shorter as compared to uncompressed UCS-2.

Programs that use mdbtools libraries will receive strings encoded in UTF-8 by
default.  This default can by over-ridden by setting the environment variable
MDBICONV to the desired encoding.