Sophie: maradns-1.4.13-2.mga4 i586

maradns-1.4.13-2.mga4.i586.rpm

<HEAD>
<TITLE>MaraDNS Data Structures</TITLE>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
</HEAD>
<body>
<H1>Summary</H1>
This document describes some of the data structures that the MaraDNS 
program contains; in particular, the data structures which contain the DNS 
resource records.
<H1>DNS records</H1>

The cached and authoritative DNS records are separate data structures in 
MaraDNS.  This is to minimize the interference between the cached data 
and the authoritative data; it should be impossible for any changes to 
data in the recursive DNS cache to change authoritative DNS data.

<p>
Despite these, the two data structures are very similar in form.  In fact, 
the <tt>udpsuccess()</tt> function is used by both the authoritative and 
recursive code to send out a UDP packet with the answer to the DNS 
question they have answered.

<p>
There are, however, some differences between the two data structures for 
the following reasons:

<ul>

<li>The authoritative data structure has no need to store "this record 
does not exist" records; the authoritative data structure is a, well, 
authoritative copy of the data in question; if the data is not there, the 
record does not exist.  

<p>
The caching data structure, however, needs to store "this record does not 
exist" records; the cache is, by definition, an incomplete image of the 
DNS space which we only set up so that multiple queries for the same name 
are faster after the first query.

<li>The authoritative data structure stores DNS referrals ("I do not have 
this information, but the information can be found here") in a different 
format than the recursive data structure.  Again, this is because the 
authoritative data is guaranteed to be complete, and the cache is 
essentially guaranteed only to be a snapshot of a few records in the DNS 
space.

<p>
If recursive DNS referrals were stored in the same format as authoritative 
DNS referrals, MaraDNS would be vulnerable to a security attack known as 
<i>DNS cache poisoning</i>.  

<li>The authoritative dns structure can "cross-link" between different DNS 
records.  For example, if there is a CNAME in the authoritative DNS space, 
and an IP which corresponds to that CNAME elsewhere in that space, MaraDNS 
can have a pointer in the CNAME record which points directly to the memory 
address for the IP corresponding to the CNAME.  This is because 
authoritative data is static in MaraDNS; the data never changes until 
MaraDNS is restarted.

<p> 
With recursive data, on the other hand, the data is very dynamic.  
Memory pointers are in a constant state of flux.  As a result, it is
impossible to have one resource record directly point to the memory
address of another resource record; direct memory pointers, while
extremely fast, can do very nasty things if the expected data is not
there.  As a result, each record is self-contained; MaraDNS literally 
copies the IP for a CNAME in to the CNAME resource record; if the IP 
resource record is removed later on, but the CNAME lingers, the IP in the 
CNAME will still point to valid data.  

</ul>

Now that I have summarized the differences between the two data 
structures, I will describe, in great technical detail, the DNS data 
structure.

<H1>The DNS hash</H1>

The data structure that MaraDNS uses is a <i>hash</i>; the program looks 
for an element in the array by string reference.  The process is, roughly 
speaking, as follows:

<ul>

<li>The string, or <i>hash index</i> is converted in to a small number.  
This is the <i>hashing</i> process.  This process allows the array to be
relatively small, while allowing fast searches for the data element that
corresponds to, say "Awww.maradns.org.".

<li>If more than one string exists which has the same <i>hash value</i> 
(turns in to the same number), the array index which the hash value points 
to will be a list of elements (a linked list, as it turns out), where each 
element points to the actual string we are trying to find a reference for, 
and the data which corresponds to that string.

<li>Once we find the <i>hash element</i> for the string we are looking for 
data for, we return a pointer to the memory address which has the data we 
are looking for.

</ul>

A hash is very efficient for quickly accessing a data element when we know 
that exact name for what we are looking for.  

<p>

The strings that MaraDNS' hash structure uses are not ordinary strings; 
instead they are special <i>JsStr</i> objects.  JsStr strings are strings 
which are generated by a special library, which I wrote myself (this is, 
in fact, the oldest code in the MaraDNS code base; I wrote most of the 
JsStr code back in the summer of 2000), and has the following features:

<ul>
<li>The strings are resistant to buffer overflows.
<li>It is possible to put ASCII nulls or any other characters in the 
strings.
<li>It is possible to have different encodings for different strings, such 
as binary, ASCII, or ISO 8859-1.
</ul>

The main feature we take advantage of is the fact that we can put ASCII 
nulls in the strings; raw "over the wire" DNS queries use binary data 
which always contains ASCII nulls.  

<p>

The strings which MaraDNS uses as hash indexes are raw binary strings, 
based on the raw format of "Domains names" specified in RFC 1035 
<hibit alt="section ">§</hibit>3.1.  A 
domain name is a series of "labels", where each label is a one-byte (or 
"octet") length field followed by a string of characters <i>length</i> 
long.  The domain name is terminated by a label with a length field of 
zero.

<p>

For example, what is in human readable form as "www.maradns.org" is 
in the following form in DNS "over the wire" packets:

<p>

<i> [binary 3] </i> www <i> [binary 7] </i> maradns <i> [binary 
3] </i> org <i> [binary 0] </i>

<p>

Where [binary <i>number</i>] is a single byte with a value of
<i>number</i>.  If one were to look at a DNS packet requesting
"www.maradns.org" with a hex dump utility, one would see something like
this:

<pre>
03 77 77 77 07 6D 61 72 61 64 6E 73 03 6F 72 67 00
 ?  w  w  w  ?  m  a  r  a  d  n  s  ?  o  r  g  ?
</pre>

In addition to have the <i>domain label</i> for www.maradns.org, the 
hash index string also has a two-byte indicator what what kind of resource 
records one wishes.

<h1>What is a resource record?</h1>

A <i>resource record</i> is the kind of information one wishes to have for 
a given name, such as www.maradns.org.  Usually, one simply wishes to know 
the IP for www.maradns.org so that one's web browser can contact the web 
page.  This type of request is handled by the A record.

<p>

However, it is also possible that one wishes to send email to maradns.org; 
perhaps one wishes to post to the maradns mailing list.  In this case, one 
would request the <i>mail exchanger</i> for maradns.org; the name server 
then replies with a list of machines which accept mail for maradns.org.  
This type of request is handled by the MX record.

<p>

In many cases, notably aol.com, the machines which accept incoming mail
are different machines than the machines which a person would connect to
with a web browser.  As a result, programs which send mail send out DNS
requests for the mail exchanger (or MX record) for a given domain name
label, and most other programs (including web browsers) simply as for the
IP (or A record) for the machine in question.

<p>

DNS was designed to be extensible.  As a result, DNS can potentially have
over 65000 different record types; less than 100 different record types
for DNS have been proposed in the 22 years DNS has been around.  
Commonly, only six different record types are used: A, MX, SOA, NS, PTR,
and CNAME.  A and MX are described above.  SOA and NS
records are used for, respectively, "this record does not exist" and
"Please go here to find the answer to your question" replies.  CNAME
records, while in common use, are never used by responsible DNS
administrators.  Should ipv6 ever become a reality, AAAA records will be
used by responsible DNS administrators; irresponsible DNS administrators
may, unfortunately, be foolish enough to use A6 or DNAME records.

<p>

Each resource record type has a numeric value associated with it.  "A" 
records have a numeric value of 1; "MX" records have a numeric value of 
15.

<H1>The hash index</H1>

In a raw DNS query, a resource record type is a big endian two byte value
which is appended to the domain label for the query.  MaraDNS' hash uses,
as a hash index, a raw binary domain label followed by a 2-byte resource
record type.  As a result, an A query for www.maradns.org is indexed
thusly (using our hypothetical hex dump program again):

<pre>
03 77 77 77 07 6D 61 72 61 64 6E 73 03 6F 72 67 00 00 01
 ?  w  w  w  ?  m  a  r  a  d  n  s  ?  o  r  g  ?  ?  ?
</pre>

A MX request for maradns.org, which a mailing program would send in order
to find out which mail server to send email with a suffix of @maradns.org 
to, is indexed thusly:

<pre>
07 6D 61 72 61 64 6E 73 03 6F 72 67 00 00 0F
 ?  m  a  r  a  d  n  s  ?  o  r  g  ?  ?  ?
</pre>

<H1>The hash data</H1>

MaraDNS' hash structure, in addition to having a pointer to the memory 
address with the hash data in question, has an integer which indicates 
the kind of data this hash elements has.  The authoritative hash (yes, 
the authoritative and cache data use separate hashes) only uses one 
datatype for its data: MARA_DNSRR, or a DNS resource record which uses an 
<tt>rr</tt> data structure.

<p>

The rr data structure is as follows (from MaraDns.h):

<pre>
/* Structure which contains the data for a DNS record pointed to from 
   the hash */
typedef struct rr {
    qual_timestamp expire; /* When this RR expires (0=authoritative/never) */
    uint32 ttl; /* The TTL for the record in question */
    uint32 authoritative; /* Are we authoritative for this zone */
    struct rr *next; /* Either the first NS for this zone _or_ the next copy
                        of the same record for the same data type */
    struct rr *ip; /* If this is a PTR, MX, or NS record, pointer to the
                      ip the rr domain name points to (if applicable) */
    js_string *ptr; /* If this is a PTR request pointing to a CNAME,
                       we want to give them the PTR record along
                       with the CNAME record. */
    uint16 rr_type; /* Type of resource record this entry is */
    js_string *query; /* Pointer to the query one asks to get this answer */
    js_string *data; /* The actual raw binary data for this RR */
    fila *zap; /* Pointer to a structure used for deleting elements from
                  the cache when the cache starts to fill up */
    char seen; /* This is used by udpsuccess() to insure that a given
                  element in the hash is only visited once */
    perm_t perms; /* What IPs can view this element; making this zero
                     means that all IPs can view the element in question */
    struct rr_list *list; /* rr_list of all records this CNAME refers to */
    char rcode; /* This is used to store the rcode for a given DNS answer */
    } rr;
</pre>

As can be seen, this can point to other records in two places: From the 
"next" element of the structure, and from the "ip" element of the 
structure.  

<p>

This structure stores a single resource records; a single DNS name can 
have multiple resource records attached to it.  For example, a single 
computer may not be able to handle all of the traffic our web site is 
receiving; so we have eight different computers, with different IPs, which 
distribute the traffic, hopefully evenly, between the eight computers.

<p>

DNS can distribute the load among the eight computers; an A query for, 
say, www.example.com, will return a list of eight IPs.  The client chooses 
one of the IPs at random, and each computer does not need to process as 
many queries.

<p>

The "next" part of this data structure is a pointer to the next DNS 
resource record in cases where a single name has multiple resource 
records.

<p>

In addition, many DNS records, for better or for worse, point to DNS names 
instead of, say, IPs (while this makes DNS more flexible, it also make DNS 
harder to implement).  Since a person really wants an IP instead of a 
name, DNS servers usually supply the IP which corresponds to the name in 
question.  This is why the rr structure as an "ip" pointer; so that 
MaraDNS, when run in authoritative mode, can give out the IP (if it has 
it) for any names it makes available in NS or MX records. 

<p>

DNS can provide various types of data in the answer to a DNS request,
depending on the request one sends to a DNS server.  All of these replies
share one thing in common: The data is a RR header, as described in RFC
1035 <hibit alt="section ">§</hibit>3.2.1, 
followed by a binary stream of data (called RDDATA in
RFC1035).  That "data" js_string object is a binary string containing the
raw DNS RDDATA for the data in question; MaraDNS creates the RR header
from the hash index (which becomes "name" and "type" in the header); this
index is pointed to in the "query" element of the RR structure, the CLASS
is always 1 (internet), the TTL field of the RR data structure is the TTL
in the RR header, and the RDLENGTH in the RR header is obtained from the
length value for the js_string object.

<p>

In addition, authoritative DNS replies include a listing of the name
servers which are the name servers for the query in question.  In other
words, a DNS question for "where is www.example.com." will get a reply
which would translate in to English as "www.example.com is at 10.1.2.3;  
and the name servers which give the right answers for this question are
named ns1.example.com, which has an IP of 10.1.2.4, and ns2.example.com,
which has an IP of 10.1.2.5".

<p>

The way MaraDNS handles this is somewhat kludgy.  After the last answer
for a question, the "next" element points to the memory address for the
first name server record for the record list in question.  When MaraDNS is
formulating a reply, it knows that, as soon as the record type changes
from a non-NS record type to a NS record type, that we have gone from
giving out direct answers to their question to giving out the list of name
servers.

<H1>The structure of cached data</H1>

Due to the complexities added by caching data, the structure for cache 
data is somewhat more complicated; there are several cases which we need 
to deal with:

<ul>

<li>The case of a RR being the direct answer to our query.

<li>The case of a RR being a CNAME answer; in this case, our query
contains both the existence of a CNAME; and the A record which the CNAME
points to (if it exists).

<li>The case where we are caching a "this record does not" exist 
entry. 

<li>The case were we are caching a "the name servers which handle 
*.example.com. are located at 10.1.2.3 and 10.1.2.4"

<li>The case where we are caching a "the name servers which handle
*.example.com. have the names ns1.example.af.mil. and ns2.example.af.mil."

</ul>

In all cases, unlike the authoritative data, the "ip" and "next" pointers 
can not point to other RRs; each RR needs to be a self-contained unit.  
(The reasons for this are explained previously in this document).

<p>

In the first case (a straight RR), the data is the same as the data for an
authoritative answer.  MaraDNS creates an RR structure in the same format
as an RR structure for an authoritative record.  The only difference is
that cache records do not include a section where the name servers for the
record in question are listed; and that the data never points to IPs (e.g.  
A cache MX record does not have the IPs which correspond to the MX record;
this is a non-issue since most mailer programs are smart enough to do
separate queries, one to get the list of MX records, and other queries to
get the A records which the MX records point to).

<p>

This data is indexed in the hash as being an answer for the query asked.

<p>

In the second case (the RR being a CNAME answer), the data is the same
format for the first, case, with one addition.  The one addition is this:  
the IP data structure element points to a RR structure (especially made
for this data) which contains the IP for this CNAME record.  The reason we 
need to do this is because stub resolvers will not perform a second query 
to find out what the A record which corresponds to a given CNAME record 
is; if the data is not in the DNS packet which contains the CNAME, the 
host name in question will not resolve.

<p>

This data is indexed in the hash with the same domain name as the 
question, but where the record type has been changed to CNAME.

<p>

In the third case, where we cache a "this record does not exist" answer, 
the data is almost identical to a positive RR.  The only difference is 
that the data is a SOA record instead of a reply for the question 
that they asked; and that the "datatype" for the hash element is a 
"MARA_DNS_NEG" datatype instead of a standard "MARA_DNSRR" datatype.

<p>

The hash index for "this does not exit" replies is the same as the query 
being asked; the only thing different is the hash datatype.

<p>

<h1>How referrals are stored</h1>

A referral is a record in the form "all *.example.com records can be found 
at the name servers with the IPs 10.1.2.3 and 10.1.2.4" or "all 
*.example.net records can be found at the name servers with the names 
ns1.example.org and ns2.example.org".

<p>

In the case of caching NS referrals, we use a different data structure 
for storing the data, as follows:

<pre>
typedef struct closer {
    unsigned int num_elements; /* Number of elements in the linked list */
    time_t ttd; /* Time for this record to die */
    int datatype; /* This is the RR type of the data (RR_A [in which
                     case, the data is a pointer to a uint32] or RR_NS 
                     [which makes the data point to a js_string object]) 
                   */
    void *data;
    fila *zap; /* So that we can determine in which order to zap RRs */
    struct closer *masked; /* If another record with a ttd after the
                              ttd for this record exists, we want to
                              access that data when this record expires */
    struct closer *next; } closer;
</pre>

The only difference is whether the server which gave us the referral gave
us glue records.  

<p>

If we have glue records, we convert the glue records in
to the IPs which are glued to them, and store the data in the hash so that
the "closer" structure has a datatype of RR_A, and data points to a
unsigned 32-bit integer (an IP).

<p> 

If we were not given glue record, then the closer struct has a datatype of 
RR_NS, and data points to a js_string which is a domain label with a 
2-byte "type" suffix; the suffix always has a value of 1.

<p>

In both cases, the data is indexed in the hash with a domain label
followed by a bogus "type" of 252; this is the data type used for an AXFR
request, which means that this kind of data will never be cached as an
ordinary RR.  The data type for the hash element is a MARA_DNS_NS element.

<p>

<h1>Removing elements from the cache</h1>

<p>

The cache, to be effective, only needs to be a small subset of all of the 
DNS information available on the internet.  In order to minimize the 
memory resources that MaraDNS needs, after the cache expands to a given 
size, MaraDNS will remove the elements from the cache.

<p>

Elements need to be removed from the cache in such a way that:

<ul>
<li>Elements removed from the cache have not been recently accessed
<li>There is not needless overhead for determining what elements to remove
    from the cache.
<li>We do not attempt to remove already-removed elements from the cache
</ul>

The way MaraDNS accomplishes this is by having a special <i>fila</i> 
data structure.  The <i>fila</i> code was developed when I was down in
México; <i>fila</i> is the Spanish word for what American English 
speakers call a "line" and what the rest of the English-speaking world 
more accurately calls a "queue".  

<p>

The fila data structure is a circular linked list, is works as follows:

<ul>

<li> Every time we create a new resource record in the cache, it is 
  placed at the top of the list.
<li> Every time we access a resource record (or a nameserver referral),
  we move that record to the top of the list.
<li> When we remove records once the cache starts to fill up, we remove 
  the records at the bottom of the list.

</ul>

The reason why the resource records in the cache need a pointer to the
corresponding fila element is because, when a record's TTL expires, we
need to remove the element from the cache, and the corresponding fila
element.

<H1>Some other information about the rr data structure</H1>

<H2>The seen element</H2>

The <i>seen</i> element in the rr data structure is used by the 
<tt>udpsuccess()</tt> routine.  <tt>udpsuccess()</tt> is the routine 
which takes a resource record, and sends it over UDP to a client which
wants a given resource record.

<p>

Since a resource record has several parts, each of which is its own
data structure, it is important to guarantee that a given resource
record is only displayed once; the seen element is given a value of
one when it is shown.

<p>

udpsuccess() also clears out all of the values of seen back to zero 
after showing all of the elements in question.

<H2>An ip of 255.255.255.255 or 0.0.0.0</H2>

Some of MaraDNS' routines allow a query to be performed, and to return an
IP based on the query performed.  In some cases, we could not get an IP;
perhaps there is no such resource record (and we received a "no such 
record" response when asking for the IP); or perhaps there was an error
condition when trying to process the IP.  The bogus IP of 0.0.0.0 
indicates an error condition; the bogus IP of 255.255.255.255 indicates
that a RR for the IP, in fact, does not exist.

</body>