<HEAD> <TITLE>MaraDNS Data Structures</TITLE> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8"> </HEAD> <body> <H1>Summary</H1> This document describes some of the data structures that the MaraDNS program contains; in particular, the data structures which contain the DNS resource records. <H1>DNS records</H1> The cached and authoritative DNS records are separate data structures in MaraDNS. This is to minimize the interference between the cached data and the authoritative data; it should be impossible for any changes to data in the recursive DNS cache to change authoritative DNS data. <p> Despite these, the two data structures are very similar in form. In fact, the <tt>udpsuccess()</tt> function is used by both the authoritative and recursive code to send out a UDP packet with the answer to the DNS question they have answered. <p> There are, however, some differences between the two data structures for the following reasons: <ul> <li>The authoritative data structure has no need to store "this record does not exist" records; the authoritative data structure is a, well, authoritative copy of the data in question; if the data is not there, the record does not exist. <p> The caching data structure, however, needs to store "this record does not exist" records; the cache is, by definition, an incomplete image of the DNS space which we only set up so that multiple queries for the same name are faster after the first query. <li>The authoritative data structure stores DNS referrals ("I do not have this information, but the information can be found here") in a different format than the recursive data structure. Again, this is because the authoritative data is guaranteed to be complete, and the cache is essentially guaranteed only to be a snapshot of a few records in the DNS space. <p> If recursive DNS referrals were stored in the same format as authoritative DNS referrals, MaraDNS would be vulnerable to a security attack known as <i>DNS cache poisoning</i>. <li>The authoritative dns structure can "cross-link" between different DNS records. For example, if there is a CNAME in the authoritative DNS space, and an IP which corresponds to that CNAME elsewhere in that space, MaraDNS can have a pointer in the CNAME record which points directly to the memory address for the IP corresponding to the CNAME. This is because authoritative data is static in MaraDNS; the data never changes until MaraDNS is restarted. <p> With recursive data, on the other hand, the data is very dynamic. Memory pointers are in a constant state of flux. As a result, it is impossible to have one resource record directly point to the memory address of another resource record; direct memory pointers, while extremely fast, can do very nasty things if the expected data is not there. As a result, each record is self-contained; MaraDNS literally copies the IP for a CNAME in to the CNAME resource record; if the IP resource record is removed later on, but the CNAME lingers, the IP in the CNAME will still point to valid data. </ul> Now that I have summarized the differences between the two data structures, I will describe, in great technical detail, the DNS data structure. <H1>The DNS hash</H1> The data structure that MaraDNS uses is a <i>hash</i>; the program looks for an element in the array by string reference. The process is, roughly speaking, as follows: <ul> <li>The string, or <i>hash index</i> is converted in to a small number. This is the <i>hashing</i> process. This process allows the array to be relatively small, while allowing fast searches for the data element that corresponds to, say "Awww.maradns.org.". <li>If more than one string exists which has the same <i>hash value</i> (turns in to the same number), the array index which the hash value points to will be a list of elements (a linked list, as it turns out), where each element points to the actual string we are trying to find a reference for, and the data which corresponds to that string. <li>Once we find the <i>hash element</i> for the string we are looking for data for, we return a pointer to the memory address which has the data we are looking for. </ul> A hash is very efficient for quickly accessing a data element when we know that exact name for what we are looking for. <p> The strings that MaraDNS' hash structure uses are not ordinary strings; instead they are special <i>JsStr</i> objects. JsStr strings are strings which are generated by a special library, which I wrote myself (this is, in fact, the oldest code in the MaraDNS code base; I wrote most of the JsStr code back in the summer of 2000), and has the following features: <ul> <li>The strings are resistant to buffer overflows. <li>It is possible to put ASCII nulls or any other characters in the strings. <li>It is possible to have different encodings for different strings, such as binary, ASCII, or ISO 8859-1. </ul> The main feature we take advantage of is the fact that we can put ASCII nulls in the strings; raw "over the wire" DNS queries use binary data which always contains ASCII nulls. <p> The strings which MaraDNS uses as hash indexes are raw binary strings, based on the raw format of "Domains names" specified in RFC 1035 <hibit alt="section ">§</hibit>3.1. A domain name is a series of "labels", where each label is a one-byte (or "octet") length field followed by a string of characters <i>length</i> long. The domain name is terminated by a label with a length field of zero. <p> For example, what is in human readable form as "www.maradns.org" is in the following form in DNS "over the wire" packets: <p> <i> [binary 3] </i> www <i> [binary 7] </i> maradns <i> [binary 3] </i> org <i> [binary 0] </i> <p> Where [binary <i>number</i>] is a single byte with a value of <i>number</i>. If one were to look at a DNS packet requesting "www.maradns.org" with a hex dump utility, one would see something like this: <pre> 03 77 77 77 07 6D 61 72 61 64 6E 73 03 6F 72 67 00 ? w w w ? m a r a d n s ? o r g ? </pre> In addition to have the <i>domain label</i> for www.maradns.org, the hash index string also has a two-byte indicator what what kind of resource records one wishes. <h1>What is a resource record?</h1> A <i>resource record</i> is the kind of information one wishes to have for a given name, such as www.maradns.org. Usually, one simply wishes to know the IP for www.maradns.org so that one's web browser can contact the web page. This type of request is handled by the A record. <p> However, it is also possible that one wishes to send email to maradns.org; perhaps one wishes to post to the maradns mailing list. In this case, one would request the <i>mail exchanger</i> for maradns.org; the name server then replies with a list of machines which accept mail for maradns.org. This type of request is handled by the MX record. <p> In many cases, notably aol.com, the machines which accept incoming mail are different machines than the machines which a person would connect to with a web browser. As a result, programs which send mail send out DNS requests for the mail exchanger (or MX record) for a given domain name label, and most other programs (including web browsers) simply as for the IP (or A record) for the machine in question. <p> DNS was designed to be extensible. As a result, DNS can potentially have over 65000 different record types; less than 100 different record types for DNS have been proposed in the 22 years DNS has been around. Commonly, only six different record types are used: A, MX, SOA, NS, PTR, and CNAME. A and MX are described above. SOA and NS records are used for, respectively, "this record does not exist" and "Please go here to find the answer to your question" replies. CNAME records, while in common use, are never used by responsible DNS administrators. Should ipv6 ever become a reality, AAAA records will be used by responsible DNS administrators; irresponsible DNS administrators may, unfortunately, be foolish enough to use A6 or DNAME records. <p> Each resource record type has a numeric value associated with it. "A" records have a numeric value of 1; "MX" records have a numeric value of 15. <H1>The hash index</H1> In a raw DNS query, a resource record type is a big endian two byte value which is appended to the domain label for the query. MaraDNS' hash uses, as a hash index, a raw binary domain label followed by a 2-byte resource record type. As a result, an A query for www.maradns.org is indexed thusly (using our hypothetical hex dump program again): <pre> 03 77 77 77 07 6D 61 72 61 64 6E 73 03 6F 72 67 00 00 01 ? w w w ? m a r a d n s ? o r g ? ? ? </pre> A MX request for maradns.org, which a mailing program would send in order to find out which mail server to send email with a suffix of @maradns.org to, is indexed thusly: <pre> 07 6D 61 72 61 64 6E 73 03 6F 72 67 00 00 0F ? m a r a d n s ? o r g ? ? ? </pre> <H1>The hash data</H1> MaraDNS' hash structure, in addition to having a pointer to the memory address with the hash data in question, has an integer which indicates the kind of data this hash elements has. The authoritative hash (yes, the authoritative and cache data use separate hashes) only uses one datatype for its data: MARA_DNSRR, or a DNS resource record which uses an <tt>rr</tt> data structure. <p> The rr data structure is as follows (from MaraDns.h): <pre> /* Structure which contains the data for a DNS record pointed to from the hash */ typedef struct rr { qual_timestamp expire; /* When this RR expires (0=authoritative/never) */ uint32 ttl; /* The TTL for the record in question */ uint32 authoritative; /* Are we authoritative for this zone */ struct rr *next; /* Either the first NS for this zone _or_ the next copy of the same record for the same data type */ struct rr *ip; /* If this is a PTR, MX, or NS record, pointer to the ip the rr domain name points to (if applicable) */ js_string *ptr; /* If this is a PTR request pointing to a CNAME, we want to give them the PTR record along with the CNAME record. */ uint16 rr_type; /* Type of resource record this entry is */ js_string *query; /* Pointer to the query one asks to get this answer */ js_string *data; /* The actual raw binary data for this RR */ fila *zap; /* Pointer to a structure used for deleting elements from the cache when the cache starts to fill up */ char seen; /* This is used by udpsuccess() to insure that a given element in the hash is only visited once */ perm_t perms; /* What IPs can view this element; making this zero means that all IPs can view the element in question */ struct rr_list *list; /* rr_list of all records this CNAME refers to */ char rcode; /* This is used to store the rcode for a given DNS answer */ } rr; </pre> As can be seen, this can point to other records in two places: From the "next" element of the structure, and from the "ip" element of the structure. <p> This structure stores a single resource records; a single DNS name can have multiple resource records attached to it. For example, a single computer may not be able to handle all of the traffic our web site is receiving; so we have eight different computers, with different IPs, which distribute the traffic, hopefully evenly, between the eight computers. <p> DNS can distribute the load among the eight computers; an A query for, say, www.example.com, will return a list of eight IPs. The client chooses one of the IPs at random, and each computer does not need to process as many queries. <p> The "next" part of this data structure is a pointer to the next DNS resource record in cases where a single name has multiple resource records. <p> In addition, many DNS records, for better or for worse, point to DNS names instead of, say, IPs (while this makes DNS more flexible, it also make DNS harder to implement). Since a person really wants an IP instead of a name, DNS servers usually supply the IP which corresponds to the name in question. This is why the rr structure as an "ip" pointer; so that MaraDNS, when run in authoritative mode, can give out the IP (if it has it) for any names it makes available in NS or MX records. <p> DNS can provide various types of data in the answer to a DNS request, depending on the request one sends to a DNS server. All of these replies share one thing in common: The data is a RR header, as described in RFC 1035 <hibit alt="section ">§</hibit>3.2.1, followed by a binary stream of data (called RDDATA in RFC1035). That "data" js_string object is a binary string containing the raw DNS RDDATA for the data in question; MaraDNS creates the RR header from the hash index (which becomes "name" and "type" in the header); this index is pointed to in the "query" element of the RR structure, the CLASS is always 1 (internet), the TTL field of the RR data structure is the TTL in the RR header, and the RDLENGTH in the RR header is obtained from the length value for the js_string object. <p> In addition, authoritative DNS replies include a listing of the name servers which are the name servers for the query in question. In other words, a DNS question for "where is www.example.com." will get a reply which would translate in to English as "www.example.com is at 10.1.2.3; and the name servers which give the right answers for this question are named ns1.example.com, which has an IP of 10.1.2.4, and ns2.example.com, which has an IP of 10.1.2.5". <p> The way MaraDNS handles this is somewhat kludgy. After the last answer for a question, the "next" element points to the memory address for the first name server record for the record list in question. When MaraDNS is formulating a reply, it knows that, as soon as the record type changes from a non-NS record type to a NS record type, that we have gone from giving out direct answers to their question to giving out the list of name servers. <H1>The structure of cached data</H1> Due to the complexities added by caching data, the structure for cache data is somewhat more complicated; there are several cases which we need to deal with: <ul> <li>The case of a RR being the direct answer to our query. <li>The case of a RR being a CNAME answer; in this case, our query contains both the existence of a CNAME; and the A record which the CNAME points to (if it exists). <li>The case where we are caching a "this record does not" exist entry. <li>The case were we are caching a "the name servers which handle *.example.com. are located at 10.1.2.3 and 10.1.2.4" <li>The case where we are caching a "the name servers which handle *.example.com. have the names ns1.example.af.mil. and ns2.example.af.mil." </ul> In all cases, unlike the authoritative data, the "ip" and "next" pointers can not point to other RRs; each RR needs to be a self-contained unit. (The reasons for this are explained previously in this document). <p> In the first case (a straight RR), the data is the same as the data for an authoritative answer. MaraDNS creates an RR structure in the same format as an RR structure for an authoritative record. The only difference is that cache records do not include a section where the name servers for the record in question are listed; and that the data never points to IPs (e.g. A cache MX record does not have the IPs which correspond to the MX record; this is a non-issue since most mailer programs are smart enough to do separate queries, one to get the list of MX records, and other queries to get the A records which the MX records point to). <p> This data is indexed in the hash as being an answer for the query asked. <p> In the second case (the RR being a CNAME answer), the data is the same format for the first, case, with one addition. The one addition is this: the IP data structure element points to a RR structure (especially made for this data) which contains the IP for this CNAME record. The reason we need to do this is because stub resolvers will not perform a second query to find out what the A record which corresponds to a given CNAME record is; if the data is not in the DNS packet which contains the CNAME, the host name in question will not resolve. <p> This data is indexed in the hash with the same domain name as the question, but where the record type has been changed to CNAME. <p> In the third case, where we cache a "this record does not exist" answer, the data is almost identical to a positive RR. The only difference is that the data is a SOA record instead of a reply for the question that they asked; and that the "datatype" for the hash element is a "MARA_DNS_NEG" datatype instead of a standard "MARA_DNSRR" datatype. <p> The hash index for "this does not exit" replies is the same as the query being asked; the only thing different is the hash datatype. <p> <h1>How referrals are stored</h1> A referral is a record in the form "all *.example.com records can be found at the name servers with the IPs 10.1.2.3 and 10.1.2.4" or "all *.example.net records can be found at the name servers with the names ns1.example.org and ns2.example.org". <p> In the case of caching NS referrals, we use a different data structure for storing the data, as follows: <pre> typedef struct closer { unsigned int num_elements; /* Number of elements in the linked list */ time_t ttd; /* Time for this record to die */ int datatype; /* This is the RR type of the data (RR_A [in which case, the data is a pointer to a uint32] or RR_NS [which makes the data point to a js_string object]) */ void *data; fila *zap; /* So that we can determine in which order to zap RRs */ struct closer *masked; /* If another record with a ttd after the ttd for this record exists, we want to access that data when this record expires */ struct closer *next; } closer; </pre> The only difference is whether the server which gave us the referral gave us glue records. <p> If we have glue records, we convert the glue records in to the IPs which are glued to them, and store the data in the hash so that the "closer" structure has a datatype of RR_A, and data points to a unsigned 32-bit integer (an IP). <p> If we were not given glue record, then the closer struct has a datatype of RR_NS, and data points to a js_string which is a domain label with a 2-byte "type" suffix; the suffix always has a value of 1. <p> In both cases, the data is indexed in the hash with a domain label followed by a bogus "type" of 252; this is the data type used for an AXFR request, which means that this kind of data will never be cached as an ordinary RR. The data type for the hash element is a MARA_DNS_NS element. <p> <h1>Removing elements from the cache</h1> <p> The cache, to be effective, only needs to be a small subset of all of the DNS information available on the internet. In order to minimize the memory resources that MaraDNS needs, after the cache expands to a given size, MaraDNS will remove the elements from the cache. <p> Elements need to be removed from the cache in such a way that: <ul> <li>Elements removed from the cache have not been recently accessed <li>There is not needless overhead for determining what elements to remove from the cache. <li>We do not attempt to remove already-removed elements from the cache </ul> The way MaraDNS accomplishes this is by having a special <i>fila</i> data structure. The <i>fila</i> code was developed when I was down in México; <i>fila</i> is the Spanish word for what American English speakers call a "line" and what the rest of the English-speaking world more accurately calls a "queue". <p> The fila data structure is a circular linked list, is works as follows: <ul> <li> Every time we create a new resource record in the cache, it is placed at the top of the list. <li> Every time we access a resource record (or a nameserver referral), we move that record to the top of the list. <li> When we remove records once the cache starts to fill up, we remove the records at the bottom of the list. </ul> The reason why the resource records in the cache need a pointer to the corresponding fila element is because, when a record's TTL expires, we need to remove the element from the cache, and the corresponding fila element. <H1>Some other information about the rr data structure</H1> <H2>The seen element</H2> The <i>seen</i> element in the rr data structure is used by the <tt>udpsuccess()</tt> routine. <tt>udpsuccess()</tt> is the routine which takes a resource record, and sends it over UDP to a client which wants a given resource record. <p> Since a resource record has several parts, each of which is its own data structure, it is important to guarantee that a given resource record is only displayed once; the seen element is given a value of one when it is shown. <p> udpsuccess() also clears out all of the values of seen back to zero after showing all of the elements in question. <H2>An ip of 255.255.255.255 or 0.0.0.0</H2> Some of MaraDNS' routines allow a query to be performed, and to return an IP based on the query performed. In some cases, we could not get an IP; perhaps there is no such resource record (and we received a "no such record" response when asking for the IP); or perhaps there was an error condition when trying to process the IP. The bogus IP of 0.0.0.0 indicates an error condition; the bogus IP of 255.255.255.255 indicates that a RR for the IP, in fact, does not exist. </body>