Riak Architecture Overview ------ This document describes the design of Riak. It is meant to give both a description of the shape of the system and also some background in the terminology used throughout. Summary and High-level View --- Riak is a distributed key-value store, strongly influenced by the Dynamo Paper[1] and the CAP Theorem[2]. It supports high availability by allowing tunable levels of guarantees for durability and eventual consistency. A Riak cluster is generally run on a set of well-connected physical hosts. Each host in the cluster runs one Riak node. Each Riak node runs a set of virtual nodes, or "vnodes", that are each responsible for storing a separate portion of the key space. Nodes are not clones of each other, nor do they all participate in fulfilling every request. The extent to which data is replicated, and when, and with what merge strategy and failure model, is configurable at runtime. The Ring --- (Much of this section is discussed in the Dynamo paper, but it's a good summary of how Riak implements the necessities.) Riak's client interface speaks of "buckets" and "keys". Internally, Riak computes a 160-bit binary hash of the bucket/key pair, and maps this value to a position on an ordered "ring" of all such values. This ring is divided into partitions. Each Riak vnode is responsible for a partition (we say that it "claims" that partition). The nodes of a Riak cluster each attempt to run roughly an equal number of vnodes. In the general case, this means that each node in the cluster is responsible for 1/(number of nodes) of the ring, or (number of partitions)/(number of nodes) vnodes. For example, if two nodes define a 16-partition cluster, then each node will run 8 vnodes. Nodes claim their partitions at random intervals around the ring, in an attempt at even distribution. When a value is being stored in the cluster, any node may participate as the coordinator for the request. The coordinating node consults the ring state to determine which vnode owns the partition in which the value's key belongs, then sends the "put" request to that vnode, as well as the vnodes responsible for the next N-1 partitions in the ring, where N is a bucket-configurable parameter that describes how many copies of the value to store. The put request may also specify that at least W (<= N) of those vnodes reply with success, and that DW (<= W) reply with success only after durably storing the value. A fetch, or "get", request operates similarly, sending requests to the vnode that "claims" the partition in which the key resides, as well as to the next N-1 partitions. The request also specifies R (<= N), the number of vnodes that must reply before a response is returned. The riak node startup are defined by the riak_core_sup and riak_kv_sup modules. The riak ring is defined by the riak_core_ring module. Vnodes are defined by the riak_kv_vnode module. Gets and Puts are driven by the riak_kv_get_fsm and riak_kv_put_fsm modules, respectively. Gossiping --- The ring state is shared around the cluster by means of a "gossip protocol". Whenever a node changes its claim on the ring, it announces its change via this protocol. It also periodically re-announces what it knows about the ring, in case any nodes missed previous updates. This gossip protocol is defined in the riak_core_gossip and riak_core_ring_manager modules. Vclocks --- With any node able to drive any request, and not all nodes needing to participate in each request, it is necessary to have a method for keeping track of which version of a value is current. This is where vclocks come in. The vclocks used in Riak are based on the work of Leslie Lamport.[3] When a value is stored in Riak, it is tagged with a vclock, establishing its initial version. For each update, the vclock is extended in such a way that Riak can later compare two versions of the object and determine: 1. Whether one object is a direct descendant of the other. 2. Whether the objects are direct descendants of a common parent. 3. Whether the objects are unrelated in recent heritage. Using this knowledge, Riak can possibly auto-repair out-of-sync data, or at least provide a client with an opportunity to reconcile divergent changesets in an application specific manner. Riak's vclock usage is defined by the vclock module in the riak source directory. Riak attempts to move data toward a consistent state across nodes, but it doesn't do so by comparing each and every object on each node. Instead, nodes gossip a "merkle tree"[4], which allows them to quickly decide which values need comparing. Riak's merkle trees are defined by the merkerl module in the riak source directory. Backends --- Sharing data among nodes, on rings, etc. is all well and good, but at some point, it has to actually be stored somewhere - like on disk! Because Riak is relevant to a wide variety of applications, its "backend" storage system is a pluggable one. Each node may be configured with a different Erlang module for doing the simple storage, at the vnode level, below all of the interconnected cluster details. At the backend level, a module only needs to define "get", "put", "delete", and "keys list" functions that receive a key and value which are both Erlang binaries. The backend can consider these binaries completely opaque data, or examine them to make decisions about how best to store them. Several backends are packaged with Riak: 1. riak_kv_bitcask_backend, which stores data in the Bitcask key/value store. 2. riak_kv_cache_backend, which behaves as an LRU-with-timed-expiry cache 3. riak_kv_dets_backend, which stores data on-disk in DETS tables 4. riak_kv_ets_backend, which stores data in ETS tables (which makes it volatile storage, but great for debugging) 5. riak_kv_fs_backend, which stores data directly to files in a nested directory structure on disk 6. riak_kv_gb_trees_backend, which stores data in Erlang gb_trees. [1] http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf [2] http://portal.acm.org/citation.cfm?doid=564585.564601 [3] http://portal.acm.org/citation.cfm?id=359563 [4] http://portal.acm.org/citation.cfm?id=704751