Part 1: Introducing riak_core

Posted: July 30th, 2010 | Author: kevin | Filed under: Erlang, riak_core | View Comments
  • What is riak_core?

    riak_core is a single OTP application which provides all the services necessary to write a modern, well-behaved distributed application. riak_core began as part of Riak. Since the code was generally useful in building all kinds of distributed applications we decided to refactor and separate the core bits into their own codebase to make it easier to use.

    Distributed systems are complex and some of that complexity shows in the amount of features available in riak_core. Rather than dive deeply into code, I’m going to separate the features into broad categories and give an overview of each.

    Note: If you’re the impatient type and want to skip ahead and start reading code, you can check out the source to riak_core via hg or git.

    Node Liveness & Membership

    riak_core_node_watcher is the process responsible for tracking the status of nodes within a riak_core cluster. It uses net_kernel to efficiently monitor many nodes. riak_core_node_watcher also has the capability to take a node out of the cluster programmatically. This is useful in situations where a brief node outage is necessary but you don’t want to stop the server software completely.

    riak_core_node_watcher also provides an API for advertising and locating services around the cluster. This is useful in clusters where nodes provide a specialized service, like a CUDA compute node, which is used by other nodes in the cluster.

    riak_core_node_watch_events cooperates with riak_core_node_watcher to generate events based on node activity, i.e. joining or leaving the cluster, etc. Interested parties can register callback functions which will be called as events occur.

    Partitioning & Distributing Work

    riak_core uses a master/worker configuration on each node to manage the execution of work units. Consistent hashing is used to determine which target node(s) to send the request and the master process on each node farms out the request to the actual workers. riak_core calls worker processes vnodes. The coordinating process is the vnode_master.

    The partitioning and distribution logic inside riak_core also handles hinted handoff when required. Hinted handoff occurs as a result of a node failure or outage. In order to assure availability, most clustered systems will use operational nodes in place of down nodes. When the down node comes back the cluster needs to migrate the data from its temporary home on the substitute nodes to the data’s permanent home on the restored node. This process is called hinted handoff and is managed by components inside riak_core. riak_core also handles migrating partitions to new nodes when they join the cluster such that all work continues to be evenly partitioned to all cluster members.

    riak_core_vnode_master starts all the worker vnodes on a given node and routes requests to
    the vnodes as the cluster runs.

    riak_core_vnode is an OTP behavior wrapping all the boilerplate logic required to implement a vnode. Application-specific vnodes need to implement a handful of callback functions in order to participate in handoff sessions and receive work units from the master.

    Cluster State

    A riak_core cluster stores global state in a ring structure. The state information is transferred between nodes in the cluster in a controlled manner to keep all cluster members in sync. This process is referred to as “gossiping”.

    riak_core_ring is the module used to create and manipulate the ring state data shared by all nodes in the cluster. Ring state data includes items like partition ownership and cluster-specific ring metadata. Riak KV stores bucket metadata in the ring metadata, for example.

    riak_core_ring_manager manages the cluster ring for a node. It is the main entry point for application code accessing the ring, via riak_core_ring_manager:get_my_ring/1, and also keeps a persistent snapshot of the ring in sync with the current ring state.

    riak_core_gossip manages the ring gossip process and insures the ring is generally consistent across the cluster.

  • What’s the plan?

    Over the next several months I’m going to cover the process of building a real application in a series of posts to this blog where each post covers some aspect of system building with riak_core. All of the source to the application will be published under the Apache2 licensed and shared via a public repo on github.

    And what type of application will we build? Since the goal of this series is to illustrate how to build distributed systems using riak_core and also satisfy my own technical curiosity I’ve decided to build a distributed graph database. A graph database should provide enough use cases to really exercise riak_core while at the same time not obscuring the core learning experience in tons of complexity.

    Thanks to Sean Cribbs and Andy Gross for providing helpful review and feedback.


Erlang Pattern: The Router

Posted: October 19th, 2009 | Author: kevin | Filed under: Erlang | View Comments

There are three components to this pattern. The client, the router, and the target. The client begins the process by calling the router with a message destined for a target. The router uses the message, and possibly other metadata, to locate a desirable target. Then the router hands off the message to the selected target. The target replies directly to the client and completes the process.

I’ve written this code a number of times in several different languages. What makes the Erlang implementation so nice is it’s brevity and simplicity. Using the gen_server behavior I can implement the core logic in just a few lines:

  1. Client calls the router with a message destined for a target:
    gen_server:call(?ROUTER, {invoke, Target, Msg})

  2. The router looks up the target and forwards the message on:

    handle_call({invoke, Target, Msg}, From, State) ->
    %% Target selection code goes here
    gen_server:cast(TargetPidOrName, {From, Msg}),
    {noreply, State};

  3. The target replies directly to the waiting client:

    handle_cast({Originator, Msg}, State) ->
    %% Server logic goes here
    gen_server:reply(Originator, Reply),
    {noreply, State};

The pretty bit is that all this plumbing can be hidden away inside a function. So the original call to the router in the client winds up looking like this: router:invoke_target(Target, Msg). The rest happens behind the scenes and appears as just another gen_server call.

I’m pretty sure there are no bugs in using this approach. I’ve benchmarked a recent implementation of this pattern at over 6000 requests/sec without a hiccup.


Tomorrow Night

Posted: September 9th, 2009 | Author: kevin | Filed under: Erlang, Talks | View Comments

I’ll be speaking tomorrow night at a joint meeting of DC ALT.NET, Novalang, and Arlington Erlang Users’ Group. I’ve got two full hours to fill so I’m planning on covering a lot of ground. I’ll be leading off with a gentle but speedy introduction to Erlang followed by blatant evangelism on why webmachine is the finest framework for building REST resources EVAH.

If you’re in the area, please consider coming out. You can reserve a spot here.

Thanks to Matt, Luc, and Chris, for organizing the meeting and having me up to speak. I’m really looking forward to meeting everyone and spending an evening talking about one of my favorite subjects.


CouchDB Benchmarks: gcc vs. clang

Posted: September 6th, 2009 | Author: kevin | Filed under: Erlang | Comments Off

Since the release of Snow Leopard, I’ve been itching to run some benchmarks. Jan Lehnhardt suggested I give the benchmarking script from this bug a few runs. So that’s what I did.

At first glance testing seemed pretty simple. Compile one version of Erlang with gcc and another with clang. Then use each of those versions to compile CouchDB and run benchmarks against each Erlang/CouchDB pair. Since CouchDB also has at least one bit of native code, couchjs, I’d also used gcc and clang to compile that, too.

Then reality showed up and ruined my day. Compiling CouchDB on OS X is, well, a pain in the ass involving MacPorts and lots of dependencies if you do it wrong. Doing it right means grabbing Jan’s couchdbx-core project from Github and building new guts for CouchDBX. This is the way to go, IMHO. 10 – 15 minutes’ of compiling gave me a new Erlang, SpiderMonkey, and CouchDB I could drop into the Contents/Resources/couchdbx-core directory of my previously installed CouchDBX app.

Compiling with clang is similarly easy. All you need to do is set the environment variable CC to point to /Developer/usr/bin/clang. I found I also needed to set CFLAGS to -Qunused-arguments to get Erlang to compile.

The benchmark itself consists of three phases. The first phase creates a database and populates it with 40,000 documents. The second phase creates and executes a complex view named 'megaview' to really exercise the view collation code. The last phase creates and executes a simple view named 'simpleview'.

Let me quickly describe how the performance numbers were collected. The benchmark was executed 5 times against the clang and gcc versions of CouchDB. I've computed a raw average across all runs as well as a "normalized" average which throws out the highest and lowest times. Finally, I computed the standard deviation as a indicator of overall variability.

Enough talk. Here are the numbers:

Note: Document insertion times are given in seconds while view collation numbers are given in minutes

Operation Average Normalized Average Standard Deviation
Insert 40k docs (clang) 41.27 41.28 0.048
Insert 40k docs (gcc) 41.02 41.02 0.132
megaview (clang) 37.68 37.68 0.664
megaview (gcc) 17.97 17.95 0.463
simpleview (clang) 1.59 1.4 0.442
simpleview (gcc) 0.75 0.74 0.020

The megaview test really highlights the disparity between clang and gcc. It's worth noting the test also showcases a bottleneck in CouchDB and only used a single core on my laptop. If I had to guess I'd say the performance disparity is a function of the age of each compiler. gcc has been around forever so it makes sense it's had more time to implement more/better optimizations than clang.


September’s Speaking Gigs

Posted: August 24th, 2009 | Author: kevin | Filed under: Erlang, Talks | Comments Off

I’m speaking at two events in September. First, I’m making the trek to DC to speak at the 9/10 combinedmeeting of the Erlang Users of Arlington/DC, Alt.NET, and NovaLang groups to talk about web development with webmachine and Erlang. I’d appreciate any tips on decent and inexpensive hotels in the area and reasonable places to co-work for the day. Of course, if anyone would like to have a resident Erlang developer at their offices for the day I’m up for that, too :-)

Then, on 9/12, I’ll be introducing Erlang to open source hackers at the ominously named Evil Robot Conference, organized by my friend and former Red Hat colleague Michael DeHaan. Evil Robot is shaping up to be a kick-ass eclectic mix of technologies and speakers. I’m personally looking forward to hearing about btrfs and UnCommon Web.