Since RDF technology has its origins in academia, you have tooling written in Java, and then you have everything else. The problem with Java is that until relatively recently, it hasn't especially lent itself to casual programming. You can't just doodle up a little toy prototoype in an afternoon—everything in Java has to be an Actual Serious Software Project™. Even when you use one the new dynamic languages that back onto the JVM, you still have to lug around, well, the JVM. So that pretty much rules out snappy little command-line programs or persistent services with a modest memory footprint.

The primary problem with everything else is that there are a lot of everything-elses. Unless you're targeting Windows—or maybe even if—you're probably going to want to do a lot of your sundry programming in a dynamic, object-oriented, interpreted language, like Perl, Python, Ruby, or JavaScript. Each of these languages has at least one RDF framework.

Language Framework
Perl RDF::Trine, Attean
Python rdflib
Ruby rdf.rb
JavaScript A bunch

The problem with these frameworks is that no single one of them is in possession of the entire brain. Even controlling for the need to interface with the data using a particular programming language, there are still noticeable gaps in functionality. Your system will eventually employ more than one of them. Moreover, these frameworks are closer to an object-relational mapper for SQL than anything else: they provide a unified interface to the data, but abstract the storage to various drivers for various back-ends.

The ultimate problem, the one that I'm writing about, is that these are a mess. There is a standard query protocol called SPARQL which serves as a front-end to a handful of robust database implementations, and you can write your N-tier application on top of that, but then you're writing an N-tier application. You'll have to think about optimizing queries for network round trips, intermediate caching, and all that jazz. In other words, it becomes a whole Thing™. Conversely, you can use one of the lighter-weight storage mechanisms built on top of SQL or key-value stores, which can be directly attached.

Now: the problem with these, is that they tend to suck. There are two major issues:

  1. They tend not to be terribly efficient, certainly not nearly as efficient as they could be,
  2. They are completely ad-hoc and mutually incompatible, despite all basically doing the same thing.

RDF, being a graph and therefore structurally actually very simple, is probably the worst-shaped thing to be represented in SQL. A normal RDF application will generate orders of magnitude more SQL queries than a native SQL application, and they will be the kinds of things that do not lend themselves very well to indexing and fancy joins and the kinds of things that SQL is very good at. In my experience, if you're going to be putting that much traffic through the database, it's better to use a key-value store.

Now, all key-value stores are pretty much made equal. At least as far as their interface is concerned: you plug in an arbitrary string of data and if it's in there, you get another arbitrary string of data back out. The idea is that this process happens very quickly, and you can use it to string together more complex constructs. The basic requirements for a key-value store are:

Here are most, if not all of your options:

Product Remarks
Berkeley DB Oracle, lol
LevelDB Google, lol
RocksDB Facebook, lol
Kyoto/Tokyo Cabinet No concurrent writes
LMDB My first target

Some embedded key-value databases that exhibit these desirable properties.

I will also add that concurrent writes from different processes, not just threads, are also pretty important, but you may not care about that for your application. My other inclination is to focus on directly-attached stores, as introducing a network service also tends to introduce a lot of overhead. Put another way: if you're considering writing an RDF store, the likely plan is that the network layer, if any, will be implemented by you.

There Is No SQLite for RDF

I have surveyed a number of RDF storage implementations, and my conclusion is that they are …varied. Framework implementers seem to just do whatever is convenient for them and their own highly-localized problem of including a toy storage driver in their software package. The problem, again, is that there about a zillion ways to implement an RDF quad store on top of a key-value database, which themselves are functionally identical. Meaning that, if one were so inclined, one could come up with an architecture that would hit all key-value stores and all frameworks at once.

Consider SQLite. One of the things that makes it so popular—aside from being totally unencumbered by licensing constraints—is that it's just so easy to get up and running. Moreover, it has bindings in every conceivable programming language, so two completely different programs can connect to the same SQLite database and manipulate the exact same data. SPARQL notwithstanding, RDF developers do not have a fast and robust embedded database that can be readily shared between different programming languages and frameworks.

An Interim Solution

The answer, I believe, is to come up with a pattern—a specification: encode a set of design decisions about the structure of the underlying database such that it successfully balances efficiency and portability, so that appropriate drivers can be written in any language, against any low-level back end, in only a few hundred lines of code. This is what I have begun to do with RDF::LMDB.

So far, the prototype seems to perform pretty well. There are plenty of valid criticisms of the design as it currently stands, such as the copious use of full-length SHA-256 hashes all over the place. The design is likely to change: I'm targeting ease of implementation first, and maximizing efficiency second. These are corners that can be sanded down. Once that's happened, I'll spin out a few LMDB drivers for different languages and frameworks, and then I will write a spec.