<?xml version="1.0"?>
<?xml-stylesheet href="/transform" type="text/xsl"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:bs="http://purl.org/ontology/bibo/status/" xmlns:ci="https://vocab.methodandstructure.com/content-inventory#" xmlns:dct="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:xhv="http://www.w3.org/1999/xhtml/vocab#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" lang="en" prefix="bibo: http://purl.org/ontology/bibo/ bs: http://purl.org/ontology/bibo/status/ ci: https://vocab.methodandstructure.com/content-inventory# dct: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# xhv: http://www.w3.org/1999/xhtml/vocab# xsd: http://www.w3.org/2001/XMLSchema#" vocab="http://www.w3.org/1999/xhtml/vocab#" xml:lang="en">
  <head>
    <title property="dct:title">There Is No SQLite for RDF</title>
    <base href="https://doriantaylor.com/there-is-no-sqlite-for-rdf"/>
    <link href="document-stats#Ear4ZYtiM7DZhJdBc_HvVK" rev="ci:document"/>
    <link href="elsewhere" rel="alternate bookmark" title="Elsewhere"/>
    <link href="this-site" rel="alternate index" title="This Site"/>
    <link href="http://purl.org/ontology/bibo/status/published" rel="bibo:status"/>
    <link href="" rel="ci:canonical" title="There Is No SQLite for RDF"/>
    <link href="lexicon/#EzqXIsriaILFcWjXdS7FbI" rel="dct:audience" title="Software Developer"/>
    <link href="person/dorian-taylor#me" rel="dct:creator" title="Dorian Taylor"/>
    <link href="file/shrug" rel="foaf:depiction"/>
    <link href="person/dorian-taylor" rel="meta" title="Who I Am"/>
    <link about="./" href="3f36c30c-6096-454a-8a22-c062100ae41f" rel="alternate" type="application/atom+xml"/>
    <link about="./" href="f07f5044-01bc-472d-9079-9b07771b731c" rel="alternate" type="application/atom+xml"/>
    <link about="./" href="this-site" rel="alternate"/>
    <link about="./" href="elsewhere" rel="alternate"/>
    <link about="./" href="e341ca62-0387-4cea-b69a-cdabc7656871" rel="alternate" type="application/atom+xml"/>
    <link about="verso/" href="3f36c30c-6096-454a-8a22-c062100ae41f" rel="alternate" type="application/atom+xml"/>
    <link about="verso/" href="this-site" rel="alternate"/>
    <link about="verso/" href="elsewhere" rel="alternate"/>
    <meta content="The lack of a lightweight, efficient, directly-attached persistent storage mechanism, that can be readily shared between programming languages and frameworks, has frustrated Semantic Web development. I intend to do something about it." name="description" property="dct:abstract"/>
    <meta content="2019-11-28T02:12:11+00:00" datatype="xsd:dateTime" property="dct:created"/>
    <meta content="2019-11-28T02:12:32+00:00" datatype="xsd:dateTime" property="dct:modified"/>
    <meta content="2021-01-29T17:26:33+00:00" datatype="xsd:dateTime" property="dct:modified"/>
    <meta content="2022-05-31T15:10:50+00:00" datatype="xsd:dateTime" property="dct:modified"/>
    <meta about="person/dorian-taylor#me" content="Dorian Taylor" name="author" property="foaf:name"/>
    <meta content="summary_large_image" name="twitter:card"/>
    <meta content="@doriantaylor" name="twitter:site"/>
    <meta content="There Is No SQLite for RDF" name="twitter:title"/>
    <meta content="The lack of a lightweight, efficient, directly-attached persistent storage mechanism, that can be readily shared between programming languages and frameworks, has frustrated Semantic Web development. I intend to do something about it." name="twitter:description"/>
    <meta content="https://doriantaylor.com/file/shrug" name="twitter:image"/>
    <object>
      <nav>
        <ul>
          <li>
            <a href="document-stats#Ear4ZYtiM7DZhJdBc_HvVK" rev="ci:document" typeof="qb:Observation">
              <span>urn:uuid:6abe1962-d88c-4ec3-a661-25d05cfc7bd5</span>
            </a>
          </li>
        </ul>
      </nav>
    </object>
  </head>
  <body about="" id="E34MUb53L-FhGaM-lONmRK" typeof="bibo:Article">
    
    <aside role="note" id="E1mFLGkoeXOdMfqkKKSWHI">
      <h2>Update: There <em>is</em> an SQLite for <abbr>RDF</abbr>!</h2>
      <p><a href="https://github.com/oxigraph/oxigraph" rel="dct:references">It's called Oxigraph</a>, and predictably it's written in Rust.</p>
    </aside>
    <section id="E7W2sTtSdOvccYak1F6HmI">
      <p>Since <abbr>RDF</abbr> technology has its origins in academia, you have tooling written in <dfn>Java</dfn>, and then you have everything else. The problem with Java is that <span class="parenthesis" title="with Clojure/Scala/Groovy/Jython/JRuby">until relatively recently</span>, it hasn't especially lent itself to <q>casual</q> programming. You can't just doodle up a little toy prototoype in an afternoon&#x2014;everything in Java has to be an Actual Serious Software Project&#x2122;. Even when you use one the new dynamic languages that back onto the <abbr>JVM</abbr>, you still have to lug around, well, the <abbr>JVM</abbr>. So that pretty much rules out snappy little command-line programs or persistent services with a modest memory footprint.</p>
      <p>The primary problem with <em>everything else</em> is that there are a <em>lot</em> of everything-elses. Unless you're targeting Windows&#x2014;or maybe even <em>if</em>&#x2014;you're probably going to want to do a lot of your sundry programming in a dynamic, object-oriented, interpreted language, like <dfn>Perl</dfn>, <dfn>Python</dfn>, <dfn>Ruby</dfn>, or <dfn>JavaScript</dfn>. Each of these languages has at least one <abbr>RDF</abbr> framework.</p>
      <figure id="EpmeIr5y04SmZTgp5TXa2L">
        <table>
          <thead>
            <tr>
              <th>Language</th>
              <th>Framework</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td>Perl</td>
              <td><span class="parenthesis" title="Both of these were written by the same guy."><a href="https://metacpan.org/pod/RDF::Trine" rel="dct:references">RDF::Trine</a>, <a href="https://metacpan.org/pod/Attean" rel="dct:references">Attean</a></span></td>
            </tr>
            <tr>
              <td>Python</td>
              <td><a href="https://rdflib.readthedocs.io/en/stable/" rel="dct:references">rdflib</a></td>
            </tr>
            <tr>
              <td>Ruby</td>
              <td><a href="http://ruby-rdf.github.io/" rel="dct:references">rdf.rb</a></td>
            </tr>
            <tr>
              <td>JavaScript</td>
              <td><a href="https://www.w3.org/community/rdfjs/wiki/Comparison_of_RDFJS_libraries" rel="dct:references">A bunch</a></td>
            </tr>
          </tbody>
        </table>
      </figure>
      <p>The problem with these frameworks is that no single one of them is in possession of the entire brain. Even controlling for the need to interface with the data using a particular programming language, there are still noticeable gaps in functionality. Your system <a href="intelligent-heterogeneity" rel="dct:references" title="Intelligent Heterogeneity">will eventually employ more than one of them</a>. Moreover, these frameworks are closer to an <dfn>object-relational mapper</dfn> for <abbr>SQL</abbr> than anything else: they provide a unified interface to the data, but abstract the storage to various drivers for various back-ends.</p>
      <p>The <em>ultimate</em> problem, the one that I'm writing about, is that these are a mess. There is a standard query protocol called <abbr>SPARQL</abbr> which serves as a front-end to a handful of robust database implementations, and you can write your N-tier application on top of that, <em>but then you're writing an N-tier application</em>. You'll have to think about optimizing queries for network round trips, intermediate caching, and all that jazz. In other words, it becomes a whole Thing&#x2122;. Conversely, you can use one of the lighter-weight storage mechanisms built on top of <abbr>SQL</abbr> or key-value stores, which can be directly attached.</p>
      <p>Now: the problem with <em>these</em>, is that they tend to suck. There are two major issues:</p>
      <ol>
        <li>They tend not to be terribly efficient, certainly not nearly as efficient as they could be,</li>
        <li>They are completely ad-hoc and mutually incompatible, despite all basically doing the same thing.</li>
      </ol>
      <p><abbr>RDF</abbr>, being a graph and therefore structurally actually very simple, is probably the worst-shaped thing to be represented in <abbr>SQL</abbr>. A <q>normal</q> <abbr>RDF</abbr> application will generate orders of magnitude more <abbr>SQL</abbr> queries than a native <abbr>SQL</abbr> application, and they will be the kinds of things that do not lend themselves very well to indexing and fancy joins and the kinds of things that <abbr>SQL</abbr> is very good at. In my experience, if you're going to be putting that much traffic through the database, <span class="parenthesis" title="I suppose you could use a graph database, but what do you think those are made out of?">it's better to use a key-value store.</span></p>
      <aside role="note" id="EvvSdhGYaoTIJGkIWwnO0I">
        <p>I say this because I wrote one and attempted to rehabilitate another <abbr>SQL</abbr>-based <abbr>RDF</abbr> back-end, and gave up because past <span class="parenthesis" title="the typical unit of measure for graph databases">a few thousand statements in storage</span>, the query overhead makes it basically unusable.</p>
      </aside>
      <p>Now, all key-value stores <span class="parenthesis" title="although some are more equal than others">are pretty much made equal</span>. At least as far as their interface is concerned: you plug in an arbitrary string of data and if it's in there, you get another arbitrary string of data back out. The idea is that this process happens very quickly, and you can use it to string together more complex constructs. The basic requirements for a key-value store are:</p>
      <ul>
        <li>More than one key-value dictionary per environment,</li>
        <li>Duplicate keys (non-unique indexes),</li>
        <li>Transactions,</li>
        <li>Some kind of non-brain-damaged cursor mechanism.</li>
      </ul>
      <p>Here are most, if not all of your options:</p>
      <figure id="ExPoxNmE9PNE2Vpc6CTfZJ">
        <table class="rag-end">
          <thead>
            <tr>
              <th>Product</th>
              
              <th>Remarks</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td><a href="https://www.oracle.com/database/berkeley-db/" rel="dct:references">Berkeley DB</a></td>
              
              <td>Oracle, lol</td>
            </tr>
            <tr>
              <td><a href="https://github.com/google/leveldb" rel="dct:references">LevelDB</a></td>
              
              <td>Google, lol</td>
            </tr>
            <tr>
              <td><a href="https://rocksdb.org/" rel="dct:references">RocksDB</a></td>
              
              <td>Facebook, lol</td>
            </tr>
            <tr>
              <td><a href="https://fallabs.com/kyotocabinet/" rel="dct:references">Kyoto/Tokyo Cabinet</a></td>
              
              <td>No concurrent writes</td>
            </tr>
            <tr>
              <td><a href="https://symas.com/lmdb/" rel="dct:references">LMDB</a></td>
              
              <td>My first target</td>
            </tr>
          </tbody>
        </table>
        <figcaption>
          <p>Some embedded key-value databases that exhibit these desirable properties.</p>
        </figcaption>
      </figure>
      <p>I will also add that <em>concurrent writes</em> from different <em>processes</em>, not just threads, are also pretty important, <span class="parenthesis" title="SQLite, after all, does not do this.">but you may not care about that for your application.</span> My other inclination is to focus on directly-attached stores, as introducing a network service also tends to introduce a lot of overhead. Put another way: if you're considering writing an <abbr>RDF</abbr> store, the likely plan is that the network layer, if any, will be implemented by <em>you</em>.</p>
    </section>
    <section id="E8JMBn6kLNvrfXM2RoKRUL">
      <h2>There Is No <dfn>SQLite</dfn> for <abbr>RDF</abbr></h2>
      <p>I have surveyed a number of <abbr>RDF</abbr> storage implementations, and my conclusion is that they are &#x2026;varied. Framework implementers seem to just do whatever is convenient for them and their own highly-localized problem of including a toy storage driver in their software package. The problem, again, is that there about a zillion ways to implement an <abbr>RDF</abbr> quad store on top of a key-value database, which <em>themselves</em> are functionally identical. Meaning that, if one were so inclined, one could come up with an architecture that would hit all key-value stores <em>and</em> all frameworks at once.</p>
      <aside role="note" id="EasQluWWWxVYbWg0R1GnLK">
        <p>There is also no reason why the architecture couldn't be adapted for <abbr>SQL</abbr> and actually provide a sane database layout that performed better than what one normally encounters.</p>
      </aside>
      <p>Consider <dfn>SQLite</dfn>. One of the things that makes it so popular&#x2014;aside from being totally unencumbered by licensing constraints&#x2014;is that it's <em>just so easy</em> to get up and running. Moreover, it has bindings in every conceivable programming language, so two completely different programs can connect to the same <dfn>SQLite</dfn> database and manipulate the exact same data. <abbr>SPARQL</abbr> notwithstanding, <abbr>RDF</abbr> developers do not have a fast and robust embedded database that can be readily shared between different programming languages and frameworks.</p>
      <aside role="note" id="ExWg9IyJMWbJ6Ou1QksWBL">
        <p>This would of course include Java frameworks like <a href="https://jena.apache.org/" rel="dct:references"><dfn>Jena</dfn></a> and <a href="https://rdf4j.org/" rel="dct:references"><dfn>Sesame</dfn>/<dfn>RDF4J</dfn></a>.</p>
      </aside>
    </section>
    <section id="E7fKXBOcPcK2LqgCPS73mL">
      <h2>An Interim Solution</h2>
      <p>The answer, I believe, is to come up with a <em>pattern</em>&#x2014;a <em>specification</em>: encode a set of design decisions about the structure of the underlying database such that it successfully balances efficiency and portability, so that appropriate drivers can be written in any language, against any low-level back end, in only a few hundred lines of code. This is what I have begun to do with <a href="https://github.com/doriantaylor/rb-rdf-lmdb" rel="dct:references"><code>RDF::LMDB</code></a>.</p>
      <p>So far, the prototype seems to perform pretty well. There are plenty of valid criticisms of the design as it currently stands, such as the copious use of full-length <dfn><abbr title="Secure Hash Algorithm">SHA</abbr>-256</dfn> hashes all over the place. The design is likely to change: I'm targeting ease of implementation first, and maximizing efficiency second. These are corners that can be sanded down. Once that's happened, I'll spin out a few <abbr title="Lightning Memory-Mapped Database">LMDB</abbr> drivers for different languages and frameworks, and then I will write a spec.</p>
    </section>
  </body>
</html>
