I get asked sometimes why I cling so stubbornly to the Semantic Web. Before I answer this question, I have to deal with the fact that it is a totally loaded one.

The phrase semantic web, loosely speaking, refers to an overlapping Venn diagram of three distinct concepts, and each of which can theoretically be considered without invoking the other two:

So if somebody utters the phrase semantic web without further qualification—especially to criticize it—they're probably talking about the most audacious and easily straw-mannable of the three.

There is a more technical interpretation of the term Semantic Web, which refers to the use of a reasoner to infer latent information from a set of existing assertions. Reasoners have plenty of practical applications, which I will discuss, but the precipice of the abyss into all manner of occult formal logic is marked by the use of a reasoner.

More mundane is Linked Data, which is simply the architectural constraint of giving machine-readable data objects URLs in order to make them directly accessible over the Web, and moreover that said machine-readable data objects themselves contain links to other data objects. This is a perfectly sensible pattern that we can see all over the place, particularly in REST APIs, and need not have anything to do with RDF or the Semantic Web.

The central problem of Linked Data is that without some kind of protocol or appropriate context, you can't necessarily tell:

  1. what is even a link, versus say, a text field that happens to contain a URL—and that the URL-looking text actually is a URL,
  2. what the relation to the link—or literal data member, for that matter—represents.

To remedy this situation, we need some kind of schema: For (a), we need something that marks a piece of syntax as denoting this thing here is definitely a link. For (b), we need something that specifies the semantics of the relation: what the thing means. HTML has a reasonable capability for the first, and an extremely limited vocabulary for the second. HTML likewise can't help you much if the data you want to represent is something other than a document. Cue the XML boom of the early . I can say from experience that whether you're writing a schema in DTD, XML Schema, or RELAX NG, it is not a trivial undertaking. Other people clearly felt similarly, and we got things like Microformats, Microdata, and most recently, JSON Schema.

Developing alongside all of this business—sometimes quietly and sometimes not so quietly—is RDF. I first encountered it in , when it had been well underway for many years, but the perception, at least, was that it was still wedded to XML. Indeed, as late as , I got the opportunity to ask a preeminent taxonomist what they thought about RDF, only to get a pooh-pooh response: some silly XML thing.

RDF is not an XML thing. What RDF is, is a URI thing. The Resource Description Framework, being a framework for describing resources, has to reference those resources somehow, so it naturally uses Uniform Resource Identifiers, and it uses them absolutely everywhere.

This is the genius of RDF: Everything is a URI, except when it isn't. And if the URI in question is a dereferenceable URL, then what you automatically get is Linked Data. Slap a reasoner onto a big enough concentration of this material and you get the Semantic Web.

Schemas—or what in the biz are called vocabularies✱—the specifications that tell both you and the machine what means what, are as readily available as any other open-source software product. Indeed, a number of de facto core vocabularies interact with and build off each other, as the inheritance model is not too different from conventional object-oriented programming languages. If you can't find the one you need, RDF vocabularies are much easier to write than something like an XML vocabulary, because you're never defining syntax, only semantics. In other words, you only have to specify classes and properties, never sequences of elements.

Speaking of syntax, this is taken care of for you. While it originated in XML, RDF has since grown a solid dozen alternative syntaxes, of chief interest are the easily-typed Turtle, the stealthy JSON-LD, and RDFa, which embeds RDF data into other markup languages like HTML, Atom, or SVG.

Why I'm Still Here

Perhaps now, then, after several paragraphs, I will finally articulate why I use this technology. Something I have come to call the Symbol Management Problem:

In software, and especially in Web development, you will find yourself dealing with a number of symbols, tokens, slugs—identifiers intended to pick out, demarcate, and differentiate different pieces of content for different kinds of processing:

Pretty much everything has that. On the Web we also have:

Web development is particularly rife with symbols, because at the end of the day, you're just schlepping text. A number of these symbols—CSS class names and HTML IDs, URL query keys and form keys—straddle multiple technical specifications because they are meant to serve as junctions that connect the different technologies together. On a more organizational level, many of these objects correspond to entities and relations in internal databases, classes, properties and methods in object-oriented code, or objects in legacy or third-party information systems. A significant chunk of the work of Web application development reduces to mapping these disparate objects to one another, usually in an ad-hoc way.

The more symbol dictionaries you have to maintain—assuming you maintain them at all—the more overhead goes into maintaining them and/or dealing with the fallout of sub-par maintenance, and the more effort, and ultimately code, goes into translating between them. In other words, the entropy generated by the proliferation of symbols can actually foreclose on certain opportunities, because it simply becomes too costly to wrangle.

The whole point of using human-readable symbols, and not, say, random strings or numbers, is to have a mnemonic or associative device such that a human being can look at a given symbol and infer to some extent what the thing is supposed to mean. The tendency, therefore, is to make them contain recognizable words. Here we can see how the Symbol Management Problem decomposes into two parts:

Redundancy
When you have two or more terms that mean the same thing.
Collision
When you have the same term that means two or more things.

Both these situations arise when people, teams, organizations, etc. need a word for a distinct concept, and don't sufficiently consult with others in their orbit about what terms are already in use. This is a fundamental information-sharing problem that will occur any time it's easier to make something up than to look something up, and will persist to some degree no matter how good the communication gets. Nevertheless, it can be palliated.

The collision problem is solved through namespaces, which, when fully-qualified URIs, are by design impossible to collide. The redundancy problem can be solved through term reconciliation, essentially denoting, in a machine-readable-form, that a certain term in one vocabulary means the same thing as a certain term in another. The general communication problem can be greatly ameliorated by making these terms, which are fully-qualified URIs, actually point to webpages containing their own dual machine/human-readable documentation. These can be published, indexed, and made discoverable. Indeed, in most cases, we can skip over the process of minting our own symbol vocabulary entirely and directly use vocabularies authored by other people.

Symbol management is more important now in the age of APIs, when arbitrary data objects are continually being slung across administrative boundaries. The state of the art is that every website with an API also has a documentation section that tells the programmer which field means what, which fields are mandatory, which are optional, which fields are conditional on others, and what are the valid ranges of values for each field. The programmer then takes this information and writes an adapter, and this process is typically repeated—in the best-case scenario—for every programming language that needs an interface. If the programmer is tying together five APIs, they could easily be doing an ad-hoc five-way reconciliation of slightly different representations of, for example, a user. That seems like a huge waste of effort to me.

The standard sales pitch for both the Semantic Web and Linked Data goes something like you should use it because once everybody uses it, it will be awesome. That appeal entails a herculean feat of human cooperation and skates over all sorts of vested interests. I submit instead that there needs to be a motivation to use this technology even if nobody else in the world subsequently adopted it, and I believe the Symbol Management Problem to be just that.

So what have I actually made?

My job from about mid- to mid- involved designing, implementing, and running an XML content pipeline that eventually ended up managing about 120 websites in 15 languages, along with all the mailouts that could be julienned by a dozen different demographic parameters. I then went to work at one of the nascent federated identity providers where I did a lot of API and protocol work. By I had a pretty solid grasp of what XML was good for, and where it fell short.

Early Experiments, Lofty Ambitions

A central theme in my work is to begin with a bulk quantity of raw material and apply successive structure-preserving transformations. By this point, I had already been working with the Web for a decade, and had by then noticed that most of the desired behaviour can be satisfied with only a handful of operations. If I could design a substrate, I figured, then one would only need to write custom code for the minority of behaviours that the substrate didn't already cover.

Substrate-like frameworks indeed already existed, albeit coupled almost always to Java and always always to XML. As I already implied, anything that requires you to scratch-write an XML vocabulary is a non-starter. As for Java, it's something of a Rubicon that a lot of Web developers—myself included—would rather not cross. My idea, after a close read of Roy Fielding's PhD dissertation, was to make a sort of meta-framework that could theoretically be implemented in any language, even mixed and matched between multiple systems.

Instead of XML, the system would speak RDF, and even use content negotiation to select between syntaxes. This was actually a pretty solid plan except for the fact that unlike an XML document, any RDF serialization, at least at the time, was just a set of statements. There was no way to indicate an initial subject—that is, connect the content you just downloaded from the location you just downloaded it from: it would be mixed in with all the other data and there would be no way to tell which URI was the topmost one. I put my master plan on hold and went in search of more tractable problems to solve.

In I used RDF to record the results of a data analysis process. A packet of raw telemetry data would be injected into a pipeline of tests, whereby the outcome of one test may or may not cause the data to be subjected to subsequent tests. As such, the output was irregularly-shaped but still needed to be structured. It would have been incredibly difficult to pull off using SQL. The process extracted subjects—URIs—from the packet along with facts about them which eventually built up a graph. I did this for an employer so it was never recognized as anything more than an experiment.

Content Robo-Inventory

Around , I wrote a wrapper around the Mercurial version control system as a sort of first crack at an automated content inventory, that would scan the history of a repository to include things like modification dates and naming histories. This work eventually matured into a content inventory vocabulary. The idea was to create a format for recording and exchanging Web content inventories—which of course could be performed programmatically—along with the outcomes of their subsequent audits. The inventory aspect is pretty mature by this point; the audit somewhat less so. This is still an active area of development that I believe has strong implications for the discipline of content strategy.

Structured Argumentation

Also in 2009, I happened a chance encounter with a salon presentation featuring Douglas Engelbart. In it, he rather nonchalantly tossed out a mention of a thing called structured argumentation which sounded a lot like it could serve as the basis for the fitness variables depicted in Christopher Alexander's Notes on the Synthesis of Form.

Minimalist graph of Indian village from Notes on the Synthesis of Form

I had also modeled the Indian village in Appendix 2 of the book in order to try to reconstruct Alexander's HIDECS algorithm from its description in Appendix 1.

Structured argumentation—or at least the particular flavour of it that I had alighted on—is a sort of organizational protocol of constraining rhetorical moves in order to do things like, as the authors put it, solve wicked problems. Alexander was also trying to solve complex problems: compute an architectural program—that is to say a project plan—through a topological analysis of the hairball of concerns, so these two ideas fit together quite naturally. The Engelbart connection is of course the use of an interactive hypermedia system to manipulate the thing.

An RDF vocabulary for the strain of structured argumentation called IBISIssue-Based Information System—had already been written, until one day it disappeared. So in , after vacillating for years, I decided to replace it.

RDF-KV and the IBIS Tool

A year later, in , I was working on a project where I planned to use an RDF graph as the main database. The idea, hearkening back to my substrate plan, was that I could greatly abridge basic CRUD development by speaking what are effectively RDF diffs—sets of statements to add or remove—directly to the server. Thus I designed a protocol and reference implementation I called RDF-KV.

The protocol works by embedding commands into the keys of HTML forms, such that the values, when supplied by the user, complete RDF statements, with a flag to indicate whether the statement should be added or removed. The protocol is dead simple by design, and can be implemented with regular expressions. The net effect is you can put a single catch-all POST handler on the server, and manipulate your CRUD behaviour just by changing your HTML.

Once I had the protocol, I needed to test it. I had yet to produce any vocabularies or instance data for the client, because part of the plan was that I would make a tool using the protocol to construct that data. I needed a complete vocabulary to write an app against, so I dragged my IBIS vocabulary out of mothballs and in a couple weeks, spent mostly fiddling with UI, I had a reasonably serviceable structured argumentation tool. The original project for which I had designed the protocol eventually became a casualty of intraorganizational politics, but the IBIS prototype remains. Here it is: