The Symbol Management Problem

I get asked sometimes why I cling so stubbornly to the Semantic Web. Before I answer this question, I have to deal with the fact that it is a totally loaded one.

The phrase semantic web, loosely speaking, refers to an overlapping Venn diagram of three distinct concepts, and each of which can theoretically be considered without invoking the other two:

RDF is a technical standard,
Linked Data is an architectural style,
The actual Semantic Web is an utopia.

So if somebody utters the phrase semantic web without further qualification—especially to criticize it—they're probably talking about the most audacious and easily straw-mannable of the three.

There is a more technical interpretation of the term Semantic Web, which refers to the use of a reasoner to infer latent information from a set of existing assertions. Reasoners have plenty of practical applications, which I will discuss, but the precipice of the abyss into all manner of occult formal logic is marked by the use of a reasoner.

More mundane is Linked Data, which is simply the architectural constraint of giving machine-readable data objects URLs in order to make them directly accessible over the Web, and moreover that said machine-readable data objects themselves contain links to other data objects. This is a perfectly sensible pattern that we can see all over the place, particularly in REST APIs, and need not have anything to do with RDF or the Semantic Web.

The central problem of Linked Data is that without some kind of protocol or appropriate context, you can't necessarily tell:

what is even a link, versus say, a text field that happens to contain a URL—and that the URL-looking text actually is a URL,
what the relation to the link—or literal data member, for that matter—represents.

To remedy this situation, we need some kind of schema: For (a), we need something that marks a piece of syntax as denoting this thing here is definitely a link. For (b), we need something that specifies the semantics of the relation: what the thing means. HTML has a reasonable capability for the first, and an extremely limited vocabulary for the second. HTML likewise can't help you much if the data you want to represent is something other than a document. Cue the XML boom of the early 2000s. I can say from experience that whether you're writing a schema in DTD, XML Schema, or RELAX NG, it is not a trivial undertaking. Other people clearly felt similarly, and we got things like Microformats, Microdata, and most recently, JSON Schema.

XML Apologia, Sort Of

It is something of a trope in the Web development community to hate on XML. I nevertheless submit that the XML era was necessary to get the world acclimated to the notion that there could be a common data syntax, that in the worst-case scenario, you could read and write with an off-the-shelf plain text editor. Say what you will about XML: before it came along, the other candidates were kind of all over the place—and none of them dealt proactively with content encoding. If we didn't go through our XML phase, I doubt we would have standardized on Unicode nearly as smoothly as we did.

Indeed, I suspect the revulsion toward XML is, at least among the people who were there for it, is consistent with a response to trauma: Everything about XML is pedantic. It will throw an error in your face at the slightest departure from its ultra-strict and doctrinaire specification, unlike say, CSS which ignores errors, and HTML which tries to correct them. How many billions of dollars have been lost due to an XML syntax error, I can't even begin to fathom. It is also physically hard to type: even with a souped-up code editor with inline validation and autocomplete, you still have to curl your fingers in an unnatural way to hit those angle brackets, one of many operations that require hitting the shift key with your other hand.

Finally, there's the brazen byzantine baroqueness of the whole thing. The first try of everything to do with XML always seemed to be insanely overcomplicated. RELAX NG was invented because XML Schema was too complicated. SAX was invented because DOM was too complicated. SOAP and WSDL and all that mess are completely insane and demonstrably unnecessary—so much so that the Web development community more or less did away with XML entirely. Now it speaks primarily JSON and is happier for it. This explains the reactions of the younger developers: why would we put up with any of this crap if we didn't absolutely have to?

The one item that puzzles me about this business though is the ostensible phobia toward namespaces. Namespaces are the mechanism by which XML does modularity, and the general consensus among developers is that modularity is good. So why hate namespaces?

The only reason I can think of is that XML namespaces are necessarily URIs, and most of them don't do anything. Indeed, for the longest time, most of them didn't even go anywhere, and if they did, it would just say something like This is the namespace for blahblahblah. You were lucky if you got a link to human-readable spec and even luckier if you got a link to the machine-readable schema, which would have been elsewhere. Indeed, the mechanism for relating schemas to instance documents is something other than the namespace URI. The practice of actually making the namespace be the spec didn't catch on until well into the RDF era. This is bad enough from the position of an apologist. From an outsider perspective, this would have been totally nuts.

So while I believe the sun has set on XML as the go-to framework for arbitrary data exchange, it is still a valuable tool to have in the kit. Use it when the pedantry is something you actually want. Generate it programmatically—avoid writing it by hand. And for your own sanity, stay away from writing your own schema from scratch.

Why I'm Still Here

Perhaps now, then, after several paragraphs, I will finally articulate why I use this technology. Something I have come to call the Symbol Management Problem:

You have a quantity of symbols,
Which you endeavour to manage, and
This is a problem.

In software, and especially in Web development, you will find yourself dealing with a number of symbols, tokens, slugs—identifiers intended to pick out, demarcate, and differentiate different pieces of content for different kinds of processing:

Package and/or class names
Subroutine and/or method names
Variable and constant names
Coded/enumerated values (which actually predate software)

Pretty much everything has that. On the Web we also have:

URL path components
URL query keys
URL/HTML fragment identifiers
HTML form keys
CSS class names
JSON object keys
data-* attributes
…and three or four mutually-incompatible metadata schemes.

Web development is particularly rife with symbols, because at the end of the day, you're just schlepping text. A number of these symbols—CSS class names and HTML IDs, URL query keys and form keys—straddle multiple technical specifications because they are meant to serve as junctions that connect the different technologies together. On a more organizational level, many of these objects correspond to entities and relations in internal databases, classes, properties and methods in object-oriented code, or objects in legacy or third-party information systems. A significant chunk of the work of Web application development reduces to mapping these disparate objects to one another, usually in an ad-hoc way.

The more symbol dictionaries you have to maintain—assuming you maintain them at all—the more overhead goes into maintaining them and/or dealing with the fallout of sub-par maintenance, and the more effort, and ultimately code, goes into translating between them. In other words, the entropy generated by the proliferation of symbols can actually foreclose on certain opportunities, because it simply becomes too costly to wrangle.

The whole point of using human-readable symbols, and not, say, random strings or numbers, is to have a mnemonic or associative device such that a human being can look at a given symbol and infer to some extent what the thing is supposed to mean. The tendency, therefore, is to make them contain recognizable words. Here we can see how the Symbol Management Problem decomposes into two parts:

Redundancy: When you have two or more terms that mean the same thing.
Collision: When you have the same term that means two or more things.

Both these situations arise when people, teams, organizations, etc. need a word for a distinct concept, and don't sufficiently consult with others in their orbit about what terms are already in use. This is a fundamental information-sharing problem that will occur any time it's easier to make something up than to look something up, and will persist to some degree no matter how good the communication gets. Nevertheless, it can be palliated.

The collision problem is solved through namespaces, which, when fully-qualified URIs, are by design impossible to collide. The redundancy problem can be solved through term reconciliation, essentially denoting, in a machine-readable-form, that a certain term in one vocabulary means the same thing as a certain term in another. The general communication problem can be greatly ameliorated by making these terms, which are fully-qualified URIs, actually point to webpages containing their own dual machine/human-readable documentation. These can be published, indexed, and made discoverable. Indeed, in most cases, we can skip over the process of minting our own symbol vocabulary entirely and directly use vocabularies authored by other people.

Symbol management is more important now in the age of APIs, when arbitrary data objects are continually being slung across administrative boundaries. The state of the art is that every website with an API also has a documentation section that tells the programmer which field means what, which fields are mandatory, which are optional, which fields are conditional on others, and what are the valid ranges of values for each field. The programmer then takes this information and writes an adapter, and this process is typically repeated—in the best-case scenario—for every programming language that needs an interface. If the programmer is tying together five APIs, they could easily be doing an ad-hoc five-way reconciliation of slightly different representations of, for example, a user. That seems like a huge waste of effort to me.

The standard sales pitch for both the Semantic Web and Linked Data goes something like you should use it because once everybody uses it, it will be awesome. That appeal entails a herculean feat of human cooperation and skates over all sorts of vested interests. I submit instead that there needs to be a motivation to use this technology even if nobody else in the world subsequently adopted it, and I believe the Symbol Management Problem to be just that.

So what have I actually made?

My job from about mid-2002 to mid-2005 involved designing, implementing, and running an XML content pipeline that eventually ended up managing about 120 websites in 15 languages, along with all the mailouts that could be julienned by a dozen different demographic parameters. I then went to work at one of the nascent federated identity providers where I did a lot of API and protocol work. By 2006 I had a pretty solid grasp of what XML was good for, and where it fell short.

Early Experiments, Lofty Ambitions

A central theme in my work is to begin with a bulk quantity of raw material and apply successive structure-preserving transformations. By this point, I had already been working with the Web for a decade, and had by then noticed that most of the desired behaviour can be satisfied with only a handful of operations. If I could design a substrate, I figured, then one would only need to write custom code for the minority of behaviours that the substrate didn't already cover.

Substrate-like frameworks indeed already existed, albeit coupled almost always to Java and always always to XML. As I already implied, anything that requires you to scratch-write an XML vocabulary is a non-starter. As for Java, it's something of a Rubicon that a lot of Web developers—myself included—would rather not cross. My idea, after a close read of Roy Fielding's PhD dissertation, was to make a sort of meta-framework that could theoretically be implemented in any language, even mixed and matched between multiple systems.

Instead of XML, the system would speak RDF, and even use content negotiation to select between syntaxes. This was actually a pretty solid plan except for the fact that unlike an XML document, any RDF serialization, at least at the time, was just a set of statements. There was no way to indicate an initial subject—that is, connect the content you just downloaded from the location you just downloaded it from: it would be mixed in with all the other data and there would be no way to tell which URI was the topmost one. I put my master plan on hold and went in search of more tractable problems to solve.

In 2007 I used RDF to record the results of a data analysis process. A packet of raw telemetry data would be injected into a pipeline of tests, whereby the outcome of one test may or may not cause the data to be subjected to subsequent tests. As such, the output was irregularly-shaped but still needed to be structured. It would have been incredibly difficult to pull off using SQL. The process extracted subjects—URIs—from the packet along with facts about them which eventually built up a graph. I did this for an employer so it was never recognized as anything more than an experiment.

Content Robo-Inventory

Around 2009, I wrote a wrapper around the Mercurial version control system as a sort of first crack at an automated content inventory, that would scan the history of a repository to include things like modification dates and naming histories. This work eventually matured into a content inventory vocabulary. The idea was to create a format for recording and exchanging Web content inventories—which of course could be performed programmatically—along with the outcomes of their subsequent audits. The inventory aspect is pretty mature by this point; the audit somewhat less so. This is still an active area of development that I believe has strong implications for the discipline of content strategy.

Structured Argumentation

Also in 2009, I happened a chance encounter with a salon presentation featuring Douglas Engelbart. In it, he rather nonchalantly tossed out a mention of a thing called structured argumentation which sounded a lot like it could serve as the basis for the fitness variables depicted in Christopher Alexander's Notes on the Synthesis of Form.

Minimalist graph of Indian village from Notes on the Synthesis of Form — I had also modeled the Indian village in Appendix 2 of the book in order to try to reconstruct Alexander's HIDECS algorithm from its description in Appendix 1.

Structured argumentation—or at least the particular flavour of it that I had alighted on—is a sort of organizational protocol of constraining rhetorical moves in order to do things like, as the authors put it, solve wicked problems. Alexander was also trying to solve complex problems: compute an architectural program—that is to say a project plan—through a topological analysis of the hairball of concerns, so these two ideas fit together quite naturally. The Engelbart connection is of course the use of an interactive hypermedia system to manipulate the thing.

An RDF vocabulary for the strain of structured argumentation called IBIS—Issue-Based Information System—had already been written, until one day it disappeared. So in 2012, after vacillating for years, I decided to replace it.

RDF-KV and the IBIS Tool

A year later, in 2013, I was working on a project where I planned to use an RDF graph as the main database. The idea, hearkening back to my substrate plan, was that I could greatly abridge basic CRUD development by speaking what are effectively RDF diffs—sets of statements to add or remove—directly to the server. Thus I designed a protocol and reference implementation I called RDF-KV.

The protocol works by embedding commands into the keys of HTML forms, such that the values, when supplied by the user, complete RDF statements, with a flag to indicate whether the statement should be added or removed. The protocol is dead simple by design, and can be implemented with regular expressions. The net effect is you can put a single catch-all POST handler on the server, and manipulate your CRUD behaviour just by changing your HTML.

Once I had the protocol, I needed to test it. I had yet to produce any vocabularies or instance data for the client, because part of the plan was that I would make a tool using the protocol to construct that data. I needed a complete vocabulary to write an app against, so I dragged my IBIS vocabulary out of mothballs and in a couple weeks, spent mostly fiddling with UI, I had a reasonably serviceable structured argumentation tool. The original project for which I had designed the protocol eventually became a casualty of intraorganizational politics, but the IBIS prototype remains. Here it is: