I get asked sometimes why I cling so stubbornly to the Semantic Web. Before I answer this question, I have to deal with the fact that it is a totally loaded one.
The phrase semantic web
, loosely speaking, refers to an overlapping Venn diagram of three distinct concepts, and each of which can theoretically be considered without invoking the other two:
So if somebody utters the phrase semantic web
without further qualification—especially to criticize it—they're probably talking about the most audacious and easily straw-mannable of the three.
There is a more technical interpretation of the term Semantic Web, which refers to the use of a reasoner to infer latent information from a set of existing assertions. Reasoners have plenty of practical applications, which I will discuss, but the precipice of the abyss into all manner of occult formal logic is marked by the use of a reasoner.
More mundane is Linked Data, which is simply the architectural constraint of giving machine-readable data objects URLs in order to make them directly accessible over the Web, and moreover that said machine-readable data objects themselves contain links to other data objects. This is a perfectly sensible pattern that we can see all over the place, particularly in REST APIs, and need not have anything to do with RDF or the Semantic Web.
The central problem of Linked Data is that without some kind of protocol or appropriate context, you can't necessarily tell:
To remedy this situation, we need some kind of schema: For (a), we need something that marks a piece of syntax as denoting this thing here is definitely a link
. For (b), we need something that specifies the semantics of the relation: what the thing means. HTML has a reasonable capability for the first, and an extremely limited vocabulary for the second. HTML likewise can't help you much if the data you want to represent is something other than a document. Cue the XML boom of the early . I can say from experience that whether you're writing a schema in DTD, XML Schema, or RELAX NG, it is not a trivial undertaking. Other people clearly felt similarly, and we got things like Microformats, Microdata, and most recently, JSON Schema.
Developing alongside all of this business—sometimes quietly and sometimes not so quietly—is RDF. I first encountered it in , when it had been well underway for many years, but the perception, at least, was that it was still wedded to XML. Indeed, as late as , I got the opportunity to ask a preeminent taxonomist what they thought about RDF, only to get a pooh-pooh response: some silly XML thing
.
RDF is not an XML thing. What RDF is, is a URI thing. The Resource Description Framework, being a framework for describing resources, has to reference those resources somehow, so it naturally uses Uniform Resource Identifiers, and it uses them absolutely everywhere.
This is the genius of RDF: Everything is a URI, except when it isn't. And if the URI in question is a dereferenceable URL, then what you automatically get is Linked Data. Slap a reasoner onto a big enough concentration of this material and you get the Semantic Web.
Schemas—or what in the biz are called vocabularies✱—the specifications that tell both you and the machine what means what, are as readily available as any other open-source software product. Indeed, a number of de facto core vocabularies interact with and build off each other, as the inheritance model is not too different from conventional object-oriented programming languages. If you can't find the one you need, RDF vocabularies are much easier to write than something like an XML vocabulary, because you're never defining syntax, only semantics. In other words, you only have to specify classes and properties, never sequences of elements.
Speaking of syntax, this is taken care of for you. While it originated in XML, RDF has since grown a solid dozen alternative syntaxes, of chief interest are the easily-typed Turtle, the stealthy JSON-LD, and RDFa, which embeds RDF data into other markup languages like HTML, Atom, or SVG.
Perhaps now, then, after several paragraphs, I will finally articulate why I use this technology. Something I have come to call the Symbol Management Problem:
In software, and especially in Web development, you will find yourself dealing with a number of symbols, tokens, slugs—identifiers intended to pick out, demarcate, and differentiate different pieces of content for different kinds of processing:
Pretty much everything has that. On the Web we also have:
data-*
attributesWeb development is particularly rife with symbols, because at the end of the day, you're just schlepping text. A number of these symbols—CSS class names and HTML IDs, URL query keys and form keys—straddle multiple technical specifications because they are meant to serve as junctions that connect the different technologies together. On a more organizational level, many of these objects correspond to entities and relations in internal databases, classes, properties and methods in object-oriented code, or objects in legacy or third-party information systems. A significant chunk of the work of Web application development reduces to mapping these disparate objects to one another, usually in an ad-hoc way.
The more symbol dictionaries you have to maintain—assuming you maintain them at all—the more overhead goes into maintaining them and/or dealing with the fallout of sub-par maintenance, and the more effort, and ultimately code, goes into translating between them. In other words, the entropy generated by the proliferation of symbols can actually foreclose on certain opportunities, because it simply becomes too costly to wrangle.
The whole point of using human-readable symbols, and not, say, random strings or numbers, is to have a mnemonic or associative device such that a human being can look at a given symbol and infer to some extent what the thing is supposed to mean. The tendency, therefore, is to make them contain recognizable words. Here we can see how the Symbol Management Problem decomposes into two parts:
Both these situations arise when people, teams, organizations, etc. need a word for a distinct concept, and don't sufficiently consult with others in their orbit about what terms are already in use. This is a fundamental information-sharing problem that will occur any time it's easier to make something up than to look something up, and will persist to some degree no matter how good the communication gets. Nevertheless, it can be palliated.
The collision problem is solved through namespaces, which, when fully-qualified URIs, are by design impossible to collide. The redundancy problem can be solved through term reconciliation, essentially denoting, in a machine-readable-form, that a certain term in one vocabulary means the same thing as a certain term in another. The general communication problem can be greatly ameliorated by making these terms, which are fully-qualified URIs, actually point to webpages containing their own dual machine/human-readable documentation. These can be published, indexed, and made discoverable. Indeed, in most cases, we can skip over the process of minting our own symbol vocabulary entirely and directly use vocabularies authored by other people.
Symbol management is more important now in the age of APIs, when arbitrary data objects are continually being slung across administrative boundaries. The state of the art is that every website with an API also has a documentation section that tells the programmer which field means what, which fields are mandatory, which are optional, which fields are conditional on others, and what are the valid ranges of values for each field. The programmer then takes this information and writes an adapter, and this process is typically repeated—in the best-case scenario—for every programming language that needs an interface. If the programmer is tying together five APIs, they could easily be doing an ad-hoc five-way reconciliation of slightly different representations of, for example, a user. That seems like a huge waste of effort to me.
The standard sales pitch for both the Semantic Web and Linked Data goes something like you should use it because once everybody uses it, it will be awesome
. That appeal entails a herculean feat of human cooperation and skates over all sorts of vested interests. I submit instead that there needs to be a motivation to use this technology even if nobody else in the world subsequently adopted it, and I believe the Symbol Management Problem to be just that.
My job from about mid- to mid- involved designing, implementing, and running an XML content pipeline that eventually ended up managing about 120 websites in 15 languages, along with all the mailouts that could be julienned by a dozen different demographic parameters. I then went to work at one of the nascent federated identity providers where I did a lot of API and protocol work. By I had a pretty solid grasp of what XML was good for, and where it fell short.
A central theme in my work is to begin with a bulk quantity of raw material and apply successive structure-preserving transformations. By this point, I had already been working with the Web for a decade, and had by then noticed that most of the desired behaviour can be satisfied with only a handful of operations. If I could design a substrate, I figured, then one would only need to write custom code for the minority of behaviours that the substrate didn't already cover.
Substrate
-like frameworks indeed already existed, albeit coupled almost always to Java and always always to XML. As I already implied, anything that requires you to scratch-write an XML vocabulary is a non-starter. As for Java, it's something of a Rubicon that a lot of Web developers—myself included—would rather not cross. My idea, after a close read of Roy Fielding's PhD dissertation, was to make a sort of meta-framework that could theoretically be implemented in any language, even mixed and matched between multiple systems.
Instead of XML, the system would speak RDF, and even use content negotiation to select between syntaxes. This was actually a pretty solid plan except for the fact that unlike an XML document, any RDF serialization, at least at the time, was just a set of statements. There was no way to indicate an initial subject
—that is, connect the content you just downloaded from the location you just downloaded it from: it would be mixed in with all the other data and there would be no way to tell which URI was the topmost
one. I put my master plan on hold and went in search of more tractable problems to solve.
In I used RDF to record the results of a data analysis process. A packet of raw telemetry data would be injected into a pipeline of tests, whereby the outcome of one test may or may not cause the data to be subjected to subsequent tests. As such, the output was irregularly-shaped but still needed to be structured. It would have been incredibly difficult to pull off using SQL. The process extracted subjects—URIs—from the packet along with facts about them which eventually built up a graph. I did this for an employer so it was never recognized as anything more than an experiment.
Around , I wrote a wrapper around the Mercurial version control system as a sort of first crack at an automated content inventory, that would scan the history of a repository to include things like modification dates and naming histories. This work eventually matured into a content inventory vocabulary. The idea was to create a format for recording and exchanging Web content inventories—which of course could be performed programmatically—along with the outcomes of their subsequent audits. The inventory aspect is pretty mature by this point; the audit somewhat less so. This is still an active area of development that I believe has strong implications for the discipline of content strategy.
Also in 2009, I happened a chance encounter with a salon presentation featuring Douglas Engelbart. In it, he rather nonchalantly tossed out a mention of a thing called structured argumentation which sounded a lot like it could serve as the basis for the fitness variables
depicted in Christopher Alexander's Notes on the Synthesis of Form.
Structured argumentation—or at least the particular flavour of it that I had alighted on—is a sort of organizational protocol of constraining rhetorical moves in order to do things like, as the authors put it, solve wicked problems. Alexander was also trying to solve complex problems: compute an architectural program—that is to say a project plan—through a topological analysis of the hairball of concerns, so these two ideas fit together quite naturally. The Engelbart connection is of course the use of an interactive hypermedia system to manipulate the thing.
An RDF vocabulary for the strain of structured argumentation called IBIS—Issue-Based Information System—had already been written, until one day it disappeared. So in , after vacillating for years, I decided to replace it.
A year later, in , I was working on a project where I planned to use an RDF graph as the main database. The idea, hearkening back to my substrate
plan, was that I could greatly abridge basic CRUD development by speaking what are effectively RDF diffs—sets of statements to add or remove—directly to the server. Thus I designed a protocol and reference implementation I called RDF-KV.
The protocol works by embedding commands into the keys of HTML forms, such that the values, when supplied by the user, complete RDF statements, with a flag to indicate whether the statement should be added or removed. The protocol is dead simple by design, and can be implemented with regular expressions. The net effect is you can put a single catch-all POST
handler on the server, and manipulate your CRUD behaviour just by changing your HTML.
Once I had the protocol, I needed to test it. I had yet to produce any vocabularies or instance data for the client, because part of the plan was that I would make a tool using the protocol to construct that data. I needed a complete vocabulary to write an app against, so I dragged my IBIS vocabulary out of mothballs and in a couple weeks, spent mostly fiddling with UI, I had a reasonably serviceable structured argumentation tool. The original project for which I had designed the protocol eventually became a casualty of intraorganizational politics, but the IBIS prototype remains. Here it is:
The IBIS tool is a rather crude demonstration of a through-and-through RDF Web application. Graph statements that come in through the RDF-KV protocol go directly into a triple store, and when they come back out, they are rendered as RDFa. I call the demonstration crude because it is incapable of handling arbitrary data objects—that will have to wait for the inevitable rewrite. Nevertheless, we can see in this prototype a significant dent in the Symbol Management Problem.
In particular, the tool demonstrates the use of embedded RDFa as CSS selectors: RDFa naturally identifies a subtree of an (X)HTML document with a subject, and/or one or more predicates, and/or one or more classes or datatypes. This is almost always enough information to—directly, through attribute selectors—attach styling directives, and what affords the tool its wild palette.
A bare-bones (X)HTML+RDFa document is at once an extremely well-defined development target as well as a terrifically versatile piece of raw material. When you write a piece of server-side code, you write it for consumption by downstream processes. You aren't creating a page, as much as a patch of the graph, originating at the request-URI, and featuring its immediate topological neighbours. The document's markup structure is heavily constrained by the statements you're trying to render, and for reasons aforementioned, there aren't a lot of other decisions to make about things like CSS class names and the like. When you're finished fashioning one of these resources—or perhaps a function that generates them according to supplied parameters, it goes into the Lego pile where it can be consumed by and composed into other resources. I made an entire Web app this way.
I have an ongoing project developing an intranet for a long-term client in the nonprofit sector. I add a little bit more to it at every conjunction of budget and availability, an arrangement they seem to be happy with. Indeed, it's part of the reason why I came up with the pattern: they don't have—or at least wouldn't be prudent to spend—the resources to develop software the conventional way, and I need a simple design that won't go obsolete between when I put it down and when I pick it back up again.
The project mainly consists of a set of tools for comprehending a whack of HR data: lists, charts, and individual members. The former two share a control panel the client uses to filter the data. The control panel is constructed from a repurposed OWL ontology and SKOS concept scheme that together describe all the idiosyncratic terms and coded properties peculiar to the organization. The chart generator takes HTML tables with an embedded Data Cube structure which it uses to negotiate the appropriate transformation into SVG.
To reiterate, the technical innovations of this project are borne mainly out of resource constraints. It is the way it is because it would be too much of a mess for a single person to manage otherwise. And as much as I would love to show this thing off, it's an intranet that browses through and visualizes reams of confidential personal information, so you don't get to see it. I'll have to show you something else.
My personal website is not only a place to write, but also a fairly large body of content that can't refuse my Frankenstein experiments. Historically it has not been very sophisticated because I am spectacularly lazy, though as a byproduct of this laziness I stumbled across a useful technique: Every Web browser going back to Microsoft Internet Explorer 5.5 has an embedded XSLT 1.0 transformation engine. XSLT is not the slightest bit picky about the markup it consumes, and will happily transform (X)HTML into itself. It therefore makes a perfectly good, fast, and perfectly lazy page composition and template processor.
Most websites tend to have ancillary content that is repeated on every page, and so does mine. Going back as far as or , I solved this by putting the ancillary content on its own page, and then putting a <link>
to it in the <head>
of each document, using HTML's limited set of semantic relations to disambiguate those links from any others. I would then transclude the links using the document()
function in XSLT.
The problem with this approach is that both the method of resolving the links and that of how to insert them into the document are brittle and ad-hoc. Since I was using the technique on the aforementioned intranet project and the little extranets I make to share materials with my clients, I felt it was important to generalize it. I made two XSLT libraries: one to query an RDFa document, and another to do the transclusion.
These libraries worked great for my other projects, but my own site itself is just plain handwritten XHTML. The content inventory metadata I mentioned: I needed a sustainable way to reinject that data back into the markup. And so became RDF::SAK
.
The Swiss Army Knife is a library whose purpose is mainly to act as a breadboard prototype for an agglomerate of desirable operations. For the moment it handles weaving RDF data back into plain Web pages, mapping resources from durable URIs to more evanescent Web URLs and handling their naming histories, generating Atom feeds and various indexes, and a few other mundane chores. It currently takes the form of a static website generator. Its proximate goal is to generate output I can use to make websites—beginning with my own—more hypertext-y.
As somebody who writes a lot for work and reads for it even more, I am growing increasingly dissatisfied with the sparsity—the clunkiness—of digital text. Technical manuals are cluttered with preamble and exposition, while their jargon glossaries, if they exist, are tucked out of sight. News articles still don't let you pivot by person, organization, or macro-event—social network analyses and multi-story timelines only seem to appear as special features. Academic papers still require you to dig out their references by hand. It's often easier to write a passage a second time than it is to locate where you had written it previously—and even if you did, your only option is to duplicate it, rather than just reference the original in-line. Documents are continually lapsing out of date with no connection to the most recent version. Quantitative arguments are still heavily rhetorically leveraged when they could simply be interactively demonstrated in situ.
I consider myself to be in the comprehension business. My professional objective is to remove the obstacles that slow down the uptake of knowledge. To remove the obstacles to knowledge, we must increase the paths through information. The more paths—links—the more complexity. That complexity needs to be managed, and for this job I have yet to encounter a more effective candidate.