Back in 2008 I decided to put up a website again, after not having one for something like 6 years. This was going to be super budget, just an interim static site until I could circle back around with the full-fledged CMS later.

Design Constraints

I had two constraints in mind when I got started this time:

  1. Dense Hypermedia: I wanted to make the experience linky. I wanted the content of any one document to fit on a standard laptop screen, and I wanted it to link if it needed more space than that. The idea was to replicate old-school hypertext: whether the goal was didactic, argumentative, or narrative, I wanted the reader to be able to choose what they read next.
  2. Cool URIs Don't Change: This was at least as much of a technical challenge as it was a stylistic one. The idea was that if you mint a Web URI—I'm talking about the actual string, not the document it points to—you lose control of it somewhat. Other people, companies, machines, services become aware of it, and they use it to return and fetch the resource—or at least some resource—identified by it. The state of URI preservation in was bad, and I wanted to see if I could do something about it. Put more generically, I wanted to have a website where no legitimate user would ever encounter a 404.

I bailed on the first constraint pretty early on. I found very quickly that writing hypertext is hard: the scope has a tendency to explode to what I estimate to be proportional to the square of the amount of writing initially expected. Publication would be constipated waiting for some parenthetical Pandora's box or other to be wrangled to satisfaction. I found this could be palliated somewhat by publishing subgraphs that only linked to each other or documents that had already been published, but the no-404 stipulation meant that I was on the hook for an increasingly unmanageable hairball until all paths through it had terminated. I wanted links in the documents to remind myself that there was something there to expand on, but I didn't want those links in the hands of users, and I certainly didn't want them in the databases of indexers either, at least not until there was something on the other end.

Without adequate instrumentation, writing dense hypertext turned out to be just too hard. Within a year I had reverted to writing essays.

The second constraint—unbreakable URIs—turned out to be easier to maintain. As a byproduct of my first tech job in , I had gotten familiar with mod_perl, which gives you full access to the guts of Apache without wasting your life writing C. Working at that level meant your app shipped as a unitary module, bypassing the clunkery of contemporaneous Web application development techniques like CGI scripts, or code-interpolated documents such as PHP. This taught me an important lesson: what is called the Request-URI, the combination of the /path and ?query, the part between the ://host and the #fragment, from the point of view of the standard and the Web server, may as well be a flat dictionary key. It is only by convention that it represent some location on the server's file system. If you can get into this pipeline early enough, you can make the Request-URI represent whatever you want.

Put another way: /path/hierarchies/are/not/necessary. The only thing that matters—to the server—is that the Request-URI unambiguously picks out a resource. The slug is easy enough: just do a sensible transformation of the title. Throw away the idea of sections and plunk everything in the root. If anything threatens a collision, that's when you start adding /path/segments. And when it does, just put in a redirect.

The Semantic Connection

Around this time is also when I was really ramping up my work with RDF, the lingua franca of the Semantic Web. What you find very quickly when you start working with RDF is that it is ravenously hungry for URIs. Combined with the notion that HTTP(S) URIs ought to point somewhere, this is quickly escalates into a microcontent curation nightmare.

Of course, RDF doesn't specify what kind of URI can go in its elements, and there are far more species in the world than just http:. Take, for example, an identifier like:


The UUID: Standard, spat out of a random number generator, big enough to enumerate every atom in the universe, and nobody is going to confuse it for the address of a Web page. Unless it is the address of a Web page, in which case you transform it like so:

Once you come up with a clever title, you can derive it into a slug, and provided it's unique, that can be the new address. If you expose the UUID to the public for any reason, you can just redirect that too.

So now it's about 2010 and I have a version-controlled folder on my computer that contains an ocean of files that look like e8f61587-bb56-4e5c-b7dd-2954b76a84b9.xml. The majority are missives I started writing and promptly forgot existed, only to start writing anew. Wouldn't it be great to have a little program that just generates a content inventory so I can get this under control?

The program that did the mining is about a thousand lines of Python, which just zips through the designated folder and concomitant versioning database, constructs a graph, and serializes it to a file. At this stage I had been using all third-party RDF vocabularies to represent this metadata: The Bibliographic Ontology to represent the various types of documents, and Dublin Core for many of the relations between them. To represent people and organizations, such as authors and publishers, I used FOAF.

It didn't take long for needs to emerge that weren't expressed by these third-party vocabularies. For one, I wanted to be able to ascribe editorial destinies to these documents that were clear, machine-readable, database-selectable entities:

This was the impetus for writing my own content inventory vocabulary, which I started around the cusp of 2012.

Expanding the Vocabulary

Here is an old test render of some published documents on my site organized in reverse chronological order. The documents are visualized as their bounding boxes, taken by dividing the number of characters by the—33 em—paragraph width, times a weighted average of each character's tendency to fill the width of the em square. The result is a good approximation of the actual geometry of each rendered document, juxtaposed against one another.

Since I was always driving toward short documents with lots of links, I wanted an easy way to pick documents out of the inventory which most need rehabilitation. This entailed some way of measuring them. Raw word count is, in my opinion, inadequate—I want a sense of the anatomy of the document as well, without having to look at it. Of course, at the time, HTML had no unambiguous concept of a chapter or section, so it occurred to me to count blocks: paragraphs, lists, tables, block quotes, <div>s, etc., and ratios like sentences and words per block. I hacked in a little statistics gatherer to my inventory generator program and amended my vocabulary to provide for descriptive statistics for each document, which could power sparklines that would tell me the shape of each at a glance.

This is an early test for a stylized box-whisker diagram of a bunch of documents ordered by descending document length. Each vertical bar represents descriptive statistics for words per block: Dark points represent medians, and the interquartile range is around the median in the darker purple, while the extrema are grey. The lighter purple bars are the ±standard deviation—clipped at zero—and the mean is in turquoise. You can see the original test here, which was on a black backdrop with the palette inverted.

I also consternated over what to do about carving up the corpus. Barring two or three collections I started before instituting the policy, I was adamant about not creating any disjoint sections. It was important that every resource could exist in more than one category at once. Besides, the inverse of a category is just a predicate, and a resource can have arbitrarily many of those.

I gave this talk at the 2017 Information Architecture Summit about why it might be a good thing to organize information primarily by semantic relation, from which conventional categories could be derived.

A crude type of predicate is the tag, but a tag is just a string of text. At the very least, there are logistical problems with coalescing minor variants of said strings that were all intended to mean the same thing. Beyond that, there is no way to ascribe a general conceptual domain for what kind of thing a tag is supposed to be, and no well-defined way to relate tags to one another.

Enter SKOS: a way to represent concepts as distinct, identifiable entities, from each of which every conceivable label dangles like a Christmas ornament, and garlanded in semantic relations. In contrast to a puddle of text strings, a SKOS concept scheme is a fully-featured taxonomical structure.

Next came the task of connecting the concepts to the documents. Dublin Core provides a subject relation, which is a good start, but only really useful for conveying what a document is about. A document can be about a concept and not mention it, while it can mention a concept and not strictly be about it. Thus, I added the following relations to my vocabulary:

  • Mentions: the document explicitly invokes the concept by name,
  • Introduces: in addition to mentioning the concept, the document defines, describes, or otherwise explains it for an audience who may not already know what it is,
  • Assumes: the document may or may not explicitly mention the concept, but it is written as if the audience is already familiar with it.

As I will expound in a moment, I was keenly interested in sparing my audience of jargon, notwithstanding the content that actually treated the jargon-y topics first hand. I figured these relations would form the raw material for performing operations to that and, or at the very least for hinting at what documents treated the right content but for the wrong audience, and needed to be brought to heel.

Constructing an Audience

In my professional life I consider myself to be something of a liminal character, straddling the boundary between those with technical proclivities and those who actively eschew them. I had been vacillating for some time about partitioning my content into two sections, the recto for the bulk of humanity, and the verso for the techies—a move I ultimately made in . But I don't want the split to be too stark: I am one person writing for two audiences, and the works naturally mingle—they interact with one another. If I wanted them to be truly separate, I'd put them on different websites. What I want is subtler control—a way to signal to people what side of the fence they're currently on, and when they're about to cross over.

This cleavage plane manifested initially in the feeds, which, in true lazy fashion, I reingest for indexes on the home and Verso pages. Up until very recently—like —I wrote them by hand. In order to make them amenable to being generated, I had to design some way for an algorithm to reliably pick which article went where.

The current partition is simple enough: any article that talks about computers—and moreover how to do things with computers—goes into one bucket, and in the other bucket goes everything else. One could imagine though, eventually, a kind of onion-skin gradient of obscurity: the concepts at the centre are things everybody understands, and from the centre radiate little archipelagoes of specialist knowledge. Thankfully, a structure like this is precisely what a system like SKOS is designed to represent.

If you think about it, an audience is a conceptual entity in its own right, denoting a group of people who share the same values and understand the same concepts. I added an Audience class to my content inventory vocabulary, which inherits properties from both SKOS Concept and the Dublin Core AgentClass, making it compatible with the audience relation of the same. To solve my partitioning problem, I created a non-audience relation to complement the audience relation from Dublin Core, and this gave me the expressivity I needed to compute the partition. Roughly:

If the index's non-audience matches the document's audience,
  and the document has no other audiences, discard it from this index.

This small addition kept me from having to explicitly tag every document with an audience—a set of concepts I haven't finished yet—and the main index with every conceivable audience. Though I wouldn't have to do that, exactly: since my Audience class inherits from SKOS, it gets the full set of hierarchical semantic relations, which I use to derive, for example, whether or not a Python programmer is a Programmer, and therefore that content directed at Python programmers belongs in the techie feed.

Here is the hairball of concepts and audiences I mined from my site a few years ago. Purple objects are plain concepts, blue are audience classes, and lime green are audiences which are also organizational roles. Orange lines indicate a has broader relation, where the arrowhead denotes the broader term. Green lines denote a symmetric relation. The faint lines merely connect the objects to the large green entity in the middle that stands for the concept scheme. Note that while SKOS can express subordination and superordination, its structure is not strictly hierarchical. Working with it is actually more set-theoretic.

What's really exciting, though, is the notion of using the corpus and its attendant concepts to construct the set of audiences. I currently have about fifty concepts and a dozen audiences, which I just dashed off informed by nothing but a little introspection. If it wasn't a mere personal website created with play labour, I would probably base this structure on some ethnographic research in an attempt to close the gap between who I've already written for, who I want to be writing for, and who actually reads my work.


This odyssey took over a decade to get to this state—in part because it isn't anything close to my main gig, but also in true Gibsonian fashion, the future isn't evenly distributed.

I wrote the original tooling in Python because the version control system I use for my website is also written in Python: it was therefore easy to hook directly into it and pump out the metadata. It turns out, however, that the software needed to consume all this wonderful data, and make effective use of it, is considerably more sophisticated than that which is needed to produce it. The key piece that performs all the highly convenient and time-saving inference generation is missing from the Python toolkit, and making one from scratch is about three notches above my paygrade.

As such, the path of least resistance was nothing short of a complete retooling. While the proximate code could be rewritten in just a few days, there are invariably a bunch of gaps and missing dependencies when moving from one platform to another. Not something I could afford without even a de facto sponsor. To be sure, the amount of time this project has spent idle versus in motion is easily a hundred to one.

It was only last year, in , that I got the first opportunity in five years to review it. A client, as a byproduct of my project with them, had me looking at Ruby, which happens to have the missing piece! A not-very-good one albeit, but at least it works. Moreover, I managed to achieve a good chunk of the retooling effort just through ordinary project offgassing. As such, I have a prototype coalescing for a Swiss-Army knife of sorts, to do all the basic operations of the original Python code, and then some.

Coda for the Coda

I piloted this technique in a couple of other places aside from my own site, including an attempt to overhaul the website of the Information Architecture Institute. Even with professionals involved, it was a hard sell, and despite the pitfalls and caveats I've mentioned, I'm not entirely sure why.

I want to reiterate that I wrote this content inventory vocabulary with the idea that it would be an exchange format: Some tool would generate this data, and potentially some other tool would consume it. Heck, it could even get woven straight into a content management system. It could facilitate the migration from one content management system to another. The data could be repurposed for entirely other ends. Endless tools and infrastructure could be built on top of it.

The tools that I wrote for my own site, especially the most recent one, may be able to be used for other websites, but they're frankly kinda disposable. The important thing, to me at least, is the general technique embodied in this and other metadata vocabularies.

In a way I'm surprised, because the technique is extraodinarily powerful, nothing even really comes close to touching it. I'm also not surprised, because it's also really hard. Twenty years in, the Semantic Web is still missing key elements to make it, if not easy to use, at least worth the pain. But it's demand that drives the building of better tools, and the chamfering of their sharp corners.

I hope what I have shared today ignites some interest.