Back in 2008 I decided to put up a website again, after not having one for something like 6 years. This was going to be super budget, just an interim static site until I could circle back around with the full-fledged CMS later.

Design Constraints

I had two constraints in mind when I got started this time:

  1. Dense Hypermedia: I wanted to make the experience linky. I wanted the content of any one document to fit on a standard laptop screen, and I wanted it to link if it needed more space than that. The idea was to replicate old-school hypertext: whether the goal was didactic, argumentative, or narrative, I wanted the reader to be able to choose what they read next.
  2. Cool URIs Don't Change: This was at least as much of a technical challenge as it was a stylistic one. The idea was that if you mint a Web URI—I'm talking about the actual string, not the document it points to—you lose control of it somewhat. Other people, companies, machines, services become aware of it, and they use it to return and fetch the resource—or at least some resource—identified by it. The state of URI preservation in was bad, and I wanted to see if I could do something about it. Put more generically, I wanted to have a website where no legitimate user would ever encounter a 404.

I bailed on the first constraint pretty early on. I found very quickly that writing hypertext is hard: the scope has a tendency to explode to what I estimate to be proportional to the square of the amount of writing initially expected. Publication would be constipated waiting for some parenthetical Pandora's box or other to be wrangled to satisfaction. I found this could be palliated somewhat by publishing subgraphs that only linked to each other or documents that had already been published, but the no-404 stipulation meant that I was on the hook for an increasingly unmanageable hairball until all paths through it had terminated. I wanted links in the documents to remind myself that there was something there to expand on, but I didn't want those links in the hands of users, and I certainly didn't want them in the databases of indexers either, at least not until there was something on the other end.

Without adequate instrumentation, writing dense hypertext turned out to be just too hard. Within a year I had reverted to writing essays.

The second constraint—unbreakable URIs—turned out to be easier to maintain. As a byproduct of my first tech job in , I had gotten familiar with mod_perl, which gives you full access to the guts of Apache without wasting your life writing C. Working at that level meant your app shipped as a unitary module, bypassing the clunkery of contemporaneous Web application development techniques like CGI scripts, or code-interpolated documents such as PHP. This taught me an important lesson: what is called the Request-URI, the combination of the /path and ?query, the part between the ://host and the #fragment, from the point of view of the standard and the Web server, may as well be a flat dictionary key. It is only by convention that it represent some location on the server's file system. If you can get into this pipeline early enough, you can make the Request-URI represent whatever you want.

Put another way: /path/hierarchies/are/not/necessary. The only thing that matters—to the server—is that the Request-URI unambiguously picks out a resource. The slug is easy enough: just do a sensible transformation of the title. Throw away the idea of sections and plunk everything in the root. If anything threatens a collision, that's when you start adding /path/segments. And when it does, just put in a redirect.

The Semantic Connection

Around this time is also when I was really ramping up my work with RDF, the lingua franca of the Semantic Web. What you find very quickly when you start working with RDF is that it is ravenously hungry for URIs. Combined with the notion that HTTP(S) URIs ought to point somewhere, this is quickly escalates into a microcontent curation nightmare.

Of course, RDF doesn't specify what kind of URI can go in its elements, and there are far more species in the world than just http:. Take, for example, an identifier like:

urn:uuid:e8f61587-bb56-4e5c-b7dd-2954b76a84b9

The UUID: Standard, spat out of a random number generator, big enough to enumerate every atom in the universe, and nobody is going to confuse it for the address of a Web page. Unless it is the address of a Web page, in which case you transform it like so:

https://doriantaylor.com/e8f61587-bb56-4e5c-b7dd-2954b76a84b9

Once you come up with a clever title, you can derive it into a slug, and provided it's unique, that can be the new address. If you expose the UUID to the public for any reason, you can just redirect that too.

https://doriantaylor.com/e8f61587-bb56-4e5c-b7dd-2954b76a84b9
    -> https://doriantaylor.com/content-management-meta-system

So now it's about 2010 and I have a version-controlled folder on my computer that contains an ocean of files that look like e8f61587-bb56-4e5c-b7dd-2954b76a84b9.xml. The majority are missives I started writing and promptly forgot existed, only to start writing anew. Wouldn't it be great to have a little program that just generates a content inventory so I can get this under control?

The program that did the mining is about a thousand lines of Python, which just zips through the designated folder and concomitant versioning database, constructs a graph, and serializes it to a file. At this stage I had been using all third-party RDF vocabularies to represent this metadata: The Bibliographic Ontology to represent the various types of documents, and Dublin Core for many of the relations between them. To represent people and organizations, such as authors and publishers, I used FOAF.

It didn't take long for needs to emerge that weren't expressed by these third-party vocabularies. For one, I wanted to be able to ascribe editorial destinies to these documents that were clear, machine-readable, database-selectable entities:

This was the impetus for writing my own content inventory vocabulary, which I started around the cusp of 2012.

Expanding the Vocabulary

Here is an old test render of some published documents on my site organized in reverse chronological order. The documents are visualized as their bounding boxes, taken by dividing the number of characters by the—33 em—paragraph width, times a weighted average of each character's tendency to fill the width of the em square. The result is a good approximation of the actual geometry of each rendered document, juxtaposed against one another.

Since I was always driving toward short documents with lots of links, I wanted an easy way to pick documents out of the inventory which most need rehabilitation. This entailed some way of measuring them. Raw word count is, in my opinion, inadequate—I want a sense of the anatomy of the document as well, without having to look at it. Of course, at the time, HTML had no unambiguous concept of a chapter or section, so it occurred to me to count blocks: paragraphs, lists, tables, block quotes, <div>s, etc., and ratios like sentences and words per block. I hacked in a little statistics gatherer to my inventory generator program and amended my vocabulary to provide for descriptive statistics for each document, which could power sparklines that would tell me the shape of each at a glance.

This is an early test for a stylized box-whisker diagram of a bunch of documents ordered by descending document length. Each vertical bar represents descriptive statistics for words per block: Dark points represent medians, and the interquartile range is around the median in the darker purple, while the extrema are grey. The lighter purple bars are the ±standard deviation—clipped at zero—and the mean is in turquoise. You can see the original test here, which was on a black backdrop with the palette inverted.

I also consternated over what to do about carving up the corpus. Barring two or three collections I started before instituting the policy, I was adamant about not creating any disjoint sections. It was important that every resource could exist in more than one category at once. Besides, the inverse of a category is just a predicate, and a resource can have arbitrarily many of those.