Content Robo-Inventory

Define the Damn Thing™

It's this kind of tool that makes any attempt at a clear-cut distinction between information architecture, content strategy, business intelligence, SEO and probably four or five other nascent disciplines start to look a bit silly. Among other reasons, it's because it makes use of a data structure which, after working with it for the past five years or so, I'm pretty confident is the natural home for every single one of these concerns. Moreover, the data has to be sourced from several different organizational silos in order to be brought together in one place and be made useful. Who should own that problem? Who even has the authority?

Well, So Far I Do

I took an interest in this idea because I wanted to get a handle on my own site, which is a scintillating paragon of neglected content — so much so that only about af fifth of it is even visible, of which maybe half is current. The rest is squirreled away in various stages of incompleteness. I wanted a meaningful way of putting the whole corpus front and centre so that I could quickly expose and prioritize the material that needs attention. I had a nagging feeling, however, that there was something insufficiently real-world about my own site, which, while admittedly a feat of extreme laziness, it is at least informed by working with the web virtually every day for half a lifetime. As such, the content inventory engine I wrote just didn't feel right until I could turn it toward a corpus that was a bit more representative of what I was likely to find in the wild.

Good thing you folks elected me to the IA Institute board! Its site is perfect. Just what I need to finish my masterpiece, which I started finishing in November 2010, and continued at a pace consistent with a volunteer directorship of a non-profit organization*. Well, I'm happy to say that I'm at it again. Let's consider some of the data points I'm interested in.

Stuff I'd Like to See in My Content Inventory

You know, because I inventory like a boss.

Data About Individual Resources

The canonical URI of the resource
Any other URIs the resource has or might have had in the past
Who wrote the damn thing or has otherwise touched it
When it was written
How many times it's been revised and when each revision happened
How much traffic it gets relative to the rest of the corpus
How long it is in words, paragraphs and sections
How long that is relative to the rest of the corpus

Actual Metadata of Individual Resources

Title
Short title, for use in links with constrained real estate
Abstract or description or whatever you feel like calling it
Intended audience as a FOAF agent or Dublin Core AgentClass
Subject as some kind of resource, at least a SKOS concept; not to be confused with…
Relevant concepts, also SKOS concepts, from which keywords, tags, whatever can be derived

Information About Links

Links from this resource to other places within the site
Links from this resource to other sites
Links which are hidden, e.g. with the <link> element
Links which are part of the navigation or other chrome
Links which are part of a widget or other ancillary content
Links which are part of the actual content
Links which reference a form
Media assets which are embedded, including but not limited to images
Scripts, stylesheets and other utilities referenced in the resource
Inbound links from within the site
Inbound links from other sites
A huge angry beacon if the resource is an orphan, i.e. it has no inbound links from within the site
Any well-trodden paths through the site this resource may belong to

Now, What To Do About It?

Status, augmenting existing bibo:DocumentStatus with: empty, incomplete, incorrect, obsolete, retired and orphan, which I already mentioned
Action, as in what to do about it and who to pin it on: keep, split, merge, update metadata, proofread, revise, rewrite and retire
Any helpful ad-hoc annotations or bookmarks through a mechanism like Annotea

Wait, You Forgot Sections!

Nope. I'm pretty convinced that the idea of prescribed sections on the web is a diversion from what the web excels at. The structure of the RDF data model is such that it represents connections between resources based on what they mean, both in the type of connection and the content of the resources themselves. These connections accrue over time and represent associations that ultimately people have found meaningful. When we pile all this information together, along with certain elements above, we can tease the model apart along its natural fissures. Try as we might, our best tools for hand-crafting sets of resources equate to biased heuristics. We would do ourselves a service to take advantage of the data. Besides, any existing section landings will still be present in the scan, it's just a question of how useful they will be.

Anyhoo…

My super-ghetto prototype website-to-RDF crawler ~~is chugging away on iainstitute.org as I write. It is slow, because I am too lazy to make it multithreaded. When it exits, I'm gonna make another pretty picture like this, and then I'm going~~ is now finished and the results are at the top of the page. Now to get to Actual Work™.

Peace out.