stuff designers and developers should know about the web

The Basic Unit: The Resource

A Web resource is a relation between:

one or more identifiers,
to one or more representations,
in response to one or more request methods.

To compare this to something familiar, a file has exactly one identifier: its name, and exactly one representation: its content. A file can be opened and read, written to and overwritten, or deleted outright. The hypertext transfer protocol has analogous verbs for these operations: GET, and the much less-frequently used PUT and DELETE. There is also the method POST, which, in effect, is a way to tell a resource something and have it optionally respond, without necessarily modifying the resource itself.

It's important to understand that an HTTP resource is not an artifact, but rather a process, or more accurately, a collection of processes, one for each request method. Even if it's as mundane as looking up a file, opening it up and spewing its contents down the wire, the resource is that operation, plus all other implemented, enabled, and permitted operations over that particular file, not the file itself.

To sum up, an HTTP resource is what happens at a certain address—or addresses—when requested.

On Representations

A representation is a finite segment of data intended to convey the meaningful content of the resource. Unlike a plain file, which only has one representation, an HTTP resource has five dimensions along which the same content can vary. And that's not including ways in which a resource can be designed to deliver different content under different conditions.

Two of these dimensions are not meaningful to designers—character set and compression—but two are, and I don't see them exploited nearly as often as they could be: content type and language. The fifth dimension unfortunately never made it out of the lab, which is too bad because it's the coolest: content features.

Content features would have enabled Web browsers to signal the server whether they were being viewed on luminous, reflective or projected displays, whether they preferred their printables in Letter or A4, their screen size, both in pixels and physical units, and whether or not they were Retina. Some of those capabilities are possible through different routes, but in my opinion, they aren't nearly as elegant.

This means that depending on how you request a resource, it could give you back an English HTML page, a French PDF or a JSON data structure in Chinese, all containing the same information—or rather the same meaning. This means that the resource's identifier can denote the content in the abstract, and have the specifics negotiated on a per-request basis. As far as I'm concerned, this is a huge untapped capability with enormous potential for applications of content strategy, information architecture, and general user experience design.

Web browsers send metadata along with their requests, like the user's language preference, which in turn is retrieved from the operating system. I used this technique at a now-defunct company from 2002 to 2005 to deliver content seamlessly in 15 different languages. The user's experience was that the site was in their native language and that's just the way it was. We did have a language selector, though you would only need to use it if you spoke a different language than what your computer was configured for.

The ability to vary a resource's representation by language is a no-brainer for seamless internationalized experiences, but varying by media type is also extremely powerful. Consider how so many web pages are not documents per se, but output from a database: lists, sets, tables, hierarchical structures, et cetera. Now, lots of large sites have also spent a lot of money creating APIs which ape several aspects of their vanilla human-oriented content, but geared for consumption by machine, upon which developers can create value-added applications. What if it were possible to create a single polymorphic resource that was a legible page for people, and a structured data object for programs? A site made out of resources like this would be its own API.

And then of course there's the practice of embedding machine-readable data into human-readable content, but let's consider one thing at a time.

Comprehending that the same HTTP resource can have multiple representations means we can think about that resource in terms of its general informational content rather than the details of what language it's in or what program is appropriate to open it.

Yes I do understand that internationalization is expensive, and I'm glossing over huge swathes of problems—many of which I've tackled first-hand—around organizing internationalized content. Know what else is expensive? Changing a system from assuming a single natural language to being fully internationalized. Eraser : drafting table :: sledgehammer : building site.

Furthermore, thinking about HTTP resources as fundamentally polymorphic opens up a different way to organize projects. Websites may have arbitrarily many resources, but are almost guaranteed not to each have more than a handful of resource types. A resource type is not the same thing as a file type. A resource type is this: news article, photograph, income statement. A file type is this: HTML, JPEG, CSV. Web projects are usually organized by feature and section, but resource types only embody a narrow concept of their own behaviour, and zero concept of their location on the site. I strongly suspect that boiling a Web project down to its constituent resource types instead of working in terms of features and sections would probably yield more robust returns, sooner, and with a cornucopia of serendipitous wins in tow.

This would be especially true if you borrowed the notion of class inheritance from object-oriented programming. Specialized resource types could inherit properties from general ones, meaning less code would have to be written overall. Just like all other open-source projects, implementations for the more generic resource types would start to show up in the usual places, free to reuse and expand upon.

On Identifiers

An identifier is a symbol which denotes a resource, which is hopefully shorter than the meaningful content of the resource itself. The uniform resource identifier was invented for the Web, and it does what it says on the tin: identify information resources using a grammar which is uniform. When an identifier encodes information about how to retrieve a resource, it is also called a locator or address. The generic form of a URI—or more specifically, URL, for locator—looks like this:

scheme://authori.ty/path?query#fragment

The scheme says how to deal with the identifier, the authority says who, which on the Internet also means where. On the Web, it's the path and query components that get sent to the server.

The /path?query pair can literally contain anything as long as it's syntactically valid from the perspective of the protocol. For instance, this is legal:

/}mMq5C8sVQ[E?/+P9h/oA%22C/~%20'K'4b2%25

The browser does not try to infer any meaning from the content of the identifier.
Neither does the communication protocol.
The only thing that needs to interpret the identifier is the particular server to which the request was directed.
Oh, and I suppose the identifier can be made meaningful to users, but I'm going to flesh that out elsewhere.

Except I will say this: When browsing the Web, ever cut off the end of a page's address to reveal the landing page for that section of the site? Ever just get a list of files instead, like a folder on your computer? Ever get something completely unexpected?

All of that behaviour, the idea that the elements between the / character correspond to folders, that there will be an index page at every slash—or list of files in lieu—all of that is an illusion. Or perhaps, more accurately, a convention. There's no spec or standard dictating that kind of behaviour, but a lot of effort has gone into each server implementation to make the Web mimic files and folders. That makes sense, because the early Web was just files and folders, and to a great extent still is, but the point I'm trying to impress is that any relationship between HTTP URLs and an ordinary file system is metaphorical at best.

Also worth noting: file extensions, like .html or .jpg are completely unnecessary in URLs, because the media type of the representation is sent alongside in a protocol header. File extensions can be done away with entirely on the Web, unless you wanted to use them deliberately as part of the user interface.

Once again: there is no connection whatsoever between the layout of a website's URLs and the site's logical structure. Any hard binding between the two evaporated a long time ago. This fact ought to be meaningful to both designers and developers, because it means that the URL is now 100% pure UI.

Let me put it this way: you could conceivably make a site whose addresses were all serial numbers that incremented in the order the resource was added to the system, like this:

http://my.site/4739

A strategy like this would be perfectly fine for a system that wasn't trying to use its URLs to imply anything meaningful about its internal structure. URL shorteners do precisely this, though they encode the serial number in base 60-ish, using capital and lower-case letters in addition to decimal digits, to get the shortest possible end result. Underneath, however, is still an integer, so the very existence of /4739 implies the definite presence of /4738 and the possibility of a /4740. Now consider this:

http://my.site/6a53fe60-1bd4-4b8b-9144-5c9fb1646ea6

The symbol making up the path component of that URL is called a UUID, and there are 2¹²² of them. It's basically an enormous random number with a guaranteed fixed length, giving it all sorts of useful properties under the hood. I use UUIDs in my own practice as canonical identifiers for resources, meaning they're stable and unique, and I can add more meaningful identifiers later. In fact, not having to care about what to call a resource, or where to put it in order to make progress, has utterly changed the way I work with the Web: I can decide on the site structure and naming scheme long after I've created the content and functionality.

Separating the act of deciding what to call and where to file a resource from that of crafting its content and/or behaviour goes beyond just making it easier to focus on one without having to stop and consider the other. We can actually start thinking about the business of address resolution in a completely different way.

The critical property of a URL is that it only points to one resource. It can't be ambiguous. There are a lot of other properties that are nice to have in a URL, but that one is actually required for them to work. It also helps when the address is stable, so that a resource retrieved now can also be retrieved later. The nice-to-haves include being short and being memorable, which work against the goal of being unique and stable: short and memorable occupies scarce prime real estate, so over time there is the potential that a resource will be forced to give up its address to a different one, implicitly track-switching all inbound links to a target that the people making those links had not originally intended.

You see this a lot in redesign projects: people whose job titles include the word designer are quite content to obliterate a site's address space and transform it into something else without a semblance of an acknowledgement that other sites might be referring to resources under the existing scheme, let alone a plan for dealing with them.

Sayonara 404

A website would work perfectly fine if all its resources were identified by UUIDs, or serial numbers, or random strings. Its structure could be completely flat, or sport an intricate nesting of sections, one for each and every resource. This is because a website's structure is a fiction, and hardly even for the benefit of the user. Arguably a website's structure—that is, URL structure—is more influenced by a mishmash of project management interests and search engine optimization. It'd be interesting to see what percentage of users who aren't also Web professionals actually use a site's URL to orient themselves, over, say, a graphical cue on the page itself. It would likewise be interesting to know how many navigate, with the exception of home pages, by direct input. Especially nowadays that mobile browsers tend to mask the location with the page's title.

I want to talk about the 404. Even if you're not in the business, you've invariably encountered the infamous Not Found error. As the guy who invented the Web said in 1998, there is no excuse for a 404.

Let's imagine this: a Web server is a computer program that plays Go Fish. The pool is the set of strings that represent syntactically-valid URL paths, and the server's hand is the infinitesimally-smaller set of URLs it will actually respond to. Even if you whittle the former set down to the set of symbols that may be meaningful to people, it will still be several orders of magnitude larger than the set of resources that are actually there. Scoring a hit on any resource without intimate foreknowledge of its whereabouts is virtually impossible. For all our talk about findability, it's strange that we haven't done much to improve a system so resolutely biased against finding things. The 404 error is the server telling us to go fish.

In effect what I'm suggesting is to turn the web server's URL resolver into a kind of metadata search engine—one that is more likely than not to respond positively to arbitrary requests. Even while employing redirections to nudge users toward the addresses we prefer to stand for our resources, we can still be extremely permissive with respect to the URLs actually typed in.

Arguably the built-in URL resolver present in most server implementations already is a metadata search engine, just a really bad one: the only metadata it has access to are the names of files and folders, between which of course the only semantic relation that can be expressed is containment.

The reason for caring is that when we mint a URL, we lose control over it—at least absolute control. URLs get bookmarked, they get emailed around, they get chopped up and permuted in weird and unexpected ways. This is all before addressing the problem Tim Berners-Lee pointed out in that 1998 article: We nonchalantly shuffle around some files on a Web server somewhere and potentially years of accumulated inbound links are instantly snuffed, both from within the organization and without. The solution? Decouple content addressing from content creation. Make them two separate, highly-instrumented tasks.

The real use case for this behaviour is somebody using a boring old text editor to write a boring old web page that links to you. Who knows how they screw it up, it doesn't matter. Maybe they typo. Maybe they delete a character by accident. The point is they don't have any fancy-pants content management features to verify the link or check if it ever goes bad in the future.

Anyway, since we're already well past the 3000-word mark by now, I'll follow up later with precisely how to do this.

The Basic Unit: The Resource

On Representations

On Identifiers

Sayonara 404

Resource as Building Block