This is the second of two pieces.

Perhaps it is my early exposure to the relatively primitive, close-to-the-metal technologies of CGI and Web server APIs which has coloured my thinking, but I wonder if we're overlooking an opportunity to provide a much more powerful and seamless experience than we currently are. In its most raw form, the Web is stateless, anonymous and extremely loosely-coupled. This means that anything that can understand HTTP and HTML can also move information back and forth, and navigate between resources. The current incarnation of the protocol was explicitly designed to capitalize heavily on content negotiation, caching and proxies. Content was intended to be transformed, filtered, sliced, diced and julienned. We seem to have forgotten about a great deal of this functionality, and only venture there when the situation incontrovertibly demands it. Many sites nowadays simply won't work outside of a conventional browser with cookies and JavaScript enabled. The impending HTML5 specification is introducing its own SQL database, which will enable browsers to store even more complex application state.

Take Your App and Shove It

I suppose I find this part most irksome, because to me the Web isn't about building apps, it's about placing and finding information and moving data from one place to another. Every Web-based system reduces to this; it doesn't matter how it is applied. When we cast a system as an application, however, we frame otherwise generically-applicable data, information or content as being for a particular purpose, and that taints the entire experience.

Possibly the most confusing is the treatment of the Web API, not to be confused with server APIs. What Roy Fielding has been trying to tell us for the last ten years is that HTTP is the API. And he should know after all: he designed the protocol. He also coined the term and the idea behind REST, which has been co-opted, corrupted and abused by an ignorant and unappreciative developer community.

But the Data is Already There

Many APIs, even those that claim to be RESTful, do little more than provide the same information as that which is on the pages, or the same functionality as submitting certain forms on a given site. They do so inside a sandbox of API keys, dodgy cryptographic handshakes and rate limiting. They then demand the creation of wrappers to translate their baroque innards into usable software libraries, one for each programming language. The benefit of this is supposed to be access to a conduit of reliable, machine-readable data, but I wonder how much more effort it would be just to create a Web scraper which does the same thing. I have personally written scrapers to fill the gaps in APIs that lacked information I wanted, and haven't had to repair them any more than any API wrapper I've used. As a bonus, I don't have to fool around with an API key, some weird token-passing login mechanism or try to tiptoe around what often turns out to be arbitrary and officious usage accounting.

Here's the thing: the data, whatever data, is already on the page. You can go fetch it by visiting its URI. Then you can extract it with whatever tools you see fit. Don't like the extra overhead of superfluous HTML, or afraid a change to the layout will torpedo your data pipeline? Fine, supply a more machine-friendly variant of the same data, and put it at more or less the same address. A few outfits do this, but only a few. Twitter is one of them. In addition to opening up to third-party developers, they seem to use their API to drive their own site. They also provide an option to use plain-Jane stateless HTTP authentication, sans cookies, shared secrets or bogus crypto. Nice. Almost.

Easier, But Not Orders of Magnitude Easier

The problem is that their machine-friendly representations of data objects are their representations. They have their own idiosyncratic understanding of what a user object is, for example. Even though they supply the data in a vendor-neutral syntax, like XML or JSON, I still have to translate the semantics of what each of the fields means to my internal representation. This means I can't just haul in data from Twitter, or any other API provider, with a generic data interface. I need to write an additional layer of code, specific to them. This is the difference between writing one adapter for all Web sites, and writing one for each site I'm interested in.

There's an App for That

This problem, by the way, is exactly the one that the Semantic Web was designed to solve. If a Web site provides its data as RDF, the lingua franca of the Semantic Web, it can be unambiguously understood what every single little nugget of data is intended to represent, and it can be done without any site-specific code. If the site in question understands RDF, then the same applies for data coming from other sources, like me. With RDFa, even conventional HTML pages can be imbued with semantic data, obviating the need for data consumers to rely on custom scrapers or error-prone heuristics. This I see greatly improving search and findability, simplifying client-side programming and opening up a number of new, interesting and valuable uses for the data.

Resource, not Service-Oriented Architecture

There is one other consideration, and that is the siteness of a site. It is best expressed neither as a folder full of files on a server, nor as a monolithic application off in the cloud, but as a collection of potentially disparate resources under the management of a particular business entity. Any resource on the Web can point to and use any other resource anywhere else with the same trivial effort. It can even do so across organizational boundaries, unless the target organization refuses to cooperate.

This is strange behaviour, as the organization in question usually gains on the whole from what has become a perfectly natural form of reference we call linking.

So if we took all these ideas and put them together, what would it look like? More importantly, what would we gain from it?

What are You Going to Do About It?

The system I envision is in some ways a conduit for information as much as it is an origin. It interacts heavily with other Web-based systems, but doesn't completely trust any one of them. It works with all clients, not just the latest browsers, where works is defined as can consume content, follow links and send data. It does not depend on client-side state mechanisms or code, such as cookies or JavaScript, though it may use them to augment the experience. It is friendly and forthcoming to robots and search engines, though resilient to abuse. It relays data to and from third parties, but maintains its position between its users and other entities to ensure it can deliver on its promises. It records every click for the most comprehensive statistical analysis. Every datum is represented in a semantic structure for every representation capable of doing so, and every URI effectively doubles as an API endpoint. Part of it could reside in the cloud, while another could easily live in your office or hall closet.

The gain, I hope, is manifold. By reducing the command language back to plain HTTP, I can enfranchise everybody, regardless of development platform. This includes operations within the confines of the system itself, potentially composing it of many different platforms running in many different places, each contributing bits and pieces of functionality. By representing all my data as RDF, I provide a clear meaning for each structure. I can also borrow established vocabularies made by people who have much more time to consider how such structures ought to be, and which are much more likely to be adopted by other people. I can likewise consume data from any appropriate source and weave it into my own. By maximizing the system's statelessness and operability, I no longer have to worry about turning paranoid users or those with antiquated browsers away, even though their experience may be decidedly clunkier. Mostly, I'm interested in demonstrating how responsible Web development behaviour can be hedged by taking away the bulk of the work that gets in its way.

Oh, and this is something I am actually working on, inching toward completion between more pressing matters. None of it is technically very difficult, it's more of a matter of making thousands of decisions of how I want it to behave. The actual product, so far, is surprisingly compact. If I succeed, you won't be able to tell I'm even using it. No promises as to when it'll be ready though, don't want to jinx it.