Technical Note 2

The first phase of my Summer of Protocols project entails a massive refactor of a piece of code which I am currently calling the Content Swiss Army Knife, or Intertwingler. The objective is to take it from an artifact that could be described as akin to a breadboard or laboratory for experiments, to a load-bearing software library and concomitant daemon, with a sensible command-line interface.

The current state of Intertwingler is that it acts as a static website generator, merging a graph database with a directory full of files and producing a different (and disposable) directory full of files, which can be uploaded en masse to a Web server. Intertwingler marshals a Resolver which maps the durable identifiers used internally, to human-friendly HTTP URLs, and tracks their naming history. It also generates certain special resources whole-cloth, such as indexes and feeds.

We want to be able to display, as Web pages (and potentially other representations), arbitrary patches—the immediate neighbour statements radiating from a subject URI—from the graph database. These are the elementary building blocks of dense hypermedia.
We want to be able to modify an online instance of the graph database through a Web interface. (This means it will have to be a live Web application.)
We also want to be able to store and retrieve opaque information resources, such as ordinary files—media files in particular.
When those opaque objects happen to be in amenable formats (such as HTML), we want to merge them with relevant (meta)data from the graph.
We want to be able to treat all resource representations uniformly, whether they originate in files, or are generated by code, and everything in between.
We also want to be able to transform the representations of these resources in certain principled ways.

I have elected to respond to these desiderata the following way:

Introduce the concept of a Source: you give it a URI (not necessarily an HTTP(S) URL!) and a set of preferences, it responds with a representation.
Make all representation-producing code use a unified Source interface, whether it plucks them from the file system, fetches them remotely, or generates them outright.
Create an explicit Representation class that behaves as a monad-like construct that wraps the byte segment (or more likely an IO object) plus certain essential metadata, along with a cached, parsed representation, which is lazy-loaded when one or more Transform operations are present in the pipeline.
Introduce the concept of a rendering Surface.
Modularize the static site generation functionality as a specialization of a generic Surface interface.
Create a Web application module as a subclass of Surface.
Expand the Resolver to handle Transform sequences.
Move markup-generation and decoration processes into Transform instances.
Create a configurable default Transform sequence (i.e., it runs without being explicitly addressed), mapped to media type, that ingests and returns Representation objects, irrespective of their Source.

Since I am completely reorganizing the library itself, I also need to reorganize how it is configured. It has always taken a mouthful of configuration parameters, and I need to give it more. I ultimately intend to run Intertwingler as an application server (behind a caching reverse proxy), which means that unless I want to configure a bunch of redundant processes, it's going to need to respond to multiple sites (i.e., dispatch its responses via the incoming Host: header). The configuration is therefore going to have to be multi-lobed, though since a lot of it is going to be shared, I want to be able to specify a global baseline configuration, with site-specific deltas. This hopefully will prevent an entire class of configuration mistakes.

Now we can finally talk about what motivated me to write this note: when software configuration gets particularly wordy, it is often moved into a file, and these files typically reduce to some ordinary form of text. The configuration data, however, often requires some sort of constraint on what kind of text is valid, or is in some other type entirely (e.g. numbers, dates, controlled vocabularies, etc). The composite structures present in the configuration (primarily list, set, and dictionary types) are also similarly constrained in terms of what they can (or must) contain. Verifying these constraints, and what is called coercing the textual representation of a piece of configuration data (or, say, the content of an API request) into a more useful data type, is a dirt-common activity that happens all over the place in software development. Every programming language, that is sufficiently mature, has at least one library to do it.

Not all of these libraries are created equal.

Like many a Web greybeard, the bulk of my career has been spent writing Perl. I wrote mainly Perl from 1997 through 2017, and a big part of the reason I stuck with it so long is because it turns out that the median third-party module is incredibly well-designed.

The way (or at least a way) to do data validation in Perl is a thing called Type::Tiny. Type::Tiny is excellent. Its author, Toby Inkster, thought of absolutely everything. It is a generic, low-footprint solution to a ubiquitous and perennial problem, which it handles competently and efficiently, and does so in a way that blends into the idiom of the host programming language. Most importantly, it doesn't surprise you. But I'm not writing Intertwingler in Perl, I'm writing it in Ruby.

The least-bad way to do data validation in Ruby, from what I can tell, is a thing called dry-rb. This is a very sophisticated, highly-organized software framework, that is scrupulously documented and written by people who clearly know how to make software. And yet, I dread using it. I find it to be characteristically infuriating in a way that is hard to articulate. It's like instead of a delicious piece of cake, I'm looking at a bunch of raw ingredients.

Literally every time I've used it, dry-rb has done something to piss me off. I think its biggest crime is that it looks like it should be able to do things that it can't, and the only way to find out that it can't, is to waste a bunch of time with it.

Another issue with dry-rb is that it's split into about two dozen separate packages with functionality that suspiciously appears to overlap in some places, and it isn't clear what they all do, or why, or even if you need them. A prominent side effect of this situation is that the documentation is—as it tends to be—organized by package, which means to figure something out you're often flipping back and forth between three or four different documents at once. There's no single screen you can look at to get the entire picture.

The dry-rb documentation is mercifully full of examples, but they seem to be, not contrived, exactly, but definitely localized—parochial to the individual module, rather than situated in the greater system. I suspect that if you tried to concoct a sufficiently complex, yet nevertheless realistic and even common example scenario, you would discover a bunch of places where it just plain doesn't work. I am confident of this because my sporadic attempts to use this software over the last five or so years have not been particularly exotic, or even rate beyond the mundane. The failures range from surprising errors that stem from poor visibility into the fine-grained semantics of what you're writing (and one could argue that the need to have a mental inventory of such fine-grained semantics is itself a failure) to bafflingly missing functionality for obvious things. The first time I gave up completely. The second resulted in an embarrassing workaround. So did this most recent excursion—although the workaround was slightly less embarrassing.

A member of the general category of data validation software has to make an extraordinarily strong argument for why you should include it in your kit, because what it's offering to do is save you time, both by making things more organized, and by keeping you from having to repeat yourself—which is what DRY stands for. If you have to work around it—i.e., repeat yourself—to get your desired results, then its value is sharply diminished.

Moreover, everything dry-rb does is code I can write in my sleep. Like anybody who's been in the business for a while, I know how to write a one-off data validator, it's just that I would prefer not to. Having to duct-tape a purpose-made data validation framework with one-off data validation code (which is almost certainly redundant, not to mention likely to break the next time the framework authors issue an update) feels farcical to me. It's like there is a special insult to it: You could have done this your way, but we convinced you to do it our way, but you can't actually get what you need by doing it our way, and there's no time left to back out and do it your way, so now you have to do it in a way that is neither your way, nor ours.

Anyway, I finally got over the hump of the new configuration, after putting it off for weeks, for all the reasons I stated above. I don't love the solution, but the benefits of using dry-rb do outweigh the baggage.

I suppose it's high time I share some thoughts about the politics of open-source software, by entertaining refrains that go like:

If you don't like it, don't use it.
If you don't like it, make your own.
If you don't like it, send a patch.

Some people publish open-source software with no promise of support, and an attitude that if people want to use it, that's up to them, caveat downloador. Others are deliberately trying to occupy thematic territory. By that I mean, they want people to use their software, because they want to be the premiere solution to a particular class of problems. There are a lot of good reasons for either position, but I submit that if you start off with the latter, reverting to the former for rhetorical purposes is disingenuous: Just don't use it then is a bad-faith response when you are actively courting mindshare. Make your own is little more than a goad.

Send a patch is likewise easier said than done. There is something of an unspoken compact in open-source development that goes like I will fix your software, and you will perpetuate the fix. The complication here is, from the point of view of the maintainer, not all fixes are helpful, and from the point of view of the fixer, not all maintainers are responsive, or even reasonable people.

My own experience patching other people's software (to say nothing of receiving patches) is…checkered. It definitely elicited a sort of tacit diplomatic protocol that I now try to follow. Fixing another person's code, after all, can be a potentially costly proposition. With the time it takes to learn its idiosyncrasies, conform to the maintainer's style and other stipulations, and engage in copious correspondence on top of the essential work itself, the risk remains that your effort goes nowhere—or at least nowhere in time to be useful. On the receiving end, you have to deal with people showing up cold with a patch that significantly alters your editorial direction, with the earnest expectation that you're going to go ahead and rubber-stamp it. Today.

As a patch-submitter, I now probe ahead to the maintainer to see how amenable they are. I also do some recon to see how active they are in their software project: if my patch is in your next release and you don't release for several months, it's worth much less to me. GitHub has provided a reasonably tidy venue for this activity, and the more relaxed and neutral-sounding term issue (certainly not unique to GitHub) is less adversarial than bug or defect.

There is an art to filing a bug report. When the issue in question is in fact a genuine defect, the argument to fix it is already manifest. What you concentrate on instead in your report are your observations of the defective behaviour and the steps to replicate them, along with what you believe the expected behaviour should be. Bonus points if all this is directly executable by machine. Bug reports of this kind are costly to prepare, as the implication is the maintainer will be the one fixing the code. Proposals for strictly additive changes to system behaviour (which many call feature requests, but I prefer to focus on behaviour rather than features) are generally speculative, and so what you're proposing is usually that you will attach new behaviour to another person's product, which they will go on to maintain. Here the focus in correspondence is less on what the artifact should do, but why it should do it, and is generally more of an ongoing conversation, than a highly front-loaded, transactional missive. In other words, you approach a patch proposal with more conviviality and less formality than a bug report.

I guess I'm thinking about all this because I'm trying to transit attitudinally from treating open-source projects—at least this one in particular—as incidental offgassing, and more of a deliberate product with a clear vision of a user who is somebody other than me. Most, if not all open-source projects begin life as an attempt to scratch one's own personal itch—Intertwingler certainly did. dry-rb evidently did as well. If Type::Tiny did too, which is likely, it managed to transcend this state, and became the premiere data validator for Perl. I have the added wrinkle that I'm not actually trying to make the premiere anything—Web framework or CMS or whatever—for Ruby: I'm trying to create a reference implementation that enumerates a bill of materials for a dense hypermedia system, and Ruby happens to be the most convenient language at the moment in which to write it. This means it has to be legible on multiple levels, from end users, to those who would port the paradigm to other programming languages.