Betamaxed

I was an early Subversion adopter, because CVS is really terrible. Subversion, however, is utterly baroque. Implemented as an Apache module, it's the only thing I am aware of that makes extensive use of the and versioning part of the also-baroque WebDAV spec. Subversion is a messy, consummately cantankerous system, and I am glad we have evolved past it.

The greener members of the audience may not be aware that once upon a time, version control was not distributed like Mercurial and Git are, but rather the paradigm held that the master copy ought to be cloistered away behind some network protocol or other. Forking—copying the full version history of an asset repository to one under your own control—was not only a resource-intensive proposition, but also politically charged: it was seen—and often meant—as a no-confidence vote in the original authority. With distributed version control, forking is the first thing you do. The authoritative repository is determined by policy, rather than through technical means.

Subversion was billed from the start as a better CVS, which itself is just RCS with networking support. If the client-server paradigm represents the bronze age of version control, RCS is demonstrably neolithic. Over a decade ago, the new distributed tools looked a heck of a lot like the technology was regressing. Indeed, in many ways they're much simpler than the client-server systems they supplant, but they're considerably more sophisticated.

Mercurial was not my first distributed version control system; that honour goes to Darcs. As a treatise on epistemology, Darcs is eminently fascinating, but as a software product intended to do real work, I found it to be a liability. For putting up with its intolerable slowness, it rewarded me by corrupting two repositories and ruining a bunch of my work. I dumped it after only a few weeks.

Desiderata

A version control system, distributed or otherwise, needs to be:

Reliable,
Fast, or at least fast enough, and
Programmable, so you can extend the toolset.

In 2007, Git only scored two out of three—or maybe two and a half. At the time it was—and to some extent the reference implementation still is—a bunch of janky C programs duct-taped together by a bunch of janky shell scripts. You programmed Git by wrapping the executables in a pipe and parsing their output. Mercurial, on the other hand, being written mostly in Python, had an API from the very beginning. Plus Mozilla had just staked their entire codebase on Mercurial, so switching to it was a no-brainer.

11 years later, the VHS of distributed version control is clearly Git. The core toolset has been cleaned up considerably, and it has had a proper programming interface now for quite some time. This wasn't enough motivation to switch on its own. The Mercurial plugin hg-git is more than serviceable enough to interface with Git-based projects and services. For the most part, if you're using Mercurial, you can pretend Git doesn't exist.

Well, let met qualify that. From the outside, a version control system only needs to do a handful of things:

Record successive changes to a directory tree of files,
Enable you to navigate longitudinally between sets of changes,
Enable you to tag specific snapshots as being somehow meaningful, such as a release,
Enable you to branch parallel universes of change sets,
Enable you to navigate laterally between branches,
Enable you to merge branches together, including branches that belong to other people,
Afford some canonical mechanism for sharing the whole mess with others, and
Keep detailed logs of the process.

In my opinion, Git and Mercurial are virtually indistinguishable from this perspective. I expect this is heresy to proponents of either, and both sides have an arsenal of arguments about features one product has that the other lacks. However, if a given feature really was so essential, I submit you would quickly see parity across all available products.

No, what really got me seriously considering Git were two reasons:

The first, and less interesting perhaps, is the close coupling of Mercurial to Python. I like Python fine—I've used it for various things since 2001—but I have never loved it. I'll take a Python interface over parsing program output, but if I had my druthers I'd pick another language. Since I do most of my work in not-Python, it is becoming an increasing burden to have to maintain a bunch of Python code for the sole purpose of interacting with my version control.

My other reason for taking a closer look at Git is something that caught my eye while I was reading the section in the reference manual about its internals:

First, if it isn’t yet clear, Git is fundamentally a content-addressable filesystem with a VCS user interface written on top of it.

Oh. Well. I know a thing or two about content-addressable storage. That is legitimately interesting. No wonder Git a) took relatively longer to mature and b) its repository design is now fully decoupled from its tooling implementation: Git is a set of specifications for protocols and data formats first, and a software product second. Mercurial is a software product, period.

These reasons, coupled with the Git-chauvinism present in myriad other tools, are why I feel it's finally time to switch.

What DVC Systems All Have in Common

The general strategy adopted by Git, Mercurial, and presumably all the other distributed version control systems—I don't know how they would accomplish what they do otherwise—is sound. By this I mean the use of cryptographic digests to identify content, as well as representations of changes to that content. Indeed, the only serious way these distributed version control systems differ is in the fine-grained details of how the changes are represented. The rest seems to be in rough consensus.

Indeed, there's really no reason why you couldn't store your content in a general-purpose content-addressable store, and implement protocol adapters for the various version control products. People could use whichever one they wanted. On the same project, without making special arrangements. Interfacing with the master repository would Just Work™.

Version Control as Band-Aid

There is a talk by Rich Hickey in which he wonders aloud why we design information systems that have no memory—that only store the most recent state of a piece of information, and when updated, erase whatever was in there before. He answers himself rhetorically that it has to do with long-since-eliminated constraints on computing resources, particularly of storage. He offers version control as an example of a kind of system designed from the ground up to remember everything, and suggests that the hype in Big Data is at least partly to do with management getting a look at the tools developers create for themselves, and indicating that they want those capabilities too.

The ruts on this joke run so deep, it has its own array of coffee mugs, with variants for different professions.

Ordinary people—that is, non-developers—would undoubtedly demand version control, if they could be led to understand it. I don't believe, though, that understanding it is the most significant impediment.

Version control, as we know it, is oriented toward files, in particular text files. A version control system is most effective when its contents are such that you could read and write them directly. The kinds of files most at home in version control are, in principle, still accessible using a Teletype from 60 years ago. Ordinary people, by contrast, tend to manipulate files through mediating applications, whose authors may use whatever hare-brained scheme they deem appropriate to store their application's state.

As all text files and their derivations have in common the fact that they are all sequences of lines, the user interface for reviewing changes is also a sequence of lines—in other words very economical to implement. A version control system extended to proprietary application files would not only have to know how to parse the file to produce a minimal set of changes, it would have to know how to render those changes in its user interface. Vendors attempt to solve this problem by embedding version control into their products, and the results are questionable. Photoshop has its history panel, but it doesn't track across saves. Microsoft Word has Track Changes, but necessarily only affecting one particular copy of one particular file. The alternative is something like Google Docs, which no doubt stores content in a fine julienne in some kind of online database, and only projects a snapshot of that content, embodied as a conventional concrete data file, when you ask it for one.

That observation returns me to my first remark about version control systems: They're still oriented around files, rather than around content. For instance, if you cut a segment of code out of one file and paste it into another, the system can't tell that you moved it. Rather, it looks to the system like one hunk of content was deleted while another, identical yet unrelated hunk appears elsewhere from thin air. The same goes for moving content around within a file. The recording function of a version control system typically needs to be triggered explicitly. You can make a thousand edits to a thousand files but they won't be recorded until you instruct the system to commit the changes, of which it necessarily can see only the latest ones.

Imagining

What would a future look like, in which everything had an undo button that went back arbitrarily far? In which the act of creating new digital content did not mean destroying what was there before? Something like Google Docs offers us a glimpse. Changes are transmitted over HTTP requests, as messages that only express the content of each individual change itself. On the server side, what would otherwise be encapsulated as a file—in Rich Hickey parlance, a mutable and therefore volatile place for data—could instead be represented non-destructively, as a time-stamped log of accumulating changes, which could be projected into a snapshot at any stage along the way.

To bring about that future, we would likely have to move away from files—at least as anything other than bulk payloads of instantaneous freeze-frames of the state of a system. It would, ironically, mean going back to a client-server model. But who controls the servers? Well, you could, I mean, it's not that outlandish a proposition.

Coda, June 4, 2018

When I started writing this document, I didn't know that Microsoft was planning to acquire GitHub. There is, predictably, a panicked exodus to competing services. At least with Git you can extricate your source code, although you leave everything else behind.

The irony is not lost on many that a distributed system was made usable by being centralized. The skills to make such a system usable are, to riff off of Gibson, not themselves evenly distributed. Like cream in a cup of coffee, said skills eventually filter out into the zeitgeist, modulo being worth caring about. This kind of oscillation between centralized and decentralized control, moderated by the cost of skill acquisition at one end and the value of knowing the skill at the other, is something we have seen time and again, and likely will for the foreseeable future.