The Great Deshittification

I have arranged my career to interact with PHP and MySQL as little as I can get away with. I have made extra effort to avoid the chimerical fusion of these two products into the gaping hellmouth known as WordPress. MySQL I haven't used by choice since Postgres was a viable alternative, around early 2002. My distaste for PHP goes back even farther, mainly because I already had an established technique by the time it had started gaining traction, and could get by just fine without it. But moreover, I've always found the language to be an absolute mess that is uniquely well-suited to enabling people to create even bigger and more elaborate messes.

This brings us to WordPress, the engine that powers an impressive one third of the Web. There's a reason for that, which I have characterized in the past as expertise arbitrage. That is, the skill level required to set WordPress up, get it pulling actual weight, and even tinkering with and extending it, is orders of magnitude narrower than the skill level required to unfuck it when it breaks. Now, you could argue that this asymmetry is everywhere, but I submit that in the WordPress ecosystem, this phenomenon is especially, if not uniquely pronounced. I say this as a person who has de-Viagra'd my fair share of WordPress blogs when nobody else for miles around had the first clue what to do, despite them being the ones who work with it for a living and me doing everything in my power to stay as far away from that consummate lifesuck as possible. The gist is, as long as everything is fine, you're fine, but the second something unusual happens, you're fucked. And even being a WordPress or even a PHP or MySQL guru isn't enough to fix it.

Such is the case of this episode. Venkat is not in the WordPress business, except as a user. The hosting company that runs the WordPress instance that powers ribbonfarm.com recently forced an upgrade to MySQL 8.0, and the net effect was that a bunch of weird symbols started showing up in his blog posts. When asked about this mishap, the tech support staff at this premium, top-dollar, WordPress-specific hosting provider, said something on the order of we don't know, perhaps you should hire a database guy.

Now, the single longest stint I ever worked as a full-time employee happened to be four years as an infrastracture developer in a heavily-internationalized environment in the early aughts before UTF-8 was ubiquitous, so I have seen just about every permutation of encoding cock-up. They often have a signature of sorts, and this one was not unusual in that regard. In particular, this had the hallmark of having taken a run of text that had been encoded in Unicode UTF-8, treated as if it had been encoded in an older single-byte Latin character set, and then encoded again in UTF-8.

If you aren't aware, UTF-8 is a variable-byte encoding system with built-in error detection, such that any single virtual, non-composite UTF-8 character—above code point 127 that is—is two to four literal bytes with a distinctive pattern. Namely, the top bits of the bytes are always masked, and the first byte in a sequence is masked in a way that tells you how many more to expect. That way the parser always knows if a UTF-8 character has been mangled, or if it is intact. When you look at the bytes of a UTF-8 character like U+2014 (that would be the em-dash, —) in hexadecimal notation, they look like E2 80 94. Through the lens of a single-byte encoding scheme like CP-1252, those bytes turn into this: â€”. Except the actual bytes on the wire were C3 A2 E2 82 AC E2 80 9D. That is, they were valid UTF-8, just not the text that was intended:

Diagnosis: text that was actually UTF-8 was treated as CP-1252 and encoded as UTF-8 a second time.

Having settled on what had happened, the next step was to figure out where it was happening. Was it WordPress that was doing this? The database? A plug-in? This was initially unclear. I'll spare you some of the drama and confirm that the culprit was indeed the database. Getting there, however, was a bit of a challenge. See, MySQL is pretty loosie-goosie about a lot of things, and character sets are but one tiny example. Up until version 8.0, which shipped in 2018, MySQL had been setting something it calls latin1 as its default character set.

Historically for MySQL, while the character set was nominally latin1, it didn't especially care what you threw into it. The only time it would matter would be if you were trying to sort text, search within it, or otherwise match pieces of text together. This is where a thing called collation would come into play—essentially normalization functions that get automatically applied to the comparands that put them in a definite order or increase the likelihood of a match.

Since WordPress effectively uses MySQL as little more than bulk storage, and since MySQL doesn't enforce latin1 encoding—not that there's really anything to enforce—WordPress can just ignore the declared character set and shove the database full of UTF-8. Fast-forward to this forced-upgrade business on the part of WPEngine, the hosting provider. Did they Do Something™ to the data when they upgraded? I mean somebody must have, 'cause the data didn't wreck itself.

You can define the default character set on MySQL at the system level, per database, per table in a given database, and per column in a table. In addition to these, you can set the character set as sent from the client, per session as received by the server, and last but not least, the results the server emits back to the client. This last one is what got me.

A number of the individual columns, particularly those containing the content of posts and comments, were set to latin1, but to select them on the mysql command line, they were showing up in my terminal as UTF-8. What the hell was happening? Well, turns out the character_set_results variable, despite a wholesale upgrade across the board, was still set to latin1. What this meant was that the mysql command-line client was decoding the spurious UTF-8 content into the actual UTF-8 content, which it thought was latin1. This had the effect of rendering correctly in my UTF-8 terminal, which made absolutely no sense. It wasn't without further probing the documentation that I eventually found and set the variable to utf8 and saw the wrong characters in my terminal, that I determined that the wrong character sequences were what were actually stored in the database.

Satisfied with both the diagnosis and the proposed treatment—the bad characters were in the database and the task was to get them out—the question moved to how to go about doing that. Since the database user didn't have access to modify the INFORMATION_SCHEMA—a sensible mitigation tactic—the easiest thing to do to purge those per-column latin1 designations was to just sed them out of the most recent database dump and reload it. Why they were there in the first place, I suspect, is because Ribbonfarm dates back to 2007, long before UTF-8 was MySQL policy. Alternatively, they could have been put there during some other past upgrade of one part of the system or another as a prophylaxis against messing up the character set, eventually turning out to have the opposite effect.

The uncompressed database dump was about 300 megabytes. To complicate matters, MySQL puts its bulk INSERT statements all on one line, and lines of text several megabytes long tend to choke out terminals and text editors. Here is where Artem helped with a second set of eyes and hands. If we could just convert the database dump file to UTF-8 using an off-the-shelf program like iconv, we could reimport it wholesale and this would be a done deal; no screwing around with SQL.

Not so fast. Turns out that 9D is one of the few unassigned bytes in CP-1252 (what MySQL loosely called latin1) and the dump file was full of them (they are, as I described above, the result of double-encoding an em-dash). Nominally, the solution would have been as easy as piping the whole 300MB dump through iconv, but since there was no viable target encoding (CP-1252 being the closest one but crucially missing that 9D), iconv was a bust. Here is where Artem tried just writing something to brute-force the conversion, compliance be damned. This attempt was foiled as well, as some of the comments were neither UTF-8 nor latin1. Not really surprising for a sixteen-year-old blog, come to think of it. So we were back to looking for an SQL solution.

After some trial and error, the database eventually yielded to the following incantation:

…where `table` and `column` are supplanted with whatever ones you're trying to fix. There was yet another hiccup in the process due to MySQL having two separate UTF-8 encodings, the distinction between which didn't become apparent until we tried to convert a comment containing an emoji: 😇. On MySQL's long road to Unicode compliance, they made a pit stop with a flavour of UTF-8 encoding that maxed out at three bytes per character (which they call utf8mb3), and emojis take four bytes to store. So now the Ribbonfarm database is 100% utf8mb4, and this species of problem should never happen again.

Coda: Why do websites even use SQL, anyway?

…and why MySQL in particular?

The pattern of backing a website onto an SQL database is not immediately obvious. Rather, it's a relationship of convenience. SQL is one way to address complex, fine-grained data structures, and store them such that they persist. Every other strategy available at the time—the mid-1990s, that is—involved trading off either the addressability or complexity of the structure, or the ability for two or more requests to write at the same time (still a constraint with SQLite, which powers many single-user desktop and mobile apps, and is actually quite a bit younger than MySQL). SQL databases also had built-in data type serializations that developers would otherwise have had to concoct by hand.

Now, SQL databases were typically expensive things—Oracle, Sybase/Microsoft, IBM DB2. They were made to run on mainframes—to power banks, government agencies, that kind of thing. The free alternative was a thing called mSQL (for mini), which still exists. MySQL was started shortly after as a personal project, its author citing a dissatisfaction with the performance of mSQL. Attention among Web developers quickly shifted to MySQL, making it effectively the only game in town for a while.

The MySQL developers have always prioritized speed—as in runtime performance—often at the expense of correctness. For a long time, it infamously didn't have transactions—a dealbreaker for anybody who was serious about their database platform. Transactions weren't even available on an experimental basis until 2001, and not the default until 2010. Other essential features like foreign key constraints didn't show up in MySQL until well into its career. PostgreSQL, a couple years younger than MySQL and its chief competitor, prioritized correctness, and so got a reputation for being relatively sluggish.

In many ways, the prevailing style of Web development has coevolved, not just with SQL, but with MySQL in particular. The absence of composite keys meant auto-incrementing serial numbers—that were meaningless except within the database itself—would show up in public-facing places like URLs, which would vanish when those serial numbers were reinitialized. Stitching SQL queries together gave way to libraries that availed themselves of placeholders and were thus more robust against injection attacks. These were further expanded into object-relational mappers and related infrastructure, which caused the field to switch from a callback paradigm—which is central to the design of PHP, and if I understand correctly, still WordPress—to model-view-controller, with frameworks like Ruby on Rails. This newer-generation middleware, in addition to eliminating much of the labour of safely and correctly constructing complex SQL queries, also spackled over most of MySQL's many shortcomings. It didn't matter, for example, that MySQL's date and time arithmetic is and always has been hot garbage, because the ORM would handle that. Triggers weren't necessary, because the ORM would handle that. The net effect is an SQL database that is calibrated for storing Web content, and very little else.

If all you're doing with your database is serving Web pages, SQL is actually pretty clunky. For starters, SQL syntax is super verbose and generating it without using an ORM is crazy talk. Moreover, since a page is essentially a tree-like structure while SQL speaks in grids, you have a performance trade-off to make, between multiple round trips to the database server, and pulling many redundant results over the wire, which you promptly throw away. This has led to all kinds of increasingly baroque caching infrastructure—cobbling together results in the application layer that would normally have been done in the database with a series of joins. This in turn led some to interrogate the wisdom of using SQL databases when they didn't really actually use the SQL part—say, for business intelligence—thus spawning the NoSQL movement.

No sooner had that happened, that people began looking to put back what SQL had been giving them for free. NoSQL basically put Web development right back where it was in 1994, except with faster hardware and marginally better data hygiene. Completely lacking were any standards (with the possible exceptions of HTTP for the transport and JSON for what was being transported), so those started getting built back up (e.g., in the form of GraphQL). To put it in perspective, MapReduce—which has since receded into just another boring, utilitarian data processing task—was still very much a hot topic. Now the state of the art is an Ourobouros of lambdas running microservices that interface over Web APIs. That this situation is any better is far from a self-evident proposition.

My own recent projects—where recent is defined as the last ten years or so—use an RDF graph database, also known as a triple store. This is a completely different way to write Web-based software. It doesn't use SQL, but it's also not self-consciously trying not to use it. What it does have in common with the NoSQL movement is the data over which you operate is already much closer to the shape you ultimately need it in, so you spend a lot less effort restructuring it. My current project, a thing called Intertwingler, is a piece of website-making infrastructure designed to manage huge quantities of very small objects, densely connected to one another by a thicket of typed, bidirectional links. It's the kind of thing that would be prohibitive to write on top of SQL, or even using a conventional NoSQL strategy.

In any case, the SQL-backed website has been a pretty hard thing to kill. Despite the fact that WordPress—whose architecture ossified twenty years ago—is starting to look like an antique, on its face it's a good-enough strategy for small to medium systems that aren't trying to stray too far from convention. There is a deep well of both expertise and resources to draw on—probably deeper than anything else in the industry. Even the MVC frameworks that constitute the generation that came immediately after WordPress are similar enough to each other that if you've seen one of them, you've more or less seen them all. There's even an argument to be made that since scale is a problem any product developer would love to have, an exotic, novel design is liable to bite you in the ass, and so MVC running on SQL is a plenty sensible way to to get started. I, however, have had my fill.