Spreadsheet Rantifesto

Alan Kay famously said to choose the right data structure before you start tinkering around with the algorithm…you will have most of the result computed almost automatically as part of its inherent structure. Which is why the spreadsheet as the go-to data structure is so lamentable. In general, I never want data as it comes in a spreadsheet. I almost always want it as a graph, or at least a tree. Graphs and trees—at least ordered trees like XML or JSON—can embed tabular data, such as that found in a spreadsheet, but the same cannot be said for the other way around.

The frustrating part of these structures is that spreadsheet programs use them internally, but only yield access to a mickey-mouse subset. I almost always want to model one-to-many relationships, which spreadsheets straight-up can't do. I always want to define the data semantics—what the columns and rows unambiguously mean. Spreadsheets can't do this either.

CSV

Delimited text files—usually but not necessarily by commas—have emerged over the decades as the de facto way to exchange record-oriented data. This situation could be much worse, but it isn't especially good. CSV is usually disgorged from some kind of automated system such as a database or mainframe, and if it isn't, it's—hopefully—equivalent to a spreadsheet with a single table, containing no formulas, that starts in the top-left corner, with all that entails.

Headers versus no headers versus multiple-row headers: Sometimes the first record of a CSV file is a set of human-readable headers, and sometimes the file just jumps straight into data, assuming you already know what the columns mean. In practice there is no way to know without examining it. There is a parameter to the text/csv content type defined in RFC 4180, but that metadata is only going to be visible if you download the file from the Web or receive it in an e-mail, and isn't going to be preserved in any useful way on an ordinary file system. Sometimes you encounter headers that span multiple lines, and even that header parameter won't help you identify those.
Nulls: The way you represent a null value in CSV is by simply putting nothing. Unfortunately, this is also one of several ways to represent an empty string. If the difference is significant to you, you're out of luck.
Dates and times: There are about eleventy gajillion ways to represent dates and times as a text string, and then there are the ones that aren't even standardized. Have fun with all that.
Coded properties: Columns that take a finite set of values, called coded properties or enums, which include the boolean values true and false, are indistinguishable from a string. Since the data doesn't encode the set of valid values, you neither know what all those values are, nor if the data contains invalid values. Also, in just about every programming language, the string false evaluates to true. You can see how this could be a problem.
Mixed datatypes: Since there is no formal declaration for what type a given cell actually is, we have to use heuristics to determine what it probably is. Results on the ground may not be consistent. A common scenario is a column of numbers interspersed by something like N/A: that is going to trip up any parser that isn't tipped off about it ahead of time.
Cube and rollup: Here the database or whatever excreted the CSV file is doing something cute and changing the meanings of certain records by making some of them be aggregates. The only hint you get is that the adjacent columns will be null.

I suppose you could say the spiritual successor to CSV is JSON, which emerged in the early 2000s as a hack✱ for the communication of ordered-tree-shaped data. In my opinion, it really only solves the one-to-many problem, and I guess the header problem too †, and can be considered only an incremental improvement over CSV.

Anything that originates as an actual spreadsheet

Actual spreadsheets import all the problems of CSV and have plenty of their own. This is because machines don't make spreadsheets, people make spreadsheets. A spreadsheet is an integrated development environment for non-programmers; it is intended to be produced and consumed directly by human beings. Very little consideration is given to the life of the data beyond the confines of the spreadsheet program.

Typos: Modern spreadsheets of course have spell checkers and autocorrect, but those cause almost as many problems as they solve. There's also some truth to that oneliner that goes you must be Excel because you think this is a date. Coded properties are also a problem, and may be typed in with variations that need to be collated before they can be grouped and pivoted properly.
Misalignments, transpositions: When the sheet is sparse, that is, when there are lots of blank cells, a common mistake is to put values into the wrong row or column. There is no reliable way to tell, short of human eyeballs, if this is a mistake or if it is on purpose.
Regions: Spreadsheet authors routinely pick out regions on the sheet to have some special significance. While you can usually pick these out by heuristic, there is no way to tell what they actually represent. Regions can go in one dimension, i.e., a special set of columns or rows, or two: a special rectangle. Regions may also have one or more rows of headers, which are similarly ambiguous.

Apple palliated this issue somewhat by making Numbers oriented toward tables that float on a canvas, although the default template is just one big table that goes edge to edge like Excel.
Multiple sheets: Spreadsheet files can house arbitrarily many different sheets, each with a different layout and interpretation. Like regions, there is likewise no way to know programmatically what they represent.
Records versus coordinate planes: A region on a spreadsheet could represent N-dimensional records—in either the rows or the columns—or it could specifically represent a two-dimensional coordinate plane, where the edges are values of coded properties and the cells in the body are quantities that lie at the intersections of the edge values. Again, there is no programmatic clue that what you are looking at is a record set or a coordinate plane.
Pseudo-multivalues: It is common for people to fudge multiple-valued columns by simply adding extra columns. I am going to stop writing there is no way to tell programmatically now because I'm starting to feel like a broken record.
Pseudo-hierarchies: Another common trick, this time for representing hierarchical structures, is to identify the first column with the topmost rung in the hierarchy, all the way until column N+1 where the actual data begins.
Out-of-band denotations: Columns, rows, and cells are often given special meaning by changing graphical attributes like the font weight or background colour. Again these have no internal semantics and it's impossible to tell if, for example, an italicized bold cell means something over and above ordinary typographical emphasis. It is possible in some spreadsheet products to define and enumerate styles, which will be saved into the file and can be hackishly identified with some external dictionary, but people will have to remember to use them.
Formulas & macros: Luckily when spreadsheet files are saved out, cells containing formulas are evaluated with their current inputs, meaning you can read them off programmatically. If you wanted to operate on different inputs, however, you would need to implement the function library yourself. Same goes for macros if you wanted to execute them.

I worked on a project last year where I interfaced with a teammate primarily through a spreadsheet. Being a site map, it shook out as a pseudo-hierarchy of about 200 entities. My job was to consume that data and transform it into a formal structure, and I found—as I had expected to find—chasing the vagaries of ordinary spreadsheet use to be not only terrifically laborious, but really tested my ETL chops.

My colleague graciously suggested I draft up a list of rules to abide by, but I declined: Even though doing so was a huge pain in the ass—as anticipated and planned for—it was easier to just watch and accommodate rather than try to assign to a person the rightful job of a computer, especially when the tradeoff would be writing a less robust interpreter that's just going to fail some other way. People are always going to use software in ways you can't anticipate. It's the spreadsheet writ large that's to blame for not entertaining the notion that there's a world beyond the spreadsheet.

How do we fix this?

I want to underscore that I believe there is great value in being able to type in data with your fingers and move it around ad-hoc like a spreadsheet affords. I also believe that very same ad-hockery is what has limited the spreadsheet's functionality since it was invented over four decades ago. The original design was never meant to grow past the confines of an individual PC. Now we have Google treating spreadsheets like an industrial-strength networked data repository. You can now spreadsheet on your phone and your changes get automatically saved to the internet.

How do you preserve this beefed-up, ad-hoc flexibility while adding the guardrails necessary to make the data more valuable to the larger ecosystem? I see two main challenges: data semantics and user interface. Being XML, the OpenDocument and Office Open file formats could conceivably model this additional structure, but they will almost certainly have to be updated with additional syntax to capture it. That is an enormous, politically-fraught job. Coming up with a UI that wasn't too jarring and alien to spreadsheet users would also be a challenge: figuring out how to represent one-to-many and many-to-one relationships on a familiar cell grid, real hierarchies instead of pseudo-hierarchies, cycles, prototypical tuples and regions, and so on.

Challenging spreadsheet incumbents head-on is a risky proposition; you'd have to offer customers something that was hard enough to copy and distinct enough for them to entertain yet another spreadsheet-adjacent product. Two entrants come to mind: Causal.app, which appears to be on the ascendancy, and Guesstimate, which seems to have already peaked. Both of these products do something an ordinary spreadsheet cannot: probabilistic modelling. They are likely to make attractive acquisition targets for Microsoft, Apple, or Google.

Indeed, when you hit the edge of a spreadsheet's performance envelope, it shatters into a million pieces. If you need more exotic statistical tools than even these aforementioned next-gen products have to offer, your choices are things like R and Jupyter. Even heavier scientific and engineering applications demand MATLAB, Maple, or Mathematica. Sanitizing and normalizing is the province of OpenRefine. Visualization? D3 or Tableau. Complex data structures, that get operated over in bulk, require a Real Database™, even if that is just Access, AirTable, or SQLite, to say nothing of the graph databases or industrial-grade RDBMS products.

Getting a modest quantity of data, by hand, into a persistent structure, that doesn't demand any up-front work on schema definition, and provides a rudimentary computational vocabulary, is what I believe to be the core strength of the spreadsheet. Bonus points if you can serialize the result into something useful elsewhere. Network sync and telecollaboration—when considering the total addressable market—is a nice-to-have, although I'll wager it's easier to write a product this way these days than not to. The real value-add will be in solving how this additional structure I've been talking about will be manipulated by hand, and preserved in some kind of exchangeable data format.

Anyway, I figure about five million bucks ought to cover it. Anybody in?