Alan Kay famously said to choose the right data structure before you start tinkering around with the algorithm…you will have most of the result computed almost automatically as part of its inherent structure.
Which is why the spreadsheet as the go-to data structure is so lamentable. In general, I never want data as it comes in a spreadsheet. I almost always want it as a graph, or at least a tree. Graphs and trees—at least ordered trees like XML or JSON—can embed tabular data, such as that found in a spreadsheet, but the same cannot be said for the other way around.
The frustrating part of these structures is that spreadsheet programs use them internally, but only yield access to a mickey-mouse subset. I almost always want to model one-to-many relationships, which spreadsheets straight-up can't do. I always want to define the data semantics—what the columns and rows unambiguously mean. Spreadsheets can't do this either.
Delimited text files—usually but not necessarily by commas—have emerged over the decades as the de facto way to exchange record-oriented data. This situation could be much worse, but it isn't especially good. CSV is usually disgorged from some kind of automated system such as a database or mainframe, and if it isn't, it's—hopefully—equivalent to a spreadsheet with a single table, containing no formulas, that starts in the top-left corner, with all that entails.
text/csv content type defined in RFC 4180, but that metadata is only going to be visible if you download the file from the Web or receive it in an e-mail, and isn't going to be preserved in any useful way on an ordinary file system. Sometimes you encounter headers that span multiple lines, and even that header parameter won't help you identify those.true and false, are indistinguishable from a string. Since the data doesn't encode the set of valid values, you neither know what all those values are, nor if the data contains invalid values. Also, in just about every programming language, the string falseevaluates to
true. You can see how this could be a problem.N/A: that is going to trip up any parser that isn't tipped off about it ahead of time.I suppose you could say the spiritual successor to CSV is JSON, which emerged in the early 2000s as a hack✱ for the communication of ordered-tree-shaped data. In my opinion, it really only solves the one-to-many problem, and I guess the header problem too †, and can be considered only an incremental improvement over CSV.
Actual spreadsheets import all the problems of CSV and have plenty of their own. This is because machines don't make spreadsheets, people make spreadsheets. A spreadsheet is an integrated development environment for non-programmers; it is intended to be produced and consumed directly by human beings. Very little consideration is given to the life of the data beyond the confines of the spreadsheet program.
you must be Excel because you think this is a date. Coded properties are also a problem, and may be typed in with variations that need to be collated before they can be grouped and pivoted properly.
Spreadsheet authors routinely pick out regions on the sheet to have some special significance. While you can usually pick these out by heuristic, there is no way to tell what they actually represent. Regions can go in one dimension, i.e., a special set of columns or rows, or two: a special rectangle. Regions may also have one or more rows of headers, which are similarly ambiguous.
there is no way to tell programmaticallynow because I'm starting to feel like a broken record.
I worked on a project where I interfaced with a teammate primarily through a spreadsheet. Being a site map, it shook out as a pseudo-hierarchy of about 200 entities. My job was to consume that data and transform it into a formal structure, and I found—as I had expected to find—chasing the vagaries of ordinary spreadsheet use to be not only terrifically laborious, but really tested my ETL chops.
My colleague graciously suggested I draft up a list of rules to abide by, but I declined: Even though doing so was a huge pain in the ass—as anticipated and planned for—it was easier to just watch and accommodate rather than try to assign to a person the rightful job of a computer, especially when the tradeoff would be writing a less robust interpreter that's just going to fail some other way. People are always going to use software in ways you can't anticipate. It's the spreadsheet writ large that's to blame for not entertaining the notion that there's a world beyond the spreadsheet.
I want to underscore that I believe there is great value in being able to type in data with your fingers and move it around ad-hoc like a spreadsheet affords. I also believe that very same ad-hockery is what has limited the spreadsheet's functionality since it was invented over four decades ago. The original design was never meant to grow past the confines of an individual PC. Now we have Google treating spreadsheets like an industrial-strength networked data repository. You can now spreadsheet on your phone and your changes get automatically saved to the internet.
How do you preserve this beefed-up, ad-hoc flexibility while adding the guardrails necessary to make the data more valuable to the larger ecosystem? I see two main challenges: data semantics and user interface. Being XML, the OpenDocument and Office Open file formats could conceivably model this additional structure, but they will almost certainly have to be updated with additional syntax to capture it. That is an enormous, politically-fraught job. Coming up with a UI that wasn't too jarring and alien to spreadsheet users would also be a challenge: figuring out how to represent one-to-many and many-to-one relationships on a familiar cell grid, real hierarchies instead of pseudo-hierarchies, cycles, prototypical tuples and regions, and so on.
Challenging spreadsheet incumbents head-on is a risky proposition; you'd have to offer customers something that was hard enough to copy and distinct enough for them to entertain yet another spreadsheet-adjacent product. Two entrants come to mind: Causal.app, which appears to be on the ascendancy, and Guesstimate, which seems to have already peaked. Both of these products do something an ordinary spreadsheet cannot: probabilistic modelling. They are likely to make attractive acquisition targets for Microsoft, Apple, or Google.
Indeed, when you hit the edge of a spreadsheet's performance envelope, it shatters into a million pieces. If you need more exotic statistical tools than even these aforementioned next-gen products have to offer, your choices are things like R and Jupyter. Even heavier scientific and engineering applications demand MATLAB, Maple, or Mathematica. Sanitizing and normalizing is the province of OpenRefine. Visualization? D3 or Tableau. Complex data structures, that get operated over in bulk, require a Real Database™, even if that is just Access, AirTable, or SQLite, to say nothing of the graph databases or industrial-grade RDBMS products.
Getting a modest quantity of data, by hand, into a persistent structure, that doesn't demand any up-front work on schema definition, and provides a rudimentary computational vocabulary, is what I believe to be the core strength of the spreadsheet. Bonus points if you can serialize the result into something useful elsewhere. Network sync and telecollaboration—when considering the total addressable market—is a nice-to-have, although I'll wager it's easier to write a product this way these days than not to. The real value-add will be in solving how this additional structure I've been talking about will be manipulated by hand, and preserved in some kind of exchangeable data format.
Anyway, I figure about five million bucks ought to cover it. Anybody in?