<?xml version="1.0"?>
<?xml-stylesheet href="/transform" type="text/xsl"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:bs="http://purl.org/ontology/bibo/status/" xmlns:ci="https://vocab.methodandstructure.com/content-inventory#" xmlns:dct="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:http="http://www.w3.org/2011/http#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:xhv="http://www.w3.org/1999/xhtml/vocab#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" lang="en" prefix="bibo: http://purl.org/ontology/bibo/ bs: http://purl.org/ontology/bibo/status/ ci: https://vocab.methodandstructure.com/content-inventory# dct: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ http: http://www.w3.org/2011/http# rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# xhv: http://www.w3.org/1999/xhtml/vocab# xsd: http://www.w3.org/2001/XMLSchema#" vocab="http://www.w3.org/1999/xhtml/vocab#" xml:lang="en">
  <head>
    <title lang="en" property="dct:title" xml:lang="en">Spreadsheet Rantifesto</title>
    <base href="https://doriantaylor.com/spreadsheet-rantifesto"/>
    <link href="document-stats#EwN5WfQB4VTRxmAYLssjYL" rev="ci:document"/>
    <link href="elsewhere" rel="alternate bookmark" title="Elsewhere"/>
    <link href="this-site" rel="alternate index" title="This Site"/>
    <link href="http://purl.org/ontology/bibo/status/published" rel="bibo:status"/>
    <link href="" rel="ci:canonical" title="Spreadsheet Rantifesto"/>
    <link href="lexicon/#software-developer" rel="dct:audience" title="software developer"/>
    <link href="person/dorian-taylor#me" rel="dct:creator" title="Dorian Taylor"/>
    <link href="http://purl.org/ontology/bibo/status/draft" rel="dct:references"/>
    <link href="file/nelly-dilemma" rel="foaf:depiction"/>
    <link href="person/dorian-taylor" rel="meta" title="Who I Am"/>
    <link about="./" href="3f36c30c-6096-454a-8a22-c062100ae41f" rel="alternate" type="application/atom+xml"/>
    <link about="./" href="this-site" rel="alternate"/>
    <link about="./" href="elsewhere" rel="alternate"/>
    <link about="./" href="e341ca62-0387-4cea-b69a-cdabc7656871" rel="alternate" type="application/atom+xml"/>
    <link about="./" href="f07f5044-01bc-472d-9079-9b07771b731c" rel="alternate" type="application/atom+xml"/>
    <link about="verso/" href="3f36c30c-6096-454a-8a22-c062100ae41f" rel="alternate" type="application/atom+xml"/>
    <link about="verso/" href="this-site" rel="alternate"/>
    <link about="verso/" href="elsewhere" rel="alternate"/>
    <meta content="spreadsheet-rantifesto" datatype="xsd:token" property="ci:canonical-slug"/>
    <meta content="I think I have finally identified what bugs me so much about spreadsheets: they come right up to the line of being useful for SO many other things, but stop short. I am almost mad enough to do something about it." name="description" property="dct:abstract"/>
    <meta content="2021-04-15T19:50:19+00:00" datatype="xsd:dateTime" property="dct:created"/>
    <meta content="2021-04-21T14:14:08+00:00" datatype="xsd:dateTime" property="dct:modified"/>
    <meta content="2021-05-04T05:25:16+00:00" datatype="xsd:dateTime" property="dct:modified"/>
    <meta content="2021-05-12T06:17:42+00:00" datatype="xsd:dateTime" property="dct:modified"/>
    <meta content="2022-05-31T15:10:50+00:00" datatype="xsd:dateTime" property="dct:modified"/>
    <meta about="person/dorian-taylor#me" content="Dorian Taylor" name="author" property="foaf:name"/>
    <meta content="summary_large_image" name="twitter:card"/>
    <meta content="@doriantaylor" name="twitter:site"/>
    <meta content="Spreadsheet Rantifesto" name="twitter:title"/>
    <meta content="I think I have finally identified what bugs me so much about spreadsheets: they come right up to the line of being useful for SO many other things, but stop short. I am almost mad enough to do something about it." name="twitter:description"/>
    <meta content="https://doriantaylor.com/file/nelly-dilemma" name="twitter:image"/>
    <object>
      <nav>
        <ul>
          <li>
            <a href="//dorian.substack.com/p/the-nerden-of-dorking-paths" rev="dct:references" typeof="bibo:Article">
              <span property="dct:title">The Nerden of Dorking Paths</span>
            </a>
          </li>
          <li>
            <a href="document-stats#EwN5WfQB4VTRxmAYLssjYL" rev="ci:document" typeof="qb:Observation">
              <span>urn:uuid:c0de567d-0078-4553-b471-98060bb2c8d8</span>
            </a>
          </li>
        </ul>
      </nav>
    </object>
  </head>
  <body about="" id="Ev7lmrOxN3b33J7_doj2pI" typeof="bibo:Article">
    <section id="E01ymca051x7Apvgpru96I">
      <p><a href="https://youtu.be/p2LZLYcu_JY?t=4246" rel="dct:references">Alan Kay famously said</a> to <q>choose the right data structure before you start tinkering around with the algorithm&#x2026;you will have most of the result computed almost automatically as part of its inherent structure.</q> Which is why the spreadsheet as the go-to data structure is so lamentable. In general, I never want data as it comes in a spreadsheet. I almost <em>always</em> want it as a <a href="https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)" rel="dct:references">graph</a>, or at <em>least</em> a <a href="https://en.wikipedia.org/wiki/Tree_(graph_theory)" rel="dct:references">tree</a>. Graphs and trees&#x2014;at least ordered trees like <abbr>XML</abbr> or <abbr>JSON</abbr>&#x2014;can <a href="https://en.wikipedia.org/wiki/Graph_embedding" rel="dct:references">embed</a> tabular data, <span class="parenthesis" title="indeed the two most popular spreadsheet formats are built on top of XML, and Numbers is Protobuf">such as that found in a spreadsheet</span>, but the same cannot be said for the other way around.</p>
      <aside role="note" id="Ex735urhMF1ZTB_OVeLQ1I">
        <p>A vector/matrix/tensor type, which <em>can</em> be represented in a spreadsheet, is also computationally convenient when organized as an array, but I tend not to need those as much, and can construct them out of list-like structures on demand. You would have to scan these objects sequentially anyway to parse and serialize them.</p>
      </aside>
      <p>The frustrating part of these structures is that spreadsheet programs use them <em>internally</em>, but only yield access to a mickey-mouse subset. I almost always want to model one-to-many relationships, which spreadsheets straight-up can't do. I <em>always</em> want to define the data semantics&#x2014;what the columns and rows unambiguously <em>mean</em>. Spreadsheets can't do this either.</p>
    </section>
    <section id="EwGlfSvJee9B6Jq1NUYloJ">
      <h2><abbr title="Comma-Separated Values">CSV</abbr></h2>
      <p>Delimited text files&#x2014;usually but not necessarily by commas&#x2014;have emerged over the decades as the de facto way to exchange record-oriented data. This situation could be much worse, but it isn't especially good. <abbr>CSV</abbr> is usually disgorged from some kind of automated system such as a database or mainframe, and if it isn't, it's&#x2014;<em>hopefully</em>&#x2014;equivalent to a spreadsheet with a single table, containing no formulas, that starts in the top-left corner, with all that entails.</p>
      <dl>
        <dt>Headers versus no headers versus multiple-row headers</dt>
        <dd>Sometimes the first record of a <abbr>CSV</abbr> file is a set of human-readable headers, and sometimes the file just jumps straight into data, assuming you already know what the columns mean. In practice there is no way to know without examining it. There <em>is</em> a parameter to the <code>text/csv</code> content type <a href="https://tools.ietf.org/html/rfc4180" rel="dct:references">defined in RFC 4180</a>, but that metadata is only going to be visible if you download the file from the Web or receive it in an e-mail, and isn't going to be preserved <span class="parenthesis" title="I mean your Mac miiiight store it but nothing will use it.">in any useful way on an ordinary file system.</span> Sometimes you encounter headers that span multiple lines, and even that <code>header</code> parameter won't help you identify those.</dd>
        <dt>Nulls</dt>
        <dd>The way you represent a null value in <abbr>CSV</abbr> is by simply putting nothing. Unfortunately, this is also one of several ways to represent an empty string. If the difference is significant to you, you're out of luck.</dd>
        <dt>Dates and times</dt>
        <dd>There are about eleventy gajillion ways to represent dates and times as a text string, and then there are the ones that aren't even standardized. Have fun with all that.</dd>
        <dt>Coded properties</dt>
        <dd>Columns that take a finite set of values, called coded properties or <dfn>enums</dfn>, which include the boolean values <span class="parenthesis" title="or yes and no or Y and N or 1 and 0"><code>true</code> and <code>false</code></span>, are indistinguishable from a string. Since the data doesn't encode the set of valid values, you neither know what all those values are, nor if the data contains invalid values. Also, in just about every programming language, the string <q>false</q> evaluates to <code>true</code>. You can see how this could be a problem.</dd>
        <dt>Mixed datatypes</dt>
        <dd>Since there is no formal declaration for what type a given cell <em>actually</em> is, we have to use heuristics to determine what it <em>probably</em> is. Results on the ground may not be consistent. A common scenario is a column of numbers interspersed by something like <code>N/A</code>: that is going to trip up any parser that isn't tipped off about it ahead of time.</dd>
        <dt>Cube and rollup</dt>
        <dd>Here the database or whatever excreted the <abbr>CSV</abbr> file is doing something cute and changing the meanings of certain records by making some of them be aggregates. The only hint you get is that <span class="parenthesis" title="Or maybe they are empty strings! who knows!">the adjacent columns will be null.</span></dd>
      </dl>
      <p>I suppose you could say the spiritual successor to <abbr>CSV</abbr> is <a href="https://tools.ietf.org/html/rfc8259" rel="dct:references"><abbr title="JavaScript Object Notation">JSON</abbr></a>, which emerged in the early 2000s as a hack&#x2731; for the communication of ordered-tree-shaped data. In my opinion, <span class="parenthesis" title="and I suppose the null-versus-empty-string problem">it really only solves the one-to-many problem</span>, and I <em>guess</em> the header problem too&#x200A;&#x2020;, and can be considered only an incremental improvement over <abbr>CSV</abbr>.</p>
      <aside role="note" id="EpWeRyxD9vA8RJCofZKp7L">
        <p>&#x2731;&#x2009;<em>inb4 haters</em>: Before dedicated parsers, the way you used to get <abbr>JSON</abbr> into your application was by <code>eval()</code>ing it.</p>
        <p>&#x2020;&#x2009;Object keys can be construed as header names but there is no mechanism for enforcing any kind of consistency in those names from one object to another&#x2014;at least not without <a href="https://json-schema.org/" rel="dct:references">getting</a> <a href="https://json-ld.org/" rel="dct:references">fancy</a>.</p>
        <p>Honourable(?) mention goes to <a href="https://yaml.org/" rel="dct:references"><abbr title="YAML Ain't Markup Language">YAML</abbr></a>, which is yet another <del>markup language</del> <del>liability</del> <ins>borderline-Turing-complete data serialization format</ins>.</p>
      </aside>
    </section>
    <section id="EVI4_6TddpFPmhUx1rdPqK">
      <h2>Anything that originates as an actual spreadsheet</h2>
      <p>Actual spreadsheets import all the problems of <abbr>CSV</abbr> and have plenty of their own. This is because machines don't make spreadsheets, <em>people</em> make spreadsheets. A spreadsheet is an integrated development environment for non-programmers; it is intended to be produced and consumed directly by human beings. Very little consideration is given to the life of the data beyond the confines of the spreadsheet program.</p>
      
      <dl>
        <dt>Typos</dt>
        <dd>Modern spreadsheets of course have spell checkers and autocorrect, but those cause almost as many problems as they solve. There's also some truth to that oneliner that goes <q>you must be Excel because you think this is a date</q>. Coded properties are also a problem, and may be typed in with variations that need to be collated before they can be grouped and pivoted properly.</dd>
        <dt>Misalignments, transpositions</dt>
        <dd>When the sheet is <em>sparse</em>, that is, when there are lots of blank cells, a common mistake is to put values into the wrong row or column. There is no reliable way to tell, short of human eyeballs, if this is a mistake or if it is on purpose.</dd>
        <dt>Regions</dt>
        <dd>
          <p>Spreadsheet authors routinely pick out regions on the sheet to have some special significance. While you can usually pick these out by heuristic, there is no way to tell what they actually represent. Regions can go in one dimension, i.e., a special set of columns or rows, or two: a special rectangle. Regions may also have one or more rows of headers, which are similarly ambiguous.</p>
          <aside role="note" id="EZBXF8eXdAz116XejhD0YI">
            <p>Apple palliated this issue somewhat by making Numbers oriented toward tables that float on a canvas, although the default template is just one big table that goes edge to edge like Excel.</p>
          </aside>
        </dd>
        <dt>Multiple sheets</dt>
        <dd>Spreadsheet files can house arbitrarily many different sheets, each with a different layout and interpretation. Like regions, there is likewise no way to know programmatically what they represent.</dd>
        <dt>Records versus coordinate planes</dt>
        <dd>A region on a spreadsheet could represent N-dimensional records&#x2014;in either the rows <em>or</em> the columns&#x2014;or it could specifically represent a two-dimensional coordinate plane, where the edges are values of coded properties and the cells in the body are  quantities that lie at the intersections of the edge values. Again, there is no programmatic clue that what you are looking at is a record set or a coordinate plane.</dd>
        <dt>Pseudo-multivalues</dt>
        <dd>It is common for people to fudge multiple-valued columns by simply adding extra columns. I am going to stop writing <q>there is no way to tell programmatically</q> now because I'm starting to feel like a broken record.</dd>
        <dt>Pseudo-hierarchies</dt>
        <dd>Another common trick, this time for representing hierarchical structures, is to identify the first column with the topmost rung in the hierarchy, all the way until column N+1 where the actual data begins.</dd>
        <dt>Out-of-band denotations</dt>
        <dd>Columns, rows, and cells are often given special meaning by changing graphical attributes like the font weight or background colour. Again these have no internal semantics and it's impossible to tell if, for example, an italicized bold cell means something over and above ordinary typographical emphasis. <span class="parenthesis" title="but not Google sheets!">It <em>is</em> possible in some spreadsheet products</span> to define and enumerate <em>styles</em>, which will be saved into the file and can be hackishly identified with some external dictionary, but people will have to remember to use them.</dd>
        <dt>Formulas &amp; macros</dt>
        <dd>Luckily when spreadsheet files are saved out, cells containing formulas are evaluated with their current inputs, meaning you can read them off programmatically. If you wanted to operate on <em>different</em> inputs, however, you would need to implement the function library yourself. Same goes for macros if you wanted to execute them.</dd>
      </dl>
      <p>I worked on a project <time datetime="2020">last year</time> where I interfaced with a teammate primarily through a spreadsheet. Being a site map, it shook out as a pseudo-hierarchy of about 200 entities. My job was to consume that data and transform it into a formal structure, and I found&#x2014;as I had expected to find&#x2014;chasing the vagaries of ordinary spreadsheet use to be not only terrifically laborious, but really tested my <abbr title="extract, transform, load">ETL</abbr> chops.</p>
      <p>My colleague graciously suggested I draft up a list of rules to abide by, but I declined: Even though doing so was a huge pain in the ass&#x2014;as anticipated and planned for&#x2014;it was easier to just watch and accommodate rather than try to assign to a person the rightful job of a computer, especially when the tradeoff would be writing a less robust interpreter that's just going to fail some other way. People are always going to use software in ways you can't anticipate. It's the spreadsheet writ large that's to blame for not entertaining the notion that there's a world beyond the spreadsheet.</p>
    </section>
    <section id="Ea6qhK5flTZD01ZiY6OulK">
      <h2>How do we fix this?</h2>
      <p>I want to underscore that I believe there is great value in being able to type in data with your fingers and move it around ad-hoc like a spreadsheet affords. I also believe that very same ad-hockery is what has limited the spreadsheet's functionality since it was invented over four decades ago. <a href="https://www.wired.com/2014/10/a-spreadsheet-way-of-knowledge/" rel="dct:references">The original design</a> was never meant to grow past the confines of an individual <abbr>PC</abbr>. Now <a href="https://developers.google.com/sheets/api/" rel="dct:references">we have Google treating spreadsheets like an industrial-strength networked data repository</a>. You can now spreadsheet on your phone and your changes get automatically saved to the internet.</p>
      <aside role="note" id="EaG8_SeM6lTStTIMR1Xx7L">
        <p><a href="https://covidtracking.com/" rel="dct:references">The COVID Tracking Project</a> literally ran its entire production infrastructure off a Google sheet. Because that's a thing you can do nowadays.</p>
      </aside>
      <p>How do you preserve this beefed-up, ad-hoc flexibility while adding the guardrails necessary to make the data more valuable to the larger ecosystem? I see two main challenges: <strong>data semantics</strong> and <strong>user interface</strong>. Being <abbr>XML</abbr>, the <a href="https://docs.oasis-open.org/office/v1.2/OpenDocument-v1.2.html" rel="dct:references">Open&#xAD;Document</a> and <a href="https://www.ecma-international.org/publications-and-standards/standards/ecma-376/" rel="dct:references">Office Open</a> file formats <em>could</em> conceivably model this additional structure, but they will almost certainly have to be updated with additional syntax to capture it. That is an enormous, politically-fraught job. Coming up with a <abbr>UI</abbr> that wasn't too jarring and alien to spreadsheet users would also be a challenge: figuring out how to represent one-to-many <em>and</em> many-to-one relationships on a familiar cell grid, real hierarchies instead of pseudo-hierarchies, cycles, prototypical tuples and regions, and so on.</p>
      <p>Challenging spreadsheet incumbents head-on is a risky proposition; you'd have to offer customers something that was hard enough to copy and distinct enough for them to entertain yet another spreadsheet-adjacent product. Two entrants come to mind: <a href="https://causal.app/" rel="dct:references">Causal.app</a>, which appears to be on the ascendancy, and <a href="https://www.getguesstimate.com/" rel="dct:references">Guesstimate</a>, which seems to have already peaked. Both of these products do something an ordinary spreadsheet cannot: probabilistic modelling. They are likely to make attractive acquisition targets for Microsoft, Apple, or Google.</p>
      <p>Indeed, when you hit the edge of a spreadsheet's performance envelope, it shatters into a million pieces. If you need more exotic statistical tools than even these aforementioned next-gen products have to offer, your choices are things like <a href="https://www.r-project.org/" rel="dct:references">R</a> and <a href="https://jupyter.org/" rel="dct:references">Jupyter</a>. Even heavier scientific and engineering applications demand <a href="https://www.mathworks.com/products/matlab.html" rel="dct:references">MATLAB</a>, <a href="https://maplesoft.com/" rel="dct:references">Maple</a>, or <a href="https://www.wolfram.com/mathematica/" rel="dct:references">Mathematica</a>. Sanitizing and normalizing is the province of <a href="https://openrefine.org/" rel="dct:references">OpenRefine</a>. Visualization? <a href="https://d3js.org/" rel="dct:references">D3</a> or <a href="https://www.tableau.com/" rel="dct:references">Tableau</a>. Complex data structures, that get operated over in bulk, require a Real Database&#x2122;, even if that is just <a href="https://www.microsoft.com/en-us/microsoft-365/access" rel="dct:references">Access</a>, <a href="https://airtable.com/" rel="dct:references">AirTable</a>, or <a href="https://sqlite.org/" rel="dct:references">SQLite</a>, to say nothing of the graph databases or <span class="parenthesis" title="PostgreSQL, Oracle, Microsoft SQL Server, MySQL I guesssss&#x2026;">industrial-grade <abbr title="relational database management system">RDBMS</abbr> products</span>.</p>
      <p>Getting a modest quantity of data, by hand, into a persistent structure, that doesn't demand any up-front work on schema definition, and provides a rudimentary computational vocabulary, is what I believe to be the core strength of the spreadsheet. Bonus points if you can serialize the result into <span class="parenthesis" title="although for users like me, this is essential">something useful elsewhere</span>. Network sync and telecollaboration&#x2014;when considering the total addressable market&#x2014;<span class="parenthesis" title="I am willing to bet most people don't use this, but in ten years they will">is a nice-to-have</span>, although I'll wager it's easier to write a product this way these days than not to. The real value-add will be in solving how this additional structure I've been talking about will be manipulated by hand, and preserved in some kind of exchangeable data format.</p>
      <p>Anyway, I figure about five million bucks ought to cover it. Anybody in?</p>
    </section>
  </body>
</html>
