Table of Contents

1 preamble (0:00-0:45)

  • so much information we are concerned with nowadays takes the form of symbolic representations of entities of all sorts, arranged primarily for the purpose of manipulation by computer.

    • not just documents and files, but representations of people, organizations, places, things…
    • some of these entities so represented are concrete, like specific people, specific buildings, a particular copy of a particular book
    • some of these entities are virtual, like a company, or the concept of the colour red…
    • still other entities are arguably their own representations, that is, they represent some result that exists solely inside the computer, and don't correspond to anything in the outside world.
    • information of this kind is generally referred to as "data".

1.1 what is data, anyway? (00:45-2:09)

  • data is the plural of datum!
    • (i'm not being glib here, i promise)
  • so what is a datum?
    • a datum (a word ordinary people don't often hear) can be understood as a specific fact, measurement, assertion, or claim about a thing, that is to say, about an entity.
  • now i'm going to conjecture here that part of the reason why you don't hear the singular word datum very often is because one datum is typically not found all by itself.
    • this is because a datum is typically only addressable relative to the thing it's describing.
  • at the very least, it needs some kind of container, something you can point to, some kind of information resource, to convey it.
    • like a document, database, or similar structure
    • because these take effort to compile, there's bound to be more than one datum per resource.
    • indeed there is likely, in and around a given resource, to be a great number of data, so many that it stops making sense to talk about data as individual objects but as a quantity.
      • just like you wouldn't say "glass of waters", or "pail of sands", "data" has become a mass noun.

1.2 so what's the problem? (2:09-4:34)

  • we regularly rely on specific information for understanding our situation, gleaning insights, planning, and making decisions
  • we process information at different places and times,
    • the results of this processing are often fed into subsequent processes,
  • the information — the data — we're interested in, may span many different resources and systems,
    • the information needs to be correct, or it is at best useless, if not actively harmful.
      • correctness is typically maintained by nominating a single authoritative source for a given piece of information, and then propagating that information "downstream".
      • this means that different systems are not only going to control the representations of different entities, but different assertions about the same entities could themselves be spread across disparate information systems.
        • these information systems themselves could be located in different administrative zones, or in other words, owned and controlled by separate people or organizations.
  • in order to reason, computationally, about certain entities, it may be necessary to integrate information from two or more sources into a single working set.
    • transforming these representations from disparate information systems into a common working representation is a big chore that has to be done anew for each system and each category of entity.
      • this is a huge pain, even when you control all the systems in question.
      • it is an even bigger pain in the much more likely scenario: that you don't control all, or even any of them.
    • so we have a couple of concrete problems:
      1. how do we mitigate inaccuracies caused by outdated data?
        • this is easy in principle
        • we let different information systems handle the resources they can assert some meaningful authority over, and we access these resources when we need them.
      2. how do we mitigate the transformation overhead when integrating data from different systems? specifically:
        • how do we encode these representations of entities so that we can maximize their reuse across information systems, including systems across administrative boundaries?
        • how do we identify and locate these entities?
    • this is a question that requires a much more detailed answer.

2 bob likes cookies (or: the basic problem of representation; 4:34-)

  • having characterized the problem somewhat, we can go into depth with a concrete example
  • take the following statement: "Bob likes cookies."
    • (with apologies to manu sporny)
    • if you were to write this down as an english sentence, it may be intelligible to you, but as computational raw material it is essentially worthless.
    • about all you can do with this text, "Bob likes cookies.", is search it for keywords. This won't tell you much.
    • In order to try to reason over this string of letters, we need to turn it into a formal representation:
      • likes: Bob -> Cookies (draw this as a graph)
    • it turns out you actually need some pretty sophisticated natural language processing to turn this (the text string) into this (the graph)
    • of course if i don't address the fact that i'm writing and drawing these examples on a piece of paper, i'm cheating you.
      • it takes a state-of-the-art neural network (that is to say, artificial intelligence) to get the handwritten text into digital text, to which the standard natural language processing could subsequently be applied.
      • it likely would take a custom neural net to do the same for the drawing, although you could skip the natural language processing part and go straight from the drawing to the data structure.
      • and this is with the caveat that all of this cutting-edge artificial intelligence might make a mistake (up to about 10% of the time).
      • (XXX SKIP: just to footnote here: the process of drawing the diagram is a lot easier to map to a formal structure than the completed diagram itself: that capability was first demonstrated in the late 1960s. (GRAIL graphical input language))
    • what i'm trying to say here is that even with modern artificial intelligence to (usually) make sense out of what is effectively mud, there is a lot of value in a structured representation for data, so you don't have to go and re-run all those expensive computations that you may not always have access to.

2.1 some questions (5:12-6:22)

  • who is Bob? or more accurately which Bob are we talking about? because it's common knowledge that there's more than one Bob.
    • for that matter, what is Bob?
      • is he a person? is he an animal? is he something else?
      • is he real, or is he fictional?
      • is he even a he?
    • (these are details that are not immediately obvious to a computer!)
  • what are cookies?
    • are we talking about the particular form of baked good?
    • or do we mean the pieces of data that get sent to your web browser?
    • or do we mean something else?
    • whether it is one or the other, is there a general consensus on what a cookie even is? and irrespective of that, does Bob agree with the consensus or does he have his own private definition?
      • (if Bob is a dog, then maybe "cookies" are a colloquialism for dog treats)
  • what does it mean for Bob to 'like' something?
    • does it mean Bob enjoys physically eating the food item known as "cookies", or does he merely enjoy contemplating the abstract concept of "cookies"?
    • (the distinction might matter when you least expect it!)
  • these are all questions that would need answers if you wanted to operationalize this one piddly little statement.

3 a more likely scenario (6:22-)

  • "Bob likes cookies." is a single statement: a single datum, a claim about the entity Bob.
  • it is customary that an information system will bundle together a number of such claims associated with an entity, and then present that as a single logical object.
    • it is likewise customary to bundle the bundles together into a document, database, or other identifiable information resource.
      • something you can point to, something you can go to; something you can retreive.
  • a familiar representation of such a bundling might be something that looks like a spreadsheet:
    • (show spreadsheet)
    • it is customary, though not essential, that each column represents a field, and each row represents a record, with the first row reserved to label the columns
    • we could just as easily flip the axes though, so the columns represent records and the rows represent fields:
      • (show transposed representation)
    • this two-dimensional representation is actually an illusion anyway: it's a one-dimensional sequence of one-dimensional sequences.
      • this means the underlying data structure could be congruent to the records, or it could cut across them all.
      • this is a good overture to the kinds of challenges we're faced with when exchanging data between systems:
      • the spreadsheet has no way to "know" which way it's oriented, only the human user can tell.
      • it's not just the content, but also the structure that is meaningful.

3.1 let's go shopping (for math)

  • if the structure is as meaningful as the content, then we should probably pick a good one,
    • and it's the math people who always have the best structures.
  • there are a number of ways to represent these structures mathematically, but a particularly appropriate one is a thing called a tuple:
    • (think single, double, triple, quadruple, quintuple, et cetera; tuple is short for N-tuple where N is some number)
    • each position in a tuple is identified with a meaning, so the first slot means this, the second means that, and so on
      • (maybe show one?)
    • so you can make a tuple of labels and then match their positions to the tuples of values; it's a straightforward encoding scheme: name -> # -> value
    • this is very close to exactly what is often happening inside the computer
    • many programming languages make this kind of structure more convenient by providing a mapping type, but all those do is collapse this lookup process from two logical steps to one
      • (XXX SKIP mapping types go by other names):
        • "dictionary"
        • "hash" (referring to a usually-unimportant detail of the underlying implementation)
        • sometimes "object", though in most languages an "object" type is distinct from a mapping type, even though they might behave similarly
  • inside a single program running on a single computer, these structures can take any form that's most convenient for the program's execution.
    • if you need to move these things from one computer to another (or different program on the same computer, or the same program that needs to exit and start again!) they have to be piled together into what amounts to a set of instructions for recreating the structure and its semantics internally.

3.2 fuckin json how does it work

  • the term for said instructions is serialization, a format suitable for storing as a file, or sending as a message across a communication channel.
    • so we can transform the table from this (row-record table)
    • to this (column-record table)
    • to this (tuples)
    • to this (map)
    • to this (json with ordinary strings as keys)
    • and now we can finally talk about the actual concrete problem we're trying to solve.

3.2.1 baby's first json

  • what we have here is one representation of an actual set of instructions to recreate one of these mapping objects
    • (or rather, since what you are seeing is a picture of a run of text, it is actually a representation of a representation.)
  • this formal notation, which is only one of an entire universe of formal notations, makes it much easier to move pre-digested information around.
    • the curly braces represent the start and end of the denotation
      • the colons connect the left and right sides as key-value pairs
      • the square brackets denote an explicit sequence of elements
      • individual elements are separated by commas
    • (note as well that the spacing and line breaks are not significant, they are merely there to make this example more readable.)
  • this brings us back to the generalized Bob-likes-cookies problem:
    • you can ingest this run of text into a program and it will generate the corresponding structure, but the structure will be useless to the program unless the program "knows" what it's looking at.
  • the first remark is: what kind of entity does this representation even represent?
    • identification is often context-dependent: the consuming program "knows" what is being represented because it "knows" where it got the information from.
      • this is like saying "I found this object in the fork drawer, so it must be a fork."
      • there are a zillion problems with this, a number of which can be yoked together by contemplating what happens when we can't rely on where it came from to identify it.
      • in order to be of any use, the object needs a mechanism for declaring what kind of thing it represents.
        • I won't get into specific detail yet, because there are some other issues I need to address.

3.2.2 the symbol management problem

  • let's focus our attention for a moment on these left-hand-side labels:
    • these may be suitable for labels, at least in English, but are altogether unsuitable as identifiers:
      • it is important to understand that the computer doesn't care what these identifiers are: they could be numbers, they could be wads of gibberish.
      • the only genuine requirement on the part of the computer is that they are unique, which in practice also entails that they are exact.
        • another convention, although arguably not strictly necessary, is that identifiers conform to the characters you would see on a standard American typewriter, and that they don't contain any spaces.
      • otherwise, identifiers are a piece of user interface, where the "user" is typically a programmer.
      • the way to deal with identifiers is to put them into what is called a controlled vocabulary, which is exactly what it sounds like: a very strict dictionary of terms and their very specific meanings.
      • other niceties of identifiers, include not only being memorable, but also inferrable
      • that is, given knowledge of a subset of identifiers in a vocabulary, you are likely to guess others correctly, without having to go look them up.
  • so given that, let's change those labels to something more identifier-y.
  • there are a few conventions for multi-word identifiers, i'm not gonna get into them but here are a few
    • (camel case)
    • (pothole case)
    • (kebab case)
  • (XXX SKIP ALL THIS)
    • the tendency is to pick either upper or lower-case; upper-case used to be more fashionable but now considered to be the text equivalent of yelling;
      • in principle it doesn't matter which you pick, as long as you pick one
    • (XXX SKIP there are a number of conventional styles for dealing with multi-word identifiers given the aforementioned constraints:)
      • this is called camel case
      • but there is also pothole or snake case which uses underscores
      • also lisp or kebab case which uses hyphens
  • again, the computer doesn't care which you use, but it's considered gauche to mix these conventions, so pick one and stick with it.

3.3 remember what i was saying about uniqueness being important

  • so what i'm showing you is more or less the best current practice for data exchange in networked information systems
    • it nevertheless leaves some serious problems unsolved.
  • so, you pile together all your terminology into a controlled vocabulary, you put that online in some ad-hoc form, and you repeat that process for every information system you create.
  • and everybody on the consuming side has to do this for your system and every other system they consume data from, and their own system.
    • (SKIP: or at the very least they have to integrate your controlled vocabulary with every single other controlled vocabulary they interact with.)

3.4 who owns a word, anyway?

  • so we have a situation where every information system that publishes a data interface has to define what it means by every term in its controlled vocabulary, though most of these systems are very often talking about the same things (people, companies, products, documents, etc.)
    • considerable effort has to be done on the consumer side, either reconciling different terms that mean the same thing, or reconciling the same term used in two or more systems that mean different things.
  • the root of the problem is that these are just words, and anybody can use them privately however they like
    • for instance, i can make a private rule that asserts that every time i say the word "apple", what i mean is "banana"
      • but getting other people to adopt this convention requires a a lot more muscle.
  • entities like facebook and google can simply command compliance with their worldview by dint of the sheer size of their platforms, which operate as totalitarian fiefdoms.
    • you have to speak the king's tongue in order to interact with the realm.
    • this limits not only what you can say, but also what you can mean, since interpretation is completely up to the platform in question.
  • this is the kind of issue standards bodies, or at least industry consortia, exist to settle.
    • a lot of these entities are just big clearing-houses of "what we mean when we say X"
    • but then come the politics!
    • it turns out that being able to impose public definitions of what words mean is incredibly powerful!
    • it is therefore naturally highly contested territory!
  • and then these standards bodies are often captured by their most powerful members
  • the situation reminds me of a quote i like:
    • "Technology standardization is commercial diplomacy and the purpose of individual players (as with all diplomats) is to expand one's area of economic influence while defending sovereign territory."

3.5 finally, fiiiiiiinally…

  • standards bodies decouple these controlled vocabularies from the platforms, but as I just said, influencing the content of these vocabularies, either the terms themselves or what they mean, is a costly process.
    • so what if there was a way to create these controlled vocabularies unilaterally and put them online for whoever wants to use them?
    • what if, likewise, you could use and extend third-party vocabularies just as you would any open-source software?
    • it turns out that standardizing this capability has been in the works since 1996, and while it was slow to get going, it has had all the necessary (in my opinion) parts and pieces in place since about 2010 at the latest.
    • this family of standards is called RDF, or more informally, linked data.
      • you might also hear the phrase knowledge graph, which is new on the scene, and you might remember hearing semantic web.
      • these are all overlapping regions in a venn diagram.
        • (show venn diagram)

3.6 what does it consist of?

  • you take these identifiers and turn them into Web addresses.
    • what is the effect of this change?
      1. first of all it guarantees the identifiers will be unique in a global context, since URLs have an authority component,
      2. if the vocabulary authors follow best practices, the terms are de facto links to their own documentation and formal, machine-readable specification. (This is extremely handy!)
      3. these formal specifications are therefore online and available for anybody to use and build off of
        • the ones that are useful get currency and stick around
    • what about side effects?
      • well, the identifiers themselves naturally get a lot longer
      • this means more to type and therefore more to mistype if you're laying these structures out by hand
        • by the way, don't lay this stuff out by hand
      • you will also see arguments that all these extra letters are inefficient for storage and transmission of the data
        • in practice, any efficiency penalty is going to be negligible
      • there are likewise common conventions that make the basis for this complaint go away
      • there is the (in my opinion) more serious problem of these vocabularies disappearing from the web, or their authors otherwise losing control over them, but this can be mitigated through sensible practices as well.
  • I should also remark that we can give the same URL treatment to the identity of the resource itself along with the type of entity it's supposed to represent.

3.7 so why aren't these standards being used more broadly?

  • the answer is that these standards are being used, just not in the one place where you'd expect it to be: mainstream Web development.
    • The big operators are:
      • publishing and journalism (other than Web, apparently)
      • GLAMR (galleries, libraries, archives, museums, records)
      • finance
      • government
      • but the big one is biomedical, you see that industry's fingerprints on everything in this space
      • as a rule, adoption also seems to be bigger in europe than in north america
  • these standards are being used on the Web in a narrow sense
    • google and facebook use RDF to help you help them put enhanced web page previews on google and facebook
      • but they've managed to mangle it, each in their own way, so the information isn't terribly useful for anything but putting web page previews on google and facebook.
        • (twitter made a superficially similar effort but it doesn't look like they even tried to comply with the spec.)
    • this is probably the most unimaginative and impoverished application of the technology, deployed in the most cynical way
  • i'm not exactly sure why this is the case that these standards don't get wider adoption, though.
    • a common argument against it is usually an aesthetic one, couched in "efficiency" at some scale or another
      • often the charge is that this technology unnecessarily complicates things.
      • it does complicate things initially, as introducing any new way of doing things would, there's no denying that.
      • but what does and does not qualify as "necessary" depends on what your values are, so let's talk about those.

3.8 values lol

  • the technologies that underpin linked data are very good at decoupling meaningful informational content from the particular silo in which it is stored.
  • the principal business model of silicon valley (and its would-be mimics elsewhere in the world) is precisely to build silos in order to hoard and arbitrage data.
    • i mean, that's what a platform is.
  • indeed at the heart of any business is to give people a reason to deal with you instead of somebody else.
  • information is an attractive reason because while it can be copied, it can't be substituted.
    • so for any particular piece of information, if people can't get a copy easier somewhere else, they have to get it from you.

3.9 what if you didn't care about hoarding data?

  • the key benefit of this technology is moving information between distinct systems.
    • people who work at companies who don't prioritize this kind of interoperability are going to view it as unnecessarily complicated
      • we can postulate that this is because gaining proficiency around radical interoperability simply doesn't get you promoted in what has come to be called "surveillance capitalism".
    • but virtually every organization has more than one information system, so this capability is still useful, even if you don't expose it to end users.
      • that's because it solves a host of practical problems that arise with trying to merge information from different systems
      • you can tell they're serious problems because there are alternative, and in my opinion, inferior, attempts to solve them.
  • but what if you don't care about hoarding data?
    • what if you're a company whose value proposition was precisely that your customers could cash out a hundred percent of their data and never look back?
    • then you could take advantage of solutions to a raft of boring practical problems that have plagued information systems as long as they have existed, and open up an entire new arena of competition, where the data-hoarding incumbents will have a hard time chasing you.
    • or what about organizations that exist to share information and/or don't "compete" in the traditional sense?
      • universities,
      • government agencies,
      • think tanks
      • museums and archives,
      • societies, unions, and trade or professional associations,
      • etc
  • organizations like these don't have the same priorities as tech startups, so why derive their methods from silicon valley values?

4 wrap it up, nice and neat

  • from my own perspective, and in my professional capacity, there are two principles i adhere to:
    1. that a business relationship is likely to be more equitable if there are alternatives to it,
    2. and that computers could be doing more work than they currently do.
  • relationships with businesses built upon hoarding data are naturally lopsided in their favour. that's just their nature.
    • and if we rely exclusively on these entities, our capabilities to understand our environment and to express ourselves as individuals, communities, and as citizens in civil society will always be limited to whatever is on their menu.
  • just like the free and open-source software movements of previous decades, the radical decoupling of data from data silos is a political agenda.
  • it is my conviction that people and organizations who have the temerity to view truly open data as a strength rather than a weakness, will be rewarded, if not immediately, then surely in the longer term.
    • perhaps this could be you.

Author: dorian

Created: 2022-03-18 Fri 10:36

Validate