RDF-KV

This is a very early draft—more like running notes, of a protocol for embedding RDF data into plain-jane, no-JavaScript, your-grandma's HTML.

Rationale

Just about any data you could express in a web form can also be expressed in RDF.

As such, it makes sense to write web apps that natively speak RDF in order to take advantage of the vast library of vocabularies, inference, validation, etc.

Only problem is then you'd need some kind of JavaScript contraption to send RDF (Turtle, JSON-LD, whatever) to the server for any browser-based application, which is inherently brittle, and extra complexity to debug.

Solution: create a way to express RDF using boring old web forms, and then put a filter on the server that turns conforming application/x-www-form-urlencoded request content into RDF before processing.

Requirements

Must yield valid HTML 4 because 5 is no longer deterministic
Grammar must be regular so it can be parsed easily with a regex
Must be able to be typed by hand to facilitate slapdash prototypes
Must assume input is already parsed, and not rely on the order of form inputs (unlike RDF/POST)
Must not depend on inference, stored prefixes, etc, though another part of the implementation certainly could use them
Must ignore input that doesn't match the protocol
Should, however, raise an exception on malformed attempts to match the protocol
Must "fail open", i.e. not do stupid or destructive stuff if malformed
Must not be too chatty, i.e. be as succinct as possible; use the fewest bytes to express semantics

Basic Syntax

The form element's action URI is the subject, the input elements' names are predicates, and their values are the objects.

<form method="POST" action="http://example.com/my/resource">
  <input type="text" name="http://purl.org/dc/terms/title"/>
  <button>Set the Title</button>
</form>

will produce:

<http://example.com/my/resource> dct:title "Whatever you wrote" .

If you aren't aware, the DTD attribute type for name is and always has been CDATA, which means it can be any non-empty string. This is great for creating triples with plain literal values, but the addresses of resources could also be inferred from rdfs:range properties in whatever schema a given predicate belongs to.

Resources and Blank Node Identifiers

In lieu of such inference, however, we can supply the following:

<input name="http://www.w3.org/1999/02/22-rdf-syntax-ns#type :"/>

The colon : at the end of the name signifies that the input's value should be treated as a resource. If we want the input's value to represent a blank node identifier, we use the underscore character _ instead.

Literals, Languages and Data Types

Even though the default behaviour is to treat input values as plain literals, there are language-tagged and typed literals to consider as well. We encode these by adapting a similar syntax to Turtle:

<input name="http://purl.org/dc/terms/description @en"/>
<input name="http://purl.org/dc/terms/created
             ^http://www.w3.org/2001/XMLSchema#date"/>

Here, the aforementioned two inputs prescribe the language and datatype of their respective values. Note that a literal can only have a language or a datatype, not both. For the sake of completeness, although it likely won't come up often in practice, the character to disambiguate plain literals is the apostrophe '.

Subject

In the case you need to specify a different subject, simply prepend it to the predicate.

<form method="POST" action="http://example.com/my/resource">
  <input type="text" name="http://example.com/other/resource
                           http://purl.org/dc/terms/title"/>
  <button>Set the Title</button>
</form>

Graph

If you need to specify a graph other than the default for an individual statement, put the graph's URI after the object designator.

<input name="http://purl.org/dc/terms/title ' http://example.com/my/graph"/>

This would be a rare instance in which you would encounter the need for the ' designator, which you can naturally omit if you also specify a subject:

<input name="http://example.com/other/resource
             http://purl.org/dc/terms/title
             http://example.com/my/graph"/>

Statement Reversal

The condition often arises that we wish to specify the input's value as a subject rather than an object. To accommodate this, put a bang ! at the front of the field name:

<input name="! http://purl.org/dc/terms/creator"/>

This changes the direction of the statement, so subjects become objects and objects become subjects. Note that you can only specify URIs or blank nodes this way. If you want to use a reverse statement with a literal, use a placeholder.

Add/Subtract

The default behaviour is to merge relevant resources with the contents of the form, but if you want to delete statements, prepend with a -. + is a no-op for the default behaviour.

<input name="- dct:title"/>

Also consider = for "nuke all subject-predicate pairs of this kind and replace them with this value"

<input name="= dct:title"/>

Control Words

because you will invariably want to change the global behaviour: $PREFIX etc

SUBJECT: the default subject instead of the form's action
GRAPH: override the default graph
PREFIX: override namespace prefix declarations
TARGET: redirect to some other address (dunno if this is smart or dumb)

Abbreviation

typing in full URIs sucks

prefixes/CURIEs, duh

ok but consider the situation where the entry exists but the prefix isn't registered (then you get garbage data)

or the prefix collides with a URI scheme (like http) (then you get MORE garbage data)

so here's a question: do we do "variables" as in proper variables, or do we do "macros" as in dumb text substitution?

ok how about this bnf

rdf-kv ::= partial-statement | declaration
declaration ::= '$' WS NCName (WS '$')?
macro ::= '$' NCName | '${' NCName '}'
term ::= IRI | CURIE | macro
partial-statement ::= (modifier WS)? (term (WS term)? (WS designator)? |
    term WS designator WS term |
    term WS term (WS designator)? WS term) (WS '$')?

Validation

Obviously you could use RDFS/OWL/XSD to validate form contents. but there is problem re: tracing them back to the original inputs in case they're invalid. Basically you will need to keep track of the original, bit-for-bit, verbatim form keys and values in document order (the order of the keys not so important but being verbatim is). Then it should be a matter of devising a response body that contains enough information to stitch the offending form controls back together.

About Those Macro Expansions

The macros in RDF-KV are basic string-replacements à la shell variables.

Note that the form designer should endeavour to hide this macro business from end users. It is for me, not for them.

The reason we need this type of system in the first place is because the menu of HTML (pre-5, and arguably 5 as well) form controls is pretty bleak for the kind of information real human beings actually tend to need to manipulate (remember, the ultimate objective is to make this easier for people).

Take dates, for example, often separated into three <select> boxes for year, month and day, in lieu of a sane alternative. There has to be some mechanism for concatenating those values together, and if the product of this protocol is a set of RDF statements that require no further manipulation, those values have to be concatenated in transit.

<select name="$ y">
  <option value="2013">2013</option>
  <!-- and so on -->
</select>
<select name="$ m">
  <option value="01">January</option>
  <!-- et cetera -->
</select>
<select name="$ d">
  <option value="01">1</option>
  <!-- und so weiter -->
</select>
<input type="hidden" name="dct:created ^xsd:date $" value="$y-$m-$d"/>

Conditional Expansion

You might have noticed the $ terminating the statement template. It is an explicit signal to expand macros in the statement's value. Macro expansion is off for values (of either statements or macro declarations) by default, in case the end user happens to accidentally type one in. However, macro expansion is always on for the statement templates.

Empty Values

Empty values are going to have to have a different meaning for macros than statement templates. Namely, you can't discard the declaration because it's empty, since it might be used later on. But, if a macro declaration has multiple values, where one or more values are empty and at least one isn't, the empty ones should be discarded.

Multiple Values

These are web forms and application/x-www-form-urlencoded carrying this data, so that means macros can be defined more than once. What does that mean for variable substitution? I'm thinking the behaviour that would ultimately be the least surprising would be Cartesian product. (Not surprising in the logical sense, but potentially very surprising in the engineering sense!)

Look at it from the perspective of the user (in this case, the form designer), who is designing for their user, who they want to do as little work filling out forms as can be gotten away with. Enter one value in a slot and use it multiple places, enter two values and it should generate two statements.

The problem with the Cartesian product of N sets is that it can get really big, really fast.

I'm going to permit recursive expansion in macro declarations, because it would be lame if I didn't.

I'm not going to permit macro expansion at all in the names of the macros themselves, because that is just crayzo.

For statement values, there is no need for recursion, but consider the interaction of having multiple, multiple-valued macros in the same value: Cartesian product.

<!-- imagine both $first and $last each have 10 first/last names -->

<input type="hidden" name="dct:contributor $" value="$first $last"/>

<!-- you're looking at 100 (meaningless) statements getting generated -->

The statement templates are where things get interesting. They would behave the same way as the values do, but they would multiply the number of statements produced even higher. We're talking about a Cartesian product (statements) of a Cartesian product (values) of a Cartesian product (macros). Immediately that situation brings to mind denial-of-resource attacks where a tiny message explodes into a crippling logic bomb. But, such a device would have characteristically low initial entropy, which is amenable to detection heuristics: Essentially, any enormous number of RDF statements generated in this fashion are simply not going to be very interesting, and would therefore be immediately suspect, and the process of inflating them can be shut down long before the set gets too big.

Unbound Macros

suppose you reference a macro that was never declared. what happens?

ignore it (leave the $symbol reference alone): pros: doesn't screw with input beyond defined macros, can use literal $ characters; cons: will create garbage data if there is an error in the form.
raise an error: pros: informative to the form designer, who is really the intended user of macros; cons: might blow the end user up unexpectedly, plus you'll have to pull some crap to get literal $ chars into the form values if you want 'em (e.g. by making a non-expanding macro that contains a dollar sign).
replace it with the empty string: pros: consistent with the way it works in Bourne etc shells; cons: fails silently and produces garbage data.

currently leaning toward leaving it alone

Gimme/"Environment" Variables

The server should never trust the client.

This is an actual use case I'm interested in: I want to use a this protocol to make a complex RDF structure, and I want the subjects to be UUID URNs (a technique I use religiously for canonical URIs and/or ones I haven't decided on what to name yet). I want to make sure I pick UUIDs that don't collide with ones already in the database, lest I corrupt a bunch of existing data.

But, you say, if the UUIDs are generated in the standard fashion, the likelihood of that happening is infinitesimal. Indeed that's how they were designed. Sure, but consider boneheaded scenarios where one is hard-coded, or left behind from some other process, and so on. Better yet: what about if somebody is up to no good and knows the URI (UUID or not) and slips some harmful statements into the form? Fine. Throw an ACL on the target. But what about the error message (for the benign user whose form submission incomprehensibly doesn't work)? Best to just let the server generate any necessary new identifiers. Consider:

<input type="hidden" name="$ new $" value="urn:uuid:$NEW_UUID"/>

I am well aware this kind of functionality can get out of hand.

Security Considerations

Oh boy! You mean besides the ones already considered?

If it isn't already evident, this protocol should only be used to POST standard application/x-www-form-urlencoded HTML forms. It would make a complete mess of the query string if it was used with GET. Also, at the time of this writing, I have no idea how you would reconcile this protocol with file uploads.

Since that nasty FTP URL trick was plugged, I'm not sure how you get browsers to POST across domains without JavaScript, so I think we're safe there.

Pretty certain all subjects mentioned in the form, unless they're brand new, should be topologically connected somehow to the resource to which the form was POSTed. Certainly if you're going to be making any destructive changes. We can imagine this lending itself to an escalation attack where the attacker connects two unconnected resources together on order to mess with one of them in a later request. Actually we can imagine a lot of things, so it's probably best to get this thing working so we can figure out all the glorious ways we can break it.

Also, this is going to have to go for $SUBJECT, which is the override for the form action URI.

Implementation

Perl, because I need that right now.
Python, because lots of people (including me) use that.
Some kind of JavaScript/JQuery adapter, because that would be useful for progressive enhancement.
Others? FYPM.

Future Directions

Shorthand for rdf:List, Seq, Bag, Alt?
some functionality for doing basic lookups in the existing graph?