Slow Data, Decentralization and Semantic Web Architecture

[I've still got a bug in my blog software which mangles links, so apologies for the ironically unlinky URIs]

Slow Food (http://en.wikipedia.org/wiki/Slow_Food) is an international movement founded to offer an alternative to fast food, "it strives to preserve traditional and regional cuisine and encourages farming of plants, seeds and livestock characteristic of the local ecosystem". By a little analogical legerdemain, fast data is the kind of stuff you get from regular search engines - quick but not very nutritious, probably bad for you. Slow Data on the other hand has been harvested with care and with attention paid to its preparation. It's far more satisfying in the long run. While complex Semantic Web systems are currently at a slight disadvantage performance-wise (largely due to their youth), there's no reason that high quality data can't be readibly accesible at high speed using existing, well-documented Web techniques. But I'll call it Slow Data anyhow.

So...I recently got a letter (!) which included a description of a proposed social net application based around RDF data. The author knew what they were talking about and the system sounded good, but they were really struggling with one aspect, how to avoid making a centralised system.

One of the great rallying cries of the Linked Data movement has been to open data out to the Web. I doubt very much that I've seen a presentation on the subject that hasn't referred to data silos, usually with a predictable image. This antipattern reaches its zenith in applications where the only interface to the data is a dedicated 'snowflake' API (so named because every one is unique), severely limiting the potential for Web-style interconnection (links). Behind the scenes the application implementation may be highly distributed, but all the user or developer can see is a walled garden with a gatekeeper. That's a lot of buzzwords in one paragraph, so I'd better move towards the point.

How is an RDF triplestore any more open than a SQL-style database hooked up to the Web?
It might sound heretical, but it isn't, or at least isn't necessarily. The only advantage it has is that by default it uses URIs as identifiers for things (corresponding to the keys in a SQL store) which if designed properly will be dereferenceable over HTTP, i.e. they will be links which can be followed to find out more about the named resources. But SQL-backed Web applications can expose links that can be followed, and many do. (The same goes for NoSQL stores). SPARQL is a query language that can be applied to a particular variety of graphs, but again in itself it isn't really any more webby than the triplestores it addresses. However there is the SPARQL Protocol for RDF (SPROT) http://www.w3.org/TR/rdf-sparql-protocol/ which allow things like a HTTP GET /sparql/?query=EncodedQuery and changes the whole ball game (you don't hear much mention of SPROT, I suppose because of the ugly name and a spec that's mostly WSDL stuff that everyone ignores).

Hopefully everyone's familiar with Chapter 5 of Fielding's dissertation - http://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm - so to cut the waffle I'll cherry-pick one heading: 5.1.4 Cache. If we imagine the Web (of Data) as one huge interlinked information space, then individual stores such as those associated with specific applications can be considered as caches of small chunks of the Web of data. This is probably easiest to conceive by contrasting two different pieces of software. For one let's have a social net app that lets people discover other people with similar interests. It will store data around resources of the type foaf:Person with properties such as foaf:interest and to leverage the social angle foaf:knows. A traditional app for this kind of thing would involve people signing up and entering information about themselves. But quite justifiably a person might say "I don't want to enter loads of stuff into a form in application Y when I already entered it in application X yesterday" (yes, this is the old Data Portability thing). But pause there and for a second piece of software let's have a generic link-follower and data aggregator, i.e. a crawler or bot, or as they're known in FOAF circles, a scutter. It's not difficult to make such things directed, so they only following specific link types of interest (check Slug http://ldodds.com/projects/slug/ - see also https://github.com/ldodds/slug). Let's make the storage system for this scutter a triplestore. Ok, set the scutter going on the Web at large with a plan to follow foaf:Person related links and slurp the data. Come back a few hours later, and you have an already-populated store to which you can plug in the social app, no need for people to sign up (in an ideal world, and ignoring privacy matters).

Now the scutter plan for this (i.e. get people data) is pretty much isomorphic to a SPARQL query along the lines of:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

CONSTRUCT { ?s ?p ?o } WHERE {
?s rdf:type foaf:Person .
?s foaf:interest ?o .
?s foaf:knows ?o .
?s ?p ?o .
}

This is exactly the kind of query you'd also want to be asking in the social net app. Going through the scutter, you're asking the Web at large, but because the data has already been aggregated in your store, it doesn't take a thousand GET requests to find relevant statements. But the statements are exactly the same. In other words, an RDF store is just a cache of a small chunk of the Web of Data.

For performance reasons this kind of cache would be selective in the data collected, so maybe strictly speaking the architecture is more like Uniform Pipe and Filter http://www.ics.uci.edu/~fielding/pubs/dissertation/net_arch_styles.htm#sec_3_2_2 with the uniformity essentially maintained by following the SPARQL and SPROT specs (and 5.1.6 Layered System is probably relevant too).

This kind of thing is entirely implementable today, in fact the Semantic Web Client Library http://www4.wiwiss.fu-berlin.de/bizer/ng4j/semwebclient/ can do SPARQL queries on the Web at large (SELECT at least, not sure if it supports CONSTRUCT).

There are other pieces of the Semantic Web toolkit that can be cleanly inserted into Web architecture (as one would hope, given that the Semantic Web is meant to be an extension of the existing Web). For example, a general-purpose WebID setup (FOAF+SSL http://esw.w3.org/WebID) could be inserted between client and server to handle authentication, acting as a proxy and/or gateway.

Somewhere recently (I think in a paper by danbri and others) I saw discussion about what was needed to get from a Web of Linked Data to a more fully Semantic Web. In other words, even if you score 5 stars at http://lab.linkeddata.deri.ie/2010/star-scheme-by-example/ there might be more you can offer. I might have dreamt it, but I believe the discussion mentioned inference and reasoners. The thing is on the one hand we have lots of linked data already out there, and we already have pretty performant reasoners (e.g. http://clarkparsia.com/pellet/ ) but reasoning over Web-scale data is likely to remain a fantasy. That is, unless you imagine multiple reasoners acting as dedicated, fairly task-specific agents/services over their own manageable little batch of data. These again could be deployed as proxies. For example, another bit of FOAF jargon is smushing, which originally (when people were bnodes) meant the unification of data about a person based on the assumption that the person could be identified by means of their email address or homepage. Since it's more common now to use URIs to identify people (see http://dig.csail.mit.edu/breadcrumbs/node/71) I don't think it's unreasonable to extend the term to cover unification of multiple URIs for a person (typically with owl:sameAs links somewhere). Now going back to the triplestore of the app described above, that's only really interested in statements including the identified foaf:Person, foaf:interest and foaf:knows. There's nothing to stop this treating a person as two individuals if data has been pulled from sites which use their own person ID schemes. But if somewhere else on the Web at large there was a triplestore with reasoning capability that could eat person IDs, foaf:mbox and foaf:homepage data and spit out owl:sameAs statements, this could be used to unify the descriptions for the application. This triplestore could have a scutter as its input and a SPARQL endpoint to provide output, in other words being a uniform pipe kind of proxy.

Ok, so effectively I'm arguing here that we already have all the bits from which we can glue together a Semantic Web that sits nicely with the Architecture of the World Wide Web http://www.w3.org/TR/webarch/ . But I do think there are at least two specific areas that need attention in the near future. One is in the increased use and optimization for named graphs, especially those of the order of only tens or hundreds of statements. I thought I had a good justification for this, but now my minds gone blank, so just call it a gut feeling. The other thing is in description of datasets - there's already some stuff around annotations and provenance etc, but I'm thinking more in terms of discovery and agents/services being able to advertise themselves to allow a client that's looking for some particular kind of data. the Vocabulary of Interlinked Datasets (voiD, http://vocab.deri.ie/void) is pretty good in this space, but I reckon we need to go a lot further, and have been mulling over a little quasi-protocol for matchmaking between datasets and agents. I'll post more on that once I've got something to talk about...

There is a teeny bit of low-cost, potentially invaluable data that it'd be nice to see more of. Let's say a directed scutter has crawled the Web and has aggregated all statements of the form <http://example.org/fred> foaf:interest ?x. While ideally it will be placing the triples it's found into named graphs corresponding to the provenance, a more likely coding scenario (because the queries will get silly with thousands of FROMs - hmm, does SPARQL NG do anything about that?) would be to dump everything into a default graph. But, while the full provenance may not be retained is this setup, it can still be made available to consumers of the data if statements of the form <http://example.org/fred> rdfs:seeAlso <http://wherever.com/source/somedataaboutfred> are added to the store. Call it future-proofing.


danja
2010-12-14T11:08:17+01:00
architecture arch semweb rdf
Related
Comments
Edit