Firefox browsing history fun@en

The only distinct thing I had to do today was "This Week's Semantic Web", from then it was the fairly open-ended task of getting tutorial material together for the Talis Platform. I lost most of the morning updating Eclipse on the laptops. This afternoon, while going through my sources I ran across a post from Sean B. Palmer, which led me down a right old rabbit hole.

It turns out that the history of visited pages in Firefox is kept in a text file, using the Mork format. This format is rather painful. Sean pointed to a Perl parser (which contains amusing comments, btw), Phil Wilson had done a bookmarklet too, which doesn't work with recent FF, so to me both of these paths sounded hard work.

But then I found a Python script, demork.py, that could convert the stuff into XML ( dunno how sbp missed that). So I tweaked that to produce RDF/XML: mork2rdf.py, here's a sample of the result: history.rdf. [PS. now I realise there should be some separation between the item and the person & their browsing history - next time...] You may notice I cleared my history earlier - it was a bit too big for experiments, and probably too embarrasing for publication. (Note that the script is now only for the history.dat, not for any other Mork files - apparently Thunderbird uses the format for contacts, but I couldn't find an example locally, save that for another day).

Yesterday, when I intended doing a bit more Erlang, I wound up instead playing with Python on the Platform, got as far as posting data up to a store. So I couldn't resist trying the history RDF with that (the posting Python is just a first pass - after sleeping on it I decided the structural approach wasn't very good, but still the relevant method was easy enough to get at). I'd already done:

python mork2rdf.py history.dat > history.rdf

So after a bit of fiddling in Python I then set the auth credentials for my store (the Platform uses HTTP Digest) and did:

from talis_platform import SimpleTalisGraph


file = open("history.rdf", "rt")

data = file.read()

file.close()





graph = SimpleTalisGraph()

graph.set_credentials(TALIS_STORE_URI, TALIS_USERNAME, TALIS_PASSWORD, TALIS_REALM)

print graph.postWithDigestAuth(data)

which responded:

URI : http://api.talis.com/stores/danja-dev1/meta


HTTP Error 204: No Content

Which was what I was after (urllib2 thinks anything other than a 200 is an error - go figure). Now I can run SPARQL queries over the stuff online from its endpoint. The data is now automagically merged with the beginnings of my Personal Graph, though I haven't got enough in there yet to make it particularly interesting/useful. But there's another handy RESTful interface which gives you RSS 1.0 vocab data, and as it happens I'd modelled the history resources as using rss:item. So I went to the search interface, entered "Python" and got the related material as feed data. Voila: ADD by subscription.

Although it means "This Week's SemWeb" will be late again, I reckon I'll file all this as structured procrastination.

@en

Danny Ayers
2007-09-10T21:08:37+02:00

Related
Comments
Edit