Today I added a sitemap to this blog. Some notes-to-self.
Not sure what inspired me to do this, but I have been wanting a complete list of blog post URIs for a while to play around with augmenting the data (e.g. pulling out the links contained within posts and grabbing more info about them).
Blog engine general setup
The HTTP request routing first goes through Apache, if there's a file on the filesystem that matches the request, that is returned. If not, the request gets transparently forwarded to an instance of Gradino running on port 8080. The request gets dispatched through jax-rs to the appropriate handler in the code (most of the code is in Scala but using various Java libs). All the blog data is stored in a Jena TDB triplestore. When a request is made a SPARQL query is run programmatically against the store. The results are formatted as appropriate using a little crude templating (example). Results for the front page and feed are both cached as in-memory strings.
Adding sitemap generator
So for the sitemap, first pass I set things up in the same fashion as the front page and feed are generated, just without a LIMIT on the SPARQL query. This wound up making Apache give a proxy error, not sure exactly why (for some reason error messages didn't show) but it seemed reasonable to assume that it was somehow related to the quantity of results, maybe a silent timeout. I've got archives in the store going back years, my current query (excluding everything with "comment" in the URI) produces just over 5,500 results.
So then I decided to modify things to generate a static file when a POST was received at a particular URL. I should have seen this coming, but my initial attempt at this also gave a proxy error. D'oh! Performance-wise it was effectively the same routine running in the same thread.
But I was able to get it working by making the sitemap generator class a Scala Actor. When the appropriate POST is received, the handler creates a new instance of the Actor and sends it a message, but then continues along the original thread, returning an "ok" message to the browser.
Along the way I evolved what I reckoned was most suitable to put in the sitemap file. The blog front page just uses the core sitemap terms, and this is hard-coded:
<url>
<loc>http://dannyayers.com/</loc>
<changefreq>daily</changefreq>
<priority>0.9</priority>
</url>
Initially I had individual posts using the News sitemap terms, until just now I noticed that they are only for things that change a lot... So instead they just look like this:
<url>
<loc>http://dannyayers.com/2003/05/12/bufo-bufo/</loc>
<lastmod>1970-01-01T01:00:00Z</lastmod>
<changefreq>monthly</changefreq>
</url>
I've left it as monthly in case I want to change any of the template of the individually rendered pages, but reindexing isn't really a priority once the content text has been looked at.
Next I guess I should look at Semantic Sitemaps.
I'm typing this as yet another version of the generator code is running, I've kept making little errors that only show up when I point Google at the sitemap file... But if you're reading this then Gogle is happy with the current version :)