Dave Beckett - Journalblog

Hacking the semantic linked data web

  • Recent posts

  • Follow me on twitter

Category: comment

Leaving Yahoo – Joining Digg

I’m heading to a new adventure at Digg in San Francisco to be a lead software engineer working on APIs and syndication.

I’ve been at Yahoo! nearly 5 years so it is both a happy and sad time for me, and I wish all the excellent people I worked with the best of luck in future.

Here is a summary of the main changes:

  • Silicon Valley -> San Francisco
  • 15,000 staff -> 100 staff
  • Architect -> Software engineer
  • strategizing, meeting -> coding
  • Powerpoint, OmniGraffle, twiki -> emacs, eclipse, …?
  • (No coding!) -> Python, Java, Hadoop, Cassandra, …?
  • Sunny days -> Foggy days
  • 15 min commute -> 2.5hr commute (until I move to SF)
  • Public company -> private company

Exciting!

Rasqal RDF Query Library 0.9.20

I just released a new version of my Rasqal RDF Query Library for two main new features:

  1. Support more of the new W3C SPARQL working drafts of 1 June 2010 for SPARQL 1.1 Query and SPARQL 1.1 Update.
  2. Support building with Raptor V2 API as well as Raptor V1 API..

The main change is to start to add to Rasqal’s APIs and query engine changes for the new SPARQL 1.1 working drafts. This release adds support the syntax for all the changes for Query and Update. The new draft syntax is available via the ‘laqrs’ query language name, until the SPARQL 1.1 syntax is finalized. The ‘sparql’ query language provides SPARQL 1.0 support.

On Query 1.1, the addition is primarily syntax and API support for the new syntax. There is expression execution for the new functions IF(), URI(), STRLANG(), STRDT(), BNODE(), IN() and NOT IN() which are noew usable as part of the normal expression grammar. The existing aggregate function support was extended to add the new SAMPLE() and GROUP_CONCAT() but remains syntax-only. Finally the new GROUP BY with HAVING conditions were added to the syntax and had consequent API updates but no query engine execution of them.

For Update 1.1 the full set of update operations syntax were added and they create API structures. Note, however there seem to be some ambiguities in the draft syntax especially around multiple optional tokens in a row near WITH which are particularly hard to implement in flex and bison (aka “lex and yacc”).

The main non-SPARQL 1.1 related change is to allow building Rasqal with Raptor V2 APIs rather than V1. Raptor V2 is in beta so this is not a final API and is thus not the default build, it has to be enabled with --enable-raptor2 with configure. When raptor V2 is stable (2.0.0), Rasqal will require it.

The changes to Rasqal in this release, in summary, are:

  • Updated to handle more of the new syntax defined by the SPARQL 1.1 Query and SPARQL 1.1 Update W3C working drafts of 1 June 2010
  • Added execution support for new SPARQL 1.1 query built-in expressions IF(), URI(), STRLANG(), STRDT(), BNODE(), IN() and NOT IN().
  • Added an ‘html’ query result table format from patch by Nicholas J Humfrey
  • Added API support for group by HAVING expressions.
  • Added XSD Date comparison support.
  • Support building with Raptor V2 API if configured with --with-raptor2.
  • Many other bug fixes and improvements were made.
  • Fixed Issues: #0000352, #0000353, #0000354, #0000360, #0000374, #0000377 and #0000378

See the Rasqal 0.9.20 Release Notes for the full details of the changes.

Get it at http://download.librdf.org/source/rasqal-0.9.20.tar.gz.

PS The source code control has also moved to GIT and hosted at GitHub.

Raptor RDF Syntax Library V2 beta 1

Today I released the first beta version of Raptor 2. This is the culmination of about 9 months work refactoring the Raptor 1 codebase. In hindsight, I should have done this years ago, but I knew it would be a lot of work, and it was.

The reasoning behind doing this is multi-fold, but basically the code had a lot of cruft and bad design choices that couldn’t be removed without breaking the APIs in lots of ways, and at some point it’s easier to just do it all at once, and that’s where we are now.

Cruft meant removing stuff deprecated for a long time but also renaming all the functions to follow the same “objects in C” style used throughout Redland’s libraries which has standard naming forms:

  • raptor_class_method()
  • Constructors: raptor_new_class() (core constructor or 1 arg constructor) and raptor_new_class_from_extras()
  • Copy constructor: raptor_class_copy()
  • Destructors: raptor_free_class()

The major addition was a raptor_world object that is used as a single object to hold on to all shared resources and configuration. This was a design pattern I put in librdf and Rasqal but for some reason, never considered it for raptor. This turned out to be a mistake since I had to then pass around a lot of parameters and configuration to individual object instances, more than was really needed. Examples of this include the error handling which added two parameters to several constructors. The error handling, now expanded to a general log mechanism after librdf’s handles multiple structured log record types and the logging policy is once-per-world.

The addition of the world object meant that each constructor for an object in raptor now takes that object, so it can get access to the shared configuration and resources. That itself meant the change was extensive, broad in scope. The single place to manage resources means it’s easier to ensure proper cleanup and deal with library-wide issues.

One other pain point was Raptor’s simplistic (but functional!) URI class. It manipulated URIs as plain old C strings (char*). I knew from building librdf, that this could be more efficient by interning the strings so a URI for a particular string is held only once, and reference counted. I used the already built raptor AVL-Tree to implement it, and as a bonus, moved that AVL Tree to the public API, so it can be reused (Rasqal has a copy of the implementation). The resulting reference-counted URIs mean that after URI construction, comparison and copying are very cheap – and given that this is RDF, those are done a lot. The old URI code also had a swappable implementation which added a lot of complexity to the code and that has gone now, since the new implementation is more sophisticated. There is probably more work that can be done here to make this URI work better, such as caching the URI structure so that it’s quicker to generate relative URIs. Also one day I should really validate that all the URIs built are legal to the syntax.

Another long term problem was the triple itself, which I had called ‘statement’ way back when I was creating it. Unfortunately a raptor_statement had hard-coded the RDF specifics – the subject can only be URI or blank node, predicate can only be a URI etc. That meant the code was twisty. That has been replaced by an array of 3 or 4 raptor terms (URI or blank node or literal) so it can handle both triples, quads and any possible extension beyond RDF (2004), although today none of the current parsers or serializers expect non-RDF statements. That change also made a lot of the internal code simpler to understand and quicker. The RDF terms were also introduced in a reference count manner, along with adding reference counting to the statements, it meant that passing triples around which used to involve a lot of copying, is now a simple integer increment of the reference. More speed!

That sorted out the fundamentals of statements, terms and URIs and changed pretty much every piece of code that touched them in all the parsers and serializers and core code.

There were a few pieces of new work added – two new serializers and one new parser. Two of those were written by Nicholas J Humfrey who is now a core committer.

I’d also like to call out thanks to Lauri Aalto for keeping raptor, rasqal and librdf relatively buildable while I was refactoring and breaking things. He wrote the code to make Rasqal and librdf build and work with raptor V1 and V2 at the same time.

Other work included updating all the reference documentation, tutorials, examples and sundry documentation for the new APIs including admin code to automate some of the documentation so it always included accurate details about formats.

There is lots more that changed in detail, listed in the Raptor 1.9.0 Release Notes, help on upgrading and there’s even a perl script docs/upgrade-script.pl thrown in (generated by another perl script!) that may help with applying the changes. The reference manual contains a full reference on changes between raptor 1.4.21 and 1.9.0 in the form of old / new mappings with explanations.

I know that Raptor 2 is not going to place Raptor 1 for applications for some time, so this is a separately installed library with a new location for the header file and a new shared library base. However, once this hits 2.0.0 it’ll be a dependency of Rasqal and librdf.

Summary of release:

  • Removed all deprecated functions and typedefs.
  • Renamed all functions to the standard raptor_class_method() form.
  • All constructors take a raptor_world argument.
  • URIs are interned and there is no longer a swappable implementation.
  • Statement is now an array of 3-4 RDF Terms to support triples and quads.
  • World object owns logging, blank node ID generation and describing syntaxes.
  • Features are now called options and have typed values.
  • GRDDL parser now saves and restores shared libxslt state.
  • Added serializers for HTML ‘html’ and N-Quads ‘nquads’.
  • Added parser ‘json’ for JSON-Resource centric and JSON-Triples.
  • Switched to GIT version control hosted by GitHub.
  • Added memory-based AVL-Tree to the public API.
  • Fixed reported issues:

    0000357, 0000361, 0000369, 0000370, 0000373 and 0000379

It turns out that after all that, the resulting libraries for raptor 2 are actually 4% smaller than raptor 1 when installed (Debian, i386):

 -rw-r--r-- 1 root root 379780 Mar 10 06:59 /usr/lib/libraptor.so
 -rw-r--r-- 1 root root 364448 Aug 16 17:30 /usr/lib/libraptor2.so

The gzipped tarball itself is as small as raptor 1.4.17 from 2008!

Get it at http://download.librdf.org/source/raptor2-1.9.0.tar.gz

PS The source code control has also moved to GIT and hosted at GitHub.

Happy 10th Birthday Redland

Redland‘s 10th year source code commit birthday is today 28th Jun at 9:05am PST – the first commit was Wed Jun 28 17:04:57 2000 UTC.

Happy 10th Birthday! Please celebrate with tea and cake.

(New releases soon, but it’s nice and sunny here in California, it’s very distracting.)

Command Line Semantic Web with Redland

I gave a ‘lightning’ talk (actually in about 15 mins) Command Line Semantic Web with Redland on 15th March 2010 at the Semantic Web Austin Meetup during SXSW at Texas Coworking, Austin, TX, USA. Today I recorded it as a screencast and put it online.

The embedded Vimeo version is below (best to view full screen to see all the text), but you can also get alternate hosted and downloadable versions (iPhone, 3GP, Full size) from my site.

Command Line Semantic Web With Redland from Dave Beckett on Vimeo.