• Recent posts

    • Open to Rackspace
    • Undugg
    • Releases = Tweets
    • Rasqal 0.9.21 and SPARQL 1.1 Query aggregation
    • End of life of Raptor V1. End of support for Raptor V1 in Rasqal and librdf
  • Follow me on twitter

Writing an RDF query engine. Twice

“You’ll end up writing a database” said Dan Brickley prophetically in early 2000. He was of course, correct. What started as an RDF/XML parser and a BerkeleyDB-based triple store and API, ended up as a much more complex system that I named Redland with the librdf API. It does indeed have persistence, transactions (when using a relational database) and querying. However, RDF query is not quite the same thing as SQL since the data model is schemaless and graph centric so when RDQL and later SPARQL came along, Redland gained a query engine component in 2003 named Rasqal: the RDF Syntax and Query Library for Redland. I still consider it not a 1.0 library after over 7 years of work.

Query Engine The First

The first query engine was written to execute RDQL which today looks like a relatively simple query language. There is one type of SELECT query returning sequences of sets of variable bindings in a tabular result like SQL. The query is a fixed pattern and doesn’t allow any optional, union or conditional pattern matching. This was relatively easy to implement in what I’ve called a static execution model:

  1. Break the query up into a sequence of triple patterns: triples that can include variables in any position which will be found by matching against triples. A triple pattern returns a sequence of sets of variable bindings.
  2. Match each of the triple patterns in order, top to bottom, to bind the variables.
  3. If there is a query condition like ?var > 10 then check that it evaluates true.
  4. Return the result.
  5. Repeat at step #2.

The only state that needed saving was where in the sequence of triple patterns that the execution had got to – pretty much an integer, so that the looping could continue. When a particular triple pattern was exhausted it was reset, the previous one incremented and the execution continued.

This worked well and executes all of RDQL no problem. In particular it was a lazy execution model – it only did work when the application asked for an additional result. However, in 2004 RDF query standardisation started and the language grew.

Enter The Sparkle

The new standard RDF query language which was named SPARQL had many additions to the static patterns of the RDQL model, in particular it added OPTIONAL which allowed optionally (sic) matching an inner set of triple patterns (a graph pattern) and binding more variables. This is useful in querying heterogeneous data when there are sometimes useful bits of data that can be returned but not every graph has it.

This meant that the engine had to be able to match multiple graph patterns – the outer one and any inner optional graph pattern – as well as be able to reset execution of graph patterns, when optionals were retried. Optionals could also be nested to an arbitrary depth.

This combination meant that the state that had to be preserved for getting the next result became a lot more complex than an integer. Query engine #1 was updated to handle 1 level of nesting and a combination of outer fixed graph pattern plus one optional graph pattern. This mostly worked but it was clear that having the entire query have a fixed state model was not going to work when the query was getting more complex and dynamic. So query engine #1 could not handle the full SPARQL Optional model and would never implement Union which required more state tracking.

This meant that Query Engine #1 (QE1) needed replacing.

Query Engine The Second

The first step was a lot of refactoring. In QE1 there was a lot of shared state that needed pulling apart: the query itself (graph patterns, conditions, the result of the parse tree), the engine that executed it and the query result (sequence of rows of variable bindings). That needed pulling apart so that the query engine could be changed independent of the query or results.

Rasqal 0.9.15 at the end of 2007 was the first release with the start of the refactoring. During the work for that release it also became clear that an API and ABI break was necessary as well to introduce a Rasqal world object, to enable proper resource tracking – a lesson hard learnt. This was introduced in 0.9.16.

There were plenty of other changes to Rasqal going on outside the query engine model such as supporting reading and writing result formats, providing result ordering and distincting, completing the value expression and datatype handling data and general resilience fixes.

The goals of the refactoring were to produce a new query engine that was able to execute a more dynamic query, be broken into understandable components even for complex queries, be testable in small pieces and to continue to execute all the queries that QE1 could do. It should also continue to be a lazy-evaluation model where the user could request a single result and the engine should do the minimum work in order to return it.

Row Sources and SPARQL

The new query engine was designed around a new concept: a row source. This is an active object that on request, would return a row of variable bindings. It generates what corresponds to a row in a SQL result. This active object is the key for implementing the lazy evaluation. At the top level of the query execution, there would be basically one call to top_row_source.getRow() which itself calls inner rowsources’ getRow() in order to execute the query to return the next result.

Each rowsource would correspond approximately to a SPARQL algebra concept, and since the algebra had a well defined way to turn a query structure into an executable structure, or query plan, the query engine’s main role in preparation of the query was to become a SPARQL query algebra implementation. The algebra concepts were added to Rasqal enabling turning the hierarchical graph pattern structure into algebra concepts and performing the optimization and algebra transformations in the specification. These transformations were tested and validated against the examples in the specification. The resulting tree of “top down” algebra structures were then used to build the “bottom up” rowsource tree.

The rowsource concept also allowed breaking up the complete query engine execution into understandable and testable chunks. The rowsources implemented at this time include:

  • Assignment: allowing binding of a new variable from an input rowsource
  • Distinct: apply distinctness to an input rowsource
  • Empty: returns no rows; used in legitimate queries as well as in transformations
  • Filter: evaluates an expression for each row in an input rowsource and passes on those that return True.
  • Graph: matches against a graph URI and/or bind a graph variable
  • Join: (left-)joins two inner rowsources, used for OPTIONAL.
  • Project: projects a subset of input row variables to output row
  • Row Sequence: generates a rowsource from a static set of rows
  • Sort: sort an input rowsource by a list of order expressions
  • Triples: match a triple pattern against a graph and generate a row. This is the fundamental triple pattern or Basic Graph Pattern (BGP) in SPARQL terms.
  • Union: return results from the two input rowsources, in order

The QE1 entry point was refactored to look like getRow() and the query engines were tested against each other. In the end QE2 was identical, and eventually QE2 was improved such that it passed more DAWG SPARQL tests that than QE1.

So in summary QE2 works like this:

  • Parse the query string into a hierarchy of graph patterns such as basic, optional, graph, group, union, filter etc. (This is done in rasqal_query_prepare())
  • Create a SPARQL algebra expression from the graph pattern tree that describes how to evaluate the query. (This is in rasqal_query_execute() calling QE2 )
  • Invert the algebra expression to a hierarchy of rowsources where the top rowsource getRow() call will evaluate the entire query (Ditto)

(If you want to see some of the internals on a real query, run roqet -d debug query.rq from roqet built in maintainer mode and both the query structure and algebra version will be generated.

The big advantage from a maintenance point of view is that it is divided into small understandable components that can be easily added to.

The result was released in Rasqal 0.9.17 at the end of 2009; 15 months after the previous release. It’s tempting to say nobody noticed the new query engine except that it did more work. There is no way to use the old query engine except by a configure argument when building it. The QE1 code is never called and should be removed from the sources.

Example execution

When QE2 is called by the command line utility roqet, there is a lot going on inside Rasqal and Raptor. A simplified version of what goes on when a query like this is run:

$ cat example.rq
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name ?website
FROM   <http://planetrdf.com/bloggers.rdf>
WHERE { ?person foaf:weblog ?website ;
                foaf:name ?name .
        ?website a foaf:Document
      }
ORDER BY ?name

$ roqet example.rq 
roqet: Querying from file example.rq
roqet: Query has a variable bindings result
result: [name=string("AKSW Group - University of Leipzig"), website=uri<http://blog.aksw.org/>]
...

is described in the following picture if you follow the numbers in order:

This doesn’t include details of content negotiation, base URIs, result formatting, or the internals of the query execution described above.

SPARQL 1.1

Now it is the end of 2010 and SPARQL 1.1 work is underway to update the original SPARQL Query which was complete in January 2008. It is a substantial new version that adds greatly to the language. In the SPARQL 1.1 2010-10-14 draft version it adds (these items may or may not be in the final version):

  • Assignment with BIND(expr AS ?var)
  • Aggregate expressions such as SUM(), COUNT() including grouping and group filtering with HAVING
  • Negation between graph patterns using MINUS.
  • Property path triple matching.
  • Computed select expressions: SELECT ... (expr AS ?var)
  • Federated queries using SERVICE to make SPARQL HTTP query requests
  • Sub SELECT and BINDINGS to allow queries/results inside queries.
  • Updates allowing insertion, deletion and modification of graphs via queries as well as other graph and dataset management

The above is my reading of the major items in the latest draft SPARQL 1.1 query language or it’s dependent required specifications.

Rasqal next steps

So does SPARQL 1.1 mean Rasqal Query Engine 3? Not yet, although the Rasqal API is still changing too much to call it stable and another API/ABI break is possible. There’s also the question of making an optimizing query engine, a more substantial activity. At this time, I’m not motivated to implement property paths since it seems like a lot of work and there are other pieces I want to do first. Rasqal in GIT handles most of the syntax and is working towards implementing most of the execution of aggregate expressions, sub selects and SERVICE although no dates yet. I work on Rasqal in my spare time when I feel like it, so maybe it won’t be mature with a stable API (which would be a 1.0) until SPARQL 2 rolls by.

Redland librdf 1.0.11 released

I have just released Redland librdf library version 1.0.11 which has been in progress for some time, delayed by the large amount of work to get out Raptor V2 as well as initial SPARQL 1.1 draft work for Rasqal 0.9.20.

The main features in this release are as follows:

See the Redland librdf 1.0.11 Release Notes for the full details of the changes.

Note that the Redland language bindings 1.0.10.1 works fine with Redland librdf 1.0.11 but the bindings will soon have a release to match.

Leaving Yahoo – Joining Digg

I’m heading to a new adventure at Digg in San Francisco to be a lead software engineer working on APIs and syndication.

I’ve been at Yahoo! nearly 5 years so it is both a happy and sad time for me, and I wish all the excellent people I worked with the best of luck in future.

Here is a summary of the main changes:

  • Silicon Valley -> San Francisco
  • 15,000 staff -> 100 staff
  • Architect -> Software engineer
  • strategizing, meeting -> coding
  • Powerpoint, OmniGraffle, twiki -> emacs, eclipse, …?
  • (No coding!) -> Python, Java, Hadoop, Cassandra, …?
  • Sunny days -> Foggy days
  • 15 min commute -> 2.5hr commute (until I move to SF)
  • Public company -> private company

Exciting!

Rasqal RDF Query Library 0.9.20

I just released a new version of my Rasqal RDF Query Library for two main new features:

  1. Support more of the new W3C SPARQL working drafts of 1 June 2010 for SPARQL 1.1 Query and SPARQL 1.1 Update.
  2. Support building with Raptor V2 API as well as Raptor V1 API..

The main change is to start to add to Rasqal’s APIs and query engine changes for the new SPARQL 1.1 working drafts. This release adds support the syntax for all the changes for Query and Update. The new draft syntax is available via the ‘laqrs’ query language name, until the SPARQL 1.1 syntax is finalized. The ‘sparql’ query language provides SPARQL 1.0 support.

On Query 1.1, the addition is primarily syntax and API support for the new syntax. There is expression execution for the new functions IF(), URI(), STRLANG(), STRDT(), BNODE(), IN() and NOT IN() which are noew usable as part of the normal expression grammar. The existing aggregate function support was extended to add the new SAMPLE() and GROUP_CONCAT() but remains syntax-only. Finally the new GROUP BY with HAVING conditions were added to the syntax and had consequent API updates but no query engine execution of them.

For Update 1.1 the full set of update operations syntax were added and they create API structures. Note, however there seem to be some ambiguities in the draft syntax especially around multiple optional tokens in a row near WITH which are particularly hard to implement in flex and bison (aka “lex and yacc”).

The main non-SPARQL 1.1 related change is to allow building Rasqal with Raptor V2 APIs rather than V1. Raptor V2 is in beta so this is not a final API and is thus not the default build, it has to be enabled with --enable-raptor2 with configure. When raptor V2 is stable (2.0.0), Rasqal will require it.

The changes to Rasqal in this release, in summary, are:

  • Updated to handle more of the new syntax defined by the SPARQL 1.1 Query and SPARQL 1.1 Update W3C working drafts of 1 June 2010
  • Added execution support for new SPARQL 1.1 query built-in expressions IF(), URI(), STRLANG(), STRDT(), BNODE(), IN() and NOT IN().
  • Added an ‘html’ query result table format from patch by Nicholas J Humfrey
  • Added API support for group by HAVING expressions.
  • Added XSD Date comparison support.
  • Support building with Raptor V2 API if configured with --with-raptor2.
  • Many other bug fixes and improvements were made.
  • Fixed Issues: #0000352, #0000353, #0000354, #0000360, #0000374, #0000377 and #0000378

See the Rasqal 0.9.20 Release Notes for the full details of the changes.

Get it at http://download.librdf.org/source/rasqal-0.9.20.tar.gz.

PS The source code control has also moved to GIT and hosted at GitHub.

Raptor RDF Syntax Library V2 beta 1

Today I released the first beta version of Raptor 2. This is the culmination of about 9 months work refactoring the Raptor 1 codebase. In hindsight, I should have done this years ago, but I knew it would be a lot of work, and it was.

The reasoning behind doing this is multi-fold, but basically the code had a lot of cruft and bad design choices that couldn’t be removed without breaking the APIs in lots of ways, and at some point it’s easier to just do it all at once, and that’s where we are now.

Cruft meant removing stuff deprecated for a long time but also renaming all the functions to follow the same “objects in C” style used throughout Redland’s libraries which has standard naming forms:

  • raptor_class_method()
  • Constructors: raptor_new_class() (core constructor or 1 arg constructor) and raptor_new_class_from_extras()
  • Copy constructor: raptor_class_copy()
  • Destructors: raptor_free_class()

The major addition was a raptor_world object that is used as a single object to hold on to all shared resources and configuration. This was a design pattern I put in librdf and Rasqal but for some reason, never considered it for raptor. This turned out to be a mistake since I had to then pass around a lot of parameters and configuration to individual object instances, more than was really needed. Examples of this include the error handling which added two parameters to several constructors. The error handling, now expanded to a general log mechanism after librdf’s handles multiple structured log record types and the logging policy is once-per-world.

The addition of the world object meant that each constructor for an object in raptor now takes that object, so it can get access to the shared configuration and resources. That itself meant the change was extensive, broad in scope. The single place to manage resources means it’s easier to ensure proper cleanup and deal with library-wide issues.

One other pain point was Raptor’s simplistic (but functional!) URI class. It manipulated URIs as plain old C strings (char*). I knew from building librdf, that this could be more efficient by interning the strings so a URI for a particular string is held only once, and reference counted. I used the already built raptor AVL-Tree to implement it, and as a bonus, moved that AVL Tree to the public API, so it can be reused (Rasqal has a copy of the implementation). The resulting reference-counted URIs mean that after URI construction, comparison and copying are very cheap – and given that this is RDF, those are done a lot. The old URI code also had a swappable implementation which added a lot of complexity to the code and that has gone now, since the new implementation is more sophisticated. There is probably more work that can be done here to make this URI work better, such as caching the URI structure so that it’s quicker to generate relative URIs. Also one day I should really validate that all the URIs built are legal to the syntax.

Another long term problem was the triple itself, which I had called ‘statement’ way back when I was creating it. Unfortunately a raptor_statement had hard-coded the RDF specifics – the subject can only be URI or blank node, predicate can only be a URI etc. That meant the code was twisty. That has been replaced by an array of 3 or 4 raptor terms (URI or blank node or literal) so it can handle both triples, quads and any possible extension beyond RDF (2004), although today none of the current parsers or serializers expect non-RDF statements. That change also made a lot of the internal code simpler to understand and quicker. The RDF terms were also introduced in a reference count manner, along with adding reference counting to the statements, it meant that passing triples around which used to involve a lot of copying, is now a simple integer increment of the reference. More speed!

That sorted out the fundamentals of statements, terms and URIs and changed pretty much every piece of code that touched them in all the parsers and serializers and core code.

There were a few pieces of new work added – two new serializers and one new parser. Two of those were written by Nicholas J Humfrey who is now a core committer.

I’d also like to call out thanks to Lauri Aalto for keeping raptor, rasqal and librdf relatively buildable while I was refactoring and breaking things. He wrote the code to make Rasqal and librdf build and work with raptor V1 and V2 at the same time.

Other work included updating all the reference documentation, tutorials, examples and sundry documentation for the new APIs including admin code to automate some of the documentation so it always included accurate details about formats.

There is lots more that changed in detail, listed in the Raptor 1.9.0 Release Notes, help on upgrading and there’s even a perl script docs/upgrade-script.pl thrown in (generated by another perl script!) that may help with applying the changes. The reference manual contains a full reference on changes between raptor 1.4.21 and 1.9.0 in the form of old / new mappings with explanations.

I know that Raptor 2 is not going to place Raptor 1 for applications for some time, so this is a separately installed library with a new location for the header file and a new shared library base. However, once this hits 2.0.0 it’ll be a dependency of Rasqal and librdf.

Summary of release:

  • Removed all deprecated functions and typedefs.
  • Renamed all functions to the standard raptor_class_method() form.
  • All constructors take a raptor_world argument.
  • URIs are interned and there is no longer a swappable implementation.
  • Statement is now an array of 3-4 RDF Terms to support triples and quads.
  • World object owns logging, blank node ID generation and describing syntaxes.
  • Features are now called options and have typed values.
  • GRDDL parser now saves and restores shared libxslt state.
  • Added serializers for HTML ‘html’ and N-Quads ‘nquads’.
  • Added parser ‘json’ for JSON-Resource centric and JSON-Triples.
  • Switched to GIT version control hosted by GitHub.
  • Added memory-based AVL-Tree to the public API.
  • Fixed reported issues:

    0000357, 0000361, 0000369, 0000370, 0000373 and 0000379

It turns out that after all that, the resulting libraries for raptor 2 are actually 4% smaller than raptor 1 when installed (Debian, i386):

 -rw-r--r-- 1 root root 379780 Mar 10 06:59 /usr/lib/libraptor.so
 -rw-r--r-- 1 root root 364448 Aug 16 17:30 /usr/lib/libraptor2.so

The gzipped tarball itself is as small as raptor 1.4.17 from 2008!

Get it at http://download.librdf.org/source/raptor2-1.9.0.tar.gz

PS The source code control has also moved to GIT and hosted at GitHub.