• Recent posts

    • Open to Rackspace
    • Undugg
    • Releases = Tweets
    • Rasqal 0.9.21 and SPARQL 1.1 Query aggregation
    • End of life of Raptor V1. End of support for Raptor V1 in Rasqal and librdf
  • Follow me on twitter

Open to Rackspace

I’m happy to announce that today I started work at Rackspace in San Francisco in a senior engineering role. I am excited anticipating these aspects:

  1. The company: a fun, fast moving organization with a culture of innovation and openness
  2. The people: lots of smart engineers and devops to work with
  3. The technologies: Openstack cloud, Hadoop big data and lots more
  4. The place: San Francisco technology world and nearby Silicon Valley

My personal technology interests and coding projects such as Planet RDF, the Redland librdf libraries and Flickcurl will continue in my own time.

Undugg

Digg just announced that Digg Engineering Team Joins SocialCode and The Washington Post reported SocialCode hires 15 employees from Digg.com

This acquihire does NOT include me. I will be changing jobs shortly but have nothing further to announce at this time.

I wish my former Digg colleagues the best of luck in their new roles. I had a great time at Digg and learned a lot about working in a small company, social news, analytics, public APIs and the technology stack there.

Releases = Tweets

I got tired of posting release announcements to my blog so I just emailed the announcements to the redland-dev list, tweeted a link to it from @dajobe and announced it on Freshmeat which a lot of places still pick up..

Here are the tweets for the 13 releases I didn’t blog since the start of 2011:

  • 3 Jan: Released Raptor RDF syntax library 2.0.0 at http://librdf.org/raptor/ only 10 years in the making :)
  • 12 Jan: Released Rasqal RDF Query Library 0.9.22: Raptor 2 only, ABI/API break, 16 new SPARQL Query 1.1 builtins and more http://bit.ly/fzb9xW #rdf
  • 27 Jan: Rasqal 0.9.23 RDF query library released with SPARQL update query structure fixes (for @theno23 and 4store ): http://bit.ly/gVDp57
  • 1 Feb: Released Redland librdf 1.0.13 C RDF API and Triplestores with Raptor 2 support + more http://bit.ly/hOr4HA
  • 22 Feb: Released Rasqal RDF Query Library 0.9.25 with many SPARQL 1.1 new things and fixes. RAND() and BIND() away! http://bit.ly/flFDH1
  • 20 Mar: Raptor RDF Syntax Library 2.0.1 released with minor fixes for N-Quads serialializer and internal librdfa parser http://bit.ly/fT3aPX
  • 26 Mar: Released my Flickcurl C API to Flickr 1.21 with some bug fixes and Raptor V2 support (optional) See http://bit.ly/f7QncO
  • 1 Jun: Released Raptor 2.0.3 RDF syntax library: a minor release adding raptor2.h header, Turtle / TRiG and ohter fixes. http://bit.ly/jHKaB8
  • 27 Jun: Rasqal RDF query library 0.9.26 released with better UNION execution, SPARQL 1.1 MD5, SHA* digests and more http://bit.ly/lI7lDW
  • 23 Jul: Released Redland librdf RDF API / triplestore C library 1.0.14: core code cleanups, bug fixes and a few new APIs. http://bit.ly/qqV1Rb
  • 25 Jul: Raptor RDF Syntax C library 2.0.4 released with YAJL V2, and latest curl support, SSL client certs, bug fixes and more http://bit.ly/oCIIDd

(yes 13; I didn’t tweet 2 of them: Rasqal 0.9.24 and Raptor 2.0.2)

You know it’s quite tricky to collapse months of changelogs (GIT history) into release notes, compress it further into a news summary of a few lines and even harder to compress that into less than 140 characters. It is way less if you include room for a link url and space for retweeting and sometimes need a hashtag for context.

So how do you measure a release? Let’s try!

Tarballs

Released tarball files from the Redland download site.

date package old
version
new
version
old
tarball size
new
tarball size
tarball
byte diff
tarball
%diff
2011-01-03 raptor 1.4.21 2.0.0 1,651,843 1,635,566 -16,277 -0.99%
2011-01-12 rasqal 0.9.21 0.9.22 1,356,923 1,398,581 +41,658 3.07%
2011-01-27 rasqal 0.9.22 0.9.23 1,398,581 1,404,087 +5,506 0.39%
2011-01-30 rasqal 0.9.23 0.9.24 1,404,087 1,412,165 +8,078 0.58%
2011-02-01 redland 1.0.12 1.0.13 1,552,241 1,554,764 +2,523 0.16%
2011-02-22 rasqal 0.9.24 0.9.25 1,412,165 1,429,683 +17,518 1.24%
2011-03-20 raptor 2.0.0 2.0.1 1,635,566 1,637,928 +2,362 0.14%
2011-03-20 raptor 2.0.1 2.0.2 1,637,928 1,633,744 -4,184 -0.26%
2011-03-26 flickcurl 1.20 1.21 1,775,246 1,775,999 +753 0.04%
2011-06-01 raptor 2.0.2 2.0.3 1,633,744 1,652,679 +18,935 1.16%
2011-06-27 rasqal 0.9.25 0.9.26 1,429,683 1,451,819 +22,136 1.55%
2011-07-23 raptor 2.0.3 2.0.4 1,652,679 1,660,320 +7,641 0.46%
2011-07-25 redland 1.0.13 1.0.14 1,554,764 1,581,695 +26,931 1.73%
Barchart of %diffs between tarball releases.  Noticeable differences are raptor 2.0.0 with a big negative change and rasqal 0.9.22 with largest increase.
Click image to embiggen

Releases that stand out here are Raptor 2.0.0 which was a major release with lots of changes and Rasqal 0.9.21; that changed a lot upwards and it was both an API break as well as lots of new functionality.

Sources

Taken from my GitHub repositories extracting the tagged releases, excluding ChangeLog* files, and running diffstat over the output of a recursive diff -uRN.

date package old
version
new
version
source
files
changed
source
lines
inserted
source
lines
deleted
source
lines
net
2011-01-03 raptor 1.4.21 2.0.0 215 34,018 30,348 64,366
2011-01-12 rasqal 0.9.21 0.9.22 94 11,641 5,712 17,353
2011-01-27 rasqal 0.9.22 0.9.23 25 5,663 5,199 10,862
2011-01-30 rasqal 0.9.23 0.9.24 48 1,107 227 1,334
2011-02-01 redland 1.0.12 1.0.13 96 3,721 5,627 9,348
2011-02-22 rasqal 0.9.24 0.9.25 64 3,857 1,333 5,190
2011-03-20 raptor 2.0.0 2.0.1 42 6,163 5,833 11,996
2011-03-20 raptor 2.0.1 2.0.2 9 55 12 67
2011-03-26 flickcurl 1.20 1.21 19 737 308 1,045
2011-06-01 raptor 2.0.2 2.0.3 88 2,827 2,232 5,059
2011-06-27 rasqal 0.9.25 0.9.26 116 7,130 4,272 11,402
2011-07-23 raptor 2.0.3 2.0.4 33 808 103 911
2011-07-25 redland 1.0.13 1.0.14 75 3,681 5,477 9,158
Total 924 81,408 66,683 148,091
Barchart of number of source lines changed (insertions + deletions) in each release.  Raptor 2.0.0 stands out as much larger than all the others, nearly combined.
Click image to embiggen

Again Raptor 2.0.0 stands out as changing a huge number of files and lines. Also you can see the mistake that was Raptor 2.0.1 being corrected the same day with Raptor 2.0.2 with a few changes. This didn’t seem to get tweeted. However also note that several of the Rasqal releases like 0.9.22 and 0.9.26 changed many files. The ‘source lines net’ column is the addition of the insert and deletes although some of those lines are the same.

Words

Words from the changelog, the release notes and the news post comparing the number of words in the rendered output.

date package old
version
new
version
changelog
words
release
note
words
changelog
to release
word ratio
news
words
changelog
to news
word ratio
2011-01-03 raptor 1.4.21 2.0.0 15,411 2,709 5.69 365 42.22
2011-01-12 rasqal 0.9.21 0.9.22 3,465 1,199 2.89 162 21.39
2011-01-27 rasqal 0.9.22 0.9.23 318 135 2.36 52 6.12
2011-01-30 rasqal 0.9.23 0.9.24 450 254 1.77 59 7.63
2011-02-01 redland 1.0.12 1.0.13 778 235 3.31 73 10.66
2011-02-22 rasqal 0.9.24 0.9.25 1,649 558 2.96 136 12.13
2011-03-20 raptor 2.0.0 2.0.1 247 76 3.25 50 4.94
2011-03-20 raptor 2.0.1 2.0.2 42 27 1.56 42 1.00
2011-03-26 flickcurl 1.20 1.21 119 - - 68 -
2011-06-01 raptor 2.0.2 2.0.3 872 266 3.28 28 31.14
2011-06-27 rasqal 0.9.25 0.9.26 4,410 970 4.55 96 45.94
2011-07-23 raptor 2.0.3 2.0.4 517 345 1.50 77 6.71
2011-07-25 redland 1.0.13 1.0.14 1,347 620 2.17 88 15.31
Total 29,625 7,394 1,296
Bar chart of the ratio of the number of words in the changelog to those in the release news.  Rasqal 0.9.26 and Raptor 2.0.0 are the argest but the tiny Raptor 2.0.2 bug fix is also notable.
Click image to embiggen

So now we get to words. Yes, lots of words, most of them by me. Starting with the changelog which is a hand edited version of the SVN and later GIT changes was over 15K words for Raptor 2.0.0. And that gets boiled down lots into release notes, news and then a terse tweet. Since the changelog corresponds roughly to source changes but the news to user visible changes like APIs, you can see that the oddities are again Rasqal 0.9.26 where there were lots of changes but not so much news; it was mostly internal work.

Now I need to go summarise this blog post in a tweet: Releases = Tweets in 1156 words http://bit.ly/n88ZIQ

Rasqal 0.9.21 and SPARQL 1.1 Query aggregation

Rasqal 0.9.21 was just released on Saturday 2010-12-04 (announcement) containing the following new features:

  • Updated to handle aggregate expression execution as defined by the SPARQL 1.1 Query W3C working draft of 14 October 2010
  • Executes grouping of results: GROUP BY
  • Executes aggregate expressions: AVG, COUNT, GROUP_CONCAT, MAX, MIN, SAMPLE, SUM
  • Executes filtering of aggregate expressions: HAVING
  • Parses new syntax: BINDINGS, isNUMERIC(), MINUS, sub SELECT and SERVICE.
  • The syntax format for parsing data graphs at URIs can be explictly declared.
  • The roqet utility can execute queries over SPARQL HTTP Protocol and operate over data from stdin.
  • Added several new APIs
  • Fixed Issue: #0000388

See the Rasqal 0.9.21 Release Notes for the full details of the changes.

I’d like to emphasis a couple of the changes to the roqet(1) utility program: you can now use it to query over data from standard input, i.e. use it as a filter, but only if you are querying over one graph. You can also specify the format of the data graphs either on standard input or from URIs, if the format can’t be determined or guessed from the mime type, name or bytes. Finally roqet(1) can execute remote queries at a SPARQL Protocol HTTP service, sometimes called a “SPARQL endpoint”.

The new support for SPARQL Query 1.1 aggregate queries (and other features) led me to make comments to the SPARQL working group about the latest SPARQL Query 1.1 working draft based on the implementation experience. The comments (below) were based on the implementation I previously outlined in Writing an RDF query engine. Twice at the end of October 2010.

The implementation work to create the new features was substantial but relatively simple to describe: new rowsources were added for each of the aggregation operations and then included in the execution plan when the query structure indicated their presence after parsing. There was some additional glue code that needed to be add to allow rows to indicate their presence in a group; a simple integer group ID was sufficient and the ID value has no semantics, just a check for a change of ID is enough to know a group started or ended.

I also introduced an internal variable to bind the result of SELECT aggregate expressions after grouping ($$aggID$$ which are illegal names in sparql). I then used that to replace the aggregate expression in the SELECT and the HAVING expressions and used the normal evaluation mechanisms. As I understand it, the SPARQL WG is now considering adding a way to name these explicitly in the GROUP statement. A happy coincidence since I had implemented it without knowing that.

To prepare this I did think about the approach a lot and developed a couple of diagrams for the grouping and aggregation rowsources that might help to understand how they work, how they can be implemented and tested as standalone unit modules, which they were.

Rasqal Group By Rowsource
Diagram showing input and outputs of group by rowsource in Rasqal.

As always, the above isn’t quite how it is implemented. There is no group by node if there is an implicit group when GROUP BY is missing but an aggregate expression is used; instead the Rasqal rowsource class synthesizes 1 group around the incoming row, when grouping is requested.

Rasqal Aggregation Rowsource
Diagram showing inputs and outputs of Rasqal aggregation rowsource.

This shows the guts of the aggregate expression evaluation where internal variables are introduced, substituted into the SELECT and HAVING expressions and then created as the aggregate expressions are executed over the groups.

The rest of this post are my detailed thoughts on this draft of SPARQL 1.1 Query as posted to the SPARQL WG comments list but turned into HTML markup here.


Dave Beckett comments on SPARQL 1.1 Query Language W3C WD 2010-10-14

These are my personal comments (not speaking for any past or current employer) on: SPARQL 1.1 Query Language W3C Working Draft 14 October 2010

My comments are based on the work I did to add some SPARQL 1.1 query and update support to my Rasqal RDF query library (engine and API) in version 0.9.21 just released 2010-12-04 as announced.

Some background to my work is given in a blog post: Writing an RDF query engine. Twice

I. General comments

I felt the specification introduced more optional features bundled together, where it was not entirely clear what the combination of those features would do. For example a query with no aggregate expression but has a GROUP BY and HAVING is allowed by the syntax and the main document doesn’t say if it’s allowed or what it means.

I found it hard to assemble all the pieces from the mathematical explanations into something I could code.

The spec has several terms in the grammar not in the query document. After asking, these turned out to be federated query (BINDINGS), or update (LOAD, …) but these are not pointed out or linked to clearly although there is mention of the documents in the status section. Please make these more clear.

I decided to concentrate on the new Aggregates feature since I had already implemented SELECT expressions, leaving Subqueries and Negation to later. Property paths should be in the list of new features in the status section at the of the document.

“SPARQL 1.1 Uniform HTTP Protocol for Managing RDF Graphs” is rather a long title; what does ‘Uniform’ or ‘HTTP’ add? SOAP is dead. suggest “SPARQL 1.1 RDF Graph Management Protocol” or RDF dataset

With all the additions especially property paths (a new query language), update (data management language) and federated query (remote query support) and I understand ~30 additional keywords are being added beyond this draft for functions and operators, I see this as a major change to SPARQL 1.0, more of a SPARQL 2. You should consider renaming it.

II. Aggregates

I found the math in the aggregation and grouping sections rather hard to understand so I also looked what MySQL and SQLite did, and wrote my own diagram based on the data flow:

Guessed dataflow of SPARQL 1.1 query engine in executing a query
See http://www.dajobe.org/2009/11/sparql11/

so for me it was easier to see the individual components/stages (which roughly correspond to SPARQL algebra terms).

I had to make several of my own tests with my guess on what the answers should be. With all the pieces for aggregate expressions: grouping, aggregate expression, distinct, having, counting (count * vs count(expr)) there needs to be several tests with good coverage.

I felt aggregate functions can be broken down into these parts

  1. selecting of aggregate function value
  2. grouping of results – optional; explicit, implicit when agg func present
  3. execution of aggregate functions – optional; with some special cases
  4. filtering of group results with having – optional

(following my diagram above)

As it is clear they are all optional, it probably is worth explaining what it means when they are absent, such as group by + having with no aggregate expression as mentioned above.

III. Bindings is a new syntax

BINDINGS essentially gives a new way to write down a variable bindings result set. Even though it is discussed in the federated query spec about using it for SERVICE, it’s not restricted to that by the grammar or specifications.

BINDINGS in the query grammar (rBindingsClause) I previously asked about on 2010-10-15 and this comment is an extension of that comment.

So as I read it this is a valid ‘query’ which does no real execution but just returns a set.

SELECT * WHERE BINDINGS ?var1 ?var2 {
  ( "var1-value1"  "var2-value1" )
  ( "var1-value2"  "var2-value2" )
}

or if you really must you can leave out the WHERE:

SELECT * BINDINGS {
  ( "var1-value1"  "var2-value1" )
  ( "var1-value2"  "var2-value2" )
}

My question is to ask if this is correct and to clarify in the spec the intended use, whether or not it is intended for use with SERVICE only.

IV. Section-by-section comments

Section: Status of this Document

Should mention property paths as new since that is a major addition after SPARQL 1.0

Please link to the documents in the status, these are just text.

Sections 1-8

Skipped, they are same as SPARQL 1.0 I hope

9 Property Paths

I am unlikely to ever implement any of this, it’s a second query language inside SPARQL. How many systems implemented this before the SPARQL 1.1 work was started?

10 Aggregates

I took all the examples in this section and turned them into test cases where possible.

10.2

The explanation of errors and ListEvalE is rather opaque. It is still not clear to me what is done with errors in GROUP BY, HAVING and arguments to aggregate expressions. Some are skipped, some are ignored and return NULL. Examples and tests will enable checking this but the spec needs to be clearer.

Definition: Group and Aggregation were hard for me to understand. The input to Aggregation being a ‘scalar’ meaning actually a set of key:value pairs was confusing. It is not also not clear if those are a set or an ordered set of parameters. This is only used today for the ‘separator’ with GROUP_CONCAT.

10.2.1 HAVING

What happens when there is an expression error?

What variables and expressions can be used here and what is their scope?

10.2.2 Set Functions

Another confusing section. I mostly ignored this and did what SQL did.

None of the functions that I can tell, ever use ‘err’.

10.2.3 Mapping from Abstract Syntax to Algebra

scalarvals argument is used here – I think this is called ‘scalar’ earlier.

Un-numbered Section after 10.2.3: Joining Aggregate Values

Never figured out what this was trying to define but my code executes the example.

11. Subqueries

(Ignored in my current work)

12 RDF Dataset

(Same as SPARQL 1.0 I assume so no comments)

13 Basic Federated Query

Yes, please merge in the text here.

14 Solution Sequences and Modifiers

( Aside: This is one of those SPARQL parts where everything mentioned is optional. Otherwise this section has no change from SPARQL 1.0, I am just mentioning it as a pointer of a trend. )

15. Query Forms

No comments.

16. Testing Values

16.3 Operator Mapping

Is it worth noting the new operators in SPARQL 1.1?

Operators: implemented isNUMERIC()

16.4 Operators Definitions

My current state of implementation of new to SPARQL 1.1 expressions

  • 16.4.16 IF – implemented
  • 16.4.17 IN – implemented
  • 16.4.18 NOT IN – implemented
  • 16.4.19 IRI – implemented
  • 16.4.20 URI – implemented
  • 16.4.21 BNODE – implemented
  • 16.4.22 STRDT – implemented
  • 16.4.23 STRLANG – implemented

No comments on the above

16.4.24 NOT EXISTS and EXISTS

I am lumping these together with sub-SELECT to implement. My concern here is that the syntax gets super-complex since all the graph pattern syntax can now appear inside any expression syntax.

There is a filter operator “exists” that …

Does this imply these can only appear in FILTER expressions? Please clarify.

17 Definition of SPARQL

I looked at the 17.2.3 for aggregate queries and it was more helpful than the math earlier. The pseudo code in Step 4 is a bit too unclear. Is that an example implementation or the required one?

17.6 Extending SPARQL Basic Graph Matching

Ignored.

18 SPARQL Grammar

Clearly this is not complete; there are lots of notes to update it.

19 Conformance

If property paths are not removed, please add a conformance level that includes SPARQL 1.1 without property paths.

Does SPARQL 1.1 Query require implementation of the dependent specs - federated query and update? Looks to me that protocol may also be dependent?


End of life of Raptor V1. End of support for Raptor V1 in Rasqal and librdf

Raptor V1 was last released in January 2010 and Raptor V2 seems pretty stable and working. I am therefore announcing that from early 2011, Raptor V2 will replace Raptor V1 and be a requirement for Rasqal and librdf.

End of life timeline

Now

  1. Raptor V1 last release remains 1.4.21 of 2010-01-30
  2. Raptor V2 release 2.0.0 will happen “soon”.
  3. The next Rasqal release will support Raptor V1 and Raptor V2.
  4. The next librdf release will support Raptor V1 and Raptor V2 (and requires Rasqal built with the same Raptor version).

2011

  1. The next Rasqal release will support Raptor V2 only.
  2. The next librdf release will support Raptor V2 only (and require a Rasqal built with Raptor V2).

In the style of open source I’ve been using for the Redland libraries, which might be described as “release when it’s ready, not release by date”, these dates may slip a little but the intention is that Raptor V2 becomes the mainline.

I do NOT rule out that there will be another Raptor V1 release but it will be ONLY for security issues. It will contain minimal changes and not add any new features or fix any other type of bug.

Developer Actions

If you use the Raptor V1 ABI/API directly, you will need to upgrade. If you want to write conditional code, that’s possible. The redland librdf GIT source (or 1.0.12) uses the approach of macros that rewrite V2 into V1 forms and I recommend this way since dropping Raptor V1 support then amounts to removing the macros.

The Raptor V2 API documentation has a detailed section on the changes and there is also an upgrading document plus it points to a perl script docs/upgrade-script.pl (also in the Raptor V2 distribution) that automates some of the work (renames mostly) and leaves markers where a human has to fix.

The Raptor V1 API documentation will remain in a frozen state available at http://librdf.org/raptor/api-1.4/

Packager Actions

If you are a packager of the redland libraries, you need to prepare for the Raptor V1 / Raptor V2 transition which can vary depending on your distribution’s standards. The two versions share two files: the rapper binary and the rapper.1 man page. I do not want to rename them to rapper2 etc. since rapper is a well known utility name in RDF and I want ‘rapper’ to provide the latest version.

In the Debian packaging which I maintain, these are already planned to be in separate packages so that both libraries can be installed and you can choose the raptor-utils2 package over raptor-utils (V1).

In other distributions where everything is in one package (BSD Ports for example) you may have to move the rapper/rapper.1 files to the raptor V2 package and create a new raptor1 package without them. i.e. something like this

Raptor V1 package 1.4.21-X:
/usr/lib/libraptor1.so.1* …
(no /usr/bin/rapper or /usr/share/man/man1/rapper.1 )
Raptor V2 package 2.0.0-1:
/usr/lib/libraptor2.so.0* …
/usr/bin/rapper
/usr/share/man/man1/rapper.1
conflicts with older raptor1 packages before 1.4.21-X

The other thing to deal with is that when Rasqal is built against Raptor V2, it has internal change that mean librdf has to also be built against rasqal-with-raptor2. This needs enforcing with packaging dependencies.

This packaging work can be done/started as soon as Raptor V2 2.0.0 is released which will be “soon”.