Dave Beckett - Journalblog

Hacking the semantic linked data web

  • Recent posts

  • Follow me on twitter

Month: July, 2004

Redland Porting Hints

Just dashed off some quick Porting Hints for Redland, Raptor and Rasqal mostly for win32, based on what I’ve heard from various people and described in email from time to time.

I’m pretty confident Raptor is straightforward, quite confident about Rasqal (no requirements but Raptor) but then Redland and the language bindings can take a lot of optional libraries (SleepycatDB/BDB, MySQL, …).

Parsing 8G of RDF/XML

Last week the UniProt in RDF Format was announced by Eric Jain. This includes a 700Mb gzipped RDF/XML file in the data area. That’s 8369854785 bytes of RDF/XML. So, of course I had to throw my Raptor parser at it to see if it’d survive.

Round 1, it died with this error:

$ gunzip < uniprot.rdf.gz |rapper -c - http://example.org
rapper: Error - URI http://example.org:41640376 - Duplicated rdf:ID value '_501D28'

So a duplicate ID, I wonder if they know that, in line 41.6M.

Round 2, disable errors:

$ gunzip < uniprot.rdf.gz | rapper -c --ignore-errors - http://example.org
rapper: Parsing returned 134091199 statements

This took about 26 minutes on my 2 year old desktop PC to count the 134M triples. Which works out to around 86,000 triples/second. The PC was struggling with the CPU of the ID checking as well as that of gunzip. Parsing the raw file might not work since raptor uses standard C I/O, not large file I/O so could not seek to read all an 8G file.

rapper was taking a huge amount of memory, I suspect for the rdf:ID duplicate value checking which hasn't yet seen that size of data. While I was waiting I've thought of a few ways to optimise it.

DOAP part 4 – Launching the DOAP vocabulary

Just out is Describe open source projects with XML, Part 4 by Edd Dumbill for XML Watch, IBM developerWorks aka DOAP part 4.

FOAF driven development paper for FOAF Galway

Just wrote FOAF Driven Development, my position paper for the FOAF Galway workshop run by SWAD-Europe and DERI.

Raptor RDF Parser Toolkit 1.3.2

Raptor RDF Parser Toolkit 1.3.2 is announced and is a minor update with the following changes:

  • Added support for compiling against expat source trees (patch from Jose Kahan)
  • Added raptor_alloc_memory to allocate memory in raptor, typically needed by handler routines on win32.
  • Make errors in fetching WWW content pass to the main error handler.
  • Added accessor functions for parts of the raptor_locator structure (patch from Edd Dumbill)
  • Disabled the broken Unicode NFC checking via GNOME glib for this release.

The latter is something I want to replace since I’ve estimated it’ll take a day or two’s coding and should be easier to be smaller and faster as an NFC checker only, not the full normalisation NFC/D/NFCK/NFDK that the gnome glib function attempted.