Last week the UniProt in RDF Format was announced by Eric Jain. This includes a 700Mb gzipped RDF/XML file in the data area. That’s 8369854785 bytes of RDF/XML. So, of course I had to throw my Raptor parser at it to see if it’d survive.
Round 1, it died with this error:
$ gunzip < uniprot.rdf.gz |rapper -c - http://example.org rapper: Error - URI http://example.org:41640376 - Duplicated rdf:ID value '_501D28'
So a duplicate ID, I wonder if they know that, in line 41.6M.
Round 2, disable errors:
$ gunzip < uniprot.rdf.gz | rapper -c --ignore-errors - http://example.org rapper: Parsing returned 134091199 statements
This took about 26 minutes on my 2 year old desktop PC to count the 134M triples. Which works out to around 86,000 triples/second. The PC was struggling with the CPU of the ID checking as well as that of gunzip. Parsing the raw file might not work since raptor uses standard C I/O, not large file I/O so could not seek to read all an 8G file.
rapper was taking a huge amount of memory, I suspect for the
rdf:ID duplicate value checking which hasn't yet seen
that size of data. While I was waiting I've thought of a few ways to
optimise it.
Redland Porting Hints
Just dashed off some quick Porting Hints for Redland, Raptor and Rasqal mostly for win32, based on what I’ve heard from various people and described in email from time to time.
I’m pretty confident Raptor is straightforward, quite confident about Rasqal (no requirements but Raptor) but then Redland and the language bindings can take a lot of optional libraries (SleepycatDB/BDB, MySQL, …).