Dave Beckett - Journalblog
RDF and free software hacking

28 June 2008

The Flickcurl Story

In January 2007 I started playing with the Flickr API - the HTTP-based web service that lets you manipulate Flickr. At that point I was using it to play with machine tags and to see how the most popular Web Service API worked, especially in the area of authentication. This was in the days before OAuth if you can remember that far back.

I started with a test program in C that called libcurl and did some of the signing and parameter marshaling of the flickr.photos.getInfo call which is where all the juicy metadata about photos is. I started thinking about ways to map photo metadata into RDF for manipulating and querying; there is an existing Perl Flickr RDF mapping but it didn’t contain everything. This state of sources was useful; it contained a small library with the one API method plus a command-line utility to call it. Since I was using libCurl to call Flickr, I named it Flickcurl. Also CFlickr was taken! (Flickcurl also uses libxml but flickcurlibxml is just nuts).

Apart from playing with photo metadata I also had some personal reasons to make something new. I wanted a lighter weight and less formal project than the way I had been building the Redland RDF Libraries. More of it compiles, ship it model and less of the unit tests, test cases and continual make check, worrying about portability approach. Maybe more fun would be a way to put it. I’m happiest as a free software / open source software tool-builder and at this point in 2007 I was spending a lot of time at work doing non-coding things such as designing specifications and doing technical leadership and the chance to work on some different code now and then was appealing to counterpoint the work stuff.

Redland is a set of libraries that have been growing since mid-2000 with more and more features as the semantic web technology stack grows so at any point in time there is no clear end state. For this project I wanted a clear goal to reach so I could be clearly done at some point. This is possible with the Flickr API since there are at any time a finite number of API calls (something like 100) so progress can be measured… although Flickr did add API calls while I was working on it. The result was I made a Flickcurl API coverage page with embedded API changelog (automatically generated from source code comments).

Flickcurl 0.1 was “released” 2007-01-21 although I didn’t announce it to anyone at that point. It was more of a tarball than an actual release.

One more thing I wanted to do was to experiment with different ways to tell people about software, compared to the ways I as using with Redland which was mostly email based but also via SourceForge and Freshmeat. So for Flickcurl I tried a bunch of different ways:

That was kind of fun, and I also followed a similar light weight process with Triplr but that’s another story. I think caring less worked out fine; people did use it and submit patches. Right now I still use the Flickr mailing list, API group, and freshmeat project.

As the library headed towards 100% of the API and beyond it did get a bit more formal and I imported what I think are the best practices from the Redland libraries:

  • objects in C design
  • always refactoring the source code: refactoring is not just for dynamic languages
  • source code docu-comments generating an HTML API reference via gtk-doc
  • folding in portability fixes
  • make it work with optional libraries for extra functionality (Raptor in this case, to allow serialising to other RDF syntaxes)
  • built in portable ANSI C
  • taking care about memory leaks with valgrind
  • comes with a utility program able to exercise the entire API (called flickcurl)
  • Debian packages (created by somebody else, yay!)
  • manual pages for the command line utilities

The general aim was to get 100% of the Flickr API done by the end of 2007 and I actually reached it for Flickcurl 1.0 on 2008-01-12 which was pretty close.

So right now the library has gone beyond 1.0; the latest release is Flickcurl 1.4 which was released last Tuesday 24th June (see release notes) which primarily added video support but I also updated the photo metadata mapping to RDF by adding a serializer class for abstracting the photo-to-triples process.

The RDF triple mappings is something that has always been there but not part of the core library. It could be optionally used inside Raptor to automatically read Flickr photo URIs as RDF data sources. I doubt it’ll ever be presented inside a public web service like Triplr since it would require passing in Flickr API authentication tokens and user credentials.

The RDF triples mapping I’ve made for the Flickr photo metadata has mixture of vocabularies which are in 3 buckets:

  1. Existing Vocabularies: well known RDF schemas (class and properties) that have been developed over many years by multiple people and organisations, sometimes with a lot of formality.
  2. Flickr-specific Vocabularies: vocabularies I made up mostly for Flickr video and places API terms.
  3. Machine Tag Vocabularies: I made them up using machinetags.org/ns URIs as a root for the namespaces associated with the vocabularies. The terms in the vocabularies come from how people used machine tags on Flickr and are not always defined.

This is a range of what might be called semantic web heavy to light although there is absolutely nothing wrong with mixing things up if you are not worried about inference. This is OK! I should probably put some html/schema documents at the vocabularies and get the redirects and all that # and / business sorted so that the linked data works out properly but what I have now is just a start and I’d be interested to see what people think. There are more details of the vocabularies and terms I’m using in the Flickcurl 1.4 release notes although I should probably add vocabulary information to the documentation too.

That’s all for now but I’ll expand some more in another post about the Flickr API itself and my experience with it and impressions of it as a both a software developer and HTTP Web Service designer.

5 April 2008

Raptor Web Library 1.4.17

Last week I released version 1.4.17 of my Raptor C library (release notes) but in the 38 releases over the 7 years or so since I started building it, there’s more to see than just triples.

I/O stream API
Abstracts from specifics of I/O so that you can read/from write to any of a string/memory buffer, a file, a C FILE* or any custom user data structure. Similar to C++ idea of stream. This allows lazy evaluation of I/O and using language-specific I/O routines such as PerlIO (potentially, not yet!). The read abstraction is new in 1.4.17.
Sequence API
For small lists of data items that can grow at either end. This allow small lists, stacks and queues to be made when needed, and iterated over. Do not use for large lists! (Something else is coming along to handle this soon.)
Stringbuffer API
Provides a growing string class, similar to Java’s StringBuffer, which can be added to many times and then the result obtained (with no further changes allowed). This is handy for constructing formatted output, queries using parameters and values which is hard in C without entering the world of char buf[big enough]. The stringbuffer API tries hard to minimise copying.
Unicode API
For a small part of Unicode that is used near XML; UTF-8 encoding and decoding and Normal Form C validation. Nearby it provides XML 1.0 and 1.1 name-checking/validation functions.
URI API
Construction of URIs, resolving of relative URIs and turning absolute URIs into relative. It handles the RFC3986 URI spec updates.
Web fetch API
Retrieving of content from a URI in chunks of bytes. It works by calling curl, libxml or BSD libfetch underneath. It also handles redirections and adjusting any base URIs for 30x responses, returning content type headers early so that content negotiation can be done and customisation of the requests to send appropriate User-Agent and Cache-Control headers (latter, new in 1.4.17)
XML with Namespaces Streaming (SAX2) Reader API (newly public in 1.4.17).
Streaming reading of large XML documents over either libxml2 or expat (build time choice with configure). Expat out of the box does not support namespaces and XML QNames, so Raptor adds those and hides various library differences and some bugs. The XML API also provides full XML Base (xml:base) support.
XML Writer API
Write XML elements as Canonical XML output.

Some of the above are large pieces of work and some are small, but they are all solid and many have been used for multiple years in production. These turned out to be handy datatype classes for web programming and I needed them since RDF is built with web technology.

The bonus is that all of the above is used to provide the signature features of Raptor: RDF Parsing - turning syntax into triples and RDF Serializing - turning triples into syntax. Raptor now parses 7 syntaxes (GRDDL, N-Triples, RDF/XML, RSSes, Atom, TRiG, Turtle) and serializes to 10 (Atom 1.0, GraphViz DOT, JSON * 2, N-Triples, RDF-XML * 3, RSS 1.0, Turtle). The JSON outputs are new for 1.4.17

So although Raptor deals with all the RDF syntax details, it does a lot more. But I’m not changing the name!

13 March 2008

Yahoo! Search reading the semantic web

Yahoo! Search announced today in their blog post that they will soon support in the Search Monkey project the use of semantic web technologies such as RDFa with standard vocabularies such as FOAF and Dublin Core. It also will do a lot of other cool stuff that you can read about above.

It was nice to see that several other people noticed this. (Techmeme frozen page - #1 story)

I didn’t work on this project and don’t work in the search division, but continue to build with Semantic Web technologies in another more internal-facing part of Yahoo! But it is exciting to see that there are more public applications getting out, such as this and research projects like microsearch by Peter Mika.

10 February 2008

Birthdays - XML is 10 and RDF/XML is 9

Happy 10th Birthday XML.

It’s clear you are going to be around for some time. People know your good points and bad and have got the kinks worked out using you in production, in diversity and at scale.

Take care not to be distracted in the next 10 years by sexy new text formats that overlap in some features, but don’t replace you for many uses. I’m talking about you, JSON.

In the RDF world, RDF/XML is the syntax people love to hate, or just love/hate. It is 1 year younger than you, so maybe in February 2009 we’ll have something to celebrate about that. Yeah, it might happen :)

I recently made a new textual RDF syntax sibling Turtle with TimBL whose official birthday was last month, although it’s actual birth was January 2004 in Bristol, or earlier if you look into it’s ancestry. In 6 (10?) more years it’ll be something we can properly rely on, like XML is today.

Dave

P.S. For more memories, check out Tim, Eve and Norm who were involved in XML from very early on when I was just an observer.

13 August 2007

Semantic Web Yahoo - Part Deux

It’s been nearly 2 years since I joined Yahoo! and the the semantic web-based technology I helped develop has been deployed in production for some time. It has been encouraging to see the ideas get more accepted since today I noticed that in a hotjobs search for rdf yahoo near Sunnyvale there 5 jobs open - not in my group, but in Yahoo! Local.

Our group in Sunnyvale is continuing to look for HTTP and web caching experts, designers and coders for building REST-based web services. Right here and now we have interesting, large scale, rich data problems and are applying semweb techniques to them. Contact me if any of this sounds exciting to you.

Semantic Web Yahoo - Part one

3 August 2007

Flickcurl - C API to Flickr

In January 2007 just for fun I started writing a C API to Flickr using the Flickr web services called Flickcurl. The name was because it was originally built using Flickr via libCurl to do the HTTP work … although right now it contains more use of libxml than of libcurl.

I started this for a bunch of reasons, including to learn more about “web 2.0″ web APIs, see how RESTy the Flickr API really is (Answer: not much, it’s very much an RPC model) and the issues with developing a Web API. It’s clear this is an evolved and evolving one since now and then I discover undocumented returned attributes in the XML and cases where it is not clear why attributes were used instead of elements. It’s very suited towards dynamic scripting languages where it is easy to pass around dictionaries / hashes / associative arrays of parameters that can grow. So in some sense, making something feel like a natural API in a static language like C is rather going against the grain and rather slow work.

There are, however, things available to help. There are method reflection APIs so I wrote a code generating program that can nicely automate writing many of the simpler calls that return no value or just a single one. I also used a lot of similar patterns so that parsing tags xml is quite similar to parsing comments xml. The XML is primarily read via XPath and a little DOM.

One other nice thing about this is that this a piece of work with a fixed size, albeit growing slowly. The Flickr API currently has 104 calls - depending on how you measure them - so it’s easy to check progress, and that’s how I’ve been doing it. I built tools to read the docu-comments (javadoc, gnome-doc, kernel-doc style) and mark the Flickcurl coverage release by release.

The news today is that I have reached the half way point: 50% of the APi with the release of Flickcurl 0.11 at least until they add something more! I have also done most of what I think are the trickier parts - the uploading, searching and getting info about photos. The remaining API parts are more regular, so I feel like I’m coding downhill now.

Now there’s something else it does - and this won’t be a surprise to most given my interests. Flickcurl generates RDF descriptions from Flickr photos with a flickrdf utility, including reading Machine Tags. The namespaces are either well known ones, or invented by me, pointing at the machinetags.org wiki - you can create your own definition.

flickrdf uses Raptor to do nicer serializing when it is available. So this means I can turn jellyfish into Turtles. W00t! (*)

$ ./flickrdf -o turtle http://www.flickr.com/photos/dajobe/196308964/
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/#> .
@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://www.flickr.com/photos/dajobe/196308964/>
    dc:creator [
        a foaf:Person;
        foaf:maker <http://www.flickr.com/photos/dajobe/196308964/>;
        foaf:name "Dave Beckett";
        foaf:nick "dajobe"
    ];
    dc:dateSubmitted “2006-07-23T18:16:13Z”^^xsd:dateTime;
    dc:rights <http://creativecommons.org/licenses/by-nc-sa/2.0/>;
    dc:modified “2007-02-25T07:45:46Z”^^xsd:dateTime;
    dc:issued “2006-07-23T18:16:13Z”^^xsd:dateTime;
    dc:created “2006-07-23T05:28:50Z”^^xsd:dateTime;
    geo:lat “36.620487″;
    geo:long “-121.904468″;
    dc:title “Jellyfish at Monterey Aquarium”;
    dc:subject “jellyfish” .

After that bad joke (and it could have been worse if I had a picture of a Turtle) here’s what you need to know. Get it at flickcurl-0.11.tar.gz (md5sum eea351e4d35e8d1c63b124cd8ee257ba, sha1sum d220f6371c0c5334c824a51ba848d9358d73e533) or the latest in the Flickcurl Subversion It’s licensed under the GPL2 / LGPL2 / Apache 2.0 or any newer versions of any of them.

Note: I work for Yahoo! and although Flickr is part of Yahoo! this project is my own personal work.

(*) Actually I’m slightly cheating with this example, there’s a couple of bug fixes in SVN after the release which are needed to get this output.

20 July 2007

Heading to OSCON

I am heading to Portland next week for the O’Reilly Open Source Conference 2007 which I’ve been to only once before. This year’s conference has around 15 parallel tracks, which makes it pretty huge to follow. I’m hoping to get learn lots of new things, be inspired with ideas, meet some great people, and who knows, maybe hire some to work with me ;)

25 March 2007

Triplr - stuff in, triples out

I’ve made a new thing: Triplr for GETting semwebby data. Go check it out.

It’s unrelated to the other older new thing not previously mentioned in a blog post: Flickcurl which is the C library I made for the Flickr API (about 25% complete) although I did steal the cute name from the utility which turns a Flickr photo’s description into triples, with the help of the new machine tags support. My conversion could be improved, I had to invent some namespaces.

17 March 2007

semantic web is webby data

I often been puzzled why people write “The Semantic Web is AI” and “The Semantic Web is a top-down design” and “The Semantic Web is Ontologies”. As far as I’m concerned, all of those are bogus. I think I’ve worked out why they write this - they aren’t talking to anyone actually working directly on the technologies.

The semantic web is: a webby way to link data. That is all.

Everything beyond that is entirely optional fluff: data vs metadata, syntaxes, ontologies, query languages, rules, logic, …

This is my “lowercase” semantic web and the basis of what I have in running production code right now.

I’ll probably use that as my theme when I speak about A Little Semantics Can Go a Long Way on the panel at the Semantic Technology Conference in San Jose in May. ( I’ll also be at WWW2007 in Banff, Canada and XTech 2007 in Paris. )

17 February 2007

Grindstone to The GRDDL

My Raptor RDF parsing / serialising library has been doing GRDDL processing of a sort to make RDF triples for several years but it was only in Raptor 1.4.14 announced 31 January that I finally got round to managing the recursion through XML Namespace URIs and HTML head profiles, so that I was covering the majority of the spec. That was my coding over the Christmas break I took in the UK.

In the last few weeks I have been working on the GRDDL tests, some of which themselves are in beta, and getting my code through them, or fixing the tests. I’m happy that finally I’ve got to the stage where I think either a) I pass a test or b) the test has the wrong result. I’m currently waiting for the answer to my last report to the WG and they could still change or add to the spec but I expect it’ll be Last Call very shortly. I’ll wait until their reply before I ship a new version of Raptor with the most recent changes, which you can read now in the draft release notes if you want to know more.

So apart from diving into Raptor Subversion - feel free - you can kick the tires of the fixes right now with the Raptor parser demo for GRDDL and you’ll need a URI with some GRDDL-compatible markup for the URI box. Try http://www.w3.org/ :) or the examples in GRDDL spec or Dan Connolly’s home page.

That’s not the only thing I’ve done since Christmas but I’ll leave Flickcurl for another day.