The code behind PlanetRDF
uses the source
blogroll to get the RSS feed URIs. These are fetched and RSS parsed using the
Ultra-liberal RSS parser
giving Unicode inside Python. This data is used to create a
skeleton html document in UTF-8 which is passed to
tidy to try to fix HTML escaping and tagging messes.
Tidy is told to read and write UTF-8. The aggregation then is performed
and the result is a new
RDF/XML (RSS1.0) feed
in UTF-8 which is then XSLTed into
XHTML in UTF-8.
There is no explicit transcoding.
If there is a problem, it’ll be at the first RSS stage.
There are sometimes encoding errors in titles in the main page body
which is due to python problems understanding when tidy emits
UTF-8 encoded bytes and python attempts to read them as ASCII.
The right hand side is always correct, since it is all done in RDF
from the source blogroll, no munging.
I guess it’s time to junk the “Ultra-liberal” parser and replace it
with a real one and as all PlanetRDF feeds are RSS 1.0, not RSS tag soup,
we can use an RDF/XML parser. At that point PlanetRDF will be triples
all the way down :)
More detail of how PlanetRDF works was given in
Planet Blog by Edd Dumbill.
Munging Planet RDF
Sam Ruby says in his slide Munging from his slides on the pitfalls around Unicode, XML and HTTP:
Planet RDF will take HTML and run it through a iso-8859-1 to utf-8 conversion
This is not quite correct.
The code behind PlanetRDF uses the source blogroll to get the RSS feed URIs. These are fetched and RSS parsed using the Ultra-liberal RSS parser giving Unicode inside Python. This data is used to create a skeleton html document in UTF-8 which is passed to tidy to try to fix HTML escaping and tagging messes. Tidy is told to read and write UTF-8. The aggregation then is performed and the result is a new RDF/XML (RSS1.0) feed in UTF-8 which is then XSLTed into XHTML in UTF-8. There is no explicit transcoding. If there is a problem, it’ll be at the first RSS stage.
There are sometimes encoding errors in titles in the main page body which is due to python problems understanding when tidy emits UTF-8 encoded bytes and python attempts to read them as ASCII. The right hand side is always correct, since it is all done in RDF from the source blogroll, no munging.
I guess it’s time to junk the “Ultra-liberal” parser and replace it with a real one and as all PlanetRDF feeds are RSS 1.0, not RSS tag soup, we can use an RDF/XML parser. At that point PlanetRDF will be triples all the way down :)
More detail of how PlanetRDF works was given in Planet Blog by Edd Dumbill.