When in Rome…
April 18, 2008 Leave a comment
I have been doing a lot of work parsing feeds (both RSS and ATOM) lately and have been using a tool called “Project ROME” for that. I know there is another tool called Abdera but that only handles ATOM feeds.
The ROME project page describes it as follows:
ROME is an set of open source Java tools for parsing, generating and publishing RSS and Atom feeds. The core ROME library depends only on the JDOM XML parser and supports parsing, generating and converting all of the popular RSS and Atom formats including RSS 0.90, RSS 0.91 Netscape, RSS 0.91 Userland, RSS 0.92, RSS 0.93, RSS 0.94, RSS 1.0, RSS 2.0, Atom 0.3, and Atom 1.0. You can parse to an RSS object model, an Atom object model or an abstract SyndFeed model that can model either family of formats.
Which is what it does and it does it very well. I have thrown any number of feeds at it and it has performed very well. What I particularly like is the fact that foreign markup is accessible so any special tags like iTunes and Media RSS.
No tool is perfect and there are a few ‘lackings’ in it.
- For some reason it does not support comment urls in items, I am not sure why this is the case since I would have expected it.
- Some feeds contain some XSL/CSS directives located just before the feed itself, those are used to direct a browser to “pretty print” the feed when it displays it rather than raw XML. ROME does not like that at all and this stuff needs to be stripped from the feed before it is handed over for parsing.
- Some feeds (like the NY Times, ahem…), have lots of null characters past the end of the feed, but which are part of the document. I suspect what is happening somewhere is that the feed is deemed to be longer than it actually is and the empty space is filled with null characters (let us pass on the existential issue of filling empty space with nulls). Those also need to be stripped out.
Unfortunately the last release was made in December 2006 and the project does not seem to have any work done on it since. Hopefully someone will step up to the plate and take it on, I might when work lets up. The one obvious thing I would do is add Generics to it.