Reading Articles in a Web Browser

I am a big fan of the new “Reader” feature in Apple’s Safari browser, it makes it much easier to read an article without all the side-content taking your attention away. The really nice thing about “Reader” is that it will fetch all the pages that make up the article so you don’t have to page forward as you are reading unlike “Readability” (which I also really like). “Reader” also keeps the images in place, as well as any objects that may be embedded in the article (Flash for example, cough!! cough!!)

As a side project I have been implementing a similar thing in Java to extract indexable text from web pages. I started off looking at the “Readability” source code (the fact that I understood what was going on while have never touched JavaScript is a credit to the developer), and then took off in my own direction inspired by “Reader”. The project took longer than I was expecting but it is almost code-complete. I used the excellent Jericho HTML Parser for this, the trick is to achieve a balance between coding for specific sites and keeping it generic so that it can deal with as many sites as possible.

In the event the code is able to handle multi-page articles, handles more sites than “Reader” can, and is able to output text for indexing, simplified HTML for reading, and handles images quite nicely.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: