Academic Journals, Presentation and Discoverability
I have recently been overhauling a crawler for a client. This crawler crawls the table of contents and articles for various academic journals, applies various extractions, markups and lookups, and stores the results into SQL database. A website provides a normalized format for review.
This crawler was originally built in 2018 and has been added to and updated over the years but got to a point where it was clear that a major refactoring was required.
Assumptions were made when it was build that did not hold over time, and articles and issues were being regularly missed by the crawler. Add to which the crawler was built around crawling RSS/RDF/ATOM feeds, these were well supported in 2018, but less well supported now.
RSS
As I mentioned before RSS/RDF/ATOM feeds are less well supported now than they were, there were a few lessons learned along the way.
First the feeds themselves are not always well thought out. Ideally we look for a feed for the current issue, and another feed for pre-prints, this provides the most flexibility in choosing what content you want to get we want pre-prints for some journals but not others. Some journals combine both in a single feed which is not ideal.
Second feeds are not well supported/maintained. It is clear that most of the money (if not all) comes in through the website. Iit is not unusual for them to lag behind the website, or stop being published entirely, only to skip issues when fixed.
I am a big advocate of feeds, they are lightweight and easy to produce, and a great way to keep up with a site but they seems to be relegated to a second-class (or third-class) citizen these days. So it is easier to crawl the website which leads me to…
Tables of Contents
There are two tables of contents we are interested in, the main one is the list of articles in an issue, no problems there, all journals handle this well, even when spread over multiple pages.
Things get a little more uneven with prior tables of content. Ideally a single page listing the current issue and prior issues in reverse chronological order would be best, so we can see all the issues at glance. Only one journal does this (out of about 40), others have tabs, pulldowns, accordions, etc… All this made discoverability quite complex, which leads me to…
Metadata
Most journals include metadata in the HTML which really help crawlers (and search engines) to parse what is important and index it. There are a number of metadata schemes available, Google Scholar, PRISM and Dublin Core are all very popular and it is a great idea to include all these. Open Graph is another but it is generic. The majority of journals include great metadata and it really helps.
The one negative is that I have not found an exhaustive list of Google Scholar metadata tags, the Google Scholar pages includes examples, but I was not able to find a definitive list, though this blog post lists the tags they have run into. We have seen some variability in how the metadata tags are used for which good documentation would really help.
You must be logged in to post a comment.