Sometimes It’s The Little Things
21 January 2010 2 Comments
(stackoverflow rep: 9138, Project Euler 90/274 complete)
From time to time I trawl through my blog subscriptions: some are defunct while others may have changed their feed details sufficently that they’re no longer being picked up. I have about 270 subscriptions, which makes the job a chore and hence it doesn’t get done very frequently. The upshot is, for the case where the blog hasn’t just died, I sometimes miss something.
What should we do with tedious manual activities? Automate! I went and did some investigation.
Google Reader will, through the “manage subscriptions” link (it’s at the bottom of the subscriptions list in my browser) let you download your details in an XML (more specifically, Outline Processor Markup Language, or OPML) file. It looks like this (heavily snipped to avoid excess tedium):
<?xml version="1.0" encoding="UTF-8"?> <opml version="1.0"> <head> <title>mikewoodhouse subscriptions in Google Reader</title> </head> <body> <outline title="misc" text="misc"> <outline text="Grumpy Old Programmer" title="Grumpy Old Programmer" type="rss" xmlUrl="https://grumpyop.wordpress.com/feed/" htmlUrl="https://grumpyop.wordpress.com"/> <outline text="Google Code Blog" title="Google Code Blog" type="rss" xmlUrl="http://google-code-updates.blogspot.com/atom.xml" htmlUrl="http://googlecode.blogspot.com/"/> </outline> </body> </opml>
Ignoring for the moment the beauties of XML, this is pretty simple: there’s an outer “outline” that matches the folder I’ve created in Reader, within which is an outline for each feed to which I’m subscribed.
What I wanted to do is something like this:
- parse the OPML, extracting the xmlUrl tag;
- download the feed using that tag;
- scan the entry listing in the feed to find the latest entry date, as a proxy for the last-known activity on that blog;
- review the blogs that seemed oldest and deadest for update or removal.
Well, with a little Googling and not much Rubying, it actually turned out to be so. John Nunemaker‘s HappyMapper gem does a quick enough job of the parsing:
require 'happymapper' module OPML class Outline include HappyMapper tag 'outline' attribute :title, String attribute :\xmlUrl, String # remove the \ - WordPress insists on trying to make a smiley out of colon-x has_many :\outlines, Outline # see above. Stupid WordPress. Or me. Or both. end end sections = OPML::Outline.parse(File.read("google-reader-subscriptions.xml")) sections.delete_if { |section| section.outlines.size == 0 } # remove outline children with no parents
The delete_if part is there to cater for my parse creating duplicates of the “child” outlines: once in their own right and once within their parent section. I’m pretty sure I’ve seen how to avoid that somewhere, but for now this will do, since all my subscriptions live in folders. It leaves something there for the next iteration.
And then there’s the spiffy little Feed Normalizer gem, that will parse RSS or Atom agnostically, which is good: I don’t want to have to care.
require 'feed-normalizer' require 'open-uri' sections.each do |section| section.outlines.each do |feed| list = FeedNormalizer::FeedNormalizer.parse(open(feed.xmlUrl)) latest = list.entries.map{|entry| entry.date_published}.max puts "#{section.title} #{feed.title} #{latest}" end end
Job done.
OK, this is the everything-works-as-expected version, which assumes files will always exist (they won’t), date strings are present and valid (they aren’t), but nobody wants to see a pile of exception- and error-handling code. Or at least, they shouldn’t. Not in a blog post.