Sometimes It’s The Little Things

(stackoverflow rep: 9138, Project Euler 90/274 complete)

From time to time I trawl through my blog subscriptions: some are defunct while others may have changed their feed details sufficently that they’re no longer being picked up. I have about 270 subscriptions, which makes the job a chore and hence it doesn’t get done very frequently. The upshot is, for the case where the blog hasn’t just died, I sometimes miss something.

What should we do with tedious manual activities? Automate! I went and did some investigation.

Google Reader will, through the “manage subscriptions” link (it’s at the bottom of the subscriptions list in my browser) let you download your details in an XML (more specifically, Outline Processor Markup Language, or OPML) file. It looks like this (heavily snipped to avoid excess tedium):

<?xml version="1.0" encoding="UTF-8"?>
<opml version="1.0">
    <head>
        <title>mikewoodhouse subscriptions in Google Reader</title>
    </head>
    <body>
        <outline title="misc" text="misc">
            <outline text="Grumpy Old Programmer"
                title="Grumpy Old Programmer" type="rss"
                xmlUrl="https://grumpyop.wordpress.com/feed/" htmlUrl="https://grumpyop.wordpress.com"/>
            <outline text="Google Code Blog" title="Google Code Blog"
                type="rss"
                xmlUrl="http://google-code-updates.blogspot.com/atom.xml" htmlUrl="http://googlecode.blogspot.com/"/>
        </outline>
    </body>
</opml>

Ignoring for the moment the beauties of XML, this is pretty simple: there’s an outer “outline” that matches the folder I’ve created in Reader, within which is an outline for each feed to which I’m subscribed.

What I wanted to do is something like this:

  • parse the OPML, extracting the xmlUrl tag;
  • download the feed using that tag;
  • scan the entry listing in the feed to find the latest entry date, as a proxy for the last-known activity on that blog;
  • review the blogs that seemed oldest and deadest for update or removal.

Simples!

Well, with a little Googling and not much Rubying, it actually turned out to be so. John Nunemaker‘s HappyMapper gem does a quick enough job of the parsing:

require 'happymapper'
module OPML
  class Outline
    include HappyMapper
    tag 'outline'
    attribute :title, String
    attribute :\xmlUrl, String # remove the \ - WordPress insists on trying to make a smiley out of colon-x
    has_many :\outlines, Outline # see above. Stupid WordPress. Or me. Or both.
  end
end

sections = OPML::Outline.parse(File.read("google-reader-subscriptions.xml"))
sections.delete_if { |section| section.outlines.size == 0 } # remove outline children with no parents

The delete_if part is there to cater for my parse creating duplicates of the “child” outlines: once in their own right and once within their parent section. I’m pretty sure I’ve seen how to avoid that somewhere, but for now this will do, since all my subscriptions live in folders. It leaves something there for the next iteration.

And then there’s the spiffy little Feed Normalizer gem, that will parse RSS or Atom agnostically, which is good: I don’t want to have to care.

require 'feed-normalizer'
require 'open-uri'

sections.each do |section|
 section.outlines.each do |feed|
 list = FeedNormalizer::FeedNormalizer.parse(open(feed.xmlUrl))
 latest = list.entries.map{|entry| entry.date_published}.max
 puts "#{section.title} #{feed.title} #{latest}"
 end
end

Job done.

OK, this is the everything-works-as-expected version, which assumes files will always exist (they won’t), date strings are present and valid (they aren’t), but nobody wants to see a pile of exception- and error-handling code. Or at least, they shouldn’t. Not in a blog post.

Advertisements

2 Responses to Sometimes It’s The Little Things

  1. Pingback: Daily Dose of Excel » Blog Archive » Culling You RSS Feeds

  2. Pingback: Ennuyer.net » Blog Archive » Rails Reading January 26 2010

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: