Sometimes It’s The Little Things

(stackoverflow rep: 9138, Project Euler 90/274 complete)

From time to time I trawl through my blog subscriptions: some are defunct while others may have changed their feed details sufficently that they’re no longer being picked up. I have about 270 subscriptions, which makes the job a chore and hence it doesn’t get done very frequently. The upshot is, for the case where the blog hasn’t just died, I sometimes miss something.

What should we do with tedious manual activities? Automate! I went and did some investigation.

Google Reader will, through the “manage subscriptions” link (it’s at the bottom of the subscriptions list in my browser) let you download your details in an XML (more specifically, Outline Processor Markup Language, or OPML) file. It looks like this (heavily snipped to avoid excess tedium):

<?xml version="1.0" encoding="UTF-8"?>
<opml version="1.0">
    <head>
        <title>mikewoodhouse subscriptions in Google Reader</title>
    </head>
    <body>
        <outline title="misc" text="misc">
            <outline text="Grumpy Old Programmer"
                title="Grumpy Old Programmer" type="rss"
                xmlUrl="https://grumpyop.wordpress.com/feed/" htmlUrl="https://grumpyop.wordpress.com"/>
            <outline text="Google Code Blog" title="Google Code Blog"
                type="rss"
                xmlUrl="http://google-code-updates.blogspot.com/atom.xml" htmlUrl="http://googlecode.blogspot.com/"/>
        </outline>
    </body>
</opml>

Ignoring for the moment the beauties of XML, this is pretty simple: there’s an outer “outline” that matches the folder I’ve created in Reader, within which is an outline for each feed to which I’m subscribed.

What I wanted to do is something like this:

  • parse the OPML, extracting the xmlUrl tag;
  • download the feed using that tag;
  • scan the entry listing in the feed to find the latest entry date, as a proxy for the last-known activity on that blog;
  • review the blogs that seemed oldest and deadest for update or removal.

Simples!

Well, with a little Googling and not much Rubying, it actually turned out to be so. John Nunemaker‘s HappyMapper gem does a quick enough job of the parsing:

require 'happymapper'
module OPML
  class Outline
    include HappyMapper
    tag 'outline'
    attribute :title, String
    attribute :\xmlUrl, String # remove the \ - WordPress insists on trying to make a smiley out of colon-x
    has_many :\outlines, Outline # see above. Stupid WordPress. Or me. Or both.
  end
end

sections = OPML::Outline.parse(File.read("google-reader-subscriptions.xml"))
sections.delete_if { |section| section.outlines.size == 0 } # remove outline children with no parents

The delete_if part is there to cater for my parse creating duplicates of the “child” outlines: once in their own right and once within their parent section. I’m pretty sure I’ve seen how to avoid that somewhere, but for now this will do, since all my subscriptions live in folders. It leaves something there for the next iteration.

And then there’s the spiffy little Feed Normalizer gem, that will parse RSS or Atom agnostically, which is good: I don’t want to have to care.

require 'feed-normalizer'
require 'open-uri'

sections.each do |section|
 section.outlines.each do |feed|
 list = FeedNormalizer::FeedNormalizer.parse(open(feed.xmlUrl))
 latest = list.entries.map{|entry| entry.date_published}.max
 puts "#{section.title} #{feed.title} #{latest}"
 end
end

Job done.

OK, this is the everything-works-as-expected version, which assumes files will always exist (they won’t), date strings are present and valid (they aren’t), but nobody wants to see a pile of exception- and error-handling code. Or at least, they shouldn’t. Not in a blog post.

harrylillis.com would probably have been cheaper

Putting 2 and 2 together, Jeff Atwood appears to have paid* a fairly large (to me) sum to acquire the superuser.com domain. I wonder how much Microsoft paid for bing.com?

I switched my default search engine to Microsoft’s new beta search engine yesterday. Today I switched back to Google. Not that bing was all that bad – to be honest I couldn’t see much difference between what it gave me and what I see from Google. The background picture, which I guessed was of some Greek island yesterday (it’s somewhere different, but similarly attractive today) was certainly pleasant.

The killer was that after Firefox (3.0.10) reported that the page load was “Done” (and the results certainly seemed to be present) there was a delay – during which time FF froze completely – of about 12 seconds, after which my browser shook itself and woke up.

I don't need to search to find Microsoft being annoying...

I don't need to search to find Microsoft being annoying...

Bing could be the best search engine in the world ever and I’d still not use it if that delay were present. I can’t believe it occurs for all users for all browsers but I’m only me and I prefer FireFox. It could be some interaction with one or more of my – fairly standard – plugins. Maybe I’ll try it again when the beta is done.

Your mileage, of course, may vary.

* Unless of course he was tweeting about another one, which is entirely possible.