Sometimes It’s The Little Things

(stackoverflow rep: 9138, Project Euler 90/274 complete)

From time to time I trawl through my blog subscriptions: some are defunct while others may have changed their feed details sufficently that they’re no longer being picked up. I have about 270 subscriptions, which makes the job a chore and hence it doesn’t get done very frequently. The upshot is, for the case where the blog hasn’t just died, I sometimes miss something.

What should we do with tedious manual activities? Automate! I went and did some investigation.

Google Reader will, through the “manage subscriptions” link (it’s at the bottom of the subscriptions list in my browser) let you download your details in an XML (more specifically, Outline Processor Markup Language, or OPML) file. It looks like this (heavily snipped to avoid excess tedium):

<?xml version="1.0" encoding="UTF-8"?>
<opml version="1.0">
    <head>
        <title>mikewoodhouse subscriptions in Google Reader</title>
    </head>
    <body>
        <outline title="misc" text="misc">
            <outline text="Grumpy Old Programmer"
                title="Grumpy Old Programmer" type="rss"
                xmlUrl="https://grumpyop.wordpress.com/feed/" htmlUrl="https://grumpyop.wordpress.com"/>
            <outline text="Google Code Blog" title="Google Code Blog"
                type="rss"
                xmlUrl="http://google-code-updates.blogspot.com/atom.xml" htmlUrl="http://googlecode.blogspot.com/"/>
        </outline>
    </body>
</opml>

Ignoring for the moment the beauties of XML, this is pretty simple: there’s an outer “outline” that matches the folder I’ve created in Reader, within which is an outline for each feed to which I’m subscribed.

What I wanted to do is something like this:

  • parse the OPML, extracting the xmlUrl tag;
  • download the feed using that tag;
  • scan the entry listing in the feed to find the latest entry date, as a proxy for the last-known activity on that blog;
  • review the blogs that seemed oldest and deadest for update or removal.

Simples!

Well, with a little Googling and not much Rubying, it actually turned out to be so. John Nunemaker‘s HappyMapper gem does a quick enough job of the parsing:

require 'happymapper'
module OPML
  class Outline
    include HappyMapper
    tag 'outline'
    attribute :title, String
    attribute :\xmlUrl, String # remove the \ - WordPress insists on trying to make a smiley out of colon-x
    has_many :\outlines, Outline # see above. Stupid WordPress. Or me. Or both.
  end
end

sections = OPML::Outline.parse(File.read("google-reader-subscriptions.xml"))
sections.delete_if { |section| section.outlines.size == 0 } # remove outline children with no parents

The delete_if part is there to cater for my parse creating duplicates of the “child” outlines: once in their own right and once within their parent section. I’m pretty sure I’ve seen how to avoid that somewhere, but for now this will do, since all my subscriptions live in folders. It leaves something there for the next iteration.

And then there’s the spiffy little Feed Normalizer gem, that will parse RSS or Atom agnostically, which is good: I don’t want to have to care.

require 'feed-normalizer'
require 'open-uri'

sections.each do |section|
 section.outlines.each do |feed|
 list = FeedNormalizer::FeedNormalizer.parse(open(feed.xmlUrl))
 latest = list.entries.map{|entry| entry.date_published}.max
 puts "#{section.title} #{feed.title} #{latest}"
 end
end

Job done.

OK, this is the everything-works-as-expected version, which assumes files will always exist (they won’t), date strings are present and valid (they aren’t), but nobody wants to see a pile of exception- and error-handling code. Or at least, they shouldn’t. Not in a blog post.

Wax On, Wax Off, Wax Lirrical?

(stackoverflow rep: 8502, Project Euler 88/266 complete

One of the presentations at the Stack Overflow London DevDay (and I believe something similar occurred at several of the other events) was an introduction to Python, conducted via a dissection of a classic piece of code developed by Peter Norvig, currently Chief Scientist at Google.

The code is a “toy” spell-check routine and the talk focused on some of the “interesting” aspects of Python that help to make it powerful despite being very concise. List comprehensions, in particular.

What wasn’t so apparent at the time was how the code actually worked, and I wanted to understand it. Plus, in my competitive way, and given that I kind of left Python behind a few years back*, I wondered whether a Ruby implementation could be as terse.

So, obviously enough, given that I’m writing this, I wrote one. I didn’t quite get to Norvig’s 21 lines: Python not needing “end”s gives it a big advantage. If Ruby had significant white space I’d have made it. Actually, if I was prepared to make one really long line out of the Array creation (lines 17-20 below) I’d have squeaked under, but we’re not playing code golf and I balk at 170-character code lines.

There are several other-language implementations listed at the bottom of the page, including a Ruby one. On my WinXP machine, running straight MRI 1.8.6, my version is a gratifying 2.7 times faster, as well as being about 8 lines shorter.

Having got there, and achieved the original purpose of gaining a reasonable understanding of the algorithm in the process, I’m less than wholly satisfied with the outcome. It does the job, but it’s a long way from being idiomatic Ruby or being useful as anything more than a demonstration. Since code kata** seem to be on the menu this month, I may go back and code this up some more different ways to explore the algorithm, the language, the environment and my own personal inadequacies. For one thing, I’m curious to see how I would test-drive the development of the algorithm.

Anyway, FWIW, this is what I have so far…

Alphabet = ('a'..'z').to_a

NWORDS = begin
 words = Hash.new(0)
 File.read('big.txt').downcase.scan(/[a-z]+/).each { |w| words[w] += 1 }
 words
end

def correct(word)
 (known([word]) || known(edits1(word)) || known_edits2(word) || [word]).max{|a,b| NWORDS[a] <=> NWORDS[b]}
end

def edits1(word)
 n = word.size
 (0..n).map do |i|
 a, b = [word[0, i], word[i-n, n-i]]
 [ (a + b[1, n] unless b.empty?),
 (a + b[1, 1] + b[0, 1] + b[2, n] unless b.size < 2),
 (Alphabet.map { |c| a + c + b[1, n] } unless b.empty?),
 Alphabet.map { |c| a + c + b } ]
 end.flatten.uniq
end

def known_edits2(word)
 edits1(word).map { |e1| edits1(e1).select { |e2| NWORDS.has_key?(e2) } if e1 }.flatten
end

def known(words)
 list = words.select { |w| NWORDS.has_key?(w) }
 list.size > 0 ? list : false
end

And yes, it does come up with the right correction for the misspelling in the title.


* Not Python’s fault – the project where I was using it ceased to be, and most of my work for the last year or so has been web-based, with a choice of Java- or Ruby/Rails frameworks to work with. I grasped the latter like a British MP grasps an expense claim.

** One of Uncle Bob’s favourite kata is prime factoring. There’s a Ruby distillation of such an algorithm at Ben Rady’s blog. I don’t know when I’ve been more tickled by an algorithm. I’d love to see the refactoring steps that got to it.

Round-tripper

(stackoverflow rep: 3856, Project Euler 63/235 complete)

Good grief. I wrote the first draft of this about a month ago, planning on completing and posting it when the code was done. I expected that to take a few more days. A little more work required on the estimating front, then.

I’m starting to go off Oracle.

Let me put that into context a little. I first encountered Oracle some time in 1998, when version 5 was all the rage. I’d actually taught data analysis, third and fifth normal form, stuff like that for a few years previously but actual hands-on table creation had to wait. Strange but true. Anyway, over the next two or three years, some of which I spent as “Technical Architect” for the investment bank where I worked, I got to be something of a whiz with both version 5 and the swanky new version 6. Heck, I know the query optimiser’s rules off by heart. I’m not just blowing my own trumpet, mind: when I was untimately (fortuitous typo retained) laid off, I was offered a job by Oracle, which I rejected because I didn’t want to take a pay cut.

I spent five more years in Oracle-land with another bank before drifting into the realms of Sybase in its early MS SQL Server guise, and then Sybase itself across three jobs and four years (it seems like longer). Now, fourteen years after we parted company, Oracle and I are back together.

But we’ve both changed. I no longer code in COBOL and have acquired a pathological dislike of business logic in the database. Oracle has a cost-based optimiser, loves to grab all your business rules (more processors = more revenue) and has become a fat bloated porcine creation. Even the free “personal” 10g Express Edition for Windows is a 165MB download.  (OK, SQL Server Express 2008 is even larger, I checked). When running, the thing takes out a 642MB virtual machine. OK, it’s almost entirely swapped out, but still.

How we did parallel processing in the old days

How we did parallel processing in the old days

But Oracle is still a helluva fast platform. Unoptimised I was seeing about 8K inserts a minute on my development PC, three times that on a real server. Unfortunately our db server currently lives abroad for tax reasons (or something) and the network latency is fierce. About 900 inserts a minute fierce. So I needed to batch up my inserts or enter the living hell that is SQL Loader.

In order to get multiple insert processes working within my Ruby On Rails-based code, I split each file into several fragments, then run a separate process on each fragment. This takes a bit of doing, generating lots of CMD files that run ruby scripts with “START [/WAIT] CMD /C whatever_I_want_goes_here“.

My file-splitting code, I thought, was rather spiffy – it needs to put the headings from the original to each fragment (because they’re used to figure out what’s in the file) then it starts dealing out the records:

def create_fragment_files(paths)
  File.open(file_path, 'r') do |fin|
    hdgs = fin.readline.chomp
      files = paths.map { |path| File.open(path, 'w+') }
      files.each { |fout| fout.puts hdgs }
      fin.each_line do |line|
        files.first.puts line
        files.push files.shift # the first shall be last...
      end
    files.each { |fout| fout.close }
  end
end

There are faster ways, I’m sure – I could calculate the “ideal” file size and dump records into a file until it’s reached, but this is fast enough (well under a minute for an 85MB file) and it pleases me.

There’s a handy little library, ar-extensions, that makes batching of inserts possible within ActiveRecord (which is the default data mapping library within Rails). It works nicely with MySQL, but turned out to have the Oracle code stubbed and invalid. It only took me a day or two to find a solution to that problem, although I still haven’t figured out how to push an update through a proxy server to github. Finally a chance to do something open sourceful, and I’m thwarted at every turn.

So all in all, it’s taken a month. OK, a month in which a lot of other stuff got done, but still.On the plus side, I just fired it up and I’m watching about 36,000 inserts a minute go through. It’ll be faster when the lookup tables are fully populated. (Another day on, and I’m looking at it: 46,000 – and I still have a few tricks up my sleeve)

While the nearly-two years’ of data is backfilling I now get to rewrite the front end.

And the point of this post? In no small part, to remind me of what I actually spent the lion’s share of the last month doing. Also, to record my first-ever open-source contribution, even if I still haven’t worked out how to get my source out into the open.

If you have been, thanks for your forebearance.

Route 55 (and Route 19)

(Stackoverflow reputation down to 2232 after they cleaned up some over-voted stuff from the early days, sniff)

While I prevaricate over all kinds of things, including a redesign of the xlUnit interface, I have been enjoying Michael‘s series of articles on Project Euler solutions in VBA, posted at Daily Dose of Excel. There are some ingenious solutions to long-standing VB/VBA deficiencies, not least the absence of built-in facilities for handling arbitrarily large numbers beyond double precision variables.

I have an abiding fondness for the “classic” VB family – not least because it was the skill that fed and housed me and my family for a good 15 years or more. But boy, it can look a bit tired these days.

As it happens, I’d been taking a few shots at the Euler problems myself, but in Ruby, since that’s my language of choice these days (not least because much of my working day is currently spent working on intranet applications using Rails). So it was interesting to compare the two.

Let’s take problem 55. I took Michael’s code (with the neat little large number AddAsStrings routine) into Excel on my whizz-bang dual-dual-Xeon machine and it solved the problem in 0.554 seconds, which, considering the amount of string-based arithmetic that’s going on, is a testament to the speed of modern PCs.

Below is my Ruby version, which takes a rather different approach. Firstly, in Ruby we have support for arbitrarily large numbers, via the built-in Bignum class, so the string adding business is taken care of. Secondly, classes in Ruby, even compiled standard ones, are open to modification, via a technique colloquially known as monkey-patching. So I could patch in a method directly to the Integer class, which seems appropriate, since we’re looking for a property of the number.

Here’s the code:

class Integer
  def lychrel?(max = 50)
    temp = self
    max.times do
      temp = temp + temp.to_s.reverse.to_i
      return false if temp.to_s == temp.to_s.reverse
    end
    true
  end
end
puts (1..9999).inject(0) { |t, i| t + (i.lychrel? ? 1 : 0) }

That took 0.410 seconds on the same machine. I can see at least one inefficiency: calling to_s twice on the same number, which is expensive.

On the other hand, VBA has the edge on problem 19. I spotted a little optimisation in Michael’s code, which gave me this in VBA, which is about 20 times faster at 0.0019 seconds than the original:

Dim Start   As Date
Dim Answer  As Long
Start = DateSerial(1901, 1, 1)
Do While Start < DateSerial(2001, 1, 1)
   If Weekday(Start) = vbSunday Then
      Answer = Answer + 1
   End If
   Start = DateSerial(Year(Start), Month(Start) + 1, 1)
Loop
Debug.Print Answer

Ruby’s Date class doesn’t do anything clever with months outside the 1 to 12 range, so I had to inject a little logic, but otherwise we’re pretty much in synch, algorithmically:

d = Date.new(1901,1,1)
end_date = Date.new(2000,12,31)
until d > end_date do
  res += 1 if d.cwday == 7
  y, m = d.year, d.month + 1
  y, m = y + 1, 1 if m > 12
  d = Date.new(y, m, 1)
end
puts res

Not much in it, lines-of-code wise as you’d probably expect, but the Ruby code takes about 0.36 seconds, which is a hell of a difference.

Call it one-all for now.

Whither (Wither?) VBA?

Considering the origins of the various Office components (and a light bulb may be flickering dimly in the minds of those who can remember back that far) we’ve come a long way in the application automation stakes. Can you remember Access Basic ? WordBasic? Excel macros pre-VBA? Did early versions of Outlook have macros?

Office application macro capabilities have come a long way but they’ve been pretty much stuck at the last version of the VB6 runtime. That’s about a decade with no significant change. In that time Microsoft have hammered their way up to version 3.5 of .NET, but with only half-arsed (my opinion) gestures made towards improving/extending/renewing the internal automation aspects of Office.

I say “half-arsed”, which, I dunno, might be a little harsh, but the whole VSTO thing just seems like a thin wrapper on COM Interop, which is itself a wrapper to permit communication between shiny new .NET code and skanky old legacy stuff. Why do I need to use VSTO at all? If I need complex, compiled high-performance extensions then I’m probably better off getting one of my C++ literate colleagues to write a “proper” XLL add-in that won’t have to deal with COM at all. If I don’t need high-performance then any scripting language that can talk to COM will do the job. Heck, I can use Ruby (and do) – David Mullett has a whole blog on the topic of Windows automation with Ruby.

Microsoft want to get away from VBA, I think that’s clear. They’re never, never, never going to get the current host of non-technical VBA users to switch to VSTO. Forget it, it’s not going to happen. Hell, I don’t want to have to use VSTO and I’m one who should benefit from raising the cost of entry to macro programming. Do MS want to get away from COM? Maybe. They wanted to get away from DDE too, but it’s still lurking somewhere not too deep inside Windows.

How to make an old programmer slightly less grumpy

How to make an old programmer slightly less grumpy

But here we have the Dynamic Language Runtime, which sits on top of the .NET CLR and allows fun things such as IronPython, IronRuby and others. Snappy performance, ability to use .NET libraries, interoperability between languages, sounds like fun. According to Wikipedia, the plan is to move VB to run on it. Now there’s a thought: why shouldn’t Excel be rebuilt in sexy modern managed code, with VBA ported to the DLR and the old COM interfaces reduced to a shim to keep backwards compatibility? Then we’d have macro programming where it should be, in the application, with the billions of lines of legacy code still runnable, and I’d be able to hit Alt-F11 and work in Ruby.

Seems like a win-win scenario to me.