Wax On, Wax Off, Wax Lirrical?

(stackoverflow rep: 8502, Project Euler 88/266 complete

One of the presentations at the Stack Overflow London DevDay (and I believe something similar occurred at several of the other events) was an introduction to Python, conducted via a dissection of a classic piece of code developed by Peter Norvig, currently Chief Scientist at Google.

The code is a “toy” spell-check routine and the talk focused on some of the “interesting” aspects of Python that help to make it powerful despite being very concise. List comprehensions, in particular.

What wasn’t so apparent at the time was how the code actually worked, and I wanted to understand it. Plus, in my competitive way, and given that I kind of left Python behind a few years back*, I wondered whether a Ruby implementation could be as terse.

So, obviously enough, given that I’m writing this, I wrote one. I didn’t quite get to Norvig’s 21 lines: Python not needing “end”s gives it a big advantage. If Ruby had significant white space I’d have made it. Actually, if I was prepared to make one really long line out of the Array creation (lines 17-20 below) I’d have squeaked under, but we’re not playing code golf and I balk at 170-character code lines.

There are several other-language implementations listed at the bottom of the page, including a Ruby one. On my WinXP machine, running straight MRI 1.8.6, my version is a gratifying 2.7 times faster, as well as being about 8 lines shorter.

Having got there, and achieved the original purpose of gaining a reasonable understanding of the algorithm in the process, I’m less than wholly satisfied with the outcome. It does the job, but it’s a long way from being idiomatic Ruby or being useful as anything more than a demonstration. Since code kata** seem to be on the menu this month, I may go back and code this up some more different ways to explore the algorithm, the language, the environment and my own personal inadequacies. For one thing, I’m curious to see how I would test-drive the development of the algorithm.

Anyway, FWIW, this is what I have so far…

Alphabet = ('a'..'z').to_a

NWORDS = begin
 words = Hash.new(0)
 File.read('big.txt').downcase.scan(/[a-z]+/).each { |w| words[w] += 1 }
 words
end

def correct(word)
 (known([word]) || known(edits1(word)) || known_edits2(word) || [word]).max{|a,b| NWORDS[a] <=> NWORDS[b]}
end

def edits1(word)
 n = word.size
 (0..n).map do |i|
 a, b = [word[0, i], word[i-n, n-i]]
 [ (a + b[1, n] unless b.empty?),
 (a + b[1, 1] + b[0, 1] + b[2, n] unless b.size < 2),
 (Alphabet.map { |c| a + c + b[1, n] } unless b.empty?),
 Alphabet.map { |c| a + c + b } ]
 end.flatten.uniq
end

def known_edits2(word)
 edits1(word).map { |e1| edits1(e1).select { |e2| NWORDS.has_key?(e2) } if e1 }.flatten
end

def known(words)
 list = words.select { |w| NWORDS.has_key?(w) }
 list.size > 0 ? list : false
end

And yes, it does come up with the right correction for the misspelling in the title.


* Not Python’s fault – the project where I was using it ceased to be, and most of my work for the last year or so has been web-based, with a choice of Java- or Ruby/Rails frameworks to work with. I grasped the latter like a British MP grasps an expense claim.

** One of Uncle Bob’s favourite kata is prime factoring. There’s a Ruby distillation of such an algorithm at Ben Rady’s blog. I don’t know when I’ve been more tickled by an algorithm. I’d love to see the refactoring steps that got to it.