The Probability of Culpability

(stackoverflow rep: 10,976, Project Euler 98/288 complete)

(in which we are reminded that finding solutions to most of the issues we face requires looking no further than our own immediate surroundings)

“I think there’s a problem with the web server – one of my pages isn’t displaying”.

“Strange – everything else seems to be fine – have you made any changes recently?”

“No – and it was fine yesterday. It must be the server; I’m going to escalate it to support.”

A few minutes later, I discovered that a directory had, as a result of a deviant drag-and-drop, been relocated inside another directory, with the result that the failing page’s Javascripts were no longer loading. Panic over, issue de-escalated. It was our team’s collective fault after all.

I have, over the years, found myself working in a support capacity. Needs must, from time to time. I doubt very much that I’m the only support person who ever had a user who claimed that they’d found a bug in Windows/a compiler/Excel/some other third-party application. I also suspect that the number of occasions where the actual error really did reside where the user asserted it did is small. Vanishingly small.

Ask not at what the finger pointeth...

As a general rule of thumb, when faced with a perplexing error, we may consider the “stack” of software within which our problem has arisen. With that in mind, we should consider how much effort has been invested into assuring the correctness of each layer. Take an Excel/VBA application running on, say, Windows 7. I’d guess that the test effort invested in Windows would be an order of magnitude greater – at least – than for Excel. And the step should be as great when we come down to the app we built. That’s reasonable – it reflects the impact of a detected problem in each layer, an impact we can measure in time, money, reputation and the like.

The web environment above supports hundreds of internal sites and is supported by a dedicated team of engineers. The environment changes slowly and only after extensive testing. It has to – the cost of messing up is high.

Sometimes the problem really is upstream, don’t get me wrong. At least one version of Excel, when switching from German to English, doesn’t translate BRTEILJAHRE() to YEARFRAC(), for example. IronRuby has reached version 1.0 but I did discover a little bug (well, in one of the standard libraries, at least) and years ago I had the excitement of working on a PC with a P60 CPU, FDIV bug and all.

The very fact that I remember these examples shows how rare they are, compared to the number of bugs I identify and fix in my own code every day. I can barely remember what I fixed this morning, for goodness’ sake. Orders of magnitude.

The UK National Lottery (or “Lotto” as I think it’s now called) used to have a tagline: “it could be you”. In that particular case it was generally a safe bet to append “but it probably won’t be”. When we’re looking for the source of software errors, we can change that to “and it probably is”.

Estimated distribution of causes of problems

Large Corporation IT Excise

Excise: an indirect tax

We’re getting a temporary 2.5% cut in Value Added Tax, to encourage us to spend our way out of recession. I must say, I’m motivated to get out there and spend, spend, spend. (irony mode off).

Amongst the many forms of excise levied on the Large Corporate employee is the “standard build”. Of course, in most Large Corporates, the build is anything but standard, but they like to maintain the fiction because it implies that costs are being controlled to within an inch of their costly little lives.

My main work PC got sick on Monday. Following a weekend “security update” it no longer felt comfortable about running Excel or an other Office app, insisting that, once loaded, the application was not in fact installed. Since my machine, by virtue of my being a developer with local admin rights, is by no means standard, a rebuild was scheduled within a couple of hours of identifying the problem. Apparently the Service Level Agreement for rebuilds stipulates a maximum four day turnaround.

Four. Frickin’. Days. Are they having a Turkish*?

In the event the PC was collected and driven away on a trolley in just under two, to be returned the following morning, which should be tomorrow.

Such is progress. Ten years ago, in a different bank, when a set of DLLs I was testing managed to destabilise my PC, a chap came round inside an hour, inserted a floppy disc and gave the three-finger salute. An hour later, NT 4.0 was up and running, good as new. Come to think, that site may still have been on 3.51…

When the machine comes back, I’ll get the dubious pleasure (assuming my admin rights have somehow been preserved) of reloading the various items I need to be able to do something that actually justifies my comfortable income.

To be honest, I expected the list to be longer. I will install Visual Studio at some point, but I haven’t written a line of C# this year, so it’s hardly pressing.

I reckon I should have it all running again by Friday, which makes a week’s money for maybe two days’ actual work. Large Corporation Excise – I hope the increased security was worth it.

* Rhyming slang: Turkish bath (barff) = Laugh