The Probability of Culpability
28 June 2010 Leave a comment
(stackoverflow rep: 10,976, Project Euler 98/288 complete)
(in which we are reminded that finding solutions to most of the issues we face requires looking no further than our own immediate surroundings)
“I think there’s a problem with the web server – one of my pages isn’t displaying”.
“Strange – everything else seems to be fine – have you made any changes recently?”
“No – and it was fine yesterday. It must be the server; I’m going to escalate it to support.”
A few minutes later, I discovered that a directory had, as a result of a deviant drag-and-drop, been relocated inside another directory, with the result that the failing page’s Javascripts were no longer loading. Panic over, issue de-escalated. It was our team’s collective fault after all.
I have, over the years, found myself working in a support capacity. Needs must, from time to time. I doubt very much that I’m the only support person who ever had a user who claimed that they’d found a bug in Windows/a compiler/Excel/some other third-party application. I also suspect that the number of occasions where the actual error really did reside where the user asserted it did is small. Vanishingly small.
As a general rule of thumb, when faced with a perplexing error, we may consider the “stack” of software within which our problem has arisen. With that in mind, we should consider how much effort has been invested into assuring the correctness of each layer. Take an Excel/VBA application running on, say, Windows 7. I’d guess that the test effort invested in Windows would be an order of magnitude greater – at least – than for Excel. And the step should be as great when we come down to the app we built. That’s reasonable – it reflects the impact of a detected problem in each layer, an impact we can measure in time, money, reputation and the like.
The web environment above supports hundreds of internal sites and is supported by a dedicated team of engineers. The environment changes slowly and only after extensive testing. It has to – the cost of messing up is high.
Sometimes the problem really is upstream, don’t get me wrong. At least one version of Excel, when switching from German to English, doesn’t translate BRTEILJAHRE() to YEARFRAC(), for example. IronRuby has reached version 1.0 but I did discover a little bug (well, in one of the standard libraries, at least) and years ago I had the excitement of working on a PC with a P60 CPU, FDIV bug and all.
The very fact that I remember these examples shows how rare they are, compared to the number of bugs I identify and fix in my own code every day. I can barely remember what I fixed this morning, for goodness’ sake. Orders of magnitude.
The UK National Lottery (or “Lotto” as I think it’s now called) used to have a tagline: “it could be you”. In that particular case it was generally a safe bet to append “but it probably won’t be”. When we’re looking for the source of software errors, we can change that to “and it probably is”.
- Estimated distribution of causes of problems
