Getting Bigger All The Time

in which we try to get tricky with Excel and find that Excel has adequate built-in trickiness already, thank you very much indeed

Over the years I’ve had a number of roles that involved fixing, tuning, cleaning and otherwise future- and bullet-proofing a fair number of Excel workbooks. There’s a whole catalog of potential issues to entertain and amuse but very common is the Extending Input Data Set. (I capitalised that in the hope that I might have christened some kind of Excel anti-pattern).

Your average Excel user is smart enough to know that some input data may grow over time: they may be copy-pasting from some other source, grabbing a database query result or even just typing in more values over time. It’s very common to see a row-by-row enrichment of such data for analysis and the formulae need to be extended to match the rows of input data. In the worst case, users have pre-copied more rows than they expect ever to need and don’t notice when they’ve started to miss data. More happily, they copy down as they see the data extending. If they see the data extending, that is1.

Helping a colleague to avoid such an unpleasantness recently led to a couple of interesting (to me) solutions. Firstly, we looked at extending the calculations applied to a list of data that we knew would extend every time we ran the spreadsheet model.

We assume we have a defined name, input_data that does what it says. Further, we have output_formulae, which should correspond one-for-one with the input_data rows. When input_data  changes, we want to automatically extend output_formulae to match. Pretty straightforward, as it turns out:

Private Sub Worksheet_Change(ByVal Target As Range)
  If Intersect(Target, Range("input_data")) Is Nothing Then
    Exit Sub
  End If
  ExtendOutputFormulaeAsRequired
End Sub

Sub ExtendOutputFormulaeAsRequired()

  Dim inputRowCount As Long, calcRowCount As Long
  inputRowCount = Range("input_data").Rows.Count
  calcRowCount = Range("output_formulae").Rows.Count

  If inputRowCount <= calcRowCount Then
    Exit Sub ' enough formula rows already - could reduce if needed, but not done here
  End If

  With Range("output_formulae")
    ' Assumes just formulae are needed:
    .Offset(calcRowCount, 0).Resize(inputRowCount - calcRowCount, .Columns.Count).Formula = _
        .Rows(calcRowCount).Formula
  End With
End Sub

That works pretty well. At least, it does if input_data is defined, something we can usually do fairly easily, using a named formula, something like =OFFSET(B2,,,COUNT(B:B),6) (assuming we have 6 rows and there’s a non-numeric column heading in B1, which in this case there is).

Here’s the wrinkle that gives rise to the second interesting (to me) question, that necessitated a question on Stack Overflow. What if I can’t be sure that all my columns will be the same length? I want the length of the longest column (there won’t be gaps vertically, which is something at least). So in the sample below, there are 7 rows of data in columns B to G, with a heading on row 1, so 7 is the longest column and defines the number of rows in input_data.

Input data – extend me!
1 1 1 1 1 1
1 1 2 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1
1
1

I don’t want to use .CurrentRegion because I can’t be certain that adjacent cells won’t be populated at some time in the future. We can’t do something like MAX(COUNT(B:G)) because COUNT() is built to handle two-dimensional ranges. I tried to be tricky, {=MAX(COUNT(OFFSET(B1:G1),,,1000000,1))}, hoping that Excel would evaluate the COUNT() for a set of OFFSETs from B1 to G1. Excel wasn’t fooled. Stack Overflow time then, and enter SUBTOTAL(), a function I confess I’ve never used in over 20 years of Excellence.

I define a new name, input_data_columns as =Sheet1!$B:$G, which I can then use in the definition of input_data_rowcount:

=MAX(SUBTOTAL(2,OFFSET(input_data_columns,,COLUMN(input_data_columns)-MIN(COLUMN(input_data_columns)),,1)))

In a worksheet this needs to be Control-Shift-Entered as an array formula, but there’s no obvious way to do this when defining a Named Formula. Fortunately, although I don’t understand why, it works OK anyway, leading in turn to input_data becoming

=OFFSET(Sheet1!$B$2,0,0,input_data_rowcount,6)

That works! Is it future-proof? Probably not2 but it might very well be an improvement on what came before.

Briefly checking what happens, if we remove the MAX() and put the remainder into a worksheet as an array formula, we get {4, 4, 7, 4, 5, 5}, which is indeed the result of COUNT() on each column.


1 From a selfish perspective, this is a Good Thing – it requires “professional” expertise to deal with and has put bread on my table many a time.

2 See 1

Another notch in the utility belt

Having been largely an all-thumbs c# programmer for the last year or so I’ve had little opportunity to add to the world of Excel know-how (and I don’t really feel qualified to say much about the .Net world yet, even if I did once get a Tweet from Jon Skeet). I do keep an eye out for Excel questions on StackOverflow, though, and one caught my eye this morning.

The questioner wanted to add a zero to a chart range. Not sure why he didn’t just add an extra constant value at the end (start or beginning) of the range, but perhaps he couldn’t (maybe the data was in another, read-only workbook?). Anyway, I came up with something that extended an existing trick a little and I thought it was mildly interesting.
Using a named formula in a chart series is1 a fairly well-known idea theses days. It lets us define a dynamic series (usually using OFFSET()) and have our charts update automagically as our series grow and shrink. What I hadn’t previously thought of2 was that I could use a VBA function call as the Name’s “Refers to” with the result being useable in a chart Series definition.

Turns out it could.

Public Function PrependZero(rng As range)
Dim val As Variant, res As Variant
Dim idx As Long
    val = Application.Transpose(rng.Columns(1).Value) ' nifty!
    ReDim res(LBound(val) To UBound(val) + 1)
    res(LBound(res)) = 0 ' this is where the function lives up to its name
    For idx = LBound(val) To UBound(val)
        res(idx + 1) = val(idx) ' copy the remainder
    Next
    PrependZero = res
End Function

(That Transpose thing is handy – I hadn’t known about it as a way of getting a 1-D range value as a vector. Two transposes needed for data along a row.)

You could use this approach for lots of things. Like, well, lots of things. Honestly, you don’t need me to spell them out.


1 This makes me a little sad – it was a source of Excel-fu that impressed people to a quite disproportionate degree. Array formulae and VBA classes still have this power.
2 I’m aware that I might be the last kid in the class here. Be gentle.

Pin Me Up, Pin Me Down

(in which we return to VBA for a Small But Useful macro-writing session)

My friend Jens comes up with interesting questions. Even better, he doesn’t come up with them too often.

Today’s poser refers to Excel 2007, a technological marvel that has only reached our work PCs relatively recently. Jens had discovered the Usefulness that is the “pin” tool on the “Recent Documents” list:

So the pinned documents don’t go away, but they do move around on the list, rising up when used, drifting down as lesser workbooks are opened. What Jens wanted was to have his pinned files stay at the top. Or the bottom. Either, really, providing they hung around together.

We couldn’t find anything in Options, Advanced or otherwise, and some Googling didn’t turn up much of use1 and at time of writing (admittedly not too long ago) I hadn’t got any useful feedback on Stackoverflow. Time to apply a different set of talents.

Excel has Application.RecentFiles, which does pretty much what it suggests. Even better, the “.Add” method, when called with a file that’s already on the list, promotes that file to the top. If we could call that method for the pinned files, we’d be laughing. If we could identify them. It turns out we can.

Excel has to store the list somewhere so it can find it when restarting. Where do “modern” Windows programs store that kind of information? The registry2. Specifically, “HKEY_CURRENT_USER\Software\Microsoft\Office\12.0\Excel\File MRU”, where MRU stands, most likely, for Most Recently Used. My registry has 50 entries here, probably reflecting that Advanced Excel Options allows me to display up to 50 documents, screen real estate permitting.

So looking at the entries, we see that there’s one string for each of our MRU entries and that each string has some structure to it:

That first part: “[F00000001]“, is particularly interesting: the last digit is a “1″ for all the files we’ve marked as pinned. So we can identify pinned files. All we need now is to be able to read the registry from VBA. We can’t use “GetSetting” because it’s limited in scope. More Googling and we discover that the Windows Scripting Host knows about registries so we add a reference to “Windows Script Host Object model” in order to be able to early-bind to WshShell.

So here’s the first attempt. Probably heaps of unhandled situations, but right now it works on my machine…

Public Sub PromotePinnedFiles()

Const KEY_ROOT = "HKEY_CURRENT_USER\Software\Microsoft\Office\12.0\Excel\File MRU\Item "

Dim itemIndex As Long
Dim itemData As String
Dim itemPath As String

On Error Resume Next ' general-purpose error-ignoring

With New WshShell ' Or use CreateObject("WScript.Shell") if reference not set
 For itemIndex = 1 To 50 ' I think 50 is the most Excel will store
  itemData = .RegRead(KEY_ROOT & itemIndex) ' get the item (ignoring errors if fewer than 50 present)
  If Len(itemData) > 0 Then
   If InStr(itemData, "*") > 0 Then
    If Mid(itemData, 10, 1) = "1" Then
     itemPath = Mid(itemData, InStr(itemData, "*") + 1)
     Application.RecentFiles.Add itemPath ' "promotes" the item to #1 in the list
    End If
   End If
  End If
 Next
End With

End Sub

The effect of which is this:

It just remains to decide how best to package it for my learned colleague.


1 Inevitably, there will be any number of comments telling me precisely where I should have looked.
2 Readers who answered “INI files” are invited to remove themselves to the back of the classroom, where the machines are still running Excel 3.0.

Tiny VBA Tooltippery Tip

Project Euler 100/304

(in which we discover What Went On In 1997)

This morning’s iteration of the daily blog/news trawl for useful information threw up “Five tips for debugging a routine in the Visual Basic Editor“, all of which are sensible, although unlikely to be news to anyone reading this, if we’re honest.

Tip #3, “View variables using data tips”, however, reminded me of something that I don’t believe is widely known. Since the site seems to require a full-blown account creation that I can’t see as appropriate for a simple comment, I’m going to mention it here.

Hovering the mouse pointer over a variable while in VBA’s Break mode will show the variable’s value in a tool tip:

The smart VBA programmer

That’s fine: almost all the time we get to see exactly what we want. Above about (or maybe exactly) 60 characters, however, we get the leading part and three little dots:

Still no problem if we only want the start of the string...

What if we want to see what’s at the end of the string, though? Well, back in (I think) 1997, I managed to get my then employer to send me to VBA DevCon (no easy task, given that the location was EuroDisney), at which I happened to meet the Microsoft guy who actually wrote the hover/tooltip thing (it was in the VB4 editor first, I believe) and he told me that viewing the last 60-ish characters of the string could be achieved by holding down the Control key before moving the pointer over the variable name:

Presto!

I don’t think I’ve ever seen this recorded. Of course, I haven’t exactly gone looking for it, so if you came all the way to the end only to discover that I was just repeating something that everyone knows, then I can only apologise. We’ll get over it.

Sorted for Excel and Whee!

If you happened upon the VBA Lamp and by rubbing it were able to produce the Excel VBA Genie and were granted a VBA Wish, would you ask for a built-in Sort() function?

If you build the kind of Excel apps that I do, you’ll run up against the need for a Sort all too frequently. Sorting arrays with sizes in the tens of thousands is not unusual and I was reminded of this when reading a post in this entry at Andrew’s Excel Tips the other day.

While it’s not crucial in the (useful) idea presented, the code has a quick and and dirty sort that is about the worst sort algorithm1 one could intelligently come up with. It’s also an prettty intuitive solution, which just goes to show that sorting may not be as simple as we may think. I’m not attacking Andrew’s skillz here, btw: the code as presented is certainly fit for purpose; it’s not presented as a general-purpose utility (at least I hope it isn’t).

I’ve accumulated a few algorithms over the years and I’ve coded up a couple more while “researching” this piece. On the one hand, we have the oft-derided BubbleSort and its close relation, CocktailSort. In the same group I’d include InsertionSort and SelectionSort. I’m going to be harsh and categorise those, with the naive sort above, as “Slow”. Well, “mostly slow”, as we’ll see.

In the “Fast” group, we have the much-touted QuickSort, and somewhere in between, we have HeapSort,and my current algorithm of choice, CombSort. As I was researching this, I also coded up ShellSort, which is about as old as I am and which was claimed to be faster than QuickSort under some conditions.

I ran some comparisons, not meant in any way to be perfectly scientific2. I ran each algorithm on arrays of 50 to 50,000 values with six different characteristics:

  • already sorted
  • exactly reversed
  • mostly sorted (values typically within one or two places of target)
  • ordered blocks of 100 random values (the first 100 values are 0 + RAND(), then 100 + RAND() and so on)
  • completely random
  • random 10-character strings

First off, the 50-record results:


50 recs (ms) sorted near sorted blocks random strings reversed
Shell 0.06 0.06 0.11 0.10 0.24 0.09
Quick 0.13 0.13 0.16 0.13 0.29 0.14
Comb 0.09 0.10 0.17 0.17 0.33 0.13
Heap 0.34 0.33 0.32 0.32 0.52 0.28
Insertion 0.02 0.02 0.20 0.17 0.47 0.37
Selection 0.25 0.25 0.25 0.25 0.54 0.25
Cocktail 0.01 0.02 0.44 0.39 1.02 0.77
Bubble 0.01 0.02 0.50 0.45 1.12 0.78
Naive 0.22 0.23 0.50 0.46 1.06 0.77

I’d say it’s pretty clear that it doesn’t matter much what you use to sort a small array, just about anything will be fast enough (unless you’re going to perform that sort tens of thousands of times in a run). It’s also apparent that the “slow” algorithms are actually pretty good if our data is already reasonably well-ordered.

So far, so “so what?”

Let’s look at the opposite end of the spectrum: 50,000 values? Here, the Fast/Slow divide is apparent. First the “Slows” (two tests only, for reasons that should become apparent):


50K (ms) near sorted random
Bubble 31 522,216
Cocktail 30 449,696
Insertion 19 179,127
Naive 219,338 510,010
Selection 220,735 220,743

Yes, that’s hundreds of thousands of milliseconds. “Three or four minutes” to you and me. The “Fasts”, meanwhile:


50K (ms) sorted near sorted blocks random strings reversed
Shell 162 164 219 377 929 250
Quick 296 298 327 365 790 306
Comb 390 396 477 622 1,348 452
Heap 899 903 885 874 1,548 844

(I only ran two tests on the “Slows”, for fear of dozing off completely.)

Again, for data where values are near their final sorted positions there’s clear evidence that something like an Insertion Sort is much faster than any of the “sexier” options. Provided you know your data will actually meet that criterion, of course.

All that considered, I’m switching from CombSort to ShellSort as my default algorithm. While it loses out a little to QuickSort in the “random” test (probably most representative of my normal use case) it doesn’t carry with it the fear of stack overflow through extreme recursion, something that’s bitten me with QS in the past. Anyway, us old’uns have got to stick together.

As already mentioned, if you have small quantities of data or infrequent sort calls in your application, it really doesn’t make much difference which algorithm you use, although I’d still suggest being aware of the properties of several options and having a basic implementation to hand. Once you reach a point where sorting contributes materially to your application run time then you owe it to yourself and your users to make an intelligent selection.

Here’s my ShellSort implemenation in VB, transcoded fairly literally from the Wikipedia pseudo-code (optimisations welcome):


Public Sub ShellSort(inp)
' sorts supplied array in place in ascending order
Dim inc As Long, i As Long, j As Long
Dim temp ' don't know what's in the input array...
  If Not IsArray(inp) Then Exit Sub ' ...but it had better be an array
  inc = (UBound(inp) - LBound(inp) + 1) / 2 ' use Shell's originally-proposed gap sequence
  Do While inc > 0
    For i = inc To UBound(inp)
      temp = inp(i)
      j = i
      Do
        If j < inc Then Exit Do ' check these conditions separately, as VBA Ors don't short-circuit
        If inp(j - inc) <= temp Then Exit Do ' ... and this will fail when j < inc
        inp(j) = inp(j - inc)
        j = j - inc
      Loop
      inp(j) = temp
    Next
    inc = Round(CDbl(inc) / 2.2)
  Loop

End Sub


1 Not the absolute worst: there’s the catastrophic genius of BogoSort, to name but one, but let’s not go anywhere nearer to there.
2 Just so we understand each other, these are my figures for my code on one of my machines. YMMV. A lot.

UK Excel Developer? Confer!

Should somehow missed it thus far, an Event is forthcoming in London in July: the

UK Excel Developer Conference

has been announced for Monday 12th July 2010.

Location: Skills Matter Exchange, just North of the City (map), at 116-120 Goswell Road, London EC1V 7DP

Registration a very reasonable looking £150+VAT (£50 more in the final week).

Speakers to include Simon “Smurf” Murphy, performance guru Charles Williams, and Excel’s own podcast-jockey, Ross McLean.

The agenda is packed – heavily targeted at the serious end of the Excel development spectrum: extension strategies, automation, performance and the like. I’d expect there to be much useful learning-type information to be acquired, both in session and during the breaks.

If it makes you more likely to attend, I plan on being there. (If that thought makes you less likely to attend, pretend I didn’t say that.)

Going Metric

(stackoverflow rep: 10,307, Project Euler 97/288 complete)

(in which a single comment substitutes for a butterfly’s wing, metaphorically speaking)

A week or so back , as part of the unfocused rant1 that stemmed from exposure to a particularly ghastly Excel travesty.

Since it struck a chord with at least one person2 and I’d been meaning for months to write something more about it, I’m going to ramble about my feeble (but still surprisingly useful) approach to VBA code metrics.

The original idea came from three contemporaneous events several years: a large and nasty workbook to be heavily modified, some reading about McCabe’s cyclomatic complexity metric and coming across a little VB utility named Oh No!

Oh No!

This utility scanned a VB project and displayed a chart that attempted to show you where possible problem areas might lie by using size and complexity of routines on its X and Y axes (Function, Sub, Property etc). Size was number on non-blank, non-comment lines, while complexity was (from the Help file):

The complexity is measured as the total number of structure statements in a procedure. These statements are: Do, For, If, ElseIf, Select, While and With.

It looked like this (pointed in this instance at its own source code):

The third dimension, “blob size”, was related to the number of parameters in the routine. All in all, it’s a nicely-done, well thought-out little app.

For my uses it had its drawbacks, of course. To start with, it wanted a .vbp (VB project) file for input, something I didn’t have in my Excel/VBA project. For another, I didn’t seem to have the sort of brain that could easily extract useful information out of the graphical display. The code didn’t like continued lines, something I use a lot, and I felt that size and parameter count were contributors to complexity: we didn’t really need three dimensions.

Most important of all, writing my own app looked like a fun part-time project and I’m a complete sucker for those.

I should confess now that a cyclomatic complexity calculation proved to be beyond me when I developed this utility in 2002. It’s probably still beyond me now, but back then the reference I had was the original paper and gaining the necessary understanding from that would in all likelihood have taken all the fun out of the whole thing. So I fudged it. Sue me.

Approaching the problem

The VBComponents of the target workbook are scanned and a collection of modules is constructed, each of which contains a separate object for each procedure2 it contains. As procedures are constructed from lines, they attempt to classify what they’re being presented and tally up the incidences of things worth counting:

Public Enum eCounterType
    ecComments = 1
    ecDeclares = 2
    ecElseIfs = 3
    ecGotos = 4
    ecHighestNestLevel = 5
    ecIifs = 6
    ecIfs = 7
    ecLines = 8
    ecLongLines = 9
    ecLoops = 10
    ecMultiDimCount = 11
    ecOneCharVars = 12
    ecParameters = 13
    ecSelects = 14
    ecWiths = 15
End Enum

You may have spotted that the deepest nesting is also recorded, although it’s not exactly a “counter” – perhaps I should have used “scoreCard” or something more expressive.

Each procedure gets a complexity “score”, which ended up being calculated, after much tweaking, using a tariff something like this:

Pts Awarded for
1.0 every “nesting” command (If, Do, For, Select, ElseIf, etc)
0.1 each line of executable code after the first, rounded down
0.1 each declared variableafter the first, rounded down
1.0 each line that contains multiple Dims
0.5 each parameter after the first
2.0 each If (more than loops because of the arbitrarily complex branching)
0.2 each long line (defined as 60 chars or more, after concatenating continued lines)
1.0 each Go To
special the square of the deepest nesting level encountered

.
It’s not supposed to be exact and it doesn’t need to be: the purpose is to give a quick sanity check and, particularly when encountering new code for the first time, an idea of where problem areas may lie.

There are deficiencies: I don’t think the code handles one-line If/End Ifs correctly – I believe it incorrectly increments the nesting level, which leads to a higher score. Since I really don’t like one-line Ifs, that’s just going to increase the chance that I see them, so I was happy to leave it like that.

If you look at the output, there are other metrics: information on the whole project, module-level summaries (if you see 300 global variables you might consider that worth investigating, for example).

There’s no penalty for “Static” declarations, probably because I hadn’t stumbled across any code that thought it was a good idea. I’d put a fairly hefty tariff on that: it’s a major obstacle to the unravelling process.

Another “drawback”: there’s no real parser – a better approach might have been to write one and utilise an AST (the very useful roodi library for Ruby uses one and can report a “proper” cyclomatic complexity metric as a result). Also, I didn’t use regular expressions – I probably hadn’t encountered them (or was still frightened) at that time: it would have made some of the line-parsing a good deal more terse.

I called it “LOCutus” – “LOC” for “lines of code”, the balance presumably indicating that the Borg were on my mind at the time. For better or worse, I’ve added it to the xlVBADevTools download (in a ZIP file with xlUnit). It’s an Excel XP file – I don’t think I’ve ever tested on XL2007 or higher, since I’ve yet to work at a site where it was used and prefer not to look at large crappy Excel apps without being paid for it…

If you download it, you’ll see that the report is showing the results of a scan on the app’s own code (shown as a HTML table below, so formatting may vary slightly):

VBProc (Module) Globals Members Procs
0 7 21
Procedure Lines Max nest Comments Declares Ifs Loops Long lines Go Tos OCV X
Load 8 2 0 3 1 1 0 0 0 9
ProcessLine 20 1 0 1 5 0 0 0 0 12
CheckForOneCharVars 7 2 0 1 1 1 0 0 0 8
CountParameters 14 3 0 4 3 1 0 0 0 17

.
The “worst” proc scores 17, largely due to its 3-level nesting. If I were to have a need to change it, I’d consider breaking out the inner loop (or taking the outermost If and refactoring into a guard clause). Seventeen isn’t generally high enough to make me concerned though.

That’s the most “formal” part of the diagnostic “toolkit” that I apply to strange new Excel spreadsheets – anyone out there have something else we could look at?

xlVBADevTool download


1 Aren’t they all? You should hear me around the office. Probably not, actually.
2 OK, exactly one.
3 These days I favour “routine” as a generic term for Function/Sub/Property or whatever. Times change.

It’s An Ill Wind

I was pontificating recently about the horror of tightly-coupled worksheet calculations and macro code in Excel and the mess that results from carelessly mashed-together programming paradigms.

By a horrible coincidence, a few smug weeks later I suddenly found myself deep in the horror again. This time, I had to contend with two versions, needing to set up and run several tens of thousands of real-world cases through “before” and “after” scenarios.

I soon started to turn up nuggets of wonderment. Here are a couple of gems from the worksheets:

=IF(CPPI=TRUE,TRUE,FALSE)
=MAX(1,DATE(YEAR(C4),MONTH(C4),DAY(C4))-DATE(YEAR(C6),MONTH(C6),DAY(C6)))

“Why”, I asked myself, “don’t these monkeys think?” (When someone asked me which monkeys I was referring to, I realised that my inner monologue was malfunctioning, but that’s not important right now.)

Obviously they were thinking, apart from anything else the whole thing does seem to work, inasmuch as a set of inputs produce a set of plausible-looking outputs. I’d assert, however, that an experienced, competent developer has a continuous, higher-order review process running, asking questions like “does that look right?”, “could that be written better?”, “is that clear?”, “could someone else understand that?” It would appear that the individuals concerned in this case had no such process running.

I have a little tool that I bring out when faced with an unknown Excel VBA application. It scans the code and delives a report that gives me an idea of the degree of caution I should exercise, calculating for each routine an “X” number, being some arcane function of code lines, mesting level, numbers of ifs, loops, one-character variables (seldom straightforward to rename, even if they’re called “q” or “z”). A score of 10 or above indicates a possible problem area. Over 50 and we’re definitely in bandit country. Here’s the report for the top-scoring routine in the horror workbook (names changed to conceal the guilty):

422 is by a factor of about 3 the highest-scoring routine I’ve ever encountered. (By contrast, and perhaps it’s an unfair comparison, the top score in xlunit is 24.) That 1044 lines of code doesn’t include comments, blank lines, variable declarations or similar. Even better, it’s declared Static, something so appallingly dangerous (to me, at least – I get nauseous when I see even a Static variable) that I had to change the analyser because it had never seen it before. “Option Explicit” is honoured more in the breach than the observance and there are a mere (ha!) 156 global variables.

I was flicking through my cherished (and slightly stained) copy of The Pragmatic Programmer last night and I’ll take as my text the section “What Make a Pragmatic Programmer?” on pages xiii and xix of the Preface. Summarising the characteristics and comparing them with the authors of the unholy mess described, I get:

Early adopter/fast adapter: given that the abomination I’m working with is the culmination of more than two years’ work and is the fourth or fifth (and probably “best”) version, FAIL.

Inquisitive: I see no sign of any curiosity, not even “why do we suck?” FAIL

Realistic: I’m (generously) prepared to specify that this one may not really apply here. (Big of me.)

Jack of all trades: and masters of none? These guys didn’t make it to Jack of one trade. FAIL.

Critical Thinker: referring to the inner monologue again, there’s no evidence to suggest that anything like the questions listed above were ever applied here. And plenty of evidence to the contrary. BIG RED FAIL.

The tragedy here is that the monkeys described are unlikely to realise exactly how snivellingly far from acceptable (let’s not waste “best” here) practice they are, as the Dunning-Kruger effect tells us is likely, so they’re not going to change in any useful way. Failing to see that the twin demons of high coupling and low (non-existent, if we’re honest) cohesion are rampaging through their code leaves me wondering where we’d start.

Depressingly, these guys aren’t unusual, they’re not even the worst. The type is perhaps a little more common in, say, the Excel world, where the tools to develop staggering incompetence are so widely available, but they’re everywhere else – the invisible 90% of the iceberg.

On the plus side, even moderately competent technicians who are prepared to don the elbow-length protective gloves and dive in to such messes will be able to look forward with some confidence to a lifetime in which there will be a paycheck for as long as they can stand to pick it up.

Buddy, Can You Paradigm?

(stackoverflow rep: 8998, Project Euler 89/273 complete

(or: “If Paradigm was half as nice”)

I started reading Coders At Work recently. In the style of a few predecessors, the book comprises transcribed (and presumably edited) interviews with fifteen programmers who have made significant contributions of various kinds. It’s a weighty tome, possibly a little overweight if I’m honest, but interesting nonetheless; it makes a change from the kind of technical tome I usually cart around on the daily commute.

Circling a little closer to, and nodding vaguely in the general direction of the point, interviewee number seven, reached just this morning, is Simon Peyton Jones, who was one of the progenitors of Haskell, one of, if not the first name that springs to mind when one hears “functional programming“. For the last decade, Jones has been a researcher at Microsoft Research in Cambridge, which suggests that he would have had at least one finger in the F# pie… There’s a DotNetRocks show number 310 features an interview with the man, by the way. I’m enviously disturbed that despite our very similar ages, he appears considerably less follicly-challenged than me, although there could be some comb-over action going on there.

In the taxonomy of programming paradigms (does that make sense?) we have Declarative and Imperative. They are opposites. In the Declarative fold, we get stuff like SQL, XSLT, all the functional languages and, I submit Excel worksheets. Actually it’s not just me: that Wikipedia link places spreadsheet cell-based programming within the Dataflow subcategory of Functional. Imperative programming, on the other hand, includes the panoply of more “traditional” languages, including good old Visual Basic and its slightly less functional sibling VBA.

Thinking about Excel and VBA developement, I’d say there are several fairly distinct uses for VBA code (usually, but not exclusively: we can write add-ins or perform automation in plenty of other ways) which include:

  • New worksheet functions;
  • Interface extensions (menus, toolbars, ribbons etc);
  • Helpers (or wizards);
  • Control and simulation

…of which I write a lot of the latter two.

(You know how some comedians talk about a string of seemingly unrelated things and then cleverly tie them all together at the end? Well, fingers crossed, stay tuned)

A while back, I was handed a workbook, developed by an external consultancy, that modelled the 20-year evolution of a financial product. We wanted to run it across a significant number of stochastic market data simulations to get a picture of the distribution of possible outcomes. About 30,000 such simulations (and about 16GB of simulated market data) would be sufficient, we thought. The workbook took about 20 minutes to perform one such calculation. We didn’t have a year.

Fortunately, there were any number of classic optimisation failures to rectify: massive VBA/worksheet interaction at the cell level (changed to use large arrays), screen update enabled at all times, lots of less painful VBA-specific stuff, Application.Calculation = xlCalculationAutomatic, you know the drill. Easy stuff. Within a day it was taking about a minute and I could look at having the whole analysis done in under a week (I have a four-CPU machine).

If it all seems straightforward, it wasn’t. I’d wrapped the workbook in a modified (hacked-up from xlUnit) test jacket to check that my changes weren’t affecting the results, pretty much a necessity for any refactoring work, and one change kept breaking. Every time I tried to switch off automatic calculation, triggering it only when I thought it was needed, the final results came out different. I finally realised that the VBA and worksheet were extremely highly-coupled – calculations were occurring as the code pushed values through: change a value, auto-calc, change another value, auto-calc again and so on. Eek. It took almost a week to unravel. Of course, since I actually enjoy doing this kind of work for the most part, it wasn’t a complete disaster, but I could have lived without the time pressure…

As a finance worker (perversely, I’m increasingly inclined to describe myself as a Banker1 these days) I have cause, from time to time, to access The Bloomberg from Excel. At least one of the functions in the (extensive and powerful, let’s be fair) API is a real functional-imperative mess. Here’s one with which I’m not at all happy:

=BDH("MXWO", "PX_LAST", "01/06/2007", "28/12/2009", "Dir=V", "Dts=S", "Sort=A", "Quote=C", "QtTyp=Y", "Days=T", "Per=cd", "DtFmt=D", "cols=2;rows=672", "FX=EUR")

Yes, there are a lot of arguments2, but most are at least not positional, which shows they’re trying. This particular example asks for the last daily quote of the MSCI World Index, in Euro, between the dates given. Even better, I guess, there’s a wizard to generate it all for you. But here’s the nasty bit: the call fills as many rows – without warning about overwrites – as it needs. Oh, and that "cols=2;rows=672" parameter? It’s rewritten if you make any change that would affect the number of cells output. This is not easy to accomplish. Try duplicating the effect in VBA, go on, I double-dare you. I think there’s a separate server process at work here, able to make asynchronous (but timely) modifications to user worksheets.

Yikes. And ugh.

To me, there’s something broken here. They’re using am apparently functional programming model but allowing the code to affect the state of the worksheet in (from the user’s perspective at least) a rather arbitrary (or at least, unpredictable) way. We’re not even getting an array formula here – the function just causes chunks of worksheet to be filled with data. In point of fact, there looks to have been some serious programming work here – I don’t think in the normal course of events that calling a function is allowed to cause areas outside the calling range to be modified. I rather suspect an external process is being asked to pump values into the sheet through some devious means like (but let’s hope it isn’t) DDE.

(This is where I try to pull it all together. A little encouragement wouldn’t go amiss here…)

The fact that we have both paradigms represented as first-class components of a single development platform may be a large part of the reason for Excel’s dominance in the marketplace. But the interface between declarative and imperative code needs to be carefully managed to avoid the kind of unnecessary side-effects that I spent so long untangling.

The messed-up simulation I whined about above failed to recognise that the worksheets should have been, in effect, a great big, super-complex function, which the code used by setting up the input conditions, firing off a Calculate and then reading the output values. Instead, it tried to implement a tightly-coupled declarative-imperative abomination, with unfortunate results.

Let’s try not to repeat that mistake, for the sake of the children.

I’ve been (and remain) grumpy – thank you and good night.

You've been a lovely audience


1 Wanna make something of it?

2 Many Reuters functions do something similar, but push all the key-value pairs into a long string.

A Third Way: DNA?

(stackoverflow rep: 8417, Project Euler 87/261 complete

When we want Excel to talk to a compiled library, the list of options available to us is not long.

On the one hand, we can write a compiled library that references the Excel SDK (pre- or post-2007), setting its prefix to XLL for convenience, typically we’d grit our collective teeth and accomplish this with C or (cough, spit) C++. One big plus here is speed – we’re talking to Excel in its own language. Or something awfully close to it. Another plus, of course, is that this approach probably involves the minimum set of dependencies for non-enterprise distribution convenience. If you haven’t spotted the big minus then we should probably agree to disagree on the relative merits of programming languages.

On the other hand, we can write a library that exposes its functions through good ol’ COM, referencing it though the Tool…Add-ins dialog. That’s probably still the best-known way to integrate functionality developed in .NET, although it does involve paying a performance price as the COM interface is crossed and recrossed. This would be the place in the world most usually occupied by VSTO. Of course, you could also write your COM library in non-managed code: VB6 is easy (if you can find it) but slow, the choice from others depends on your capacity (or desire) for pain.

On the gripping hand, we have ExcelDNA, which for my money is one of the all-round cunningest things you’re likely to come across in quite a long time. (Unless you’ve already encountered its cunningness, of course, in which case you’re probably already nodding along in sage agreement). How so? How about being able to write your spiffy new functions in C# but without having to incur the needless ins and outs of the wasteful and superfluous COM middleman?

ExcelDNA provides a small XLL that can talk to managed code. There’s one attribute to set in order to make a function visible to Excel and one libary to import (in order to be able to add the attribute). Oh, and a little config file to tell the XLL what to load. That’s it. Actually, that’s the complicated version. In case you didn’t know, the CLR includes a compiler, allowing code to be created and compiled at run-time. The simplest way to talk to Excel from .NET via ExcelDNA is just to put your code directly into that little config file and let the XLL compile it at load time. OK, there’s a second or so’s overhead, but how simple – anything simpler would probably have to be involve 21st-Century language integration where VBA lives today.

By now you’re probably muttering something like “write code, fat bloke”, which is a little cruel, but a sentiment otherwise understandable. Now only the other day, Dick Kusleika posted his take on the everybody-has-one timeless RangeJoin() UDF topic. Here’s a rather simpler implementation:


<DnaLibrary Language="CS">
<![CDATA[
using ExcelDna.Integration;
using System.Collections.Generic;

    public class MyFunctions
    {
        [ExcelFunction(Description="Joins cell values", Category="My ExcelDNA functions")]
        public static object RangeJoin(object[,] cells)
        {
            List<string> list = new List<string>();

            foreach (object o in cells)
                list.Add(o.ToString());

            return string.Join(",", list.ToArray());
        }
    }
]]>
</DnaLibrary>

To get it running, I saved it as “RangeJoin.dna”, then copied the ExcelDNA.xll into the same folder and renamed that to “RangeJoin.xll”. Because it isn’t currently anywhere on my path, I also put ExcelDna.Integration.dll in the folder. Then I double-clicked my new XLL file to start Excel and load the library. (Pictures below are from Excel 2002, but the code is tested up to 2007).

Not surprising...

There’s a tiny extra delay before the usual macro warning appears – the C# code’s being compiled, which is where compile-time failures are displayed. Then you’re up and running. The function shows up in the Insert Function dialog…

… and we can try calling it:

… with the following result:

I did notice the the =NA() output and the Euro amount came out different to what I’d have hoped. It rather looks like I’d have to dig a little deeper into the ExcelDna.Integration.dll and the ExcelReference object in particular in order to be able to access the equivalent of VBA’s Text property.

Still, not bad for a quick exercise, I’d say.

There are some alternative approaches to the dot-net-and-excel-without-going-anywhere-near-COM-add-ins topic:

For Free

Excel4Net is now free – it seems to work by implementing a single worksheet function that calls out to managed code;

The last time I looked at XLW it was focused on C++. That’s changed in the interim – C# and VB.NET are now options.

Possibly Paid-for Propositions

Len Holgate appears to be at beta test stage with JetXLL, which looks like it’ll be a commercial offering;

I also just discovered ExHale. In beta, appears current – can’t see what the terms are likely to be;

Definitely Demanding Dollars

Possibly the first implementation, ManagedXLL is (was?) a paid-for product that looks to have been similar to, but possibly broader in scope than ExcelDNA. It’s not clear whether the company is still active or the product is still available – a curious colleague recently tried to make enquiries and failed.

Follow

Get every new post delivered to your Inbox.