Getting Bigger All The Time

in which we try to get tricky with Excel and find that Excel has adequate built-in trickiness already, thank you very much indeed

Over the years I’ve had a number of roles that involved fixing, tuning, cleaning and otherwise future- and bullet-proofing a fair number of Excel workbooks. There’s a whole catalog of potential issues to entertain and amuse but very common is the Extending Input Data Set. (I capitalised that in the hope that I might have christened some kind of Excel anti-pattern).

Your average Excel user is smart enough to know that some input data may grow over time: they may be copy-pasting from some other source, grabbing a database query result or even just typing in more values over time. It’s very common to see a row-by-row enrichment of such data for analysis and the formulae need to be extended to match the rows of input data. In the worst case, users have pre-copied more rows than they expect ever to need and don’t notice when they’ve started to miss data. More happily, they copy down as they see the data extending. If they see the data extending, that is1.

Helping a colleague to avoid such an unpleasantness recently led to a couple of interesting (to me) solutions. Firstly, we looked at extending the calculations applied to a list of data that we knew would extend every time we ran the spreadsheet model.

We assume we have a defined name, input_data that does what it says. Further, we have output_formulae, which should correspond one-for-one with the input_data rows. When input_data  changes, we want to automatically extend output_formulae to match. Pretty straightforward, as it turns out:

Private Sub Worksheet_Change(ByVal Target As Range)
  If Intersect(Target, Range("input_data")) Is Nothing Then
    Exit Sub
  End If
End Sub

Sub ExtendOutputFormulaeAsRequired()

  Dim inputRowCount As Long, calcRowCount As Long
  inputRowCount = Range("input_data").Rows.Count
  calcRowCount = Range("output_formulae").Rows.Count

  If inputRowCount <= calcRowCount Then
    Exit Sub ' enough formula rows already - could reduce if needed, but not done here
  End If

  With Range("output_formulae")
    ' Assumes just formulae are needed:
    .Offset(calcRowCount, 0).Resize(inputRowCount - calcRowCount, .Columns.Count).Formula = _
  End With
End Sub

That works pretty well. At least, it does if input_data is defined, something we can usually do fairly easily, using a named formula, something like =OFFSET(B2,,,COUNT(B:B),6) (assuming we have 6 rows and there’s a non-numeric column heading in B1, which in this case there is).

Here’s the wrinkle that gives rise to the second interesting (to me) question, that necessitated a question on Stack Overflow. What if I can’t be sure that all my columns will be the same length? I want the length of the longest column (there won’t be gaps vertically, which is something at least). So in the sample below, there are 7 rows of data in columns B to G, with a heading on row 1, so 7 is the longest column and defines the number of rows in input_data.

Input data – extend me!
1 1 1 1 1 1
1 1 2 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1

I don’t want to use .CurrentRegion because I can’t be certain that adjacent cells won’t be populated at some time in the future. We can’t do something like MAX(COUNT(B:G)) because COUNT() is built to handle two-dimensional ranges. I tried to be tricky, {=MAX(COUNT(OFFSET(B1:G1),,,1000000,1))}, hoping that Excel would evaluate the COUNT() for a set of OFFSETs from B1 to G1. Excel wasn’t fooled. Stack Overflow time then, and enter SUBTOTAL(), a function I confess I’ve never used in over 20 years of Excellence.

I define a new name, input_data_columns as =Sheet1!$B:$G, which I can then use in the definition of input_data_rowcount:


In a worksheet this needs to be Control-Shift-Entered as an array formula, but there’s no obvious way to do this when defining a Named Formula. Fortunately, although I don’t understand why, it works OK anyway, leading in turn to input_data becoming


That works! Is it future-proof? Probably not2 but it might very well be an improvement on what came before.

Briefly checking what happens, if we remove the MAX() and put the remainder into a worksheet as an array formula, we get {4, 4, 7, 4, 5, 5}, which is indeed the result of COUNT() on each column.

1 From a selfish perspective, this is a Good Thing – it requires “professional” expertise to deal with and has put bread on my table many a time.

2 See 1

My (Im)perfect Cousin?

in which we start to worry about the source of our inspiration
Mona Lisa Vito: So what’s your problem?
Vinny Gambini: My problem is, I wanted to win my first case without any help from anybody.
Lisa: Well, I guess that plan’s moot.
Vinny: Yeah.
Lisa: You know, this could be a sign of things to come. You win all your cases, but with somebody else’s help. Right? You win case, after case, – and then afterwards, you have to go up somebody and you have to say- “thank you“! Oh my God, what a fuckin’ nightmare!

It is one of the all-time great movies, and netted Marisa Tomei an Oscar in the process. Yes it is. It really is1.

Not only that, but My Cousin Vinny2 throws up parallels in real life all the time. Yes it does. It really does3.

Why only recently, I was puzzling over the best (or least worst) way to implement a particularly nonsensical requirement for an intransigent client. After summarising the various unpalatable options in an email, a reply arrived from a generally unproductive source. The message content made it obvious that he’d somewhat missed the point but the conclusion he drew from that misunderstanding triggered a new thought process that gave us a new, even less, er, worser solution to our problem.

Sadly, my unwitting muse has moved on now, but he left his mark for all time4 on our latest product. I suppose he should also take partial credit for the creation of a hitherto unknown development methodology: Powerpoint-Driven Development, but that’s a story for another day.

1 All right, IMHO
2 See also My Cousin Vinny At Work, application of quotes therefrom
4 Or at least until we have a better idea and change the whole damn thing

This Wheel Goes To Eleven

(in which we make an unexpected connection regarding the D in SOLID and get all hot under the collar about it)

Let’s not beat about the bush: I think I may have reinvented Dependency Injection. While it looks rather casual, stated like that, I’ve actually spent much of the last six months doing it. (Were you wondering? Well, that.)

I’ve been designing/building/testing/ripping apart/putting back together again a library/app/framework/tool thing that allows us to assemble an asset allocation algorithm for each of our ten or so products1, each of which may have been modified at various times since inception. It’s been interesting and not a little fun, plus I’ve been climbing the C# learning curve (through the three versions shipped since my last serious exposure) like Chris Bonnington on amphetamines.

Our products are broadly similar but differ in detail in some places. So there’s lots of potential for reuse, but no real hierarchy (if you can see a hierarchy in the little chart here, trust me, it’s not real).

So Product A need features 1, 2 & 3, in that order. B needs 1 & 4, C 1, 3 & 5, etc. What I came up with was to encapsulate each feature in a class, each class inheriting from a common interface. Call it IFeature or some such. At run-time, I can feed my program an XML file (or something less ghastly perhaps) that says which classes I need (and potentially the assemblies in which they may be found), applying the wonder that is System.Reflection to load the specified assembles and create instances of the classes I need, storing them in, for example, a List<IFeature>. To run my algorithm, all I need to do is call the method defined in my interface on each object in turn. A different product, or a new version of an existing one has a different specification and it Should Just Work.

It’s all very exciting.

So changing a single feature of an existing product means writing one new class that implements the standard interface and pointing the product definition at the library that contains the new class (which may – should – be different from those already in use).

The discerning reader may, er, discern that there are elements of Strategy and Command patterns in here as well. Aren’t we modern?

While all this is very exciting (to me at least – a profound and disturbing symptom of work-life imbalance) it’s still not the end of the line. I’ve built functions and then chosen to access them serially, relying on carefully (or tricky & tedious) XML definitions to dictate sequence. I’m thinking that I can go a long way further into declarative/functional territory, possibly gaining quite a bit. And there’s a whole world of Dynamic to be accessed plus Excel and C++ interfaces of varying degrees of sexiness to be devised .

More on much of that when I understand it well enough to say something.

1 There are billions at stake, here, billions I tell you.

Tiny VBA Tooltippery Tip

Project Euler 100/304

(in which we discover What Went On In 1997)

This morning’s iteration of the daily blog/news trawl for useful information threw up “Five tips for debugging a routine in the Visual Basic Editor“, all of which are sensible, although unlikely to be news to anyone reading this, if we’re honest.

Tip #3, “View variables using data tips”, however, reminded me of something that I don’t believe is widely known. Since the site seems to require a full-blown account creation that I can’t see as appropriate for a simple comment, I’m going to mention it here.

Hovering the mouse pointer over a variable while in VBA’s Break mode will show the variable’s value in a tool tip:

The smart VBA programmer

That’s fine: almost all the time we get to see exactly what we want. Above about (or maybe exactly) 60 characters, however, we get the leading part and three little dots:

Still no problem if we only want the start of the string...

What if we want to see what’s at the end of the string, though? Well, back in (I think) 1997, I managed to get my then employer to send me to VBA DevCon (no easy task, given that the location was EuroDisney), at which I happened to meet the Microsoft guy who actually wrote the hover/tooltip thing (it was in the VB4 editor first, I believe) and he told me that viewing the last 60-ish characters of the string could be achieved by holding down the Control key before moving the pointer over the variable name:


I don’t think I’ve ever seen this recorded. Of course, I haven’t exactly gone looking for it, so if you came all the way to the end only to discover that I was just repeating something that everyone knows, then I can only apologise. We’ll get over it.

Lacking Anything Worthwhile To Say, The Third Wise Monkey Remained Mostly Mute

Project Euler 100/304 complete (on the permanent leaderboard at last!)

(in which we discover What’s Been Going On lately)


Don't talk ... code


I’ve been coding like crazy in C# of late, a language I’ve barely touched in the last few years. Put it this way, generics were new and sexy the last time I wrote anything serious in .NET… (There should be some ExcelDNA fun to be had later.)

I’d forgotten how flat-out fast compiled languages, even the bytecode/IL kind could be. It’s not a completely fair comparison, to be sure, but a Ruby script to extract 300,000 accounts from my Oracle database and write them as XML takes a couple of hours, mostly in the output part. A C# program handled the whole thing in 5 minutes. Then processes the accounts in about 30 seconds, of which 10 are spent deserializing the XML into objects, 10 are serializing the results to XML and 10 are performing the moderately heavy-duty mathematical tranformations in between.


Click to test is value for money


Lacking at present a paid-for version of Visual Studio 2010 (the Express Edition, while brilliantly capable, won’t do plugins, which precludes integration Subversion and NUnit, to name but two essentials), I have been enjoying greatly my experience with SharpDevelop, which spots my installs of TortoiseSVN and NUnit and allows both to be used inside the IDE. It’s not perfect: there are areas, particularly in its Intellisense analogue, where exceptions get thrown, but they’re all caught and I have yet to lose any work. While the polish is, unsurprisingly, at a lower level than Microsoft’s, it’s entirely adequate (and I mean that in a good way) and the price is right. I particularly liked being able to open an IronRuby session in the IDE and use it to interact with the classes in the DLL on which I was working.

While I expect VS2010 to become available as the budgeting process grinds through, I’m not at all sure that it’ll be necessary to switch. An extended set of automated refactoring tools could be attractive, although Rename and Extract Method are probably the two most useful, productivity-wise, and they’re already present. I would rather like to have Extract Class, which isn’t needed often but would be a big time (and error) saver when called for.

On another topic entirely, should you be looking for entertaining reading in the vaguely technical, erudite and borderline insane category, may I recommend To Umm Is Human to you? Any blog that has “orang utan” amongst its tags is worth a look, I’d say. If you like it, you’ll like it a lot. I was once made redundant by a Doubleday, but I don’t think they’re related.

There’s an interesting new programmer-oriented podcast on the block, too: This Developer’s Life has slightly higher production values that may ultimately limit its life – the time to produce an episode must be substantial. I found myself wanting to join in the conversation with stories of my own, a sure sign that I was engaged in the content.

That Do Impress Me Much

Over at stackoverflow, now that they have a pile of money to spend invest, the rate of change is picking up. There’s the re-worked stack exchange model, which has changed dramatically – and quite likely for the better. They’ve moved away from the original paid-for hosted service to a community-driven process, whereby a community needs to form and commit to the idea of a new site. The objective is to improve the prospects of achieving critical mass for a new site, thus increasing its chances of success. I imagine a revenue model is mooted, although it may be little more than “if we build it, they will come” at present. Sponsored tags and ads spring to mind.

This week we’ve seen the covers removed (perhaps for a limited time initially) on a “third place“, to go with the existing main Q&A and “meta” (questions about the Q&A site). It’s a chat room. Well, lots of chat rooms of varying degrees of focus, to be more specific. Quite nicely done, too.

What has really impressed me has been that during this “limited sneak beta preview”, bugs, issues, feature requests and the like have been flowing through the interface at a fair rate of knots and many have been addressed and released within hours. Minutes, sometimes.

Think about it. User detects a bug, reports it and gets a fix, to an application with global reach, in a couple of hours or less. That’s agile.

A crucial part of Lean movement in manufacturing (and its younger counterpart in software development) is eliminating waste. “Waste” is broadly defined, very broadly defined, in fact, but one easily identifiable component is Work In Progress (WIP). In software terms, this often represents effort that has been invested (and money that’s been tied up) without having been included in a release. The more we invest effort without release the more we’re wasting, since we have no possibility of obtaining a return on that investment.

Here’s a particularly quick find/fix from earlier today:

Yes, it was probably a trivial bug, but the problem was notified, found, fixed and released in eight frickin’ minutes. How many of us can turn anything around that fast?

I’m looking forward to seeing where this goes.

Sorted for Excel and Whee!

If you happened upon the VBA Lamp and by rubbing it were able to produce the Excel VBA Genie and were granted a VBA Wish, would you ask for a built-in Sort() function?

If you build the kind of Excel apps that I do, you’ll run up against the need for a Sort all too frequently. Sorting arrays with sizes in the tens of thousands is not unusual and I was reminded of this when reading a post in this entry at Andrew’s Excel Tips the other day.

While it’s not crucial in the (useful) idea presented, the code has a quick and and dirty sort that is about the worst sort algorithm1 one could intelligently come up with. It’s also an prettty intuitive solution, which just goes to show that sorting may not be as simple as we may think. I’m not attacking Andrew’s skillz here, btw: the code as presented is certainly fit for purpose; it’s not presented as a general-purpose utility (at least I hope it isn’t).

I’ve accumulated a few algorithms over the years and I’ve coded up a couple more while “researching” this piece. On the one hand, we have the oft-derided BubbleSort and its close relation, CocktailSort. In the same group I’d include InsertionSort and SelectionSort. I’m going to be harsh and categorise those, with the naive sort above, as “Slow”. Well, “mostly slow”, as we’ll see.

In the “Fast” group, we have the much-touted QuickSort, and somewhere in between, we have HeapSort,and my current algorithm of choice, CombSort. As I was researching this, I also coded up ShellSort, which is about as old as I am and which was claimed to be faster than QuickSort under some conditions.

I ran some comparisons, not meant in any way to be perfectly scientific2. I ran each algorithm on arrays of 50 to 50,000 values with six different characteristics:

  • already sorted
  • exactly reversed
  • mostly sorted (values typically within one or two places of target)
  • ordered blocks of 100 random values (the first 100 values are 0 + RAND(), then 100 + RAND() and so on)
  • completely random
  • random 10-character strings

First off, the 50-record results:

50 recs (ms) sorted near sorted blocks random strings reversed
Shell 0.06 0.06 0.11 0.10 0.24 0.09
Quick 0.13 0.13 0.16 0.13 0.29 0.14
Comb 0.09 0.10 0.17 0.17 0.33 0.13
Heap 0.34 0.33 0.32 0.32 0.52 0.28
Insertion 0.02 0.02 0.20 0.17 0.47 0.37
Selection 0.25 0.25 0.25 0.25 0.54 0.25
Cocktail 0.01 0.02 0.44 0.39 1.02 0.77
Bubble 0.01 0.02 0.50 0.45 1.12 0.78
Naive 0.22 0.23 0.50 0.46 1.06 0.77

I’d say it’s pretty clear that it doesn’t matter much what you use to sort a small array, just about anything will be fast enough (unless you’re going to perform that sort tens of thousands of times in a run). It’s also apparent that the “slow” algorithms are actually pretty good if our data is already reasonably well-ordered.

So far, so “so what?”

Let’s look at the opposite end of the spectrum: 50,000 values? Here, the Fast/Slow divide is apparent. First the “Slows” (two tests only, for reasons that should become apparent):

50K (ms) near sorted random
Bubble 31 522,216
Cocktail 30 449,696
Insertion 19 179,127
Naive 219,338 510,010
Selection 220,735 220,743

Yes, that’s hundreds of thousands of milliseconds. “Three or four minutes” to you and me. The “Fasts”, meanwhile:

50K (ms) sorted near sorted blocks random strings reversed
Shell 162 164 219 377 929 250
Quick 296 298 327 365 790 306
Comb 390 396 477 622 1,348 452
Heap 899 903 885 874 1,548 844

(I only ran two tests on the “Slows”, for fear of dozing off completely.)

Again, for data where values are near their final sorted positions there’s clear evidence that something like an Insertion Sort is much faster than any of the “sexier” options. Provided you know your data will actually meet that criterion, of course.

All that considered, I’m switching from CombSort to ShellSort as my default algorithm. While it loses out a little to QuickSort in the “random” test (probably most representative of my normal use case) it doesn’t carry with it the fear of stack overflow through extreme recursion, something that’s bitten me with QS in the past. Anyway, us old’uns have got to stick together.

As already mentioned, if you have small quantities of data or infrequent sort calls in your application, it really doesn’t make much difference which algorithm you use, although I’d still suggest being aware of the properties of several options and having a basic implementation to hand. Once you reach a point where sorting contributes materially to your application run time then you owe it to yourself and your users to make an intelligent selection.

Here’s my ShellSort implemenation in VB, transcoded fairly literally from the Wikipedia pseudo-code (optimisations welcome):

Public Sub ShellSort(inp)
' sorts supplied array in place in ascending order
Dim inc As Long, i As Long, j As Long
Dim temp ' don't know what's in the input array...
  If Not IsArray(inp) Then Exit Sub ' ...but it had better be an array
  inc = (UBound(inp) - LBound(inp) + 1) / 2 ' use Shell's originally-proposed gap sequence
  Do While inc > 0
    For i = inc To UBound(inp)
      temp = inp(i)
      j = i
        If j < inc Then Exit Do ' check these conditions separately, as VBA Ors don't short-circuit
        If inp(j - inc) <= temp Then Exit Do ' ... and this will fail when j < inc
        inp(j) = inp(j - inc)
        j = j - inc
      inp(j) = temp
    inc = Round(CDbl(inc) / 2.2)

End Sub

1 Not the absolute worst: there’s the catastrophic genius of BogoSort, to name but one, but let’s not go anywhere nearer to there.
2 Just so we understand each other, these are my figures for my code on one of my machines. YMMV. A lot.


Get every new post delivered to your Inbox.