## Sorted for Excel and Whee!

6 July 2010 10 Comments

If you happened upon the VBA Lamp and by rubbing it were able to produce the Excel VBA Genie and were granted a VBA Wish, would you ask for a built-in Sort() function?

If you build the kind of Excel apps that I do, you’ll run up against the need for a Sort all too frequently. Sorting arrays with sizes in the tens of thousands is not unusual and I was reminded of this when reading a post in this entry at Andrew’s Excel Tips the other day.

While it’s not crucial in the (useful) idea presented, the code has a quick and and dirty sort that is about the worst sort algorithm^{1} one could intelligently come up with. It’s also an prettty intuitive solution, which just goes to show that sorting may not be as simple as we may think. I’m not attacking Andrew’s skillz here, btw: the code as presented is certainly fit for purpose; it’s not presented as a general-purpose utility (at least I hope it isn’t).

I’ve accumulated a few algorithms over the years and I’ve coded up a couple more while “researching” this piece. On the one hand, we have the oft-derided BubbleSort and its close relation, CocktailSort. In the same group I’d include InsertionSort and SelectionSort. I’m going to be harsh and categorise those, with the naive sort above, as “Slow”. Well, “mostly slow”, as we’ll see.

In the “Fast” group, we have the much-touted QuickSort, and somewhere in between, we have HeapSort,and my current algorithm of choice, CombSort. As I was researching this, I also coded up ShellSort, which is about as old as I am and which was claimed to be faster than QuickSort under some conditions.

I ran some comparisons, not meant in any way to be perfectly scientific^{2}. I ran each algorithm on arrays of 50 to 50,000 values with six different characteristics:

- already sorted
- exactly reversed
- mostly sorted (values typically within one or two places of target)
- ordered blocks of 100 random values (the first 100 values are 0 + RAND(), then 100 + RAND() and so on)
- completely random
- random 10-character strings

First off, the 50-record results:

50 recs (ms) | sorted | near sorted | blocks | random | strings | reversed |

Shell | 0.06 | 0.06 | 0.11 | 0.10 | 0.24 | 0.09 |

Quick | 0.13 | 0.13 | 0.16 | 0.13 | 0.29 | 0.14 |

Comb | 0.09 | 0.10 | 0.17 | 0.17 | 0.33 | 0.13 |

Heap | 0.34 | 0.33 | 0.32 | 0.32 | 0.52 | 0.28 |

Insertion | 0.02 | 0.02 | 0.20 | 0.17 | 0.47 | 0.37 |

Selection | 0.25 | 0.25 | 0.25 | 0.25 | 0.54 | 0.25 |

Cocktail | 0.01 | 0.02 | 0.44 | 0.39 | 1.02 | 0.77 |

Bubble | 0.01 | 0.02 | 0.50 | 0.45 | 1.12 | 0.78 |

Naive | 0.22 | 0.23 | 0.50 | 0.46 | 1.06 | 0.77 |

I’d say it’s pretty clear that it doesn’t matter much what you use to sort a small array, just about anything will be fast enough (unless you’re going to perform that sort tens of thousands of times in a run). It’s also apparent that the “slow” algorithms are actually pretty good if our data is already reasonably well-ordered.

So far, so “so what?”

Let’s look at the opposite end of the spectrum: 50,000 values? Here, the Fast/Slow divide is apparent. First the “Slows” (two tests only, for reasons that should become apparent):

50K (ms) | near sorted | random |

Bubble | 31 | 522,216 |

Cocktail | 30 | 449,696 |

Insertion | 19 | 179,127 |

Naive | 219,338 | 510,010 |

Selection | 220,735 | 220,743 |

Yes, that’s hundreds of thousands of milliseconds. “Three or four minutes” to you and me. The “Fasts”, meanwhile:

50K (ms) | sorted | near sorted | blocks | random | strings | reversed |

Shell | 162 | 164 | 219 | 377 | 929 | 250 |

Quick | 296 | 298 | 327 | 365 | 790 | 306 |

Comb | 390 | 396 | 477 | 622 | 1,348 | 452 |

Heap | 899 | 903 | 885 | 874 | 1,548 | 844 |

(I only ran two tests on the “Slows”, for fear of dozing off completely.)

Again, for data where values are near their final sorted positions there’s clear evidence that something like an Insertion Sort is much faster than any of the “sexier” options. Provided you know your data will actually meet that criterion, of course.

All that considered, I’m switching from CombSort to ShellSort as my default algorithm. While it loses out a little to QuickSort in the “random” test (probably most representative of my normal use case) it doesn’t carry with it the fear of stack overflow through extreme recursion, something that’s bitten me with QS in the past. Anyway, us old’uns have got to stick together.

As already mentioned, if you have small quantities of data or infrequent sort calls in your application, it really doesn’t make much difference which algorithm you use, although I’d still suggest being aware of the properties of several options and having a basic implementation to hand. Once you reach a point where sorting contributes materially to your application run time then you owe it to yourself and your users to make an intelligent selection.

Here’s my ShellSort implemenation in VB, transcoded fairly literally from the Wikipedia pseudo-code (optimisations welcome):

Public Sub ShellSort(inp) ' sorts supplied array in place in ascending order Dim inc As Long, i As Long, j As Long Dim temp ' don't know what's in the input array... If Not IsArray(inp) Then Exit Sub ' ...but it had better be an array inc = (UBound(inp) - LBound(inp) + 1) / 2 ' use Shell's originally-proposed gap sequence Do While inc > 0 For i = inc To UBound(inp) temp = inp(i) j = i Do If j < inc Then Exit Do ' check these conditions separately, as VBA Ors don't short-circuit If inp(j - inc) <= temp Then Exit Do ' ... and this will fail when j < inc inp(j) = inp(j - inc) j = j - inc Loop inp(j) = temp Next inc = Round(CDbl(inc) / 2.2) Loop End Sub

^{1} Not the absolute worst: there’s the catastrophic genius of BogoSort, to name but one, but let’s not go anywhere nearer to there.

^{2} Just so we understand each other, these are *my* figures for *my* code on one of *my* machines. YMMV. A lot.