Sorted for Excel and Whee!
6 July 2010 10 Comments
If you happened upon the VBA Lamp and by rubbing it were able to produce the Excel VBA Genie and were granted a VBA Wish, would you ask for a built-in Sort() function?
If you build the kind of Excel apps that I do, you’ll run up against the need for a Sort all too frequently. Sorting arrays with sizes in the tens of thousands is not unusual and I was reminded of this when reading a post in this entry at Andrew’s Excel Tips the other day.
While it’s not crucial in the (useful) idea presented, the code has a quick and and dirty sort that is about the worst sort algorithm1 one could intelligently come up with. It’s also an prettty intuitive solution, which just goes to show that sorting may not be as simple as we may think. I’m not attacking Andrew’s skillz here, btw: the code as presented is certainly fit for purpose; it’s not presented as a general-purpose utility (at least I hope it isn’t).
I’ve accumulated a few algorithms over the years and I’ve coded up a couple more while “researching” this piece. On the one hand, we have the oft-derided BubbleSort and its close relation, CocktailSort. In the same group I’d include InsertionSort and SelectionSort. I’m going to be harsh and categorise those, with the naive sort above, as “Slow”. Well, “mostly slow”, as we’ll see.
In the “Fast” group, we have the much-touted QuickSort, and somewhere in between, we have HeapSort,and my current algorithm of choice, CombSort. As I was researching this, I also coded up ShellSort, which is about as old as I am and which was claimed to be faster than QuickSort under some conditions.
I ran some comparisons, not meant in any way to be perfectly scientific2. I ran each algorithm on arrays of 50 to 50,000 values with six different characteristics:
- already sorted
- exactly reversed
- mostly sorted (values typically within one or two places of target)
- ordered blocks of 100 random values (the first 100 values are 0 + RAND(), then 100 + RAND() and so on)
- completely random
- random 10-character strings
First off, the 50-record results:
50 recs (ms) | sorted | near sorted | blocks | random | strings | reversed |
Shell | 0.06 | 0.06 | 0.11 | 0.10 | 0.24 | 0.09 |
Quick | 0.13 | 0.13 | 0.16 | 0.13 | 0.29 | 0.14 |
Comb | 0.09 | 0.10 | 0.17 | 0.17 | 0.33 | 0.13 |
Heap | 0.34 | 0.33 | 0.32 | 0.32 | 0.52 | 0.28 |
Insertion | 0.02 | 0.02 | 0.20 | 0.17 | 0.47 | 0.37 |
Selection | 0.25 | 0.25 | 0.25 | 0.25 | 0.54 | 0.25 |
Cocktail | 0.01 | 0.02 | 0.44 | 0.39 | 1.02 | 0.77 |
Bubble | 0.01 | 0.02 | 0.50 | 0.45 | 1.12 | 0.78 |
Naive | 0.22 | 0.23 | 0.50 | 0.46 | 1.06 | 0.77 |
I’d say it’s pretty clear that it doesn’t matter much what you use to sort a small array, just about anything will be fast enough (unless you’re going to perform that sort tens of thousands of times in a run). It’s also apparent that the “slow” algorithms are actually pretty good if our data is already reasonably well-ordered.
So far, so “so what?”
Let’s look at the opposite end of the spectrum: 50,000 values? Here, the Fast/Slow divide is apparent. First the “Slows” (two tests only, for reasons that should become apparent):
50K (ms) | near sorted | random |
Bubble | 31 | 522,216 |
Cocktail | 30 | 449,696 |
Insertion | 19 | 179,127 |
Naive | 219,338 | 510,010 |
Selection | 220,735 | 220,743 |
Yes, that’s hundreds of thousands of milliseconds. “Three or four minutes” to you and me. The “Fasts”, meanwhile:
50K (ms) | sorted | near sorted | blocks | random | strings | reversed |
Shell | 162 | 164 | 219 | 377 | 929 | 250 |
Quick | 296 | 298 | 327 | 365 | 790 | 306 |
Comb | 390 | 396 | 477 | 622 | 1,348 | 452 |
Heap | 899 | 903 | 885 | 874 | 1,548 | 844 |
(I only ran two tests on the “Slows”, for fear of dozing off completely.)
Again, for data where values are near their final sorted positions there’s clear evidence that something like an Insertion Sort is much faster than any of the “sexier” options. Provided you know your data will actually meet that criterion, of course.
All that considered, I’m switching from CombSort to ShellSort as my default algorithm. While it loses out a little to QuickSort in the “random” test (probably most representative of my normal use case) it doesn’t carry with it the fear of stack overflow through extreme recursion, something that’s bitten me with QS in the past. Anyway, us old’uns have got to stick together.
As already mentioned, if you have small quantities of data or infrequent sort calls in your application, it really doesn’t make much difference which algorithm you use, although I’d still suggest being aware of the properties of several options and having a basic implementation to hand. Once you reach a point where sorting contributes materially to your application run time then you owe it to yourself and your users to make an intelligent selection.
Here’s my ShellSort implemenation in VB, transcoded fairly literally from the Wikipedia pseudo-code (optimisations welcome):
Public Sub ShellSort(inp) ' sorts supplied array in place in ascending order Dim inc As Long, i As Long, j As Long Dim temp ' don't know what's in the input array... If Not IsArray(inp) Then Exit Sub ' ...but it had better be an array inc = (UBound(inp) - LBound(inp) + 1) / 2 ' use Shell's originally-proposed gap sequence Do While inc > 0 For i = inc To UBound(inp) temp = inp(i) j = i Do If j < inc Then Exit Do ' check these conditions separately, as VBA Ors don't short-circuit If inp(j - inc) <= temp Then Exit Do ' ... and this will fail when j < inc inp(j) = inp(j - inc) j = j - inc Loop inp(j) = temp Next inc = Round(CDbl(inc) / 2.2) Loop End Sub
1 Not the absolute worst: there’s the catastrophic genius of BogoSort, to name but one, but let’s not go anywhere nearer to there.
2 Just so we understand each other, these are my figures for my code on one of my machines. YMMV. A lot.
Hi Mike,
Useful stuff!
Interesting.
On the code above, the line:
>> inc = (UBound(inp) – LBound(inp) + 1) / 2
should that be rounded?
I have also implemented (in Javascript in this case) a version of quick sort using a local array as a stack, instead of recursion.
@Julian – I’m not sure that it matters hugely. The pseudo-code I translated has a round(), but it’s not clear what kind of rounding it’s using (VBA does “banking rounding” and 0.5s can go up or down). AFAICT, this line chooses the starting point for the “gap sequence” and that itself seems to have been the source of some conjecture. I don’t think it’s making much of a difference…
If I can get my head far enough around it, I might try to explore some of the alternatives mentioned in the Wikipedia article. I have Sedgewick’s “Algorithms” book at home somewhere, albeit in the Pascal edition – I don’t think he ever did a VBA one 😉
@Mike – Ahh, I forgot that VBA must do some rounding if a Double is used to access an Array index position.
It is quite an interesting subject. I like the visualisations here http://www.cs.ubc.ca/~harrison/Java/sorting-demo.html
and here http://www.cs.rit.edu/~atk/Java/Sorting/sorting.html
In particular, I have never quite understood the Shear Sort.
There is another assumption in the code, hidden by the use of LBound.
LBound(imp) must be 0
Otherwise j < inc would be wrong, as it would allow j = inc through to the next test giving inp(0) <= tmp
@Julian – I think you’re right on this: the WIkipedia pseudocode assumes a zero-based array and my test data was the same. There would need to be some more (careful) offsetting for VB/VBA’s more flexible array bounds.
Hi Mike
Thanks for your comments.
I played around with your code a little and came up with the following (for a typed array of Longs in my case). Sadly I have a bad habit of using “My” all over the place!
Public Function Array_ShellSort_Long( _
MyArray() As Long _
)
Dim MyInc As Long
Dim MyI As Long
Dim MyJ As Long
Dim MyTemp As Long
Dim MyLBound As Long
If Not IsArray(MyArray) Then Exit Function
MyLBound = LBound(MyArray)
MyInc = Round(((UBound(MyArray) – MyLBound) – (LBound(MyArray) – MyLBound) + 1) / 2)
Do While MyInc > 0
For MyI = (MyInc + MyLBound) To UBound(MyArray)
MyTemp = MyArray(MyI)
MyJ = MyI
Do
If (MyJ – MyInc) < MyLBound Then Exit Do
If MyArray(MyJ – MyInc) <= MyTemp Then Exit Do
MyArray(MyJ) = MyArray(MyJ – MyInc)
MyJ = MyJ – MyInc
Loop
MyArray(MyJ) = MyTemp
Next MyI
MyInc = Round(CDbl(MyInc) / 2.2)
Loop
End Function
Regards
Julian
Love the article title 😉
I posted a comb sort function and a few sort related links a little while ago:
http://newtonexcelbach.wordpress.com/2009/03/23/a-sort-function/
which you might like to have a look at.
Do you know anything about radix sorts? They seem incredibly quick, but I never got round to implementing it properly.
Why reinvent the wheel? Just dump the numbers into a ‘scratch’ worksheet, programmatically have Excel sort them there, copy them back into an array or wherever, and delete the scratch worksheet. Oh, and turn screen updating off and on around the operation.