Sorted for Excel and Whee!

If you happened upon the VBA Lamp and by rubbing it were able to produce the Excel VBA Genie and were granted a VBA Wish, would you ask for a built-in Sort() function?

If you build the kind of Excel apps that I do, you’ll run up against the need for a Sort all too frequently. Sorting arrays with sizes in the tens of thousands is not unusual and I was reminded of this when reading a post in this entry at Andrew’s Excel Tips the other day.

While it’s not crucial in the (useful) idea presented, the code has a quick and and dirty sort that is about the worst sort algorithm1 one could intelligently come up with. It’s also an prettty intuitive solution, which just goes to show that sorting may not be as simple as we may think. I’m not attacking Andrew’s skillz here, btw: the code as presented is certainly fit for purpose; it’s not presented as a general-purpose utility (at least I hope it isn’t).

I’ve accumulated a few algorithms over the years and I’ve coded up a couple more while “researching” this piece. On the one hand, we have the oft-derided BubbleSort and its close relation, CocktailSort. In the same group I’d include InsertionSort and SelectionSort. I’m going to be harsh and categorise those, with the naive sort above, as “Slow”. Well, “mostly slow”, as we’ll see.

In the “Fast” group, we have the much-touted QuickSort, and somewhere in between, we have HeapSort,and my current algorithm of choice, CombSort. As I was researching this, I also coded up ShellSort, which is about as old as I am and which was claimed to be faster than QuickSort under some conditions.

I ran some comparisons, not meant in any way to be perfectly scientific2. I ran each algorithm on arrays of 50 to 50,000 values with six different characteristics:

  • already sorted
  • exactly reversed
  • mostly sorted (values typically within one or two places of target)
  • ordered blocks of 100 random values (the first 100 values are 0 + RAND(), then 100 + RAND() and so on)
  • completely random
  • random 10-character strings

First off, the 50-record results:


50 recs (ms) sorted near sorted blocks random strings reversed
Shell 0.06 0.06 0.11 0.10 0.24 0.09
Quick 0.13 0.13 0.16 0.13 0.29 0.14
Comb 0.09 0.10 0.17 0.17 0.33 0.13
Heap 0.34 0.33 0.32 0.32 0.52 0.28
Insertion 0.02 0.02 0.20 0.17 0.47 0.37
Selection 0.25 0.25 0.25 0.25 0.54 0.25
Cocktail 0.01 0.02 0.44 0.39 1.02 0.77
Bubble 0.01 0.02 0.50 0.45 1.12 0.78
Naive 0.22 0.23 0.50 0.46 1.06 0.77

I’d say it’s pretty clear that it doesn’t matter much what you use to sort a small array, just about anything will be fast enough (unless you’re going to perform that sort tens of thousands of times in a run). It’s also apparent that the “slow” algorithms are actually pretty good if our data is already reasonably well-ordered.

So far, so “so what?”

Let’s look at the opposite end of the spectrum: 50,000 values? Here, the Fast/Slow divide is apparent. First the “Slows” (two tests only, for reasons that should become apparent):


50K (ms) near sorted random
Bubble 31 522,216
Cocktail 30 449,696
Insertion 19 179,127
Naive 219,338 510,010
Selection 220,735 220,743

Yes, that’s hundreds of thousands of milliseconds. “Three or four minutes” to you and me. The “Fasts”, meanwhile:


50K (ms) sorted near sorted blocks random strings reversed
Shell 162 164 219 377 929 250
Quick 296 298 327 365 790 306
Comb 390 396 477 622 1,348 452
Heap 899 903 885 874 1,548 844

(I only ran two tests on the “Slows”, for fear of dozing off completely.)

Again, for data where values are near their final sorted positions there’s clear evidence that something like an Insertion Sort is much faster than any of the “sexier” options. Provided you know your data will actually meet that criterion, of course.

All that considered, I’m switching from CombSort to ShellSort as my default algorithm. While it loses out a little to QuickSort in the “random” test (probably most representative of my normal use case) it doesn’t carry with it the fear of stack overflow through extreme recursion, something that’s bitten me with QS in the past. Anyway, us old’uns have got to stick together.

As already mentioned, if you have small quantities of data or infrequent sort calls in your application, it really doesn’t make much difference which algorithm you use, although I’d still suggest being aware of the properties of several options and having a basic implementation to hand. Once you reach a point where sorting contributes materially to your application run time then you owe it to yourself and your users to make an intelligent selection.

Here’s my ShellSort implemenation in VB, transcoded fairly literally from the Wikipedia pseudo-code (optimisations welcome):


Public Sub ShellSort(inp)
' sorts supplied array in place in ascending order
Dim inc As Long, i As Long, j As Long
Dim temp ' don't know what's in the input array...
  If Not IsArray(inp) Then Exit Sub ' ...but it had better be an array
  inc = (UBound(inp) - LBound(inp) + 1) / 2 ' use Shell's originally-proposed gap sequence
  Do While inc > 0
    For i = inc To UBound(inp)
      temp = inp(i)
      j = i
      Do
        If j < inc Then Exit Do ' check these conditions separately, as VBA Ors don't short-circuit
        If inp(j - inc) <= temp Then Exit Do ' ... and this will fail when j < inc
        inp(j) = inp(j - inc)
        j = j - inc
      Loop
      inp(j) = temp
    Next
    inc = Round(CDbl(inc) / 2.2)
  Loop

End Sub


1 Not the absolute worst: there’s the catastrophic genius of BogoSort, to name but one, but let’s not go anywhere nearer to there.
2 Just so we understand each other, these are my figures for my code on one of my machines. YMMV. A lot.

Advertisements

10 Responses to Sorted for Excel and Whee!

  1. Hi Mike,

    Useful stuff!

  2. Julian Turner says:

    Interesting.

    On the code above, the line:

    >> inc = (UBound(inp) – LBound(inp) + 1) / 2

    should that be rounded?

    I have also implemented (in Javascript in this case) a version of quick sort using a local array as a stack, instead of recursion.

    • mikewoodhouse says:

      @Julian – I’m not sure that it matters hugely. The pseudo-code I translated has a round(), but it’s not clear what kind of rounding it’s using (VBA does “banking rounding” and 0.5s can go up or down). AFAICT, this line chooses the starting point for the “gap sequence” and that itself seems to have been the source of some conjecture. I don’t think it’s making much of a difference…

      If I can get my head far enough around it, I might try to explore some of the alternatives mentioned in the Wikipedia article. I have Sedgewick’s “Algorithms” book at home somewhere, albeit in the Pascal edition – I don’t think he ever did a VBA one 😉

  3. Julian Turner says:

    There is another assumption in the code, hidden by the use of LBound.

    LBound(imp) must be 0

    Otherwise j < inc would be wrong, as it would allow j = inc through to the next test giving inp(0) <= tmp

    • mikewoodhouse says:

      @Julian – I think you’re right on this: the WIkipedia pseudocode assumes a zero-based array and my test data was the same. There would need to be some more (careful) offsetting for VB/VBA’s more flexible array bounds.

      • Julian Turner says:

        Hi Mike

        Thanks for your comments.

        I played around with your code a little and came up with the following (for a typed array of Longs in my case). Sadly I have a bad habit of using “My” all over the place!

        Public Function Array_ShellSort_Long( _
        MyArray() As Long _
        )

        Dim MyInc As Long
        Dim MyI As Long
        Dim MyJ As Long
        Dim MyTemp As Long
        Dim MyLBound As Long

        If Not IsArray(MyArray) Then Exit Function

        MyLBound = LBound(MyArray)

        MyInc = Round(((UBound(MyArray) – MyLBound) – (LBound(MyArray) – MyLBound) + 1) / 2)

        Do While MyInc > 0

        For MyI = (MyInc + MyLBound) To UBound(MyArray)

        MyTemp = MyArray(MyI)

        MyJ = MyI

        Do
        If (MyJ – MyInc) < MyLBound Then Exit Do

        If MyArray(MyJ – MyInc) <= MyTemp Then Exit Do

        MyArray(MyJ) = MyArray(MyJ – MyInc)

        MyJ = MyJ – MyInc

        Loop

        MyArray(MyJ) = MyTemp

        Next MyI

        MyInc = Round(CDbl(MyInc) / 2.2)

        Loop

        End Function

        Regards

        Julian

  4. Gordon says:

    Love the article title 😉

  5. dougaj4 says:

    I posted a comb sort function and a few sort related links a little while ago:
    http://newtonexcelbach.wordpress.com/2009/03/23/a-sort-function/

    which you might like to have a look at.

    Do you know anything about radix sorts? They seem incredibly quick, but I never got round to implementing it properly.

  6. Yawar says:

    Why reinvent the wheel? Just dump the numbers into a ‘scratch’ worksheet, programmatically have Excel sort them there, copy them back into an array or wherever, and delete the scratch worksheet. Oh, and turn screen updating off and on around the operation.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: