In my previous post, I explained the bin packing problem, explained an example scenario, and established a baseline for both speed and efficiency of bin packing algorithms by writing a rather crude cursor-based procedure. In this part, I will look at some small modifications that can be made to this code to make it better at packing bins as full as possible, so that less bins will be needed.
Reordering the cursor
The Baseline procedure didn’t process the registrations in any particular order. Granted, there is an ORDER BY Year, Quarter in the cursor declaration, but that is only needed to enable me to generate consecutive session numbers within each quarter without having to do a SELECT MAX(SessionNo) query. Had I elected to use an IDENTITY column for the Sessions table, I could even have omitted the ORDER BY completely.
Since there is no order specified for the registrations within the quarter, we can assume that they will be processed in some unspecified order. But maybe we can get a more efficient distribution of registrations if we order them before processing? I am, of course, not referring to ordering by subject code, as the actual subjects are irrelevant for the algorithm – I am referring to order by the number of candidates in a registration.
This is very easy to test, of course. All it takes is adding one extra column to the ORDER BY clause of the cursor definition. So I created the procedures dbo.OrderAsc and dbo.OrderDesc to test the effects of ordering by ascending or descending number of candidates (see OrderAsc.sql and OrderDesc.sql in the attached ZIP file).
Ordering by ascending number of candidates turned out to be a pretty lousy idea. Well, to be fair, I didn’t expect otherwise – after all, if you save all the biggest registrations for the last, you’ll have no smaller registrations left to fill up the gaps. In fact, all registrations for 51 or more candidates will get a session of their own, and will not be combined with any other registration. So it’s not surprising at all to see that this method results in a huge amount of extra sessions as compared to the baseline version – an increase of no less than 19.9%!
Quarter NumSessions AvgSessionSize AvgEmptySeats
——- ———– —————- —————-
1 6435 78.116860 14081.800000
2 3419 81.741444 6242.600000
3 7796 77.624166 17444.200000
4 5686 73.602004 15009.900000
ALL 23336 77.383227 13194.625000
Equally unsurprising is the fact that changing the ORDER BY clause to sort by descending number of candidates results in more successful packing of the registrations in less sessions. This version saves 2.7% as compared to the baseline, as shown in this breakdown:
Quarter NumSessions AvgSessionSize AvgEmptySeats
——- ———– —————- —————-
1 5085 98.855850 581.800000
2 2799 99.847802 42.600000
3 6771 89.374981 7194.200000
4 4268 98.055529 829.900000
ALL 18923 95.429635 2162.125000
Even though this part mainly focuses on efficiency achieved, the time taken still remains an important factor. So I ran these two procedures five times in a row, recorded execution times, and calculated the average. For dbo.OrderAsc, the average execution time was 78,531 ms, whereas dbo.OrderDesc clocked in at 77,934 ms. As you see, both versions are slightly slower than the baseline version. For the OrderDesc version, this is caused by the rapid growth of the number of sessions as the first, biggest registrations are processed first. This means that searching a session with enough empty space for a registration soon becomes a more time-consuming task. For OrderAsc, the reverse is true – since the smallest registrations are processed first, there will at first be only a few sessions. This means that this algorithm will be a lot faster at first – but once the bigger registrations are processed and the total number of sessions rapidly increases to be way more that that in the baseline version, this advantage is lost, and the time required to search for sessions with enough empty space soon gets so high that the advantage this algorithm had as first then turns into a disadvantage.
Sorting the registrations by ascending number of candidates within a quarter before processing them hurts both speed and packing efficiency of the algorithm; we can henceforth forget about this option. On the other hand, sorting by descending number of candidates increases the packing efficiency by 2.7%, though this comes at the price of a 5.7% increase in execution time. If I had to choose between these two, my pick would depend on the needs of the organization I’m working for – if the cost per session is high and plenty of time is available for the computation, I’d go with the ordered version, but if speed is more important than saving those few extra sessions, I’d use the unsorted one.
Less obvious orderings
But why choose between these two only? I have thus far only considered the obvious sort orders, ascending and descending. Why not try a few more variations?
Thinking about the even distribution of data generated for the first quarter of each year in my test set, I considered that, were I tasked with manually combining the registrations as efficient as possible, I’d probably start by making sessions by combining two registrations for 50 candidates, than combining a registration for 51 candidates with one for 49, and so on. All of these sessions would total 100 candidates, and because of the even distribution of data, I expect roughly the same number of registrations for each size so I’d have only a few spare sessions left in the end.
That technique can’t be exactly mimicked by changing the sort order of the cursor. There are other ways to mimic it, though – but I’ll leave those for a future post J. But we can simulate this effect by ordering the registrations so that those with 50 candidates come first, then those with 49 and 51 candidates, and so on. This is done by changing the ORDER BY clause in the cursor definition to order by the “distance” between the number of candidates and the magic number 50, being half the maximum session size:
ORDER BY Year, Quarter,
ABS(NumCandidates – (@MaxCandidatesPerSession / 2.0)) ASC;
I didn’t hardcode the number 50, because I wanted my stored procedures to be fit for any maximum number of candidates. I divide by 2.0 instead of just 2 so that for an odd maximum session size (e.g. 25), the fraction is retained and registrations for 12 and 13 candidates are kept together because they are the same distance from half the maximum size (12.5).
It is of course also possible to use DESC instead of ASC to start with registrations for 100 candidates, then those for 99 or 1, and so on, saving the 50-candidate registration for the last. Both these versions are included in the attached ZIP file, in the files Order50First.sql and Order50Last.sql.
These versions both took slightly more time than the baseline version when I tested them on my laptop: 76,807 ms for dbo.Order50First, and 77,322 ms for dbo.Order50Last. The packing efficiency of dbo.Order50First is better than the baseline, but not as good as that of dbo.OrderDesc:
Quarter NumSessions AvgSessionSize AvgEmptySeats
——- ———– —————- —————-
1 5096 98.642464 691.800000
2 2805 99.634224 102.600000
3 6780 89.256342 7284.200000
4 4269 98.032560 839.900000
ALL 18950 95.293667 2229.625000
For dbo.Order50Last, the resulting number of sessions is even more than we had in the baseline!
Quarter NumSessions AvgSessionSize AvgEmptySeats
——- ———– —————- —————-
1 5444 92.336884 4171.800000
2 3032 92.174802 2372.600000
3 6928 87.349595 8764.200000
4 4617 90.643491 4319.900000
ALL 20021 90.196044 4907.125000
The reason for the disappointing efficiency of the dbo.Order50First procedure is that there is no control over the order of the registrations that have the same distance to 50. So it is quite possible, for instance, to start with a bunch of registrations for 49 candidates that will promptly be combined to sessions for 98 candidates each – so that, when the 51-sized registrations start coming in, they have to get sessions of their own. In an attempt to fix that, I tweaked the sort order some more, making sure that the for registrations with the same distance from 50, the “bigger” registrations come before the “smaller” ones.
ORDER BY Year, Quarter,
ABS(NumCandidates – (@MaxCandidatesPerSession / 2.0)) ASC,
NumCandidates DESC;
With this ORDER BY clause, I can be certain that all 51-candidate registrations are processed first, each getting its own session. After that, the 49-candidate registrations will exactly fill out all those sessions. This version (enclosed in Order50FirstB.sql) had a slightly better packing ration than dbo.Order50First – but still not as good as dbo.OrderDesc. Here are the results, which took 75,547 ms (on average) to achieve:
Quarter NumSessions AvgSessionSize AvgEmptySeats
——- ———– —————- —————-
1 5089 98.778148 621.800000
2 2804 99.669757 92.600000
3 6771 89.374981 7194.200000
4 4268 98.055529 829.900000
ALL 18932 95.384270 2184.625000
After these tests, there was still one thing left I wanted to try. Starting with registrations for 50 places was based on an idea for evenly distributed data. For other distributions, this might turn out to be a much worse idea (though the results don’t show as much). But what if, instead of starting at half the maximum session size, we start at the average registration size? For evenly distributed data, this should work out approximately the same. But maybe this order achieves a better packing ratio for other distributions? Let’s find out.
Ordering by distance from the average session size for a quarter can be accomplished by using a correlated subquery in the ORDER BY clause (compatible with all versions of SQL Server), or by using an AVG function with the OVER clause (only SQL Server 2005 and up):
ORDER BY a.Year, a.Quarter,
ABS(a.NumCandidates – (SELECT AVG(b.NumCandidates * 1.0)
FROM dbo.Registrations AS b
WHERE b.Year = a.Year
AND b.Quarter = a.Quarter)) ASC;
or
ORDER BY a.Year, a.Quarter,
ABS(a.NumCandidates – AVG(a.NumCandidates * 1.0)
OVER (PARTITION BY a.Year, a.Quarter)) ASC;
Surprisingly, when dry-testing the query by itself, the correlated subquery turned out to be faster than the one using the OVER clause, so I didn’t have to sacrifice speed for backward compatibility. I used the correlated subquery, both with the ASC and the DESC sort option (see OrderHalfFirst.sql and OrderHalfLast.sql in the attachment), to test the two possible variations of this option. Both versions turned out to be quite inefficient packers, since they both took more sessions than the baseline. Here are the results of dbo.OrderHalfFirst, acquired in 75,056 ms:
Quarter NumSessions AvgSessionSize AvgEmptySeats
——- ———– —————- —————-
1 5120 98.180078 931.800000
2 3264 85.623161 4692.600000
3 6771 89.374981 7194.200000
4 5044 82.970063 8589.900000
ALL 20199 89.401207 5352.125000
And OrderHalfLast, after running 79,294 ms, produced these results:
Quarter NumSessions AvgSessionSize AvgEmptySeats
——- ———– —————- —————-
1 6036 83.280649 10091.800000
2 2941 95.026861 1462.600000
3 7796 77.624166 17444.200000
4 4951 84.528580 7659.900000
ALL 21724 83.125345 9164.625000
Conclusion
I’ve investigated many options to increase packing efficiency. It turns out that, at least with the test data I used, just starting with the biggest registration and working down to the smallest yields the best results. This is not the fastest option, though. The baseline version discussed in the previous episode of this series is still fastest. So the choice would appear to depend on the requirements of your application – if you have plenty of time and computer capacity but need to use as little sessions as possible, go for the dbo.OrderDesc version. If execution time is of utmost importance and a few extra sessions are no big deal, then stick to the baseline (for now).
If you are in search of a solution that offers both speed and efficient packing, then the dbo.Order50FirstB version seems to be the right choice. It is only 0.05% less efficient than the best packer (dbo.OrderDesc), but over 3% faster. In the next episode I’ll be looking at ways to make the algorithm go faster. I’ll be making huge performance improvements – but packing efficiency will suffer. How much? As soon as I have completed all my tests and written the accompanying text, you’ll read it. Right here on sqlblog.com.