In part 5 of my series on the bin packing problem, I presented a method that sits somewhere in between the true row-by-row iterative characteristics of the first three parts and the truly set-based approach of the fourth part. I did use iteration, but each pass through the loop would use a set-based statement to process a lot of rows at once. Since that statement is fairly complex, I am sure that a single execution of it is far from cheap – but the algorithm used is efficient enough that the entire input set is processed after only a few repetitions of that statement.
This approach runs rings (performance-wise) around all the other solutions I had previously tried, and it also produces a more efficient packing than most (but not all) of the slower alternatives. After adding the set-based iteration algorithm to the table of test results and removing all alternatives that are both slower and less efficient (except the baseline, of course), I now only have a few options left:
Based on the measurements in this table, your choices when you need to implement a bin packing algorithm appear pretty clear – either go for a very fast algorithm that wastes very little bins, or use an algorithm that’s over 5 times as slow but wastes even less bins, or sacrifice another 35-40% performance to save yet another 0.01% of space. I have to note, though, that for a true efficiency comparison the algorithms have to be tested with many different sets of data instead of the single set I have used in this series. If you really have to implement a bin packing algorithm and efficiency is more relevant than speed, I’d advise you to conduct those tests, using all the algorithms in all the parts I wrote, using many different sets of sample data that match the distribution of your actual data as close as possible.
On the other hand, if you can live with a fairly high (but not the highest) packing efficiency and need the process to finish fast, go for set-based iteration. The performance difference between this version and all the others is significant enough to safely conclude that even with other data distributions, it will still be the fastest of them all. The only caveat is here is how the algorithms scale. If execution time for one algorithm grows linearly with the amount of input and another scales exponentially, then there will probably be some point where another algorithm will become the winner. I will take a look at that in a later blog post – but first, I want to try if I can improve the existing algorithm a bit more.
How to abuse integer division
When I was busy trying out the variations I’ll describe later in this post, I suddenly realized that one of the queries in the original version of my algorithm uses computations with fractions instead of integers. Since integer computation is generally cheaper than computation with fractional precision, changing this should improve performance a bit. I refer to the expression that calculates the minimum amount of sessions as the total number of candidates divided by the maximum session size, rounded up. I implemented this very straightforward with the expression: CEILING(1.0 * SUM(NumCandidates) / @MaxCandidatesPerSession). I multiply by 1.0 to convert from integer to numeric (basically a trick to save the few extra keystrokes that CEILING(CAST(SUM(NumCandidates) AS numeric(10,2)) needs). Otherwise the division would be done as integer division, truncating the fractional part; the CEILING function would then have no further effect. However, it’s also possible to use a trick that effectively changes the truncation of integer division to rounding up – add the divisor minus 1 to the divided before dividing. So I changed the formula for the computation of the minimum required number of sessions to (SUM(NumCandidates) + @MaxCandidatesPerSession – 1) / @MaxCandidatesPerSession and saved this version as SetBasedIter1. Now this is not the most executed query in the procedure, but it’s still executed a few times, and this change did indeed give me a performance improvement – albeit a tiny one.
Improving the first estimate
There are scenarios where the set-based iteration algorithm starts with an amount of bins that every human being will immediately recognise as insufficient. For example, if the bin size is 10 and there are three packages of size 6, everybody will immediately see that you need three bins. But my algorithm will calculate the total size of all packages (18), divide by bin size and round up to 2, and then attempt to squeeze all packages in those bins. The result will be that the outer loop will have to be executed a second time; the entire process will thus be far less efficient then it would have been if the algorithm had started with three bins.
To profit from this observation, I created yet another version of the algorithm and called it SetBasedIter2. This one uses two methods to calculate the amount of bins needed: the already familiar one (total size divided by bin size and rounded up), but also a simple count of the number of packages that are over half the bin size. The actual amount of bins used is then the highest of those two numbers. First tests (I did more extensive tests later) indicated that this version did indeed run a bit quicker while using the exact same amount of bins.
Encouraged by this success, I went even further. After all, we don’t need to limit ourselves to just looking at the number of packages over half the maximum size. The next step is packages over one third of the bin size. Let’s for example assume that the bin size is still 10, and we now have five packages of size 4. You will immediately see that only two of those packages fit in a bin, so we need three bins. But even with the improved algorithm, we still start with one bin too little – the total size of all packages is 20, divided by 10 and rounded up yields 2. And the number of packages over size 5 is 0, so we do indeed start with two bins, and we’ll need another iteration of the outer loop to get all packages in a bin. So SetBasedIter3 expands on SetBasedIter2 by also including a count of packaged over one third of the bin size, divided by 2 and rounded up. And I then also added SetBasedIter4, that adds yet another step to get a good first estimation of the amount of bins needed: the amount of packages over a quarter of the bin size, divided by three and rounded up. I could have gone on like that, but I decided to stop here and do some more extensive testing first.
The proper way to test
In the previous posts, I always tested with the same set of data, to enable easy comparison of the efficiency of the various algorithms. But I realised that the effect of the new versions of the algorithm depends strongly on the actual data distribution. If, for example, there are no packages over half the bin size, then the extra code to count them is just overhead that adds some CPU cycles but doesn’t change the outcome in any way. But if there are a lot of those large packages, then the extra CPU invested can pay off by saving an extra execution of the outer loop. So the only way to get a fair comparison between the different versions of the algorithm would be to run it lots of times, with different data all the time.
I set up an automated test script that generates a new set of test data, using a different seed but the same distribution (so you could argue that my test method is still far from perfect – and indeed, I do recommend anyone who wants to implement any bin packing algorithm in a production environment where speed and/or packing efficiency matter to set up their own tests, including ALL versions of the various algorithms, while making sure that the amount of data and its distribution matches the expected production data as closely as possible), then executes each of the algorithms ten times in a row, inserting the elapsed time and the amount of bins used in a table. I put this in an endless loop, tested that it worked the way I wanted, and then hit the execute button and then went to enjoy myself at the snooker club, followed by a good night sleep. The next morning, a total of 215 different sets of test data had been processed. I cancelled execution and noted the average execution time of each of the algorithms.
The most surprising result was that of SetBasedIter. When I tested this same algorithm for my previous post, it took 6,741 ms. This time, the exact same code resulted in an average execution time of 2,962 ms, over 50% less. Even the highest measured execution time of this algorithm, at 4,463 ms is much faster than my previously measured average execution time. I don’t know how to explain this. Maybe the distribution of test data in my original test case just happens to be extremely unlucky for this algorithm, maybe I had forgotten to shut down some other program that was running on my laptop? Maybe even a bit of both? If you have any explanation to offer, please let me know!
Other than that, the results confirmed both my suspicions: that the integer calculation of SetBaserIter1 is a tiny bit faster than the numeric computation of the original version, and that at some point, the overhead of all the extra computations in the other versions starts to exceed the average savings due to reduced number of executions of the outer loop. It turns out that this cut-off point comes pretty quick – counting the number of packages over half the maximum bin size pays off, but also counting the number of packages over a third of the maximum bin size already makes the average execution time go up.
How about efficiency?
Though not included in the table above, I have also monitored the number of sessions used by each of the algorithms. In most test cases, all versions of the algorithm used the same amount of bins, but there were a few exception. Out of 215 sets of test data, 7 used 1 extra session with the algorithms SetBasedIter2, SetBasedIter3, and SetBasedIter4, and 3 others used 1 extra session with only the algorithms SetBasetIter3 and SetBasedIter4.
The explanation for this difference is partly simple, partly not. The simple part is that, when extra bins are introduced in the first pass, the division of packages across bins will be totally different, so that bins will have different amounts of empty space remaining. And if, for instance, the remaining empty space for two bins is divided as 2+2 in one case and as 3+1 in another case, that remaining package of size 3 will take up an extra bin in the former case, but not in the latter.
I had expected this difference to show up (that’s why I kept counting sessions in my fully automated overnight-version of the test script). I had no idea beforehand of how frequent I would see these differences, and what algorithms would benefit from the differences. So I was surprised to see how little versions of the test data were actually impacted, and equally surprised that in cases where there was a difference, it was always in the disadvantage of the versions with a better estimation for the first pass. Apparently, starting with more sessions (even if we know they will definitely be needed) is never good for the packing efficiency; for some reason the packing is less tight; more empty room remains. Again, if you have any explanation of this, please let me know!
If you have read the explanation of the set-based iteration method in my previous post on this topic, you’ll remember that at one point I wrote: “The threshold calculation, the ranking of bins by remaining capacity, and the test that a package fits the bin with the same rank may all seem pretty pointless when assigning the first bunch of packages to the bins. But the same code is reused for later iterations and then these are all important ingredients, as you will see in a bit”. But if that extra complexity is only required for the first execution of the code, then why not use separate code paths for the first and the subsequent executions? That way, the first execution can use a less complicated (and hence faster, I hope) query.
I decided to put that theory to the test. I created another version of the code, SetBasedIter2B, starting at SetBasedIter2 (the fastest version so far). I duplicated the code of the complex query that creates the correct amount of session and the even more complex query that assigns registrations to sessions (which also required me to reorder some of the code), then looked at possible optimizations.
For the query that creates new sessions, the first execution could be simplified because I don’t have to find the highest used session number for each quarter. The subsequent sessions could also be simplified by changing the join type for those highest used session numbers to an inner join (the outer join was only needed for the first execution), and by removing the extra line to count registrations over half the maximum session size – the algorithm guarantees that those will all be assigned in the first pass of the outer loop, so there’s no need to count them on subsequent passes.
For the query that assigns registrations to sessions, the first execution in the inner loop can be simplified a lot. If it’s also the first execution of the outer loop, all sessions are still completely empty. But it’s also possible that it’s a later iteration of the outer loop, in which case there may be sessions with some empty space (but not enough for any of the remaining registrations) as well as new sessions that are completely empty. So instead of filtering on SpaceLeft > 0, I now filter on SpaceLeft = @MaxCandidatesPerSession. Since all sessions now have the same amount of empty space, the ordering for the ROW_NUMBER() function is not relevant; I made this explicit in the query (it’s unfortunately not allowed to leave out the ORDER BY clause), so that the optimizer doesn’t have to do unnecessary work. But the best simplification is that I can leave out the entire threshold calculation – since all (new) sessions are still completely empty, the threshold is not relevant.
I could also have added an extra codepath to separate the first execution of the inner loop in the first execution of the outer loop from the first execution of the inner loop in a later execution of the outer loop. In that case, I could have omitted the WHERE clause in both CTE expressions for the former codepath. But since the columns used in these WHERE clauses are not indexed, I don’t expect any saving from this, whereas the logic to choose the right code path would increase complexity (and maybe even execution time).
I included this version as well in the huge test on 215 sets of test data. The amount of sessions for SetBasedIter2B was always the same as for SetBasedIter2, but the performance was now a bit better – 2,789 ms on average instead of the 2,884 for SetBasedIter2; a 3.3% saving.
Sacrificing speed for efficiency
At this point I had run out of ideas to optimize the algorithm any further. But I did want to try something else. When testing the versions that sometimes add a few extra sessions based on the number of registrations over half, one third, or one quarter of the maximum, I found that these would usually produce the same amount of sessions, but sometimes produce one extra sessions. And though I could not explain it, I did decide to try if the reverse would also hold. In other words, if I deliberately start with too few sessions, will the packing efficiency increase?
In order to test this, I created yet two new versions of my algorithm: SetBasedIter1BMin10 and SetBasedIter1BMin25. Both are based on SetBasedIter1, but with the performance enhancement of the SetBasedIter2B version (duplicating the queries and simplifying them for the first pass). The change I made was that the amount of new sessions used on each pass is now set to 10% or 25% less than the amount needed (rounded up, to prevent endless repetition of the outer loop). This guarantees that we will always need extra iterations of the outer loop, so performance will definitely suffer. But I hope this will be offset by increased efficiency.
These two versions were included in the test with 215 sets of random data as well. The speed was as expected: 4,171 ms for SetBasedIter1BMin10 (almost 50% slower than the fastest option algorithm), and 5,903 ms for SetBasedIter1BMin25 (over 111% slower). But the packing efficiency did indeed increase. For the 215 sets, the amount of sessions was decreased by 0.57% on average with the Min10 version, and by 1.55% with the Min25 version. For the original set of test data, the Min10 version required 19,182 session; the Min25 version needed 18,986. Still not as good as the most efficient (but slow!) cursor-based versions, though; it seems my algorithm cannot beat those versions for packing efficiency.
After all these measurements, it’s high time for a new overview of all the relevant versions of the bin packing algorithms. I consider a version irrelevant if there is another version that is both faster and more efficient. Leaving those versions out (except the baseline), I now have the following overview. Note that the execution time for all the “SetBasedIter” versions is based on testing with 215 different sets of random data, where each test is repeated 10 times and the fastest and slowest times are discarded; the execution time for the other algorithms and the number of sessions for all algorithms is based on a single execution with the standard set of data used in the previous parts of this series.
1 I expect the SetBasedIter1B version to be 3-3.5% faster (~2.84 ms), but I did not include this version in my tests. It’s quite easy to write this version – simply start with the SetBasedIter2B version and remove the extra code required to count the number of registrations over half the maximum session size.
2 Though the number of sessions for SetBasedIter2B is equal to that for SetBasedIter1 in this specific test case, I observed it to be one more in 3.2% of all my test cases. So this version is (slightly) less efficient.
When I finished the previous part of this series, I expected to be able to cover both the final improvements to the set-based iteration algorithm and the scaling tests in this part. But as I was working, I found many more improvement possibilities than I expected. This post is, again, much longer than what I think a blog post should be.
So I will have to postpone testing how the various algorithms scale to the next post in this series. Will that post conclude this series? Frankly, I don’t know yet. At this time, I have no ideas for further improvements or other approaches. However, if someone else has a good idea, feel free to let me know (in the comments or by mail). I’ll look at your ideas and test them – and if they are good, I will include them in yet another episode in this series (that was originally planned to span 5 posts at most – oops!)
File Attachment: ImEx6.zip