I am in the middle of working on a project that relies heavily on the ability to select random items from an Amazon Web Services SimpleDB domain. A little bit of google-fu turned up an answer to a post over at Stack Overflow which described Amazons recommended approach. A really rough psuedo-code implementation looks like this:
The algorithm stores a randomizer field with a random value on all of the items; when you need a row, generate another random value and select the first item whose randomizer attribute is less than this new random value. Unfortunately this approach is broken.
Why This Approach is Broken
The selection probability in the above algorithm is non-uniform, in fact it behaves like a roulette wheel selection were the item’s weight is proportional to the distance to the next highest randomizer. For a quick demonstration lets say we have 3 items and that all our random numbers are integers in the range of 1 to 10, we could possibly end up with the following state:
The selected on column shows a list of the randoms values for which that item is selected and you can see from the probability column that the distribution is far from uniform. Depending on our requirements this might be fine but lets see if the algorithm can’t be tuned a bit to produce more uniform results.
The Coin Toss
Instead of always selecting an item whose randomizer is less or equal to some random value alternate randomly between less or equal and greater or equal. The new code looks something like this:
Reusing the same example from above with 3 items the selection probability now looks like this:
Greater or Equal
Less or Equal
This is still not uniform however it is significantly better and there is still one more thing we can add.
Using a Dynamic Randomizer
In order to achieve that nice uniform distribution requires adding something that changes uniformly. The easiest way to do this is to change an items randomizer attribute every time it is selected so it’s future probability of selection is no longer constant.
Because this method isn’t conducive to a nice countable table that fits on a blog I’ll skip that bit, however it is a pretty straight forward exercise to run a quick simulation of the above algorithm and show that it does in fact generate nice uniform results.