Posted on

Selecting Random Items in Amazon SimpleDB

I am in the middle of working on a project that relies heavily on the ability to select random items from an Amazon Web Services SimpleDB domain. A little bit of google-fu turned up an answer to a post over at Stack Overflow which described Amazons recommended approach. A really rough psuedo-code implementation looks like this:
Item selectRandomItem() {
// Generate a random value
String randomValue = generateRandomString();

// Retrieve the data from SDB
Item item = randomLE(randomValue);
if(item == null) {
// Handle the edge case where there are no items
// with randomizers less than the random value
item = randomGE();
}
return item;
}

Item randomLE(String randomValue) {
return sdb.select(
"select * " +
"from MyStore " +
"where randomizer <= '" + randomValue + "'" +
"order by randomizer desc " +
"limit 1"
);
}
Item randomGE(String randomValue) {
return sdb.select(
"select * " +
"from MyStore " +
"where randomizer >= '" + randomValue + "'" +
"order by randomizer asc " +
"limit 1"
);
}
The algorithm stores a randomizer field with a random value on all of the items; when you need a row, generate another random value and select the first item whose randomizer attribute is less than this new random value. Unfortunately this approach is broken.

Why This Approach is Broken

The selection probability in the above algorithm is non-uniform, in fact it behaves like a roulette wheel selection were the item’s weight is proportional to the distance to the next highest randomizer. For a quick demonstration lets say we have 3 items and that all our random numbers are integers in the range of 1 to 10, we could possibly end up with the following state:
ItemRandomizerSelected OnProbability
131,2,3,4,5,6.6 (60%)
277,8,9.3 (30%)
31010.1 (10%)
The selected on column shows a list of the randoms values for which that item is selected and you can see from the probability column that the distribution is far from uniform. Depending on our requirements this might be fine but lets see if the algorithm can’t be tuned a bit to produce more uniform results.

The Coin Toss

Instead of always selecting an item whose randomizer is less or equal to some random value alternate randomly between less or equal and greater or equal. The new code looks something like this:
Item selectRandomItem() {
Item item = null;

// Generate some randomness
Random randomStream = new Random();
String randomValue = generateRandomString();

if(randomStream.nextBoolean()) {
// Retrieve the data from SDB
item = randomLE(randomValue);
if(item == null) {
// Handle the edge case where there are no items
// with randomizers less than the random value
item = randomGE(randomValue);
}
}
else {
// Retrieve the data from SDB
item = randomGE();
if(item == null) {
// Handle the edge case where there are no items
// with randomizers greater than the random value
item = randomLE(randomValue);
}
}
return item;
}
Reusing the same example from above with 3 items the selection probability now looks like this:
ItemRandomizerGreater or EqualLess or EqualProbability
131,2,31,2,3,4,5,6.45 (45%)
274,5,6,77,8,9.35 (35%)
3108,9,1010.20 (20%)
This is still not uniform however it is significantly better and there is still one more thing we can add.

Using a Dynamic Randomizer

In order to achieve that nice uniform distribution requires adding something that changes uniformly. The easiest way to do this is to change an items randomizer attribute every time it is selected so it’s future probability of selection is no longer constant.
// Select a random item using the coin toss approach
Item item = selectRandomItem();

// Generate a new randomizer
String newRandomizer = generateRandomString();

// Update the items randomizer
sdb.putAttribute("MyStore", item.name(), "randomizer", newRandomizer);
Because this method isn’t conducive to a nice countable table that fits on a blog I’ll skip that bit, however it is a pretty straight forward exercise to run a quick simulation of the above algorithm and show that it does in fact generate nice uniform results.

3 thoughts on “Selecting Random Items in Amazon SimpleDB

  1. Hi! Thanks for this article, I believe it will come in handy in the project I’m currently working on. Question though: How are you generating the random string that you keep in your db so that you have no duplicates? Or do you have duplicates? I was thinking that if you had duplicates then there would be some items that would never get selected.

    1. Hey Zack,

      Normally you would be correct, however after selecting an item you assign it a new randomizer so even if there are duplicates they aren’t around for long. Since everything has the same probability of having a duplicate at any given time it does not skew the uniform distribution either.

      It also depends on how long you make your randomizer string. If it is to short (e.g. There are more rows in your table then unique randomizers) then you will constantly have duplicates and some items will never be selected.

      My personal preference for the randomizer string is to generate 128-bits of random entropy and then encode it in base64.

      1. Ah, I get it now. Very clever. Thanks for answering this, I know it’s been a while since it was posted.

Leave a Reply

Your email address will not be published. Required fields are marked *