There are a couple common equations of importance for what I'm working on at the moment. Both of them involve the idea of a set of items, where each item has a probability of being chosen (and more than one may be chosen in some cases). It's trivial to go item by item, draw a random number, and see if that item is chosen, but this is obviously inefficient. I wanted equations that would allow me to stick in a single random number and it will tell me the first relative item to be chosen, with a fixed number of operations.

The first equation is based on each item in the set having an equal probability of being chosen (note that more than one may be chosen, or none may be chosen):

probability(item) = perItemProbability

The general form of this decision is (rnd <= perItemProbability), where rnd is between 0 and 1 (as is the case with all instances of rnd in this post). Trivial. The equation we want to derive, however, not quite as much.

While it's tempting to think that the probability of any of n items being chosen is (perItemProbability * n), it's not. This isn't a linear function. Rather, you have to derive it from the probability that NONE of those items will be chosen, which is exponential.

(1 - perItemProbability)^n : probability none of n items will be chosen

1 - (1 - perItemProbability)^n : probability that at least one of n items will be chosen

rnd = 1 - (1 - perItemProbability)^n : switch out probability for a random number

1 - rnd = (1 - perItemProbability)^n : arithmetic rearrangement

log(1 - rnd) = log(1 - perItemProbability) * n : take the log of both sides

log(1 - rnd) / log(1 - perItemProbability) = n : arithmetic rearrangement

This gives us what we want: the index of the first chosen item. Of course you have to watch for values that result in taking log(0). You also have to notice that the range of the result is [0, infinity]; this is correct, but if the value is greater than the upper bound of your set of items, you have to recognize that no item was chosen.

Next up is the case where the probability of each item follows a geometric series:

probability(n + 1) = probability(n) * (1/divisor)

probability(0) = p0

sum of probabilities = 1

With this series there's no misleading simple solution, so we're forced to go straight to a more complex one. We know the sum of all the probabilities is 1, but we don't know what value of p0 to use to produce that sum. We can trivially calculate this using the geometric series equation:

1 = p0 * (1 - (1/divisor)^n) / (1 - (1/divisor)) : geometric series equation

p0 = (1 - (1/divisor)) / (1 - (1/divisor)^n) : arithmetic rearrangement

Okay, that's the easy part. Now the equation itself. Here again we'll solve for n:

rnd = p0 * (1 - (1/divisor)^n) / (1 - (1/divisor)) : substitute rnd for sum of probability

rnd / p0 * (1 - (1/divisor)) = 1 - (1/divisor)^n : arithmetic rearrangement

1 - (rnd / p0) * (1 - (1/divisor)) = (1/divisor)^n : arithmetic rearrangement

log(1 - (rnd / p0) * (1 - (1/divisor))) = log(1/divisor) * n : log of both sides

n = log(1 - (rnd / p0) * (1 - (1/divisor))) / log(1/divisor) : arithmetic rearrangement

And thus the solution: a long, ugly equation to calculate something simple. Again, you have to watch for values that result in errors: divisor must be greater than 1; but that's pretty obvious: a geometric series will only sum to a finite number if (1/divisor) is less than 1. You also have to note that when rnd = 1, n = the number of items in the set, which will be out of bounds.

Isn't math fun?

## Saturday, February 06, 2010

Subscribe to:
Post Comments (Atom)

## No comments:

Post a Comment