In this post, I will talk about a particular type of randomized algorithm called Reservoir sampling.
(pic from this page)
The task is to uniformly at random pick ‘k’ elements from a set of ‘n’ elements, where k <<< n ( k is much smaller than n) and n is so large that all the elements does not fit in memory. Here 'n' elements could be all the queries in Google search or all web pages on the internet etc. Note: This is a kind of sampling where we do sampling “without replacement” i.e. each item is only selected once from the set.
Let’s just re-fresh our Probability fundamentals
a) Probability that out of `N` elements an item is selected in ‘k’ tries (Without replacement):
P(selected) = 1 – P(not selected in k tries)
= 1 - ( (n-1)/N * (n-1)/(n-2) ... ) = n/N
b) With replacement P(selected)
P(selected) = 1 - ((N-1)/N) ^ n
Suppose, there is an infinite stream of elements and your task is to select an element from the stream and put them in a reservoir of size k. The constraint is that the probability of selecting each element in the reservoir (of size k) is equal which is equal to = k/total_size_set. Just note this is selecting an item without replacement, which means if one item is selected from the entire set, it is not replaced in the set again, (hence, cannot be re-selected.)
Case I: When the data fits in the memory.
This is an easy case when all the ‘N’ elements fits in the memory. The task to pick ‘k’ elements with equal probability can be defined as :-
- Generate a random number between 1 to N, and pick the element at that position and replace it with the last element in the set
- To pick the second element, generate a random number between 1 to (N-1) and pick the element at that position and replaced it with the 2nd from the last element.
- Do this process total ‘k’ times.
- In the end, the last k elements in the set are selected with equal probability.
Astute readers can related Case I with Fisher Yates shuffle algorithm
Case II: When all the ‘N’ elements doesn’t fit in memory and its an infinite stream of numbers.
The algorithm to achieve this is explained below: ( Assume k to be the size of reservoir and ith be any iteration.)
- If i is less than k, then the task is easy. Just take the current i-th element from the stream and place it in the empty position in the reservoir. At this point, all ‘k; elements in the set have an equal probability of being in the reservoir which is 1. (=k/k).
- Now, when i is greater than k. This is when it gets tricky. Since, there already are k elements in the reservoir, the question is whether or not to select this new element in the i-th iteration. The idea is to select this new element in the i-th iteration with a probability of k/i and replace it with any of the existing element in the reservoir with equal probability.
One can easily tell that the i-th element in the stream is selected by a probability of k/i. The question arises: What is the probability of selection for all the other k-1 elements in the set. The answer is again k/i. Interested readers can read on to below the proof.
I will show by Induction that the probability of the remaining elements selected in the reservoir is also k/i.
- Let’s assume that at the end of iteration (i-1), the probability that the k-elements in reservoir survives is k/(i-1). This is because we want to select k elements from a set of i-1 elements with equal probability ( coming from point a of this blog post).
- Now, at the i-th iteration, the new element in the stream is selected with a probability of k/i. This new element is replaced with any element from the reservoir.
- So, the probability that any element of the set is replaced is probability of selecting the i-th element and probability of selecting any of the one element from the reservoir of size k. This is nothing but the product of two probabilities =
= (k/i) * (1/k) = 1/i
- Thus, the probability that an element is replaced is 1/i and the probability that it survives is: .
= 1 - 1/i = (i-1)/i
- Thus, at the end of i-th iteration, probability that an elements exists (or is selected in the reservoir) is probability that it survives in the (i-1) iteration and it survives in the i-th iteration. This equals to (k/(i-1)) * ((i-1)/i) = k/i. (The first term coming induction and the second from the bullet point above.)
- Hence, proved that the probability that any item in the reservoir of size ‘k’ to be uniformly selected after i-th iteration is :-
P(selected) = k/i
References of this post;
a) Simple random sample : http://en.wikipedia.org/wiki/Simple_random_sample
b) Pic : http://www.cloudtm.eu/downloads/workload-analyzer-prototype#XMaldonadoMFLMR11
I will talk about the advanced topic of Weighted reservoir sampling and Distributed sampling sometime in the future. I have attached the links about them below:
- Weighted Reservior Sampling http://gregable.com/2007/10/reservoir-sampling.html
- Distributed Sampling http://blogs.msdn.com/b/spt/archive/2008/02/05/reservoir-sampling.aspx