4.4.1 Bootstrap sampling

Consider a data set $D_N$. A bootstrap data set $D_{(b)}$, $b=1,\dots ,B$ is created by randomly selecting $N$ points from the original set $D_ N$ with replacement.

Since $D_N$ itself contains $N$ points there is nearly always duplication of individual points in a bootstrap data set. Each point has equal probability $1/N$ of being chosen on each draw. Hence, the probability that a point is chosen exactly $k$ times is given by the binomial distribution

[ \mbox{Prob}\lbrace k \rbrace =\frac{N!}{k! (N-k)!} \left( \frac{1}{N}\right)^ k \left( \frac{N-1}{N}\right)^{N-k} \qquad 0 \le k \le N ]

Given a set of $N$ distinct values, there is a total of $ \binom {2N-1}{N}$ distinct bootstrap datasets. The number is quite large already for $N > 10$. For example, if $N=3$ and $D_N={a,b,c} $, we have $10$ different bootstrap sets:

$\lbrace a,b,c\rbrace,\lbrace a,a,b\rbrace,\lbrace a,a,c\rbrace,\lbrace b,b,a\rbrace,\lbrace b,b,c\rbrace,\\ \lbrace c,c,a\rbrace,\lbrace c,c,b\rbrace,\lbrace a,a,a\rbrace,\lbrace b,b,b\rbrace,\lbrace c,c,c\rbrace$.

Under balanced bootstrap sampling the $B$ bootstrap sets are generated in such a way that each original data point is present exactly $B$ times in the entire collection of bootstrap samples.