4.5.1 The bootstrap principle

Given an unknown parameter $\theta $ of a distribution $F_{\mathbf z}$ and an estimator $\hat{\boldsymbol {\theta }}$, the goal of any estimation procedure is to derive or approximate the distribution of $\hat{\boldsymbol {\theta }}-\theta $. For example the calculus of the variance of $\hat{\boldsymbol {\theta }}$ requires the knowledge of $F_{\mathbf z}$ and the computation of $E_{{\mathbf D}_ N}[(\hat{\boldsymbol {\theta }}-E[\hat{\boldsymbol {\theta }}])^2]$. Now, in practical contexts, $F_{\mathbf z}$ is unknown and the calculus of $E_{{\mathbf D}_ N}[(\hat{\boldsymbol {\theta}}-E[\hat{\boldsymbol {\theta }}])^2]$ is not possible in an analytical way. The rationale of the bootstrap approach is (i) to replace $F_{\mathbf z}$ by the empirical counterpart and (ii) to compute $E_{{\mathbf D}_ N}[(\hat{\boldsymbol {\theta}}-E[\hat{\boldsymbol {\theta }}])^2]$ by a Monte Carlo simulation approach where several samples of size $N$ are generated by resampling $D_N$.

The outcome of a bootstrap technique is a Monte Carlo approximation of the distribution $\hat{\boldsymbol {\theta }}_{(b)}-\hat{\theta}$. In other terms the variability of $\hat{\boldsymbol {\theta}}_{(b)}$ (based on the empirical distribution) around $\hat{\theta}$ is expected to be similar (or mimic) the variability of $\hat{\boldsymbol {\theta }}$ (based on the true distribution) around $\theta $.

The bootstrap principle relies on the two following properties (i) as $N$ gets larger and larger the empirical distribution $\hat{F}_{\mathbf z}(\cdot )$ converges (almost surely) to $F_{\mathbf z}(\cdot )$ and (ii) as $B$ gets larger $\text {Var}_{\text{bs}}[\hat{\boldsymbol {\theta }}]$ converges (in probability) to the variance of the estimator $\hat{\boldsymbol {\theta }}$ based on the empirical distribution. In other terms

\begin{equation} \text {Var}_{\text {bs}}[\hat{\boldsymbol{\theta }}] \stackrel{B \rightarrow \infty }{\rightarrow } E_{\widehat{{\mathbf D}_ N}}[(\hat{\boldsymbol {\theta}}-E[\hat{\boldsymbol {\theta }}])^2] \stackrel{N \rightarrow \infty }{\rightarrow } E_{{{\mathbf D}_ N}}[(\hat{\boldsymbol{\theta }}-E[\hat{\boldsymbol {\theta }}])^2] \end{equation}

where $E_{\widehat{{\mathbf D}_ N}}[(\hat{\boldsymbol {\theta}}-E[\hat{\boldsymbol {\theta }}])^2]$ stands for the plug-in estimate of the variance of $\hat{\boldsymbol {\theta }}$ based on the empirical distribution.

In practice, for a small finite $N$, bootstrap estimation inevitably returns some error. This error is a combination of a statistical error and simulation error. The statistical error component is due to the difference between the underlying distribution $F_{\mathbf z}(\cdot)$ and the empirical distribution $\hat{F}_{\mathbf z}(\cdot )$. The magnitude of this error depends on the choice of the estimator $\hat{\boldsymbol {\theta }}({\mathbf D}_N)$ and decreases by increasing the number $N$ of observations.

The simulation error component is due to the use of empirical (Monte Carlo) properties of $\hat{\boldsymbol {\theta }}({\mathbf D}_N)$ rather than exact properties. Simulation error decreases by increasing the number $B$ of bootstrap replications.

Unlike the jacknife method, in the bootstrap the number of replicates $B$ can be adjusted to the computer resources. In practice two “rules of thumb” are typically used:

  1. Even a small number of bootstrap replications, e.g. $B=25$, is usually informative. $B=50$ is often enough to give a good estimate of $\text {Var}\left[\hat{\boldsymbol {\theta }} \right]$.

  2. Very seldom are more than $B=200$ replications needed for estimating $\text {Var}\left[\hat{\boldsymbol {\theta }} \right]$. Much bigger values of $B$ are required for bootstrap confidence intervals.

Note that the use of rough statistics $\hat{\boldsymbol {\theta }}$ (e.g. unsmooth or unstable) can make the resampling approach behave wildly. Example of nonsmooth statistics are sample quantiles and the median.

In general terms, for i.i.d. observations, the following conditions are required for the convergence of the bootstrap estimate

  1. the convergence of $\hat{F}$ to $F$ (satisfied by the Glivenko-Cantelli theorem) for $N \rightarrow \infty $;

  2. an estimator such that the estimate $\hat{\theta }$ is the corresponding functional of the empirical distribution.

    [ \theta =t(F) \rightarrow \hat{\theta }=t(\hat{F}) ]

    This is satisfied for sample means, standard deviations, variances, medians and other sample quantiles.

  3. a smoothness condition on the functional. This is not true for extreme order statistics such as the minimum and the maximum values.

But what happens when the dataset $D_ N$ is not i.i.d. sampled from a distribution $F$? In such non conventional configurations, the most basic version of bootstrap might fail. Examples are: incomplete data (survival data, missing data), dependent data (e.g. variance of a correlated time series) and dirty data (outliers) configurations. In these cases specific adaptations of the bootstrap procedure are required. For reason of space, we will not discuss them here. However, for a more exhaustive view on bootstrap, we invite the reader to see the publication Exploring the limits of bootstrap edited by Le Page and Billard which is a compilation of the papers presented at a special conference of the Institute of Mathematical Statistics held in Ann Arbor, Michigan, 1990.