Chapter 6 Bootstrap
- Bootstrap can be summarized as “simulating from an estimated model”
- It is used for inference (confidence intervals / hypothesis testing)
- It can also be used for estimating the predictive power of a model (similarly to cross validation) via out-of-bootstrap generalization error
6.1 Motivation
Consider i.i.d. data. Z1,..Zn∼ PwithZi=(Xi,Yi) And assume a statistical procedure ˆθ=g(Z1,...,Zn) g(⋅) can be a point estimator for a regression coefficient, a non-parametric curve estimator or a generalization error estimator based on one new observation, e.g. ˆθn+1=g(Z1,...,Zn,Znew)=(Ynew−mZ1,...,Zn(Xnew))2 To make inference, we want to know the distribution of ˆθ. For some cases, we can derive the distribution analytically if we know the distribution P. The central limit theorem states that the sum of random variables approximates a normal distribution with n→∞. Therefore, we know that the an estimator for the mean of the random variables follows the normal distribution. ˆθn=n−1∑xi∼N(μx,σ2x/n)n→∞ for any P. However, if ˆθ does not involve the sum of random variables, and the CLT does not apply, it’s not as straightforward to obtain the distribution of ˆθ. Also, if P is not the normal distribution, but some other distribution, we can’t find the distribution of ˆθ easily. The script mentions the median estimator as an example for which the variance already depends on the density of P. Hence, deriving properties of estimators analytically, even the asymptotic ones only, is a pain. Therefore, if we knew P, we could simply simulate many times and get the distribution of ˆθ this way. That is, draw many (Xi∗,Yi∗) from that distribution and compute ˆθ for each draw.
The problem is that we don’t know P. But we have a data sample that was generated from P. Hence, we can instead take the empirical distribution ˆP that places probability mass of 1/n on each observation, draw a sample from this distribution (which is simply drawing uniformly at random from our sample with replacement) and compute our estimate of interest from this sample. ˆθ∗=g(Z1∗,...,Znew∗) We can do that many times to get an approximate distribution for ˆθ. A crucial assumption is that ˆP reassembles P. If our data is not i.i.d, this may not be the case and hence bootstrapping might be misleading. Below, we can see that i.i.d. sampling (red line) reassembles the true distribution (black line) quite well, whereas biased sampling (blue line) obviously does not. We produce a sample that places higher probability mass on the large (absolute) values.
library("tidyverse")
pop <- data_frame(pop = rnorm(10000) * 1:10000)
iid <- sample(pop$pop, 1000, replace = TRUE) # sample iid
# sample non-iid: sample is biased towards high absolute values
ind <- rbinom(10000, size = 1, prob = seq(0, 1, length.out = 10000))
not_iid <- pop$pop[as.logical(ind)] # get sample
not_iid <- sample(not_iid, 1000, replace = TRUE) # reduce sample size to 1000
out <- data_frame(iid = iid, not_iid = not_iid) %>%
gather(type, value, iid, not_iid)
ggplot(out, aes(x = value, color = type)) +
geom_density() +
geom_density(aes(x = pop, color = NULL), data = pop)
We can summarize the bootstrap procedure as follows.
- draw a bootstrap sample Z1∗,...,Zn∗
- compute your estimator ˆθ∗ based on that sample.
- repeat the first two steps B times to get bootstrap estimators ˆθ1∗,...,ˆθB∗ and therefore an estimate of the distribution of ˆθ.
Use the B estimated bootstrap estimators as approximations for the bootstrap expectation, quantiles and so on. E[ˆθ∗n]≈B−1n∑j=1ˆθ∗jn
6.2 The Bootstrap Distribution
With P∗, we denote the bootstrap distribution, which is the conditional probability distribution introduced by sampling i.i.d. from the empirical distribution ˆP. Hence, P∗ of ˆθ∗ is the distribution that arises from sampling i.i.d. from ˆP and applying the transformation g(⋅) to the data. Conditioning on the data allows us to treat ˆP as fixed.
6.3 Bootstrap Consistency
The bootstrap is is called consistent if P[an(ˆθ−θ)≤x]−P[an(ˆθ∗−ˆθ)≤x]→0(n→∞) Consistency of the bootstrap typically holds if the limiting distribution is normal and the samples Z1,..,Zn are i.i.d. Consistency of the bootstrap implies consistent variance and bias estimation:
Var∗(ˆθ∗)Var(ˆθ)→1 E∗(ˆθ∗)−ˆθE(ˆθ)−θ→1 You can think of θ as the real parameter and ˆθ as the estimate based on a sample. Similarly, in the bootstrap world, ˆθ is the real parameter, and ˆθ∗i as an estimator of the real parameter ˆθ. The bootstrap world is an analogue of the real world. So in our bootstrap simulation, we know the true parameter ˆθ. From our simulation, we get many ˆθ∗i and can find the bootstrap expectation E[ˆθ∗n]≈B−1n∑j=1ˆθ∗jn. The idea is now to generalize from the boostrap world to the real world, i.e. by saying that the relationship between ˆθ∗ and ˆθ is similar to the one between ˆθ and θ.
A simple trick to remember all of this is:
- if there is no hat, add one
- if there is a hat, add a star.
6.4 Boostrap Confidence Intervals
Note that there confidence intervals are not simply taking the quantiles of the bootstrap distribution. The trick is really to make use of the analogy between the real world and the boostrap world. So when we see our bootstrap expectation E[ˆθ∗n] is way higher than ˆθ, then we also should believe that our ˆθ is higher than θ. The above procedure accounts for that.
6.5 Boostrap Estimator of the Generalization Error
We can also use the bootstrap to estimate the generalization error. E[ρ(Ynew,m∗(Xnew))]
- We draw a sample (Z1∗,...,Zn∗,Znew) from ˆP
- We compute the bootstrapped estimator m(⋅)∗ based on the sample
- We estimate E[ρ(Ynew,m∗(Xnew)∗)], which is with respect to both training and test data.
We can rewrite the generalization error as follows: E[ρ(Ynew,m∗(Xnew∗))]=Etrain[Etest[ρ(Ynew∗,m∗(Xnew∗))|train]] Conditioning on the training data in the inner expectation, m(⋅) is non-random / fixed. The only random component is Ynew∗. Since we draw from the empirical distribution and place a probability mass of 1/n on every data point. we can calculate the inner (discrete) expectation easily via E(X)=n∑j=1pj∗xj=n−1n∑j=1xj. The expectation becomes Etrain[n−1∑ρ(Yi,m∗(Xi))]=n−1∑E[ρ(Yi,m∗(Xi))]
We can see that there is no need to draw Znew from the data. The final algorithm looks as follows:
- Draw (Z1∗,...,Zn∗)
- compute bootstrap estimator ˆθ∗
- Evaluate this estimator on all data points and average over them, i.e err∗=n−1∑ρ(Yi,m∗(Xi))
- Repeat steps above B times and average all error estimates to get the bootstrap GE estimate, i.e. GE∗=B−1∑erri∗
6.6 Out-of-Boostrap sample for estimating the GE
One can criticize the GE estimate above because some samples are used in the test as well as in the training set. This leads to over-optimistic estimations and can be avoided by using the out-of-bootstrap approach. With this technique, we first generate a bootstrap sample to compute our estimator and then use the remaining observations not used in the bootstrap sample to evaluate the estimator. We do that B times and the size of the test set may vary. You can see this as some kind of cross-validation with about 30% of the data used as the test set. The difference is that some observations were used multiple times in the training data, yielding a training set always of size n (instead of - for example n∗0.9 for 10-fold-CV).
6.7 Double Boostrap Confidence Intervals
Confidence intervals are almost never exact, meaning that P[θ∈I∗∗(1−α)]=1−α+Δ Where I∗∗(1−α) is a α-confidence interval. However, by changing the nominal coverage of the confidence interval, it is possible to make the actual coverage equal to an arbitrary value, i.e
P[θ∈I∗∗(1−α′)]=1−α The problem is that α′ is unknown. But another level of bootstrap can be used to estimate α, denoted by ˆα, which typically achieves P[θ∈I∗∗(1−ˆα′)]=1−α+Δ′ with Δ′<Δ
To implement a double bootstrap confidence interval, proceed as follows:
- Draw a bootstrap sample (Z1∗,...,Zn∗).
- From this sample, draw B second-level bootstrap samples and compute the estimator of interest and one confidence interval I∗∗(1−α) based on B second-level bootstrap samples.
- evaluate whether ˆθ lays within the bootstrap confidence interval from a. cover∗(1−α)=1[ˆθ∈I∗∗(1−α)]
- Repeat the above M times to get cover∗1,...,cover∗M and hence approximate P[θ∈I∗∗(1−α)] with p∗(α)=M−1M∑m=1cover∗m
- Vary α in all of the steps above to find α′ so that p∗(α′)=1−α
Question here (see Google docs)
6.8 Three Versions of Boostrap
- So far, we discussed the fully non-parametric bootstrap, which is simulating from the empirical distribution.
- On the other extreme of the scale, there is the parametric bootstrap.
- The middle way is the model-based bootstrap
6.8.1 Non-parametric Regression
We draw a bootstrap sample (Z1∗,...,Zn∗)∼ˆP, i.e. we sample from the empirical distribution data with replacement.
6.8.2 Parametric Boostrap
Here, we assume the data are realizations from a known distribution P, which is is determined up to some unknown parameter (vector) θ. That means we sample (Z1,...,Zn)∼Pˆθ. For example, take the following regression model y=Xβ+ϵ where we know the errors are Gaussian.
- We can estimate our regression model, obtain residuals and compute the mean (which is zero) and the standard deviation of them.
- To generate a bootstrap sample, we simulate residuals ϵ∗ from N(0,ˆμ) and
- Add them to our observed data, i.e. we obtain (Y1∗,...,Yn∗) from Xβ+ϵ∗. Hence, the final bootstrap sample we use is (x1,Y1∗),...,(x1,Yn∗)) where the xi are just the observed data.
- We can compute bootstrap estimates ˆβ∗, the bootstrap expectation E∗[β∗] as well as confidence intervals for the regression coefficients or generalization errors just as shown in detail above. The difference is only how the bootstrap sample is obtained.
Similarly, for time series data, we may assume an AR(p) model.
- Initializing X0∗,...,X−p+1∗ with 0.
- Generate random noise ϵ1∗,...,ϵn+m∗ according to Pˆθ.
- Construct our time series Xt∗=p∑j=1ˆθXt−j∗+ϵt∗ (X1,...,Xn+m)
- Throw away the first M observations that were used as fade-in.
- Proceed with the B bootstrap samples (X1∗,...,Xn∗) as outlined above for the non-parametric bootstrap to obtain coefficient estimates for θ, confidence intervals or estimating the generalization error of the model.
6.8.3 Model-Based Bootstrap
The middle way is the model based bootstrap. As with the parametric bootstrap, we assume to know the model, e.g. y=m(x)+ϵ (where m(⋅) might be a non-parametric curve estimator), but we do not make an assumption about the error distribution. Instead of generating ϵ∗ from the known distribution with unknown parameters (ϵ∗∼Pˆθ, as in the parametric bootstrap)), we draw them with replacement from the empirical distribution. To sum it up, these are the steps necessary:
- Estimate ^m(⋅) from all data.
- Simulate ϵ1∗,...,ϵn∗ by drawing from ˆP with replacement.
- Obtain (Y1∗,...,Yn∗) from ˆm(x)+ϵ∗. As for the parametric bootstrap, the final bootstrap sample we use is (x1,Y1∗),...,(xn,Yn∗)) where the xi are just the observed data.
- Again, you can use the bootstrap samples as for the other two methods.
6.9 Conclusion
Which version of the bootstrap should I use? The answer is classical. If the parametric model fits the data very well, there is no need to estimate the distribution explicitly. Also, if there is very little data, it might be very difficult to estimate P. On the other hand, the non-parametric bootstrap is less sensitive to model-misspecification and can deal with arbitrary distributions (? is that true?).