The bootstrap is a statistical technique that can be used to estimate the variability of a statistic. It works by resampling the original dataset with replacement to create pseudo-datasets that are similar to, but slightly perturbed from, the original dataset. The bootstrap can be used to estimate the standard error of a statistic, or to construct confidence intervals for the statistic.
In this example, we are using the bootstrap to estimate the confidence interval for the median of the sulfate particulate matter data. We first create a function that calculates the median of a bootstrap sample. This function takes the original dataset as input and returns the median of the bootstrap sample. We then use the `replicate()` function to generate 5000 bootstrap samples. For each bootstrap sample, we call the `median_boot()` function to calculate the median. We then use the `quantile()` function to calculate the 2.5th and 97.5th percentiles of the distribution of bootstrapped medians. These percentiles give us the 95% confidence interval for the median of the sulfate particulate matter data.
library(boot)
# Function to calculate the median of a bootstrap sample
median_boot <- function(data) {
return(median(sample(data, replace = TRUE)))
}
# Generate 5000 bootstrap samples
set.seed(1)
med.boot <- replicate(5000, median_boot(sulf))
# Calculate the 95% confidence interval for the median
quantile(med.boot, c(0.025, 0.975))
This code will produce the following output:
2.5% 97.5%
2.70 3.47
This means that we are 95% confident that the true median of the sulfate particulate matter data is between 2.70 and 3.47 micrograms per cubic meter.
One thing to keep in mind when using the bootstrap is that it can be computationally expensive, especially if you are bootstrapping a large dataset. In this case, we are bootstrapping a dataset of 500 observations, which is not too large. However, if you were bootstrapping a dataset of 10,000 observations, it would take much longer to run the bootstrap procedure.
Another thing to keep in mind is that the bootstrap is a nonparametric technique. This means that it does not make any assumptions about the distribution of the data. This can be an advantage, as it means that the bootstrap can be used with any type of data. However, it also means that the bootstrap confidence intervals may not be as accurate as confidence intervals that are calculated using parametric methods.
Generating Random Numbers
In R, there are a number of functions that can be used to generate random numbers. Some of the most common functions include:
- `runif()`: This function generates random numbers from a uniform distribution.
- `rnorm()`: This function generates random numbers from a normal distribution.
- `rbinom()`: This function generates random numbers from a binomial distribution.
- `rpois()`: This function generates random numbers from a Poisson distribution.
The `runif()` function takes two arguments: the minimum and maximum values of the range of random numbers to be generated. For example, the following code would generate 10 random numbers from a uniform distribution between 0 and 1:
> runif(10)
[1] 0.3291961 0.8373352 0.5078716 0.1460831 0.9869998
[6] 0.6887480 0.2038407 0.4272420 0.0463083 0.7802943
The `rnorm()` function takes three arguments: the mean, standard deviation, and number of random numbers to be generated. For example, the following code would generate 10 random numbers from a normal distribution with a mean of 0 and a standard deviation of 1:
> rnorm(10, mean=0, sd=1)
[1] -0.1027722 0.2183059 -1.0823986 0.6231446 0.5706418
[6] -1.2433315 0.3375700 -0.2904638 0.0175056 1.0054117
The `rbinom()` function takes four arguments: the number of trials, the probability of success, and the number of random numbers to be generated. For example, the following code would generate 10 random numbers from a binomial distribution with 10 trials and a probability of success of 0.5:
> rbinom(10, 10, 0.5)
[1] 6 6 4 6 5 5 7 5 6 5
The `rpois()` function takes two arguments: the mean and number of random numbers to be generated. For example, the following code would generate 10 random numbers from a Poisson distribution with a mean of 5:
> rpois(10, 5)
[1] 3 5 4 5 5 7 4 5 6 4
When running R code in a parallel environment, it is important to use a random number generator that is reproducible. This means that the same random numbers will be generated each time the code is run, regardless of the number of cores or processors that are used.
The `parallel` package provides a way to do this using the `RNGkind()` function. The `RNGkind()` function takes one argument: the name of the random number generator to use. The default random number generator is not reproducible, so you will need to use the `L'Ecuyer-CMRG` random number generator.
For example, the following code would set the random number generator to `L'Ecuyer-CMRG` and then generate 10 random numbers:
> RNGkind("L'Ecuyer-CMRG")
> rnorm(10)
[1] -0.2858369 0.2496919 0.8446018 -1.3092758 -0.1796944
[6] 1.5723562 -0.6021696 0.6765183 0.6964587 0.1634475
The `set.seed()` function can also be used to set the random seed. This is useful if you want to reproduce the results of a particular run of your code.
For example, the following code would set the random seed to 1 and then generate 10 random numbers: