10.3 Random Permutation and Random Sampling
Now, you have learned to work with distributions in R with the four useful functions for each distribution in Sections 10.1 and 10.2. In this section, we explore how to perform random permutations and random sampling in R. These techniques are widely used in statistics, machine learning, and data analysis for tasks like model validation and resampling.
10.3.1 Random Permutation
A random permutation rearranges the elements of a vector in a random order. This is often required in machine learning for splitting data into training and validation sets or in bootstrapping.
10.3.2 Random Sampling Without Replacement
Sampling without replacement selects elements from a vector without repeating any of them. This is useful when you need a subset of elements from the vector.
Let’s randomly sample 3 elements from x
:
Here, the size
argument specifies the number of elements to sample. If size
is greater than the length of the vector, an error occurs:
sample(x, size = 6, replace = FALSE)
#> Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'
In addition to using a vector in the first argument of sample
, you can also use a positive integer (e.g., 10), which will be equivalent to x = 1:10
. See the following code for an example.
10.3.3 Random Sampling with Replacement
Sometimes, you may want to get a sample with replacements. You will still be using the sample()
function, but with the argument replace = TRUE
. The following code samples 10 elements with replacement from x.
As expected, you will see some duplicated elements in the output vector.
A very important application of random sampling with replacement is bootstrap. A bootstrap sample is a sample of the same size as the original data with replacement. So, if you want to get a bootstrap sample from x, you will sample 5 elements with replacement from x.
Note that, when the argument size
is not provided, it will take the default value: the length of x
.
10.3.4 Random Sampling with Unequal Probabilities
By default, the sample()
function will draw each element with the same probability. In some cases, you may want to assign different probabilities for different elements.
To draw elements with different probabilities, the first method is to use the random number generator (RNG) for Binomial distribution or Bernoulli distribution. Let’s say we want to randomly sample 100 elements from a Bernoulli distribution with success probability \(p=0.2\).
rbinom(100, size = 1, prob = 0.2)
#> [1] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0
#> [38] 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0
#> [75] 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
In addition to using the rbinom
function introduced in Section 10.2, you can use the sample
function with the prob
argument inside to achieve the same goal.
sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.8, 0.2))
#> [1] 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1
#> [38] 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
#> [75] 1 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
Here, you will samples 100 elements with replacement from c(0,1)
, and the probability of drawing 0 is 0.8, the probability of drawing 1 is 0.2.
10.3.5 Exercise
- Randomly permute the vector
x <- 1:20
. Set theseed = 66
and verify that the permutation is reproducible. - Create a random sample of 5 elements from the vector
letters
without replacement. - Simulate 1000 random samples with replacement from the vector
1:10
. Compute the frequency of each number in the sample. - Split the vector
1:100
into two random subsets of size 70 and 30 without replacement. Calculate the mean of each subset.