2.5 Statistical Functions on Vectors
In this section, we will continue talking about functions on vectors, and focus on various statistical functions.
2.5.1 Numeric vectors
Let’s first create a numeric vector.
<- c(3, 2, 75, 0, 100)
h #check the value of h h
Next, we will divide statistical functions into several groups, and introduce them one by one.
Group A: minimum and maximum
min(h)
max(h)
range(h)
First, you can get the minimum and maximum values of a numeric vector, and range()
produces a length-2 vector with both the minimum value(the first element) and maximum value(the second element).
which.min(h)
which.max(h)
In addition to getting the minimum and the maximum values, it is often useful to get the corresponding locations of them. Here, the fourth element in h
has the minimum value 0, so you will get a result of 4
from which.min()
. If there are multiple elements with the minimum value, which.min()
will return the first location. Similarly, which.max()
tells you the location of the maximum value.
<- c(2, 2, 1, 1)
g which.min(g)
which.max(g)
The third element and the fourth element in g
both have the minimum value, but which.min(g)
has a value of 3 since the third element is the first location with the minimum value. Similarly, which.max()
gives you a result of 1
.
cummin(h)
cummax(h)
In addition to calculating the minimum value of all elements, you can also use the cumulative minimum function, called cummin()
. It returns a vector of the same length as the input vector, with the value at each location being the minimum of all preceding elements until that location in the original vector. For example, the first element of cummin(h)
is 3
since the minimum of the first element in the original vector is always itself, the second element is 2
since the minimum of the first two elements (3
and 2
) in h
is 2, and so on. Note that once we reach the minimum value of the vector, the remaining elements of the cumulative minimum function will always equal to the minimum value. There is also a corresponding function for computing the cumulative maximums, called cummax()
.
Group B: sum and product
sum(h)
cumsum(h)
Next, let’s look at the sum()
function, which produces the sum of all elements in the vector. For the numeric vector h
, the sum is 3+2+75+0+100
, which is 180
. Similar to cummin()
, you can use the cumsum()
function to compute the cumulative sums, which works by summing up the elements of the original vector cumulatively up to each location. In cumsum(h)
, the first element is 3
since there is only one element to do summation, the second element is 5
since the summation of the first two elements (3
and 2
) in h
is 5, and you can easily verify the value of the remaining elements by yourself.
prod(h)
cumprod(h)
We also have the prod()
function, computing the product of all elements of h
. Since there is 0
in h
, the result is 0
. Again, we have the cumulative product function cumprod()
working by multiplying the elements of the original vector cumulatively up to each location.
Group C: mean and median
sort(h)
#> [1] 0 2 3 75 100
Before introducing this group, let’s first review the sort()
function introduced in Section 2.4. By default, this function can sort elements from the smallest to the largest.
mean(h)
median(h)
The mean()
function returns the average of all elements. And the median()
function returns the middle number in the resulting vector of sort()
where the elements are listed in order from the smallest to the largest. If the vector length is odd, the middle number is the value of the element in the central location. In sort(h)
, we can see that the median corresponds to the third number out of five numbers since there are two numbers larger than 3 and two numbers smaller than 3. If the vector length is even, the middle number is the average of the two middle elements after sorting.
sort(g)
median(g)
Take g
for example, after sorting, you will see that 1
and 2
are in the middle. The median is then defined as the average of these two elements, equaling 1.5.
Group D: quantiles
quantile(h)
#> 0% 25% 50% 75% 100%
#> 0 2 3 75 100
quantile()
produces sample quantiles of a given numeric vector. By default, it generates 5 numbers, the top row represents the different percentiles, including the 0 percentile, 25th percentile (0.25 quantile), 50th percentile (0.5 quantile), 75th percentile (0.75 quantile), and 100th percentile, and the second row consists of the corresponding values of each quantile. We next go over all five quantile values.
First of all, 0 percentile and 100th percentile are always the minimum and the maximum values, respectively. The 50th percentile (0.5 quantile) is the same as the median.
The 25-th percentile (0.25 quantile), also called the first quartile, is the value such that there are 25 percent (or a quarter) of the remaining data (whole data without this number) smaller than it. For vector h
, the value is 2 since there is exactly 1 number, which is 25 percent of the remaining 4 numbers, smaller than 2.
Similarly, the 75-th percentile, also called the third quartile, is the value such that 75 percent of the remaining data is smaller than this number. For vector h
, the value is 75 since there are 3 numbers, which are 75 percent of the remaining 4 numbers, smaller than 75.
You also have an important concept called interquartile range (IQR), defined as the difference between the 3rd quartile (75-th percentile) and the 1st quartile (25-th percentile). The interquartile range of h
is 73, which is 75 - 2
.
IQR(h)
In addition to the default five percentiles, you can also use the quantile()
function to get any quantile between 0 and 1. To do this, you just need to specify the second argument probs
. Let’s try to find the 95th quantile.
quantile(h, probs = 0.95)
As before, this asks you to compute the 95th percentile, meaning 95 percent of the remaining data is smaller than this value. Because you only have 5 values in this vector, it may not be very intuitive. However, if you have more elements in a vector, say 1001, you can count the number of the remaining data that is smaller than this value, which should be 950 (the number of remaining data is 1000, and 95 percent of 1000 is 950).
In addition, the second argument can be a vector of probabilities, which will produce a numeric vector of the corresponding quantiles.
quantile(h, probs = c(0.1, 0.2, 0.99))
Group E: summary statistics
summary(h)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0 2 3 36 75 100
Compared with quantile()
, a more general function to have a comprehensive understanding of numeric vectors is summary()
. From summary()
, you can get the 5 percentiles and the mean.
(Min: 0 percentile, 1st Qu: 25-th percentile, Median: 50-th percentile, 3rd Qu: 75-th percentile, Max: 100-th percentile)
Group F: variance and standard deviation
var(h)
sd(h)
The last group of functions are var()
and sd()
which compute the sample variance and sample standard deviation of a numeric vector, respectively. The formula of sample variance of vector \(h\) is \[var(h) = \frac{1}{n-1}\sum_{i=1}^n (h_i-\bar h)^2,\] where \(n\) is the length of \(h\) and \(\bar h\) is the average of all elements. By definition, the sample standard deviation is the square root of sample variance, which you can verify by sqrt(var(h))
.
For your convenience, we would like to provide a summary of all the functions introduced in the following table.
Operation | Explanation |
---|---|
min(h) | the minimum value |
max(h) | the maximum value |
range(h) | both the minimum value and the maximum value |
which.min(h) | the (first) location of the minimum value |
which.max(h) | the (first) location of the maximum value |
cummin(h) | the cumulative minimum values |
cummax(h) | the cumulative maximum values |
sum(h) | the sum of all elements |
cumsum(h) | the cumulative sum |
prod(h) | the product of all elements |
cumprod(h) | the cumulative products |
mean(h) | the average of all elements |
median(h) | the middle number in sort(h) |
quantile(h) | the 0 percentile, 25-th percentile, 50-th percentile, 75-th percentile, and 100-th percentile |
IQR(h) | the difference between the 3rd quartile and the 1st quartile |
quantile(h, probs = 0.95) | the 95-th percentile |
quantile(h, probs = c(0.1, 0.2, 0.99)) | several quantiles at a time |
summary(h) | 5 percentiles and the mean |
var(h) | the sample variance |
sd(h) | the sample standard deviation |
2.5.2 Character vectors
Compared with numeric vectors, there are much less things you can do on character vectors. For character vectors, you can also apply summary()
.
<- rep(c("sheep", "pig", "monkey"), 3:1)
animals summary(animals)
From the result, you can see that the summary()
function will only tell you the vector length (6 elements) and vector type (character vector), much less useful than the case for numeric vectors.
2.5.3 Logical vectors
What if we apply summary()
on logical vectors?
<- rep(c(T,F,T), 3:1)
logic summary(logic)
Similar to character vectors, you can get the vector type, which is logical
here. You also get a frequency table for the times FALSE
and TRUE
appear in the vector.
Different from character vectors, you can apply almost all the functions summarized in Table ?? to logical vectors, where the coercion rule introduced in Section 2.2.3 will be in effect to convert all logical values into numerical values. In particular, in the logical vector, TRUE
will be converted to 1 and FALSE
will be converted to 0. Let’s take a look at an example,
<- c(TRUE, TRUE, FALSE, FALSE, TRUE)
a sum(a)
mean(a)
Clearly, sum(a)
equals 3 since there are 3 TRUE
values and mean(a)
equals 0.6 since it is the average of three 1s and two 0s. You are welcome to try other functions on a logical vector.
2.5.4 Exercises
Suppose x <- c(5, 2, 4, 1, 2, 1)
, y <- c(T, F, F, F, F, T, T, F, F, T)
Write R code to reproduce each element of the summary vector
summary(x)
Write R code to generate the cumulative sum, cumulative product, cumulative minimum, and cumulative maximum of
x
.Write R code to generate a vector consisting of the 0.1, 0.2, 0.6, 0.8, 0.9 quantiles of
x
.Write R code to calculate the sample variance and sample standard deviation of
x
.Write R code to generate a length-2 vector consisting of the sum and mean of
y
. Then show the unique elements ofy
and their corresponding frequencies.