2.8 Numeric vector: Summary Statistics

We’ve spent several sections learning the features of numeric vectors. In this section, we will introduce statistical functions on numeric vectors.

Let’s first create a numeric vector.

h <- c(3, 2, 75, 0, 100)
h #check the value of h
#> [1]   3   2  75   0 100

Next, we will divide statistical functions into several groups, and introduce them one by one.

2.8.1 Minimum and Maximum

min(h) 
#> [1] 0
max(h) 
#> [1] 100
range(h)
#> [1]   0 100

First, you can get the minimum and maximum values of a numeric vector by calling the min() and max() functions, respectively. To get these two values at one time, you can call the range() function, which returns a length-2 vector with the minimum value(the first element) and maximum value(the second element).

which.min(h) 
#> [1] 4
which.max(h) 
#> [1] 5

In addition to getting the minimum and the maximum values, it is often useful to get the corresponding locations of them. In h, the minimum value 0 is located in the fourth place, and this is why you will get a result of 4 from which.min(). If multiple elements are assigned with the same minimum value, which.min() will return the first location. Similarly, which.max() tells you the location of the maximum value.

g <- c(2, 2, 1, 1)
which.min(g)
#> [1] 3
which.max(g)
#> [1] 1

The third and fourth elements in g both have the minimum value, but which.min(g) returns a result of 3 since the minimum value firstly appears in the third element. Similarly, which.max() gives you a result of 1.

cummin(h) 
#> [1] 3 2 2 0 0
cummax(h) 
#> [1]   3   3  75  75 100

In addition to calculating the minimum value of all elements, you can also use the cumulative minimum function, called cummin(). It returns a vector of the same length as the input vector, with the value at each location being the minimum of all preceding elements until that location in the original vector. For example, the first element of cummin(h) is 3 because the minimum of the first element in the original vector is always itself. The second element of cummin(h) is 2, since the minimum of the first two elements (3 and 2) in his 2, and so on.

Note that once we reach the minimum value of the original vector, the remaining elements of the cumulative minimum function will always equal to the minimum value. There is also a corresponding function for computing the cumulative maximums, called cummax().

2.8.2 Sum and Product

sum(h)
#> [1] 180
cumsum(h)
#> [1]   3   5  80  80 180

Next, let’s look at the sum() function, which produces the sum of all elements in the vector. For the numeric vector h, the sum is 3+2+75+0+100, which is 180. Similar to cummin(), you can use the cumsum() function to compute the cumulative sums, which works by summing up the elements of the original vector cumulatively up to each location. In cumsum(h), the first element is 3 since there is only one element (3) to do summation, and the second element is 5 since the summation of the first two elements (3 and 2) in h is 5. You can easily verify the values of the remaining elements by yourself.

prod(h)
#> [1] 0
cumprod(h)
#> [1]   3   6 450   0   0

As long as we can calculate the sum and cumulative sum of a numeric vector, we can also compute the product of a numeric vector’s elements by the prod() function. In h, since there is 0 in h, the result is 0. Similarly, we have the cumulative product function cumprod() working by multiplying the elements of the original vector cumulatively up to each location.

h
#> [1]   3   2  75   0 100
diff(h)
#> [1]  -1  73 -75 100

It is worth noting that the diff() function doesn’t work the same way as the sum() function. Specifically, diff() returns the difference between consecutive elements in a numeric vector. In h, the difference between the first and the second elements (3 and 2) is -1, whereas that between the second and the third elements (2 and 75) is 73. In other words, the computations in done by subtracting the former element from the latter one in a consecutive pair.

2.8.3 Mean and median

mean(h)
#> [1] 36
median(h)
#> [1] 3

The arithmetic mean of a numeric vector’s elements can be easily computed via the mean() function. To get the median of a numeric vector’s elements, we would need to place them in value order and find the middle. The sort() function introduced previously can list elements in order from the smallest to the largest. Comparatively, the median() function first places the elements in value order and then returns the middle number.

If the vector length is odd, the middle number is the value of the element in the central location. In sort(h), we can see that the median corresponds to the third number out of five numbers since there are two numbers larger than 3 and two numbers smaller than 3. If the vector length is even, the middle number is the average of the two middle elements after sorting.

sort(g)
#> [1] 1 1 2 2
median(g)
#> [1] 1.5

Recall g with values of 2, 2, 1, 1. After sorting, you will see that 1 and 2 are in the middle. The median is then defined as the average of these two elements, namely 1.5.

2.8.4 Quantiles

quantile(h)
#>   0%  25%  50%  75% 100% 
#>    0    2    3   75  100

quantile() produces sample quantiles of a given numeric vector. By default, it generates 2 rows and 5 numbers. The top row represents the different percentiles, which are always the 0 percentile, 25th percentile (0.25 quantile), 50th percentile (0.5 quantile), 75th percentile (0.75 quantile), and 100th percentile. The second row consists of the corresponding values of each quantile. We next go over all five quantile values.

First of all, 0 percentile and 100th percentile are always the minimum and the maximum values, respectively. The 50th percentile (0.5 quantile) is the same as the median.

The 25-th percentile (0.25 quantile), also called the first quartile, is the value such that there are 25 percent (or a quarter) of the remaining data (all elements without this number) smaller than it. For vector h, the value is 2 since there is exactly 1 number, which is 25 percent of the remaining 4 numbers, smaller than 2.

Similarly, the 75-th percentile, also called the third quartile, is the value such that 75 percent of the remaining data is smaller than this number. For vector h, the value is 75 since there are 3 numbers, which are 75 percent of the remaining 4 numbers, smaller than 75.

You also have an important concept called interquartile range (IQR), defined as the difference between the 3rd quartile (75-th percentile) and the 1st quartile (25-th percentile). The interquartile range of h is 73, which is 75 - 2.

IQR(h)
#> [1] 73

In addition to the default five percentiles, the quantile() function can also compute a specific quantile between 0 and 1. To do this, you just need to specify the second argument probs. Let’s try to find the 95th quantile.

quantile(h, probs = 0.95)
#> 95% 
#>  95

As before, this asks R to compute the 95th percentile, meaning 95 percent of the remaining data is smaller than this value. Because you only have 5 values in this vector, it may not be very intuitive. However, if you have more elements in a vector, say 1001, you can count the number of the remaining data that is smaller than this value, which should be 950 (the number of remaining data is 1000, and 95 percent of 1000 is 950).

In addition, the second argument can be a vector of probabilities, which will produce a numeric vector of the corresponding quantiles.

quantile(h, probs = c(0.1, 0.2, 0.99))
#>  10%  20%  99% 
#>  0.8  1.6 99.0

2.8.5 Summary statistics

summary(h)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>       0       2       3      36      75     100

Compared with quantile(), the summary function produces a more comprehensive picture of a numeric vector. Applying summary(), you can get the 5 percentiles and the arithmetic mean at one time.
(Min: 0 percentile, 1st Qu: 25-th percentile, Median: 50-th percentile, 3rd Qu: 75-th percentile, Max: 100-th percentile)

Group F: variance and standard deviation

var(h)
#> [1] 2289.5
sd(h)
#> [1] 47.84872

The last group of functions are var() and sd(), which compute the sample variance and sample standard deviation of a numeric vector, respectively. The formula of sample variance of vector \(h\) is \[var(h) = \frac{1}{n-1}\sum_{i=1}^n (h_i-\bar h)^2,\] where \(n\) is the length of \(h\) and \(\bar h\) is the average of all elements. By definition, the sample standard deviation is the square root of sample variance, which you can verify by sqrt(var(h)).

For your convenience, we would like to provide a summary of all the functions introduced in the following table.

Operation Explanation Length
min(h) the minimum value 1
max(h) the maximum value 1
range(h) both the minimum value and the maximum value 2
which.min(h) the (first) location of the minimum value 1
which.max(h) the (first) location of the maximum value 1
cummin(h) the cumulative minimum values same as the original numeric vector
cummax(h) the cumulative maximum values same as the original numeric vector
sum(h) the sum of all elements 1
cumsum(h) the cumulative sum same as the original numeric vector
prod(h) the product of all elements 1
cumprod(h) the cumulative products same as the original numeric vector
mean(h) the average of all elements 1
median(h) the middle number in sort(h) 1
quantile(h) the 0 percentile, 25-th percentile, 50-th percentile, 75-th percentile, and 100-th percentile 5
IQR(h) the difference between the 3rd quartile and the 1st quartile 1
quantile(h, probs = 0.95) the 95-th percentile 1
quantile(h, probs = c(0.1, 0.2, 0.99)) several quantiles at a time same as the argument numeric vector
summary(h) 5 percentiles and the mean 6
var(h) the sample variance 1
sd(h) the sample standard deviation 1

2.8.6 Exercises

Suppose x <- c(5, 2, 4, 1, 2, 1)

  1. Write the R code to reproduce each element of the summary vector summary(x)

  2. Write the R code to generate the cumulative sum, cumulative product, cumulative minimum, and cumulative maximum of x.

  3. Write R code to generate a vector consisting of the 0.1, 0.2, 0.6, 0.8, 0.9 quantiles of x.

  4. Write R code to calculate the sample variance and sample standard deviation of x.