## 2.5 Statistical Functions on Vectors

In this section, we will continue talking about functions on vectors, and focus on various statistical functions.

### 2.5.1 Numeric vectors

Let’s first create a numeric vector.

```
<- c(3, 2, 75, 0, 100)
h #check the value of h h
```

Next, we will divide statistical functions into several groups, and introduce them one by one.

*Group A: minimum and maximum*

```
min(h)
max(h)
range(h)
```

First, you can get the **minimum** and **maximum** values of a numeric vector, and `range()`

produces a length-2 vector with both the minimum value(the first element) and maximum value(the second element).

```
which.min(h)
which.max(h)
```

In addition to getting the minimum and the maximum values, it is often useful to get the corresponding locations of them. Here, the fourth element in `h`

has the minimum value 0, so you will get a result of `4`

from `which.min()`

. If there are multiple elements with the minimum value, `which.min()`

will return the first location. Similarly, `which.max()`

tells you the location of the maximum value.

```
<- c(2, 2, 1, 1)
g which.min(g)
which.max(g)
```

The third element and the fourth element in `g`

both have the minimum value, but `which.min(g)`

has a value of 3 since the third element is the first location with the minimum value. Similarly, `which.max()`

gives you a result of `1`

.

```
cummin(h)
cummax(h)
```

In addition to calculating the minimum value of all elements, you can also use the **cumulative minimum** function, called `cummin()`

. It returns a vector of the same length as the input vector, with the value at each location being the minimum of all preceding elements until that location in the original vector. For example, the first element of `cummin(h)`

is `3`

since the minimum of the first element in the original vector is always itself, the second element is `2`

since the minimum of the first two elements (`3`

and `2`

) in `h`

is 2, and so on. Note that once we reach the minimum value of the vector, the remaining elements of the cumulative minimum function will always equal to the minimum value. There is also a corresponding function for computing the cumulative maximums, called `cummax()`

.

*Group B: sum and product*

```
sum(h)
cumsum(h)
```

Next, let’s look at the `sum()`

function, which produces the **sum** of all elements in the vector. For the numeric vector `h`

, the sum is `3+2+75+0+100`

, which is `180`

. Similar to `cummin()`

, you can use the `cumsum()`

function to compute the *cumulative sums*, which works by summing up the elements of the original vector cumulatively up to each location. In `cumsum(h)`

, the first element is `3`

since there is only one element to do summation, the second element is `5`

since the summation of the first two elements (`3`

and `2`

) in `h`

is 5, and you can easily verify the value of the remaining elements by yourself.

```
prod(h)
cumprod(h)
```

We also have the `prod()`

function, computing the **product** of all elements of `h`

. Since there is `0`

in `h`

, the result is `0`

. Again, we have the *cumulative product* function `cumprod()`

working by multiplying the elements of the original vector cumulatively up to each location.

*Group C: mean and median*

```
sort(h)
#> [1] 0 2 3 75 100
```

Before introducing this group, let’s first review the `sort()`

function introduced in Section 2.4. By default, this function can sort elements from the smallest to the largest.

```
mean(h)
median(h)
```

The `mean()`

function returns the average of all elements. And the `median()`

function returns the middle number in the resulting vector of `sort()`

where the elements are listed in order from the smallest to the largest. If the vector length is odd, the middle number is the value of the element in the central location. In `sort(h)`

, we can see that the median corresponds to the third number out of five numbers since there are two numbers larger than 3 and two numbers smaller than 3. If the vector length is even, the middle number is the average of the two middle elements after sorting.

```
sort(g)
median(g)
```

Take `g`

for example, after sorting, you will see that `1`

and `2`

are in the middle. The median is then defined as the average of these two elements, equaling 1.5.

*Group D: quantiles*

```
quantile(h)
#> 0% 25% 50% 75% 100%
#> 0 2 3 75 100
```

`quantile()`

produces **sample quantiles** of a given numeric vector. By default, it generates 5 numbers, the top row represents the different percentiles, including the 0 percentile, 25th percentile (0.25 quantile), 50th percentile (0.5 quantile), 75th percentile (0.75 quantile), and 100th percentile, and the second row consists of the corresponding values of each quantile. We next go over all five quantile values.

First of all, 0 percentile and 100th percentile are always the minimum and the maximum values, respectively. The 50th percentile (0.5 quantile) is the same as the median.

The 25-th percentile (0.25 quantile), also called the **first quartile**, is the value such that there are 25 percent (or a quarter) of the remaining data (whole data without this number) smaller than it. For vector `h`

, the value is 2 since there is exactly 1 number, which is 25 percent of the remaining 4 numbers, smaller than 2.

Similarly, the 75-th percentile, also called the **third quartile**, is the value such that 75 percent of the remaining data is smaller than this number. For vector `h`

, the value is 75 since there are 3 numbers, which are 75 percent of the remaining 4 numbers, smaller than 75.

You also have an important concept called **interquartile range** (IQR), defined as the difference between the 3rd quartile (75-th percentile) and the 1st quartile (25-th percentile). The interquartile range of `h`

is 73, which is `75 - 2`

.

`IQR(h)`

In addition to the default five percentiles, you can also use the `quantile()`

function to get any quantile between 0 and 1. To do this, you just need to specify the second argument `probs`

. Let’s try to find the 95th quantile.

`quantile(h, probs = 0.95)`

As before, this asks you to compute the 95th percentile, meaning 95 percent of the remaining data is smaller than this value. Because you only have 5 values in this vector, it may not be very intuitive. However, if you have more elements in a vector, say 1001, you can count the number of the remaining data that is smaller than this value, which should be 950 (the number of remaining data is 1000, and 95 percent of 1000 is 950).

In addition, the second argument can be a vector of probabilities, which will produce a numeric vector of the corresponding quantiles.

`quantile(h, probs = c(0.1, 0.2, 0.99))`

*Group E: summary statistics*

```
summary(h)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0 2 3 36 75 100
```

Compared with `quantile()`

, a more general function to have a comprehensive understanding of numeric vectors is `summary()`

. From `summary()`

, you can get the 5 percentiles and the mean.

(Min: 0 percentile, 1st Qu: 25-th percentile, Median: 50-th percentile, 3rd Qu: 75-th percentile, Max: 100-th percentile)

*Group F: variance and standard deviation*

```
var(h)
sd(h)
```

The last group of functions are `var()`

and `sd()`

which compute the **sample variance** and **sample standard deviation** of a numeric vector, respectively. The formula of sample variance of vector \(h\) is \[var(h) = \frac{1}{n-1}\sum_{i=1}^n (h_i-\bar h)^2,\] where \(n\) is the length of \(h\) and \(\bar h\) is the average of all elements. By definition, the sample standard deviation is the square root of sample variance, which you can verify by `sqrt(var(h))`

.

For your convenience, we would like to provide a summary of all the functions introduced in the following table.

Operation | Explanation |
---|---|

min(h) | the minimum value |

max(h) | the maximum value |

range(h) | both the minimum value and the maximum value |

which.min(h) | the (first) location of the minimum value |

which.max(h) | the (first) location of the maximum value |

cummin(h) | the cumulative minimum values |

cummax(h) | the cumulative maximum values |

sum(h) | the sum of all elements |

cumsum(h) | the cumulative sum |

prod(h) | the product of all elements |

cumprod(h) | the cumulative products |

mean(h) | the average of all elements |

median(h) | the middle number in sort(h) |

quantile(h) | the 0 percentile, 25-th percentile, 50-th percentile, 75-th percentile, and 100-th percentile |

IQR(h) | the difference between the 3rd quartile and the 1st quartile |

quantile(h, probs = 0.95) | the 95-th percentile |

quantile(h, probs = c(0.1, 0.2, 0.99)) | several quantiles at a time |

summary(h) | 5 percentiles and the mean |

var(h) | the sample variance |

sd(h) | the sample standard deviation |

### 2.5.2 Character vectors

Compared with numeric vectors, there are much less things you can do on character vectors. For character vectors, you can also apply `summary()`

.

```
<- rep(c("sheep", "pig", "monkey"), 3:1)
animals summary(animals)
```

From the result, you can see that the `summary()`

function will only tell you the vector length (6 elements) and vector type (character vector), much less useful than the case for numeric vectors.

### 2.5.3 Logical vectors

What if we apply `summary()`

on logical vectors?

```
<- rep(c(T,F,T), 3:1)
logic summary(logic)
```

Similar to character vectors, you can get the vector type, which is `logical`

here. You also get a frequency table for the times `FALSE`

and `TRUE`

appear in the vector.

Different from character vectors, you can apply almost all the functions summarized in Table **??** to logical vectors, where the **coercion rule** introduced in Section 2.2.3 will be in effect to convert all logical values into numerical values. In particular, in the logical vector, `TRUE`

will be converted to 1 and `FALSE`

will be converted to 0. Let’s take a look at an example,

```
<- c(TRUE, TRUE, FALSE, FALSE, TRUE)
a sum(a)
mean(a)
```

Clearly, `sum(a)`

equals 3 since there are 3 `TRUE`

values and `mean(a)`

equals 0.6 since it is the average of three 1s and two 0s. You are welcome to try other functions on a logical vector.

### 2.5.4 Exercises

Suppose `x <- c(5, 2, 4, 1, 2, 1)`

, `y <- c(T, F, F, F, F, T, T, F, F, T)`

Write R code to reproduce each element of the summary vector

`summary(x)`

Write R code to generate the cumulative sum, cumulative product, cumulative minimum, and cumulative maximum of

`x`

.Write R code to generate a vector consisting of the 0.1, 0.2, 0.6, 0.8, 0.9 quantiles of

`x`

.Write R code to calculate the sample variance and sample standard deviation of

`x`

.Write R code to generate a length-2 vector consisting of the sum and mean of

`y`

. Then show the unique elements of`y`

and their corresponding frequencies.