9.1 Normal Distribution

First, let’s review the definition of normal distribution, which is also called Gaussian distribution. If \(X\sim N(\mu, \sigma^2)\), we say \(X\) is a random variable following a normal distribution with mean \(\mu\) and variance \(\sigma^2\).

In the following table, we list the four useful functions for normal distribution, and they will be introduced in the subsequent four parts, respectively.

Code Name Section
dnorm(x, mean, sd) probability density function 9.1.1
pnorm(q, mean, sd) cumulative distribution function 9.1.2
qnorm(p, mean, sd) quantile function 9.1.3
rnorm(n, mean, sd) random number generator 9.1.4

9.1.1 Probability Density Function (pdf)

To characterize the distribution of a continuous random variable, you can use the probability density function (pdf) . When \(X\sim N(\mu,\sigma^2)\), its pdf is \[f(x) = \frac{1}{\sqrt{2\pi \sigma}}\exp\left[-\frac{(x-\mu)^2}{2\sigma^2}\right].\]

In R, you can use dnorm(x, mean, sd) to calculate the pdf of normal distribution.

  • The argument x represent the location(s) at which to compute the pdf.

  • The arguments mean and sd represent the mean and standard deviation of the normal distribution, respectively.

For example, dnorm(0, mean = 1, sd = 2) computes the pdf at location 0 of \(N(1, 4)\), normal distribution with mean 1 and variance 4.

Note that the argument sd is the standard deviation, which is the square root of the variance.

In particular, dnorm() without specifying the mean and sd arguments will compute the pdf of \(N(0,1)\), which is the standard normal distribution. Let’s see examples of computing the pdf at one location for three different normal distributions.

dnorm(0, mean = 1, sd = 2)
#> [1] 0.1760327
dnorm(1, mean = -1, sd = 0.5)
#> [1] 0.0002676605
dnorm(0) #standard normal
#> [1] 0.3989423

In addition to computing the pdf at one location for a single normal distribution, dnorm also accepts vectors with more than one elements in all three arguments. For example, you can use the following code to compute the three pdf values in the previous code block.

dnorm(c(0,1,0), mean = c(1, -1, 0), sd= c(2, 0.5, 1))
#> [1] 0.1760326634 0.0002676605 0.3989422804

If you want to compute the pdf at the same location 0 for distributions \(N(1,4)\), \(N(-1, 0.25)\), and \(N(0, 1)\), you can use the following code.

dnorm(0, mean = c(1, -1, 0), sd= c(2, 0.5, 1))
#> [1] 0.1760327 0.1079819 0.3989423

If you want to compute the pdf at three different locations (-3, 2, and 5) for distribution \(N(3, 4)\), you can use the following code.

dnorm(c(-3, 2, 5), mean = 3, sd = 2)
#> [1] 0.002215924 0.176032663 0.120985362

To get a better understanding on the shape of the normal pdf, let’s visualize the pdf of \(N(0,1)\). You first need to create a equal-spaced vector x from -5 to 5 with increment 0.1. Then, you can compute the pdf value for each element of x using dnorm(). Finally, you can visualize the pdf using geom_line().

x <- seq(from = -5, to = 5, by = 0.05)
norm_dat <- data.frame(x = x, pdf = dnorm(x))
ggplot(norm_dat) + geom_line(aes(x = x, y = pdf))

Next, you can take a step further to visualize three different normal distributions in the same plot, \(N(0,1)\), \(N(1,4)\), and \(N(-1, 0.25)\). You can use the same vector x and compute the three pdfs on each element of x. geom_line() is still used with the variable dist mapped to the color aesthetic.

x <- seq(from = -5, to = 5, by = 0.05)
norm_dat_1 <- data.frame(dist = "N(0,1)", x = x, pdf = dnorm(x))
norm_dat_2 <- data.frame(dist = "N(1,4)", x = x, pdf = dnorm(x, mean = 1, sd = 2))
norm_dat_3 <- data.frame(dist = "N(-1, 0.25)", x = x, pdf = dnorm(x, mean = -1, sd = 0.5))
norm_dat <- rbind(norm_dat_1, norm_dat_2, norm_dat_3)
ggplot(norm_dat) + geom_line(aes(x = x, y = pdf, color = dist))

9.1.2 Cumulative Distribution Function (cdf)

In addition to pdf, you can compute the cumulative distribution function (cdf) of the normal distribution using the function pnorm(q, mean, sd). Generally speaking, the cdf of a random variable \(X\) is defined as \[F(x) = P(X\leq x).\] Similar to dnorm(), pnorm() also has two optional arguments, mean and sd, which represent the mean and standard deviation of the normal distribution, respectively. If you don’t specify these two arguments, pnorm() will compute the cdf of \(N(0,1)\).

pnorm(0, mean = 1, sd = 2)
#> [1] 0.3085375
pnorm(0) # cdf at 0 of standard normal
#> [1] 0.5

You can also use pnorm() to visualize the cdf of the standard normal distribution.

q <- seq(from = -5, to = 5, by = 0.1)
norm_dat <- data.frame(q = q, cdf = pnorm(q))
ggplot(norm_dat) + geom_line(aes(x = q, y = cdf))

9.1.3 Quantile Function

The third useful function related to distributions is the quantile function. You can compute the quantile of the normal distribution using qnorm(p, mean, sd). The quantile function is the inverse function of the cdf. In particular, the \(p\) quantile returns the value \(x\) such that \[F(x) = P(X\leq x) = p\]

Let’s verify qnorm() is indeed the inverse function of pnorm() using the following example.

#> [1] 0.5 0.7

When \(p=0.5\), qnorm() gives us the median of the normal distribution. Let’s see a few examples for computing the quantiles.

qnorm(0.5, mean = 1, sd = 2)
#> [1] 1
#> [1] 0

You can also visualize the shape of the quantile function.

p <- seq(from =  0.01, to = 0.99, by = 0.01)
norm_dat <- data.frame(p = p, quantile = qnorm(p))
ggplot(norm_dat) + geom_line(aes(x = p, y = quantile))

9.1.4 Random Number Generator

Lastly, to generate (pick up) random numbers from normal distributions, you can use the function rnorm(n, mean, sd) , with the argument n represents the number of random numbers to generate, the arguments mean and sd are the mean and standard deviation of the normal distribution you would like to generate from, respectively. Again, if you only supply the argument n, you will be generating random numbers from \(N(0,1)\).

rnorm(3, mean = 0, sd = 1) #generate 3 random numbers from N(0, 1)
#> [1] -0.3393447 -0.1928847  0.2492699
rnorm(3) #generate another 3 random numbers from N(0,1)
#> [1]  0.1987472 -0.9778133  1.1789737

Since you are generating random numbers, the results may be different each time. In many applications, however, you may want to make the results reproducible. To do this, you can set random seed using the function set.seed() before generating the random numbers. Let’s see the following example.

#> [1]  1.2629543 -0.3262334  1.3297993

Now, let’s run it one more time.

#> [1]  1.2629543 -0.3262334  1.3297993

You can see that the exact 3 numbers are reproduced since you are using the same random seed 0. You can run these two lines of code on any machine and will get the exact same three random numbers.

Note that the code that involves randomness needs to be identical to reproduce the results. If you change the arguments in rnorm(), you will get totally different results. See the following example.

#> [1] 1.262954
#> [1] -0.3262334  1.3297993  1.2724293

By setting a different random seed, you will see different results as the following example.

#> [1] -0.6264538  0.1836433 -0.8356286

Lastly, let’s do a simple statistical exercise by checking the closeness of the sample mean and sample standard deviation to their population counterparts.

x <- rnorm(1e6, mean = 1, sd = 2)
mean(x) #sample mean
#> [1] 1.000093
var(x)  #sample covariance
#> [1] 4.001485
sd(x)   #sample standard deviation
#> [1] 2.000371

9.1.5 Exercise