4.12 Histograms

In Section 4.10, we learned how to use geom_bar() to generate bar charts for visualizing the distributions of discrete variables. You may be wondering, how about visualizing continuous variables? One popular plot is called histograms.

Let’s again use the sahp housing price data set.

4.12.1 Using the hist() function

To generate a histogram, you can simply use hist() with the variable as the argument.

library(r02pro)
hist(sahp$sale_price)

On the x-axis, the histogram displays the range of values for the sale price. Then, the histogram divides the x-axis into bins with equal width, and a bar is erected over the bin with the y-axis showing the corresponding number of observations (called Frequency on the y-axis label).

4.12.2 Using the geom_histogram() function

In addition to using the hist() function in base R. The geom_histogram() function in the ggplot2 package provides richer functionality. Let’s take a quick look.

library(ggplot2)
ggplot(data = sahp) + 
  geom_histogram(mapping = aes(x = sale_price))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We could see a message, saying “stat_bin() using bins = 30” which implies the histogram has 30 bins by default. Next, we introduce three different ways to customize the bins.

a. Use aesthetic bins.

We can change the number of bins with the global aesthetic bins.

ggplot(data = sahp) + 
  geom_histogram(mapping = aes(x = sale_price), 
                 bins = 5)

We can see that the histogram now has 5 bins and each bin has the same width.

b. Use aesthetic binwidth.

And from the message, another way to change the number of bins is to specify the binwidth, which is another global aesthetic.

ggplot(data = sahp) + 
  geom_histogram(mapping = aes(x = sale_price), 
                 binwidth = 1e2)

c. Manually set the bins.

If desired, you can manually set the bins via the breaks argument in the geom_histogram() function.

ggplot(data = sahp) + 
  geom_histogram(mapping = aes(x = sale_price), 
                 breaks = seq(from = 0, to = 600, by = 100))  

The breaks argument is a numeric vector specifying how the bins are constructed. Let’s verify the height of the first bin.

#verify the first bin count
sum(sahp$sale_price < 100)

4.12.3 Aesthetics in geom_histogram()

Next, we introduce the aesthetics in histograms, which are very similar to those in bar charts. For example, we can map a variable to the fill aesthetic.

ggplot(data = sahp) + 
  geom_histogram(mapping = aes(x = sale_price, fill = house_style), 
                 bins = 5)

We can see that like the bar chart, the bar for each bin is now divided into sub-bars with different colors. The different colors in each sub-bar correspond to the values of house_style. And the height of each sub-bar represents the count for the cases with the sale_price in this particular bin and the specific value of house_style. Just like geom_bar(), we have a global aesthetic called position, which does position adjustment for different sub-bars. The default position value is again "stack" if you don’t specify it.

a. Stacked bars

ggplot(data = sahp) + 
  geom_histogram(mapping = aes(x = sale_price, fill = house_style),
                 position = "stack", 
                 bins = 5)

As expected, we get the same histogram as before.

b. Dodged bars

The second option for position is "dodge", which places the sub-bars beside one another, making it easier to compare individual counts for the combination of a bin of sale_price and house_style.

ggplot(data = sahp) + 
  geom_histogram(mapping = aes(x = sale_price, fill = house_style),
                 position = "dodge", 
                 bins = 5)

c. Filled bars

Another option for optional is position = "fill". It works like stacking, but makes each set of stacked bars the same height.

ggplot(data = sahp) + 
  geom_histogram(mapping = aes(x = sale_price, fill = house_style),
                 position = "fill", 
                 breaks = seq(from = 0, to = 600, by = 100))

Just like geom_bar(), the y axis may be labeled as “proportion” rather than “count,” to be more precise. This makes it easier to compare proportions of different values of cut for different bins of price. For example, we can see that the proportion of house_style = "1.5Fin" is highest for the houses with sale price less than 100.

4.12.4 Density Estimate Using Histograms

In addition to using histograms to visualize the distribution of a discrete variable, you can also construct a density density of variable using a proper normalization. To generate such a density estimate, you can add y = ..density.. as a mapping in the aes() function in geom_historgram(). Let’s see an example as below.

ggplot(data = sahp) +  
  geom_histogram(aes(x = sale_price, y = ..density..), 
                 breaks = seq(from = 0, to = 6e2, by = 1e2)) + 
  scale_y_continuous(breaks = seq(from = 0, to = 7e-3, by = 1e-3))

Here, we added the breaks arguments on the bins as well as on the y-axis.

Let’s try to calculate the height of the first bar together. We know that the total area of the bars is 1, agreeing with the definition of density. First, we get the area of the first bar. Then we divide it by the width to get the height.

sale_price_no_na <- na.omit(sahp$sale_price) 
sum(sale_price_no_na < 1e2)/length(sale_price_no_na) #area of the first bar
#> [1] 0.1036585
sum(sale_price_no_na < 1e2)/length(sale_price_no_na) /1e2 #height of the first bar
#> [1] 0.001036585

You can see that the height matches the y axis for the first bar.

4.12.5 Exercises

Use the sahp data set to answer the following questions.

  1. Create three different histograms on the living area (liv_area) for each of the following settings
  • Use 10 bins
  • Set the binwidth to be 300
  • Se the bins manually to an equally-spaced sequence from 0 to 3500 with increment 500.
  1. Create histograms on the living area (liv_area) with 5 bins, and show the information of different kit_qual values in each bar. What conclusions can you draw from this plot?