4.9 Jitter and Count Plots

We have seen that scatterplot is a useful tool to visualization the relationship between two continuous variables. You may be wondering what will happen if we use it on two discrete variables.

4.9.1 Overplotting

As an example, let’s generate a scatterplot between kit_qual and heat_qual.

library(r02pro)
library(tidyverse)
ggplot(data = sahp) + 
  geom_point(mapping = aes(x = kit_qual, y = heat_qual)) #overplotting

From the plot, you may immediately realize that there are many overlapping data points. Actually, there will be at most 16 possible distinct points on the plot since both variables have 4 categories. This phenomenon is called overplotting. Overplotting is not desirable since it hides useful information about the joint distribution. For example, we don’t know which value pairs out of the 16 possibilities appear more frequently in the data.

To solve the overplotting issues, we introduce two solutions, namely jittering and count plots.

4.9.2 Jittering

The first method for solving the overplotting issue is to add a small random perturbation to all datapoints, i.e., jittering. You can use the geom_jitter() function which works by first perturb the data points and then generate a scatterplot.

ggplot(data = sahp) + 
  geom_jitter(mapping = aes(x = kit_qual, y = heat_qual))

For the jittered plot, we can clearly see which pair of kit_qual and heat_qual have more observations. By default, the perturbation will be performed both vertically and horizontally with the same amount of 40% of the resolution of the data. To customize the amount of jittering, you can specify the arguments width as the amount of horizontal jittering and height as the amount of vertical jittering in the unit of the resolution of the data. To turn off the horizontal jittering, you can specify width = 0.

ggplot(data = sahp) + 
  geom_jitter(mapping = aes(x = kit_qual, y = heat_qual),  
              width = 0.1, 
              height = 0.1)

4.9.3 Counts Plots

When we want to visualize the distribution of a pair of discrete variables, another method to solve the overplotting issue is the counts plot, which uses circles of different sizes to represent the frequency of each value pair. You can use the function geom_count() to generate a counts plot.

ggplot(data = sahp) + 
  geom_count(mapping = aes(x = kit_qual, y = heat_qual))

From this plot, you can clearly tell the frequency of each value pair by the legend showing the relationship between the size of the circle and the count.

4.9.4 Exercises

For the sahp dataset, answer the following questions.

  1. Create a scatterplot between bedroom and bathroom.

  2. What problem do you think this plot have? Provide two different plots to address this issue.