We have seen that scatterplot is a useful tool to visualization the relationship between two continuous variables. You may be wondering what will happen if we use it on two discrete variables.
As an example, let’s generate a scatterplot between
library(r02pro) library(tidyverse) ggplot(data = sahp) + geom_point(mapping = aes(x = kit_qual, y = heat_qual)) #overplotting
From the plot, you may immediately realize that there are many overlapping data points. Actually, there will be at most 16 possible distinct points on the plot since both variables have 4 categories. This phenomenon is called overplotting. Overplotting is not desirable since it hides useful information about the joint distribution. For example, we don’t know which value pairs out of the 16 possibilities appear more frequently in the data.
To solve the overplotting issues, we introduce two solutions, namely jittering and count plots.
The first method for solving the overplotting issue is to add a small random perturbation to all datapoints, i.e., jittering. You can use the
geom_jitter() function which works by first perturb the data points and then generate a scatterplot.
ggplot(data = sahp) + geom_jitter(mapping = aes(x = kit_qual, y = heat_qual))
For the jittered plot, we can clearly see which pair of
heat_qual have more observations. By default, the perturbation will be performed both vertically and horizontally with the same amount of 40% of the resolution of the data. To customize the amount of jittering, you can specify the arguments
width as the amount of horizontal jittering and
height as the amount of vertical jittering in the unit of the resolution of the data. To turn off the horizontal jittering, you can specify
width = 0.
ggplot(data = sahp) + geom_jitter(mapping = aes(x = kit_qual, y = heat_qual), width = 0.1, height = 0.1)
When we want to visualize the distribution of a pair of discrete variables, another method to solve the overplotting issue is the counts plot, which uses circles of different sizes to represent the frequency of each value pair. You can use the function
geom_count() to generate a counts plot.
ggplot(data = sahp) + geom_count(mapping = aes(x = kit_qual, y = heat_qual))
From this plot, you can clearly tell the frequency of each value pair by the legend showing the relationship between the size of the circle and the count.