4.1 Introduction to the Ames Housing Price Data Set

Before generating any beautiful plots, let’s first introduce the data set that will be used throughout this chapter. The data set is a part of the Ames Housing Price data, containing 165 observations and 12 features including the sale date and price.

The data set sahp is located in the R package r02pro, the companion package of this book. Besides the r02pro package, we will also extensively use the ggplot2 package for visualization in this chapter. Like the tibble package, ggplot2 is another member of the tidyverse package. You can install the ggplot2 package if you haven’t done so.

install.packages("ggplot2")

First, let’s load the r02pro, ggplot2, and tibble packages.

library(r02pro)
library(ggplot2)
library(tibble)

After loading the three packages, you can type sahp to have a quick look of the dataset.

sahp
#> # A tibble: 165 × 12
#>    dt_sold    bedroom bathroom gar_car oa_qual liv_area lot_area house_style
#>    <date>       <dbl>    <dbl>   <dbl>   <dbl>    <dbl>    <dbl> <chr>      
#>  1 2010-03-25       3      2.5       2       6     1479    13517 2Story     
#>  2 2009-04-10       4      3.5       2       7     2122    11492 2Story     
#>  3 2010-01-15       3      2         1       5     1057     7922 1Story     
#>  4 2010-04-19       3      2.5       2       5     1444     9802 2Story     
#>  5 2010-03-22       3      2         2       6     1445    14235 1.5Fin     
#>  6 2010-06-06       2      2.5       2       6     1888    16492 1Story     
#>  7 2006-06-14       2      3         2       6     1072     3675 SFoyer     
#>  8 2010-05-08       3      2         2       5     1188    12160 1Story     
#>  9 2007-06-14       2      1         1       5      924    15783 1Story     
#> 10 2007-09-01       5      2.5       2       5     2080    11606 2Story     
#> # … with 155 more rows, and 4 more variables: kit_qual <chr>, heat_qual <chr>,
#> #   central_air <chr>, sale_price <dbl>

You can see that sahp is a tibble with 165 observations and 12 variables. By default, the output only shows the first 10 observations in the tibble along with the first few variables that can fit the window. To view the full dataset, you can use the View() function, which will open the dataset in a new window.

View(sahp)

To view the top rows of the dataset, you can use the head() function, which produces the first 6 observations by default. You can also set an optional second argument to pick any given number of observations.

head(sahp)
head(sahp, 15)

To get a first impression on the dataset, you can use the summary() function introduced in Section 2.5.

summary(sahp)

In the output, we get the summary statistics for each variable. For numeric variables, we get the minimum, 1st quartile, median, mean, 3rd quartile, and the maximum. It also shows the number of NAs for a particular variable. For character variables, we only get the length of the vector, the class, and the mode.

Although the types of each variable are shown in the result when typing sahp, a more detailed list can be found with the function str().

str(sahp)
#> tibble [165 × 12] (S3: tbl_df/tbl/data.frame)
#>  $ dt_sold    : Date[1:165], format: "2010-03-25" "2009-04-10" ...
#>  $ bedroom    : num [1:165] 3 4 3 3 3 2 2 3 2 5 ...
#>  $ bathroom   : num [1:165] 2.5 3.5 2 2.5 2 2.5 3 2 1 2.5 ...
#>  $ gar_car    : num [1:165] 2 2 1 2 2 2 2 2 1 2 ...
#>  $ oa_qual    : num [1:165] 6 7 5 5 6 6 6 5 5 5 ...
#>  $ liv_area   : num [1:165] 1479 2122 1057 1444 1445 ...
#>  $ lot_area   : num [1:165] 13517 11492 7922 9802 14235 ...
#>  $ house_style: chr [1:165] "2Story" "2Story" "1Story" "2Story" ...
#>  $ kit_qual   : chr [1:165] "Good" "Good" "Good" "Average" ...
#>  $ heat_qual  : chr [1:165] "Excellent" "Excellent" "Average" "Good" ...
#>  $ central_air: chr [1:165] "Y" "Y" "Y" "Y" ...
#>  $ sale_price : num [1:165] 130 NA 109 174 138 ...

The str() function gives a list of each component, the corresponding type, the length, and the first several values.

4.1.1 Are two-story houses more expensive than one-story ones?

Let’s try to answer this question by doing some analysis. First, let’s create the logical vectors corresponding to two-story and one-story houses.

story_2 <- sahp$house_style == "2Story"
story_1 <- sahp$house_style == "1Story"

Then, we create two vectors containing the prices of the two groups, respectively.

sale_price_2 <- sahp$sale_price[story_2]
sale_price_1 <- sahp$sale_price[story_1]

Finally, we can run the summary() function on both vectors.

summary(sale_price_2)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>    55.0   137.9   174.0   197.8   231.5   545.2       1
summary(sale_price_1)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    44.0   129.0   160.0   183.0   224.2   465.0

From these summaries, it is clear that the corresponding statistic is larger for two-story houses compared with one-story ones, for all 6 measures. As a result, we can draw the conclusion that the two-story houses indeed have a higher sale price than the one-story ones.

4.1.2 Converting Data Types

When you import a data set into R, some variables may not have the desired types. In this case, it would be useful to convert them into the types you want before conducting further data analysis.

a. Convert a character vector to an unordered factor

Let’s look at the variable house_style in sahp. We can see from the output of str(sahp) that it is of chr type. Let’s confirm this and get it summary.

is.character(sahp$house_style)
#> [1] TRUE
summary(sahp$house_style)
#>    Length     Class      Mode 
#>       165 character character

As briefly mentioned before, using the summary() function on a character vector doesn’t provide us much useful information. Let’s find the unique values of this vector and get the frequency table.

unique(sahp$house_style)
#> [1] "2Story" "1Story" "1.5Fin" "SFoyer" "SLvl"
table(sahp$house_style)
#> 
#> 1.5Fin 1Story 2Story SFoyer   SLvl 
#>     21     81     50      5      8

We can see that there are five house styles along with their frequencies. It turns out to be particularly useful to convert this type of variable into a factor type. Let’s use the function as.factor() and run the summary function again.

house_style_factor <- factor(sahp$house_style)
summary(house_style_factor)
#> 1.5Fin 1Story 2Story SFoyer   SLvl 
#>     21     81     50      5      8

b. Convert a character vector to an ordered factor

Now, let’s take a look at another variable called kit_qual, measuring the kitchen quality. Again, let’s check the unique values.

unique(sahp$kit_qual)
#> [1] "Good"      "Average"   "Fair"      "Excellent"

In addition to having four different quality values, they have an internal ordering among them. In particular, we know Fair < Average < Good < Excellent. To reflect this, you can convert this variable in to an ordered factor using the factor() function. In particular, the ordered = TRUE argument reflects that we want to create an ordered factor.

kit_qual_ordered_factor <- factor(sahp$kit_qual, ordered = TRUE, levels = c("Fair", "Average", "Good", "Excellent"))#covert to ordered factor
summary(kit_qual_ordered_factor)
#>      Fair   Average      Good Excellent 
#>         9        85        57        14
str(kit_qual_ordered_factor)
#>  Ord.factor w/ 4 levels "Fair"<"Average"<..: 3 3 3 2 2 3 2 2 2 1 ...

c. Convert a character vector to a logical vector

Lastly, let’s look at the variable central_air, representing whether the house has central AC or not. As before, let’s get the unique elements.

unique(sahp$central_air)
#> [1] "Y" "N"

Intuitively, you can create a logical vector representing whether the house has central AC.

central_air_logi <- sahp$central_air == "Y"
summary(central_air_logi)
str(central_air_logi)

Sometimes, you may also want to create additional variables from the existing ones. For example, we know the overall quality of the house ranges from 2 to 10.

table(sahp$oa_qual)

Maybe we want to call a house of good quality if oa_qual is larger than 5. We can then create a new logical variable as follows.

good_qual <- sahp$oa_qual > 5

4.1.3 Recover Modified Values

When you are working with a data set inside a package, you may accidentally modified some values in place by mistake. In this situation, you don’t need to panic as you can easily recover the data set into a faculty setting (i.e. the original version inside the package). To do this, you just need to use the data() function with the data set name as its argument. Let’s try to modify one value of sahp and recover the data set afterward.

sahp[1,2]              #get the original value
#> # A tibble: 1 × 1
#>   bedroom
#>     <dbl>
#> 1       3
sahp[1,2] <- 5         #modify the value
sahp[1,2]              #verify the modified value
#> # A tibble: 1 × 1
#>   bedroom
#>     <dbl>
#> 1       5
data(sahp)             #recover the data
sahp[1,2]              #verify the value is recovered
#> # A tibble: 1 × 1
#>   bedroom
#>     <dbl>
#> 1       3