Chapter 6 Data Manipulation

For conducting data analysis, we often need to conduct various kinds of data manipulation. We will use the ahp data set in the r02pro package throughout this chapter. Let’s first look at the data set.

library(r02pro)
ahp

ahp is a dataset of 2048 houses in Ames, Iowa from 2006 to 2010, with 56 variables including the sale date and price. To learn more about each variable, you can look at its documentation.

?ahp

To view the entire dataset, you can use the View() function, which will open the dataset in the new file window.

View(ahp)

To get the first 6 rows of ahp, you can use the head() function, which also has an optional argument if you want a different number of top rows.

head(ahp)
head(ahp, n = 10) #the first 10 rows of ahp

The following are some possible questions we may want to explore.

  1. (pick observations by their values) Find the houses that are sold in Jan 2009.

You will learn how to filter observations in Section 6.1.

  1. (reorder the observations) Find the 10 houses with the highest sale prices.

You will learn how to reorder observations in Section 6.2.

  1. (pick variable by their names) We see there are 56 columns. For a particular data analysis question, perhaps we want to focus on a subset of the columns.

You will learn how to select variables in Section 6.4.

  1. (create new variables as functions of existing ones) From the existing variables, perhaps we want to create new ones, for instance, the average price per living area.

You will learn how to create new variables in Section 6.5.

  1. (create various summary statistics) We may want to create certain summary statistics. For example, what is the average sale price for each type of houses?

You will learn how to group observations and create summary statistics for each group in Section 6.6.