10.5 Hypothesis Testing
In this section, we introduce the basics of hypothesis testing in R. Hypothesis testing is a fundamental tool in statistics that allows us to make inferences about population parameters based on sample data.
10.5.1 The t-test
The t-test is used to test whether the mean of a population (or the difference between two population means) equals a specified value. R provides the t.test() function for this purpose.
10.5.1.1 One-sample t-test
Suppose we want to test whether the average sale price of houses in the sahp dataset is different from $200,000 (i.e., 200 in the unit of thousands).
library(r02pro)
t.test(sahp$sale_price, mu = 200)
#>
#> One Sample t-test
#>
#> data: sahp$sale_price
#> t = -3.0911, df = 163, p-value = 0.002346
#> alternative hypothesis: true mean is not equal to 200
#> 95 percent confidence interval:
#> 167.0439 192.7363
#> sample estimates:
#> mean of x
#> 179.8901The output includes:
- t: the test statistic
- df: degrees of freedom
- p-value: the probability of observing a test statistic as extreme as the one computed, assuming the null hypothesis is true
- 95 percent confidence interval: a range of plausible values for the true mean
- sample estimates: the sample mean
If the p-value is less than our significance level (commonly 0.05), we reject the null hypothesis.
10.5.1.2 Two-sample t-test
We can also compare the means of two groups. Let’s test whether one-story and two-story houses have different average sale prices.
price_1story <- sahp$sale_price[sahp$house_style == "1Story"]
price_2story <- sahp$sale_price[sahp$house_style == "2Story"]
t.test(price_1story, price_2story)
#>
#> Welch Two Sample t-test
#>
#> data: price_1story and price_2story
#> t = -0.87927, df = 88.388, p-value = 0.3816
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -48.19646 18.62819
#> sample estimates:
#> mean of x mean of y
#> 183.0277 197.8118By default, t.test() performs a Welch two-sample t-test, which does not assume equal variances. To assume equal variances, set var.equal = TRUE.
t.test(price_1story, price_2story, var.equal = TRUE)
#>
#> Two Sample t-test
#>
#> data: price_1story and price_2story
#> t = -0.91686, df = 128, p-value = 0.3609
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -46.68985 17.12157
#> sample estimates:
#> mean of x mean of y
#> 183.0277 197.811810.5.2 The Chi-Squared Test
The chi-squared test is used to test the association between two categorical variables. R provides chisq.test() for this purpose.
Let’s test whether kitchen quality (kit_qual) and house style (house_style) are independent in the sahp dataset.
# Create a contingency table
quality_table <- table(sahp$kit_qual, sahp$house_style)
quality_table
#>
#> 1.5Fin 1Story 2Story SFoyer SLvl
#> Average 14 43 18 4 6
#> Excellent 0 9 5 0 0
#> Fair 2 5 2 0 0
#> Good 5 24 25 1 2
# Perform chi-squared test
chisq.test(quality_table)
#>
#> Pearson's Chi-squared test
#>
#> data: quality_table
#> X-squared = 15.492, df = 12, p-value = 0.2156A small p-value suggests that the two variables are not independent.
10.5.3 The Correlation Test
To test whether the correlation between two numeric variables is significantly different from zero, use cor.test().
cor.test(sahp$liv_area, sahp$sale_price)
#>
#> Pearson's product-moment correlation
#>
#> data: sahp$liv_area and sahp$sale_price
#> t = 12.443, df = 162, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.6113017 0.7698383
#> sample estimates:
#> cor
#> 0.6990621The output shows the estimated correlation, the test statistic, and the p-value.
10.5.4 Interpreting Results
When interpreting hypothesis test results, keep these points in mind:
- A small p-value (typically < 0.05) provides evidence against the null hypothesis, but does not prove that the alternative is true.
- Statistical significance does not imply practical significance. A very large sample can make tiny differences statistically significant.
- The confidence interval provides more information than the p-value alone, as it shows the range of plausible values for the parameter.
10.5.5 Exercises
Using the
gm2004dataset from the r02pro package, perform a one-sample t-test to test whether the averagelife_expectancyacross all countries is different from 70. What do you conclude at the 0.05 significance level?Using the
gm2004dataset, perform a two-sample t-test to compare the averagecholesterolbetween"male"and"female"groups (thegendervariable). Is there a statistically significant difference?Create a contingency table of
continentandgenderfrom thegm2004dataset. Perform a chi-squared test to determine whether continent and gender are independent.Test whether the correlation between
BMIandcholesterolin thegm2004dataset is statistically significant. Report the estimated correlation and the p-value.Using the
sahpdataset, test whether houses with central air conditioning (central_air == "Y") have a significantly higher averagesale_pricethan those without. Use a two-sample t-test withvar.equal = FALSE.