The t-test and t-distribution are widely considered to be at the very foundation of modern statistic science and they form an important foundation for the practice of statistics. Who would believe they were invented to make great beer better?
William Sealy Gosset is credited with inventing and applying the idea of the t-test to assist in scientific quality control while working for the Guinness brewery. The idea was refined and supported by the great statistician R. A. Fisher, and the idea was initially described in a paper anonymously by “Student”, in order to protect the commercial interests of Guinness. Today, it is perhaps one of the most prevalent and basic tools in statistics, and it is a fascinating story.
Here, we will briefly look at the practical basis of the t-test before going on the look at different ways it can be applied to data.
1.1 Objectives
The question of the t-test
Data and assumptions
Graphing
Test and alternatives
Practice exercises
2 The question of the t-test
There are three common versions of the t-test.
The typical premise of the t-test is that it is used to compare populations you are interested in, which you measure with independent samples. There are a few versions of the basic question.
2.1 2 independent samples
Here you have measured a numeric variable and have two samples. The question is: are the means of the two samples different (i.e. did the samples come from different populations)? A typical design here might be an experiment with a control and one treatment group. Another more general design might be a sample taken under one defined condition, is compared to a sample taken under a different condition.
The data could be summarized in two different ways. The “long format” way would be to have a vector with the measured variable, and another vector that is a factor with 2 levels defining the two sample conditions (one numeric vector, one factor vector).
## 2 sample t-test long format datadensity <-c(rep('high', 7), rep('low', 7))height <-c(2,3,4,3,4,3,2,6,8,6,9,7,8,7)(long.data <-data.frame(density,height))
density height
1 high 2
2 high 3
3 high 4
4 high 3
5 high 4
6 high 3
7 high 2
8 low 6
9 low 8
10 low 6
11 low 9
12 low 7
13 low 8
14 low 7
The “wide format” way would be to have a different numeric column for each of the samples (2 numeric vectors).
## 2 sample t-test wide format data(wide.data <-data.frame(high.ht =c(2,3,4,3,4,3,2),low.ht =c(6,8,6,9,7,8,7)))
A typical graph representing data like this, would be a boxplot(). Optionally, to be maximally informative, one can add the raw data as points over the box summaries.
# Boxplotboxplot(height ~ density, data = long.data,main ="2 independent samples")# Optional: add raw data points# jitter() nudges the x-axis placement so that the points do not overlapset.seed(42)points(x =jitter(c(1,1,1,1,1,1,1,2,2,2,2,2,2,2), amount = .2),y = long.data$height,col ="red", pch =16, cex = .8) # Mere vanity
2.2 Compare 1 sample to a known mean
Here you have one sample which you wish to compare to a mean value. The basic question is did the sample come from a population exhibiting the known mean?
The data are simply a single numeric vector, and the population mean for comparison.`
# The datamysam <-c(2.06, 1.77, 1.9, 1.94, 1.91, 1.83, 2.08, 1.84, 2.15, 1.84, 2.05, 2.19, 1.64, 1.81, 1.83)boxplot(mysam, main ="Is your sample population different from the dashed line?")points(x =jitter(rep(1,15), amount = .1),y = mysam,col ="red", pch =16, cex = .8) # Mere vanityabline(h =2.0,col ="blue", lty =2, lwd =2) # Mere vanity
2.3 Paired samples
This is a third kind of t-test question. Here the individual observation comprising the 2 samples are not independent. A typical example might be the measurement of some variable before and after a treatment (e.g. measure crop yield in plots in a field before and after a soil treatment in successive years); another classic example would be measuring plots that are paired spatially (e.g., plots are chosen and within each plot a treatment and untreated measurement is made.).
For each of these examples, there is a unit, patient, or plot identification, that represents the relationship of each paired measure.
2.4 Plotting paired samples
# Biochar application, measure N before and after# Data # (the code are kind of ugly, but run it to "make" biochar)biochar <-structure(list(plot =c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O"), N.first =c(13.4, 16.7, 17.9, 18.5, 18.6, 18.6, 18.7, 20.5, 20.6, 21.5, 24.2, 24.5, 25, 27.1, 28.1), N.second =c(16, 16.7, 18.7, 18.7, 22.1, 22.7, 23.1, 23.1, 23.2, 23.5, 25.4, 25.9, 27.6, 28, 29.7)), class ="data.frame", row.names =c(NA, -15L))biochar
plot N.first N.second
1 A 13.4 16.0
2 B 16.7 16.7
3 C 17.9 18.7
4 D 18.5 18.7
5 E 18.6 22.1
6 F 18.6 22.7
7 G 18.7 23.1
8 H 20.5 23.1
9 I 20.6 23.2
10 J 21.5 23.5
11 K 24.2 25.4
12 L 24.5 25.9
13 M 25.0 27.6
14 N 27.1 28.0
15 O 28.1 29.7
# boxplot() would work, but hides pairwise relationship# Try this:plot(x =jitter(c(rep(1,15), rep(2,15)),amount = .02),y =c(biochar$N.first, biochar$N.second),xaxt ="n", xlim =c(0.5, 2.5),cex = .8, col ="blue", pch =16, # Mere vanityxlab ="Biochar treatment",ylab ="Soil N",main ="Do the lines tend to increase?")mtext(side =1, at =1:2, text =c("before", "after"), line =1)# Get crazy: add horizontal lines to visualize the plot pairsfor(i in1:15){lines(x =c(1.05,1.95),y =c(biochar$N.first[i], biochar$N.second[i]),lty =2, lwd =1, col ="red") # Mere vanity}
3 Data and assumptions
The principle assumptions of the t-test are:
Gaussian distribution of observations WITHIN each sample
Heteroscedasticity (our old friend) - i.e., the variance is equal in each sample
Independence of observations
3.1 Evaluating and testing the assumptions
The t-test is thought to be somewhat robust to violation of assumptions. For example, if the assumptions of Gaussian distribution and heteroscedasticiy are violated (a little), it is not likely to greatly bias your results. The assumption of independence of observations is always of high importance.
If in doubt about using your personal subjective judgement, based on your personal data analysis experience, it is always best to fully evaluate the evidence whether assumptions have been violated for parametric statistics tests with formal testing.
3.2 Gaussian distribution of observations WITHIN each sample
We would typically first test the assumption of Gaussian distribution using graphical evaluation with, for examples, a histogram (hist()) and a q-q plot (e.g., with qqplot()), possibly along with a statistical test evaluating whether data are Gaussian (e.g., shapiro.test().
NB 1 the assumption of Gaussian does not apply to to ALL OBSERVATIONS TOGETHER for both samples of a two sample t-test, but applies TO EACH SAMPLE SEPARATELY (this is sometimes confusing for beginners). The reason for this is that the very nature of the two sample t-test hypothesizes that the two samples come from DIFFERENT POPULATIONS, but that each population adheres to the Gaussian distribution.
NB 2 testing whether each sample is Gaussian is analogous to testing whether the residuals are Gaussian for regression.
The following code explores the ideas here further.
# First make some data as an example.# This is height in cm data measured from 30 females and 30 males# NB because we simulate this data, we literally specify the # samples are indeed from DIFFERENT, GAUSSIAN populations.# Obviously in a real sampling situation you would not know this,# hence we will test it formally.set.seed(1)height <-c(rnorm(30,185,10),rnorm(30,150,10))sex <-c('m','m','m','m','m','m','m','m','m','m','m','m','m','m','m','m','m','m','m','m','m','m','m','m','m','m','m','m','m','m','f','f','f','f','f','f','f','f','f','f','f','f','f','f','f','f','f','f','f','f','f','f','f','f','f','f','f','f','f','f')# Package the variables we made into a data frame# and remove the variables outside the dataframe to # keep our global environment tidy (not required, but satisfying!)data <-data.frame(height,sex)rm(height, sex)# plot raw datahist(x = data$height, main ='Height ignoring sex',xlab ='Height (cm)',freq = F)# Draw expected Gaussian# Draw overall mean and the means# for each level of sexabline(v =c(mean(data$height), # mean overallmean(data$height[data$sex=='f']), # mean fmean(data$height[data$sex=='m'])), # mean mlty =c(1,2,2), # line types for eachlwd =3, # line width for all 3col =c("black", "red", "blue")) # colour for each# Draw expected Gaussian density curvemv <- data$height # generalizes following codexcoord <-seq(min(mv), # make x coordinatesmax(mv),length =length(mv))ycoord <-dnorm(x = xcoord, # make y coordinatesmean =mean(mv),sd =sd(mv))lines(x = xcoord, y = ycoord, # draw curvecol ='darkgreen', lwd =3)# for clarity here, add on some labels...text(x =155, y = .01, labels ="f\nmean", col ="red")text(x =182, y = .011, labels ="m\nmean", col ="blue")text(x =172.1, y = .015, labels ="grand\nmean")text(x =140, y = .012, labels ="Gaussian\ndensity", col ="darkgreen")
Note the gray bars on the histogram have two peaks, and is terribly non-Gaussian looking. Also note the data does not look similar to Gaussian! To emphasize the point, we do not even expect these height measures to come from the same population (in fact our hypothesis is that they do not), therefore we do not expect the WHOLE vector of data to be Gaussian.
We could further formally test this of course, with a quantile-quantile (“qq”) plot, and perhaps a statistical test of Gaussian like the Shapiro Test.
qqnorm(data$height,main ="Q-Q Gaussian line for our Height data")qqline(data$height)
Notice the q-q plot shows divergences from the Gaussian expected line at both ends and in the middle! Compare this to our histogram and expected density curve above.
The Shapiro Test will confirm that these data are not Gaussian.
shapiro.test(data$height)
Shapiro-Wilk normality test
data: data$height
W = 0.91918, p-value = 0.0007111
So testing formally, we find that our data are different to the expected Gaussian (W = 0.92, n = 60, p = 0.0007).
3.3 Properly testing Gaussian for the two sample t-test
To properly test the assumption of Gaussian for a 2 sample t-test, you would test the assumption separately for each group. Here is an example that uses graphs and the Shapiro test:
# step 1 graph a hist() of the continuous variable# separately for each factor levelpar(mfrow =c(2,1)) # set so hist's "stack"# draw males histhist(data$height[data$sex =='m'], # select malesxlim =c(min(data$height), max(data$height)), # set limit for ALL datacol ="blue", freq = F,xlab ="Height (cm)", main ="Males") mv <- data$height[data$sex =='m']xcoord <-seq(min(mv), # make x coordinatesmax(mv),length =length(mv))ycoord <-dnorm(x = xcoord, # make y coordinatesmean =mean(mv),sd =sd(mv))lines(x = xcoord, y = ycoord, # draw curvecol ='lightblue', lwd =3)# draw females histhist(data$height[data$sex =='f'], # select femalesxlim =c(min(data$height), max(data$height)), # set limit for ALL datacol ="red", freq = F,xlab ="Height (cm)", main ="Females")mv <- data$height[data$sex =='f']xcoord <-seq(min(mv), # make x coordinatesmax(mv),length =length(mv))ycoord <-dnorm(x = xcoord, # make y coordinatesmean =mean(mv),sd =sd(mv))lines(x = xcoord, y = ycoord, # draw curvecol ='pink', lwd =3)
# set default layout backpar(mfrow =c(1,1))
You can really see the difference here and each respective population seems to conform relatively closely to Gaussian (allowing for sampling error). This is EXACTLY what we would expect for morphological data.
Now we can make q-q plots and formally test the distributions with the Shapiro Test.
First we graphically examine the distributions:
## QQ Plotspar(mfrow =c(1,2))# malesqqnorm(data$height[data$sex =='m'],main ="Q-Q Gaussian line for male height")qqline(data$height[data$sex =='m'])# femalesqqnorm(data$height[data$sex =='f'],main ="Q-Q Gaussian line for female height")qqline(data$height[data$sex =='f'])
par(mfrow =c(1,1))
Second, we evaluate whether the distribution deviate from Gaussian with an objective test.
The q-q plots look much better for each sex separately for the height data - i.e., most of the points are near the expected lines.
Shapiro-Wilk normality test
data: data$height[data$sex == "m"]
W = 0.95011, p-value = 0.1703
shapiro.test(data$height[data$sex =='f'])
Shapiro-Wilk normality test
data: data$height[data$sex == "f"]
W = 0.98568, p-value = 0.9482
The Shapiro test for Gaussian showed that neither sample of height data for each respective sex deviates significantly from Gaussian (males: W = 0.95, n = 30, p = 0.17; females: W = 0.99, n = 30, p = 0.95).
If the Gaussian assumption cannot be met reasonably with the data, a common alternative that does not require it is the Mann-Whitney U-test, also known by the name Wilcoxon Test (which we will look at below).
3.4 Heteroscedasticity assumption
We would typically examine the variance graphically or through calculation and comparison of descriptive statistics, although a formal test (e.g. using var.test()) is possible. However, in the case of the 2 sample t-test, there exist methods to “pool” the standard deviation if variances are not equal. Thus, if pooled SD is used and the 2 sample t-test is conducted assuming unequal variances (NB this is the default setting in the base R t.test()), it is not necessary to test this assumption for the specific case of the 2 sample t-test (but still may be interesting to be aware of as a feature of your data).
3.5 Independence assumption
The assumption of independence of data is extremely important and related to making an INFERENCE on a POPULATION of interest via SAMPLING one or more populations of interest. If for example pairs of observations are not independent, an alternative test would be appropriate, like the Paired t-test.
We have already examined the most common cases for data, data arrangement and types of questions that fit the t-test. The principle graph types are simple:
2 independent samples - Boxplot or similar graph, showing the central tendency of the data and making it easy to visually compare the two samples.
1 sample - Again, boxplot or similar showing the variation of the single sample. Indicating the population mean as a reference is useful.
2 paired samples - A boxplot is second best to a graph that indicates the tendency for change between the paired observations.
5 Examples of the t-test and alternatives
We will cover four examples here looking at the t-test variants and alternatives when assumptions are not met:
t-test for 2 independent samples
1 sample t-test
2 paired samples
Mann-Whitney U-test
5.1 t-test for 2 independent samples
The example we will use here is the amount of tree growth over a period of time, where samples were taken of individual trees grown under two conditions - high density of trees versus low density. The hypothesis we are testing is whether there is evidence the samples came from different populations (by inference, we are possibly interested in whether there is an effect of density on growth). The data we will use is similar to the long.data from above.
# The histograms are a little "wooly", but there are no huge # deviations from the expectation of Gaussian and the # q-q plots look ok: proceed# ?t.test# NB 1 - the x argument can be a formula# x = height ~ density# or we can set our samples to x and y respectively# x = height[low], y = height[high]t.test(formula = height ~ density, data = treegrowth)
Welch Two Sample t-test
data: height by density
t = -8.6257, df = 11.868, p-value = 1.865e-06
alternative hypothesis: true difference in means between group high and group low is not equal to 0
95 percent confidence interval:
-5.190611 -3.095103
sample estimates:
mean in group high mean in group low
3.428571 7.571429
# Flash challenge: Make a good graph of the data
There are a few points in the output.
The main test statistic is the “t value”, which can be be positive or negative (depending on which sample is larger on average); the absolute value of t should increase with the probability that the samples came different populations.
The degrees of freedom, df, is adjusted to account for differences in sample variance thought of as a way to quantify the overall difference
The 95% confidence interval is an estimation of the true mean difference between populations from which the samples were taken.
The P-value
Here, we might report our results in the technical style as follows (remember ALWAYS the test performed: the test statistic, the degrees of freedom or sample size, the P-Value):
We detected a significant difference between mean height for tree grown at high or low density (2-sample t-test: t = -8.63, df = 11.9, P < 0.0001).
5.2 1 sample t-test
The example is a sample of earwigs measured in length to the nearest 0.1 mm. There is a hypothesis that global warming has impacted the development in the population and they are getting larger. A long term dataset has established a mean earwig length of 17.0 (NB, this is our estimate of “mu”, \(\mu\), the population mean we will compare our sample to). Are the earwigs getting bigger?
# Try this:# Dataearwigs <-c(22.1, 16.3, 19.1, 19.9, 19.2, 17.7, 22.5, 17.7, 24.1, 17.8, 21.9, 24.9, 13.8, 17.2, 17.6, 19.9, 17.1, 10, 10.7, 22)# Flash Challenge: Assess this data for adherence to the Gaussian assumptionmymu <-17.0# Our mu# ?t.test #notice the mu argumentt.test(x = earwigs,mu = mymu)
One Sample t-test
data: earwigs
t = 1.7845, df = 19, p-value = 0.09031
alternative hypothesis: true mean is not equal to 17
95 percent confidence interval:
16.72775 20.42225
sample estimates:
mean of x
18.575
# Flash challenge: Make a great graph representing this test
We found no evidence the mean length of earwigs in our sample was different to the historical mean (1-sample t-test: t = 1.78, df = 19, P-value = 0.09).
5.3 2 paired samples
The example here is a measure of the hormone cortisol in pregnant cows (related to stress in mammals). A measure was taken in each individual twice; once as a baseline measure, and once after a treatment of soothing music being played to the cows for 3 hours per day. The prediction is that the mean level of cortisol will decrease relative to baseline after experiencing the music treatment.
Paired t-test
data: cort.t0 and cort.t1
t = -3.7324, df = 19, p-value = 0.001412
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-0.31605728 -0.08894272
sample estimates:
mean difference
-0.2025
# Flash Challenge:# 1) Make a great graph that represents these data# 2) do the data conform to the assumptions?# 3) Was your hypothesis upheld...?# 4) format and report the results in the technical style!
5.4 Mann-Whitney U-test
The Mann-Whitney U-test (wilcox.test() in R, don’t ask why…) is an alternative to the t-test when your data cannot achieve the assumptions for the t-test. The t-test is robust, especially when sample size is large, or the deviation from assumptions is similar for both samples. However, when sample size is not very large (e.g. ~30 per sample or less), and there is skew or the samples are dissimilar, the Mann-Whitney U-test is a good choice. Two sample and one sample methods exist.
The example here is chicken egg count for a control, and a +bonemeal diet.
par(mfrow =c(1,2))qqPlot(diet, main ="Not Gaussian") # Divergence
[1] 8 1
qqPlot(diet.bone,main ="Diff. to diet?")
[1] 8 2
par(mfrow =c(1,1))# ?wilcox.testwilcox.test(x = diet, y = diet.bone)
Warning in wilcox.test.default(x = diet, y = diet.bone): cannot compute exact
p-value with ties
Wilcoxon rank sum test with continuity correction
data: diet and diet.bone
W = 63.5, p-value = 0.0374
alternative hypothesis: true location shift is not equal to 0
Don’t forget to make a good graph of the data.
boxplot(diet, diet.bone, ylab ='Egg count')points(x=jitter(rep(1,15), 4), y = diet, col ='red', pch =16) # pure vanitypoints(x=jitter(rep(2,15), 2), y = diet.bone, col ='red', pch =16) # pure vanity
# Flash challenge: make this graph better!
We found evidence of a difference in the number of eggs lain under a control diet and a a diet supplemeted with bone meal (Mann-Whitney U-test: W = 63.5, n = [15, 15], P = 0.037) )
6 Practice exercises
For the following exercises, use the dataset in the file 11-Davis.xlsx. The dataset is in tidy format; take advantage of this by looking at the terse information present in the data dictionary tab. The data are a result of a survey of some university students, who were asked to report their height and weight, and then their height and weight were measured. There will be some data handling as part of the exercises below, and practical and important part of every real data analysis.
6.1
Pick the appropriate form of t-test to ask whether male reported and actual height are the same. Perform the test, make a great graph to illustrate, and report your results in the technical style. Show all required code.
6.2
Devise a similar test to the one in the previous question using the weight variables in females. Formally state your hypothesis, perform the test, make a great graph to illustrate, and report your results in the technical style. Show all required code.
6.3
Calculate the difference between reported height and reported weight for all study subjects and place the result in a new numeric vector. Use a t-test to discover whether the degree of discrepancy between reported height differs between males and females. Report your results in the technical style. Show all required code.
6.4
Devise a way to examine the question whether taller people tend to self report height similarly to whorter people. Discuss your approach and present any evidence, graphical or otherwise, to resolve your question.
6.5
The subjects in this dataset were students at the University of California Davis in the Psychology Department. The average height of adult men in California has been estimated as 176.5 cm. Test whether males in our dataset is different. State your conclusion and results and briefly discuss an explanation for the pattern (i.e., the difference or lack of difference) that you observe. Comment on sampling assumptions when you do so.
6.6
Write a plausible practice question involving any aspect of data handling, graphing or analysis for the t-test framework to ask a novel question for the Davis student height data.