As many of you know, how we analyze our data depends a lot on the format of our data (numbers, factors, etc), the design of our experiment, and whether or not our data is normally distributed or not
we’ll use some simulated data and some data that is available in mock datasets to analyze this today
so what if we have data that is normally distributed? we can simulate normally distributed data with rnorm, with its standard deviation and mean we will set the seed so that we all get the same results
set.seed(1)
y<-rnorm(100, mean=10, sd=2)
if we tell it to list the data, what do we see?
y
## [1] 8.747092 10.367287 8.328743 13.190562 10.659016 8.359063 10.974858
## [8] 11.476649 11.151563 9.389223 13.023562 10.779686 8.757519 5.570600
## [15] 12.249862 9.910133 9.967619 11.887672 11.642442 11.187803 11.837955
## [22] 11.564273 10.149130 6.021297 11.239651 9.887743 9.688409 7.058495
## [29] 9.043700 10.835883 12.717359 9.794425 10.775343 9.892390 7.245881
## [36] 9.170011 9.211420 9.881373 12.200051 11.526351 9.670953 9.493277
## [43] 11.393927 11.113326 8.622489 8.585010 10.729164 11.537066 9.775308
## [50] 11.762215 10.796212 8.775947 10.682239 7.741274 12.866047 13.960800
## [57] 9.265557 7.911731 11.139439 9.729891 14.803236 9.921520 11.379479
## [64] 10.056004 8.513454 10.377585 6.390083 12.931110 10.306507 14.345223
## [71] 10.951019 8.580107 11.221453 8.131805 7.492733 10.582892 9.113416
## [78] 10.002211 10.148683 8.820958 8.862663 9.729643 12.356174 6.952866
## [85] 11.187892 10.665901 12.126200 9.391632 10.740038 10.534198 8.914960
## [92] 12.415736 12.320805 11.400427 13.173667 11.116973 7.446816 8.853469
## [99] 7.550775 9.053199
does it look normal to you?, It should, given we pulled it from a normal distribution
hist(y)
what if we wanted to do the histogram in ggplot? We would need to call in the ggplot library, which is called ggplot2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
p<-ggplot(,aes(x=y))+geom_histogram()
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
how can we test for it? Maybe a shapiro.wilk test
shapiro.test(y)
##
## Shapiro-Wilk normality test
##
## data: y
## W = 0.9956, p-value = 0.9876
the p-value is not significant, so the data is normal
so if we have two different distributions that are normal and numeric, how can we test if they’re different?
set.seed(5)
y1<-rnorm(100, mean=11, sd=2)
y1
## [1] 9.318289 13.768719 8.489016 11.140286 14.422882 9.794184 10.055667
## [8] 9.729257 10.428453 11.276216 13.455261 9.396441 8.839215 10.684931
## [15] 8.856480 10.722028 9.805374 6.632066 11.481635 10.481289 12.801024
## [22] 12.883739 13.935924 12.413522 12.638018 10.413036 13.837178 13.997548
## [29] 9.685836 9.294409 11.631830 13.219388 15.430921 13.434207 13.958444
## [36] 12.903148 8.980935 6.999055 7.475628 10.714784 14.100121 9.395154
## [43] 10.850842 14.791336 10.086862 12.124447 9.225983 10.079511 9.551343
## [50] 10.861578 13.926497 11.375452 13.044046 9.816330 10.775599 9.150094
## [57] 12.506610 10.774782 10.871818 11.466551 8.726834 12.709661 9.843259
## [64] 11.992723 9.479884 10.317227 6.795342 10.396595 8.455233 10.440668
## [71] 10.591805 10.548772 11.694057 11.064736 11.827063 10.689303 12.946971
## [78] 11.242180 11.378347 9.874230 11.996832 7.515395 12.951058 10.951834
## [85] 12.351369 9.579381 15.774465 10.053136 10.848455 9.956320 12.852094
## [92] 8.875178 12.114068 12.801461 12.979891 11.767216 10.306832 9.919621
## [99] 10.634889 10.881401
how do we visualize them together?
hist(y1, col="blue", breaks=20)
hist(y, col="red", breaks=20, add=T)
the data isn’t in a super useful format at this time to use ggplot with two different colors for histogram, we can do that later
What if we do a t-test to compare these two statistically?
t.test(y,y1)
##
## Welch Two Sample t-test
##
## data: y and y1
## t = -3.242, df = 197.49, p-value = 0.001393
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.3597919 -0.3311987
## sample estimates:
## mean of x mean of y
## 10.21777 11.06327
we see a very basic output saying that these are significantly different, but that is as expected with the numbers we assigned them and the sample size
what test would we do if they weren’t normally distributed though? maybe a wilcox-test
wilcox.test(y,y1)
##
## Wilcoxon rank sum test with continuity correction
##
## data: y and y1
## W = 3814, p-value = 0.003772
## alternative hypothesis: true location shift is not equal to 0
we can see that it is still significant, but not as low of a p-value
what if they are actually paired though?
t.test(y,y1, paired=TRUE)
##
## Paired t-test
##
## data: y and y1
## t = -3.2742, df = 99, p-value = 0.00146
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.3578724 -0.3331182
## sample estimates:
## mean of the differences
## -0.8454953
we can see that we get different p-values. What else is different? Our degrees of freedom. Instead of 200 things, we have 2 measurements of 100 things
what about the non-parametic alternative?
wilcox.test(y,y1, paired=TRUE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: y and y1
## V = 1612, p-value = 0.001704
## alternative hypothesis: true location shift is not equal to 0
how do we simulate character/group data? This is more likely how you would format your data if we use this function, we can call groups by letters we see if we use letters from the range 1:2, they come up “a” and “b”
rep(letters[1:2])
## [1] "a" "b"
and if we use different range, different letters
rep(letters[6:16])
## [1] "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p"
So we can set these and assign them. Length out is setting how many total should be “a”,"b’.
group<-rep(letters[1:2], length.out=200)
let’s set our data for two different peaks at a mean of 10 and 12
set.seed(6)
response<-rnorm(n=100, mean=c(10, 11), sd=2)
let’s put this into a dataframe called test
test<-data.frame(group,response)
we can see we have a’s with a lower values. Let’s look at just the first few rows of the data, otherwise it will call in all of it
head(test)
## group response
## 1 a 10.539212
## 2 b 9.740029
## 3 a 11.737320
## 4 b 14.454391
## 5 a 10.048375
## 6 b 11.736050
did this data come out the way we wanted? We can see how R classified each group by using the str() function, to see the structure of the data
str(test)
## 'data.frame': 200 obs. of 2 variables:
## $ group : Factor w/ 2 levels "a","b": 1 2 1 2 1 2 1 2 1 2 ...
## $ response: num 10.54 9.74 11.74 14.45 10.05 ...
what if we do a basic t-test by group. We use the basic formula of reponse, or dependent, first as a function “~” of the dependent. This is pretty standard throughout
t.test(response~group, data=test)
##
## Welch Two Sample t-test
##
## data: response by group
## t = -2.1976, df = 197.55, p-value = 0.02914
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.21415080 -0.06567404
## sample estimates:
## mean in group a mean in group b
## 10.15963 10.79954
what about visualizing this? simply the same format as for the t-test
boxplot(response~group, data=test)
what about a ggplot version of this boxplot? we’ll even add a bit of colour
p2<-ggplot(test, aes(x=group, y=response))+geom_boxplot(fill="slateblue", alpha=0.5)
p2
we see a very significant difference any guesses what the non parametic version would be?
wilcox.test(response~group, data=test)
##
## Wilcoxon rank sum test with continuity correction
##
## data: response by group
## W = 4176, p-value = 0.0442
## alternative hypothesis: true location shift is not equal to 0
what if we had more than 2 groups? what about ANOVA? simulate groups and responses again, but with three groups
group.aov <- rep(letters[1:3], length.out=300)
response.aov<-rnorm(n=300, mean=c(10, 20, 30), sd=2)
visualize it
hist(response.aov)
hard to see, so let’s break it up some more
hist(response.aov, breaks=20)
put it into a data frame
test.aov<-data.frame(group.aov, response.aov)
what are the groups?
str(test.aov)
## 'data.frame': 300 obs. of 2 variables:
## $ group.aov : Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3 1 ...
## $ response.aov: num 8.12 20.19 30.13 8.02 19.86 ...
head(test.aov)
## group.aov response.aov
## 1 a 8.118466
## 2 b 20.187821
## 3 c 30.132507
## 4 a 8.018688
## 5 b 19.855770
## 6 c 29.553486
how about a histogram with more than one in ggplot?
p2<-ggplot(test.aov, aes(x=response.aov, fill=group.aov))+geom_histogram()
p2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
set up the anova
test.aov.fit<-aov(response.aov~group.aov, data=test.aov)
if we use the plot function, we can see the standard normality, etc plots
plot(test.aov.fit)
see the output. But we see no p-value or other information?
test.aov.fit
## Call:
## aov(formula = response.aov ~ group.aov, data = test.aov)
##
## Terms:
## group.aov Residuals
## Sum of Squares 18817.98 1114.08
## Deg. of Freedom 2 297
##
## Residual standard error: 1.936779
## Estimated effects may be unbalanced
let’s use summary() to get a summary of it. We see it’s really significant. But that shouldn’t be surprising given the values we told it to use
summary(test.aov.fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## group.aov 2 18818 9409 2508 <2e-16 ***
## Residuals 297 1114 4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
since we see a significant difference in the ANOVA, but it isn’t necessarily between all groups, we can do a post-hoc test, like Tukey HSD
TukeyHSD(test.aov.fit)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = response.aov ~ group.aov, data = test.aov)
##
## $group.aov
## diff lwr upr p adj
## b-a 9.748184 9.103002 10.39337 0
## c-a 19.399908 18.754726 20.04509 0
## c-b 9.651724 9.006541 10.29691 0
the output will show us all of the comparisons and we see that all of them are different from one another. What about if we use different numbers?
simulate groups and responses again, but with three groups, two are obviously very close, and they have different standard deviations, we’ll look at them with a histogram too
group.aov2<-rep(letters[1:3], length.out=300)
response.aov2<-rnorm(n=300, mean=c(10, 10.5, 30), sd=c(2,6,1.5))
hist(response.aov2, breaks=20)
test.aov2<-data.frame(group.aov2, response.aov2)
anova time
test.aov.fit2<-aov(response.aov2~group.aov2, data=test.aov2)
summary(test.aov.fit2)
## Df Sum Sq Mean Sq F value Pr(>F)
## group.aov2 2 25880 12940 989.6 <2e-16 ***
## Residuals 297 3883 13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
still very significant, but what about posthoc?
TukeyHSD(test.aov.fit2)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = response.aov2 ~ group.aov2, data = test.aov2)
##
## $group.aov2
## diff lwr upr p adj
## b-a 0.664413 -0.5401587 1.868985 0.3967147
## c-a 20.026745 18.8221736 21.231317 0.0000000
## c-b 19.362332 18.1577606 20.566904 0.0000000
we can see there is not a significant difference between a and b, which is not surprising, given the means we gave them what about a nonparametric alternative?
nonpara.aov<-kruskal.test(response.aov2~group.aov2, data=test.aov2)
nonpara.aov
##
## Kruskal-Wallis rank sum test
##
## data: response.aov2 by group.aov2
## Kruskal-Wallis chi-squared = 199.36, df = 2, p-value < 2.2e-16
there are options for post-hoc for Kruskal Test, but we will not explore right now (may involve other packages)
what if we are interested in more than one response variable (dependent variable). let’s pull in the iris flower dataset
data("iris")
check out the data
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
we can see that there are a few different numerical traits (width, length, etc) and also species. so we can ask if these dependent traits are impacted by one independent trait so we can ask, are sepal length and petal length significantly different between species? we can use a manova to answer this question in order to do this, we need to bind together the two dependent variables (sepal length and petal length)
results.manova<-manova(cbind(iris$Sepal.Length, iris$Petal.Length)~iris$Species)
summary(results.manova)
## Df Pillai approx F num Df den Df Pr(>F)
## iris$Species 2 0.9885 71.829 4 294 < 2.2e-16 ***
## Residuals 147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
but what if we want to look at each trait separately?
summary.aov(results.manova)
## Response 1 :
## Df Sum Sq Mean Sq F value Pr(>F)
## iris$Species 2 63.212 31.606 119.26 < 2.2e-16 ***
## Residuals 147 38.956 0.265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 2 :
## Df Sum Sq Mean Sq F value Pr(>F)
## iris$Species 2 437.10 218.551 1180.2 < 2.2e-16 ***
## Residuals 147 27.22 0.185
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
we can see that both traits are significantly different by species
now that we’ve covered basic reponse by group, what about continuous by continous? Correlation data? let’s use other datasets that are already in R since they can be useful let’s call in the motortrend car data
data("mtcars")
let’s check it out
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
there’s a lot to look through here, but let’s keep it simple What is directly impacting the fuel efficiency (mpg) of these cars? so we’re going to set up a linear model (lm()) and use the same formula we’ve used before (dependent~independent) Let’s just start with one variable, Weight
carfit<-lm(mpg~wt, data=mtcars)
you can check out the plots of residuals if you like
plot(carfit)
what does the output look like?
summary(carfit)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
highly significant, with a strong r2 value (% of variation in mpg explained by weight!) what does it look like?
plot(mpg~wt, data=mtcars)
what about if we include another variable? maybe the cylinders?
carfit2<-lm(mpg~wt+cyl, data=mtcars)
we see that it is significant and has a higher R2 value, but does that matter?
summary(carfit2)
##
## Call:
## lm(formula = mpg ~ wt + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2893 -1.5512 -0.4684 1.5743 6.1004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.6863 1.7150 23.141 < 2e-16 ***
## wt -3.1910 0.7569 -4.216 0.000222 ***
## cyl -1.5078 0.4147 -3.636 0.001064 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple R-squared: 0.8302, Adjusted R-squared: 0.8185
## F-statistic: 70.91 on 2 and 29 DF, p-value: 6.809e-12
what model is better?
anova(carfit, carfit2)
## Analysis of Variance Table
##
## Model 1: mpg ~ wt
## Model 2: mpg ~ wt + cyl
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 278.32
## 2 29 191.17 1 87.15 13.22 0.001064 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
this output shows that there is a significantly better fit with the more complex model (p=0.001064)
but what about an interaction effect of something like how many cylinders it has? if the car is heavier because of the engine and not something else, maybe there’s an effect?
carfit3<-lm(mpg~wt+cyl+wt*cyl, data=mtcars)
summary(carfit3)
##
## Call:
## lm(formula = mpg ~ wt + cyl + wt * cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2288 -1.3495 -0.5042 1.4647 5.2344
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.3068 6.1275 8.863 1.29e-09 ***
## wt -8.6556 2.3201 -3.731 0.000861 ***
## cyl -3.8032 1.0050 -3.784 0.000747 ***
## wt:cyl 0.8084 0.3273 2.470 0.019882 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.368 on 28 degrees of freedom
## Multiple R-squared: 0.8606, Adjusted R-squared: 0.8457
## F-statistic: 57.62 on 3 and 28 DF, p-value: 4.231e-12
we see that weight is significant to the model, as is cylinder, BUT we see a significant interaction between cylinder and weight
what’s going on here? We know that as weight goes up, the mpg goes down
plot(mpg~cyl, data=mtcars)
but we know as the cylinders go up, the mpg goes down too
plot(mpg~cyl, data=mtcars)
what happens to wt as cylinders go up?
plot(wt~cyl, data=mtcars)
ggplot is pretty easy to set up here too for scatter plots
p4<-ggplot(data=mtcars, aes(x=cyl, y=wt))+geom_point()
p4
if we look at our summary again, we can see that the estimate for the interaction is positive, whereas the estimates for the wt and cyl independently are negative. This means, while wt and cyl are negatively impacting mpg, as cylinder increases, the effect of wt on mpg decreases. There is an interaction of these variables
and if we compare the simpler wt+cyl model to our more complex model, the more complex model is significantly better
anova(carfit2, carfit3)
## Analysis of Variance Table
##
## Model 1: mpg ~ wt + cyl
## Model 2: mpg ~ wt + cyl + wt * cyl
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29 191.17
## 2 28 156.98 1 34.196 6.0995 0.01988 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
you can plot these with cool graphs, but that is something you can do easily outside of base R (ggplot2, etc)
So What if we have categorical/character data? Maybe something from a survey or phenotypes from genetic crosses? What if we want to know if there is a more common combination of hair and eye color let’s load the available dataset in R called HairEyeColor
data("HairEyeColor")
let’s look at it
str(HairEyeColor)
## 'table' num [1:4, 1:4, 1:2] 32 53 10 3 11 50 10 30 10 25 ...
## - attr(*, "dimnames")=List of 3
## ..$ Hair: chr [1:4] "Black" "Brown" "Red" "Blond"
## ..$ Eye : chr [1:4] "Brown" "Blue" "Hazel" "Green"
## ..$ Sex : chr [1:2] "Male" "Female"
HairEyeColor
## , , Sex = Male
##
## Eye
## Hair Brown Blue Hazel Green
## Black 32 11 10 3
## Brown 53 50 25 15
## Red 10 10 7 7
## Blond 3 30 5 8
##
## , , Sex = Female
##
## Eye
## Hair Brown Blue Hazel Green
## Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
it’s in kind of a weird format, and for our question, we are not worried about sex we can use the function margin.table to give us all of our output for a marginal frequency table
HairEyeNew<-margin.table(HairEyeColor, margin=c(1,2))
let’s see what it looks like
HairEyeNew
## Eye
## Hair Brown Blue Hazel Green
## Black 68 20 15 5
## Brown 119 84 54 29
## Red 26 17 14 14
## Blond 7 94 10 16
we can see it is all in a usable table now, but raw numbers are not what we are interested in
what kind of test would we use to see if something is more or less than expected? chi-square is common we can use this function with what we saved as a marginal frequency table
chisq.test(HairEyeNew)
##
## Pearson's Chi-squared test
##
## data: HairEyeNew
## X-squared = 138.29, df = 9, p-value < 2.2e-16
we see something is significant, but we want to see what frequencies we have. in order to calculate frequency, we need to divide what we have by the totals
HairEyeNew/sum(HairEyeNew)
## Eye
## Hair Brown Blue Hazel Green
## Black 0.114864865 0.033783784 0.025337838 0.008445946
## Brown 0.201013514 0.141891892 0.091216216 0.048986486
## Red 0.043918919 0.028716216 0.023648649 0.023648649
## Blond 0.011824324 0.158783784 0.016891892 0.027027027