Simulating some normal data

As many of you know, how we analyze our data depends a lot on the format of our data (numbers, factors, etc), the design of our experiment, and whether or not our data is normally distributed or not

we’ll use some simulated data and some data that is available in mock datasets to analyze this today

so what if we have data that is normally distributed? we can simulate normally distributed data with rnorm, with its standard deviation and mean we will set the seed so that we all get the same results

set.seed(1)
y<-rnorm(100, mean=10, sd=2)

if we tell it to list the data, what do we see?

y
##   [1]  8.747092 10.367287  8.328743 13.190562 10.659016  8.359063 10.974858
##   [8] 11.476649 11.151563  9.389223 13.023562 10.779686  8.757519  5.570600
##  [15] 12.249862  9.910133  9.967619 11.887672 11.642442 11.187803 11.837955
##  [22] 11.564273 10.149130  6.021297 11.239651  9.887743  9.688409  7.058495
##  [29]  9.043700 10.835883 12.717359  9.794425 10.775343  9.892390  7.245881
##  [36]  9.170011  9.211420  9.881373 12.200051 11.526351  9.670953  9.493277
##  [43] 11.393927 11.113326  8.622489  8.585010 10.729164 11.537066  9.775308
##  [50] 11.762215 10.796212  8.775947 10.682239  7.741274 12.866047 13.960800
##  [57]  9.265557  7.911731 11.139439  9.729891 14.803236  9.921520 11.379479
##  [64] 10.056004  8.513454 10.377585  6.390083 12.931110 10.306507 14.345223
##  [71] 10.951019  8.580107 11.221453  8.131805  7.492733 10.582892  9.113416
##  [78] 10.002211 10.148683  8.820958  8.862663  9.729643 12.356174  6.952866
##  [85] 11.187892 10.665901 12.126200  9.391632 10.740038 10.534198  8.914960
##  [92] 12.415736 12.320805 11.400427 13.173667 11.116973  7.446816  8.853469
##  [99]  7.550775  9.053199

does it look normal to you?, It should, given we pulled it from a normal distribution

hist(y)

what if we wanted to do the histogram in ggplot? We would need to call in the ggplot library, which is called ggplot2

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
p<-ggplot(,aes(x=y))+geom_histogram()
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Testing for normality

how can we test for it? Maybe a shapiro.wilk test

shapiro.test(y)
## 
##  Shapiro-Wilk normality test
## 
## data:  y
## W = 0.9956, p-value = 0.9876

the p-value is not significant, so the data is normal

at two different distributions

so if we have two different distributions that are normal and numeric, how can we test if they’re different?

set.seed(5)
y1<-rnorm(100, mean=11, sd=2)
y1
##   [1]  9.318289 13.768719  8.489016 11.140286 14.422882  9.794184 10.055667
##   [8]  9.729257 10.428453 11.276216 13.455261  9.396441  8.839215 10.684931
##  [15]  8.856480 10.722028  9.805374  6.632066 11.481635 10.481289 12.801024
##  [22] 12.883739 13.935924 12.413522 12.638018 10.413036 13.837178 13.997548
##  [29]  9.685836  9.294409 11.631830 13.219388 15.430921 13.434207 13.958444
##  [36] 12.903148  8.980935  6.999055  7.475628 10.714784 14.100121  9.395154
##  [43] 10.850842 14.791336 10.086862 12.124447  9.225983 10.079511  9.551343
##  [50] 10.861578 13.926497 11.375452 13.044046  9.816330 10.775599  9.150094
##  [57] 12.506610 10.774782 10.871818 11.466551  8.726834 12.709661  9.843259
##  [64] 11.992723  9.479884 10.317227  6.795342 10.396595  8.455233 10.440668
##  [71] 10.591805 10.548772 11.694057 11.064736 11.827063 10.689303 12.946971
##  [78] 11.242180 11.378347  9.874230 11.996832  7.515395 12.951058 10.951834
##  [85] 12.351369  9.579381 15.774465 10.053136 10.848455  9.956320 12.852094
##  [92]  8.875178 12.114068 12.801461 12.979891 11.767216 10.306832  9.919621
##  [99] 10.634889 10.881401

how do we visualize them together?

hist(y1, col="blue", breaks=20)
hist(y, col="red", breaks=20, add=T)

the data isn’t in a super useful format at this time to use ggplot with two different colors for histogram, we can do that later

T-test to compare

What if we do a t-test to compare these two statistically?

t.test(y,y1)
## 
##  Welch Two Sample t-test
## 
## data:  y and y1
## t = -3.242, df = 197.49, p-value = 0.001393
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.3597919 -0.3311987
## sample estimates:
## mean of x mean of y 
##  10.21777  11.06327

we see a very basic output saying that these are significantly different, but that is as expected with the numbers we assigned them and the sample size

Non-parametric alternative to t-test

what test would we do if they weren’t normally distributed though? maybe a wilcox-test

wilcox.test(y,y1)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  y and y1
## W = 3814, p-value = 0.003772
## alternative hypothesis: true location shift is not equal to 0

we can see that it is still significant, but not as low of a p-value

Paired T-test

what if they are actually paired though?

t.test(y,y1, paired=TRUE)
## 
##  Paired t-test
## 
## data:  y and y1
## t = -3.2742, df = 99, p-value = 0.00146
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.3578724 -0.3331182
## sample estimates:
## mean of the differences 
##              -0.8454953

we can see that we get different p-values. What else is different? Our degrees of freedom. Instead of 200 things, we have 2 measurements of 100 things

Paired Wilcox test

what about the non-parametic alternative?

wilcox.test(y,y1, paired=TRUE)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  y and y1
## V = 1612, p-value = 0.001704
## alternative hypothesis: true location shift is not equal to 0

Simulating character/group data

how do we simulate character/group data? This is more likely how you would format your data if we use this function, we can call groups by letters we see if we use letters from the range 1:2, they come up “a” and “b”

rep(letters[1:2])
## [1] "a" "b"

and if we use different range, different letters

rep(letters[6:16])
##  [1] "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p"

So we can set these and assign them. Length out is setting how many total should be “a”,"b’.

group<-rep(letters[1:2], length.out=200)

let’s set our data for two different peaks at a mean of 10 and 12

set.seed(6)
response<-rnorm(n=100, mean=c(10, 11), sd=2)

let’s put this into a dataframe called test

test<-data.frame(group,response)

we can see we have a’s with a lower values. Let’s look at just the first few rows of the data, otherwise it will call in all of it

head(test)
##   group  response
## 1     a 10.539212
## 2     b  9.740029
## 3     a 11.737320
## 4     b 14.454391
## 5     a 10.048375
## 6     b 11.736050

did this data come out the way we wanted? We can see how R classified each group by using the str() function, to see the structure of the data

str(test)
## 'data.frame':    200 obs. of  2 variables:
##  $ group   : Factor w/ 2 levels "a","b": 1 2 1 2 1 2 1 2 1 2 ...
##  $ response: num  10.54 9.74 11.74 14.45 10.05 ...

T-test by group

what if we do a basic t-test by group. We use the basic formula of reponse, or dependent, first as a function “~” of the dependent. This is pretty standard throughout

t.test(response~group, data=test)
## 
##  Welch Two Sample t-test
## 
## data:  response by group
## t = -2.1976, df = 197.55, p-value = 0.02914
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.21415080 -0.06567404
## sample estimates:
## mean in group a mean in group b 
##        10.15963        10.79954

Making a boxplot

what about visualizing this? simply the same format as for the t-test

boxplot(response~group, data=test)

what about a ggplot version of this boxplot? we’ll even add a bit of colour

p2<-ggplot(test, aes(x=group, y=response))+geom_boxplot(fill="slateblue", alpha=0.5)
p2

Wilcox test for group

we see a very significant difference any guesses what the non parametic version would be?

wilcox.test(response~group, data=test)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  response by group
## W = 4176, p-value = 0.0442
## alternative hypothesis: true location shift is not equal to 0

Simulating data for ANOVA

what if we had more than 2 groups? what about ANOVA? simulate groups and responses again, but with three groups

group.aov <- rep(letters[1:3], length.out=300)

response.aov<-rnorm(n=300, mean=c(10, 20, 30), sd=2)

visualize it

hist(response.aov)

hard to see, so let’s break it up some more

hist(response.aov, breaks=20)

put it into a data frame

test.aov<-data.frame(group.aov, response.aov)

what are the groups?

str(test.aov)
## 'data.frame':    300 obs. of  2 variables:
##  $ group.aov   : Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3 1 ...
##  $ response.aov: num  8.12 20.19 30.13 8.02 19.86 ...
head(test.aov)
##   group.aov response.aov
## 1         a     8.118466
## 2         b    20.187821
## 3         c    30.132507
## 4         a     8.018688
## 5         b    19.855770
## 6         c    29.553486

Histograms withmore than one group in ggplot

how about a histogram with more than one in ggplot?

p2<-ggplot(test.aov, aes(x=response.aov, fill=group.aov))+geom_histogram()
p2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ANOVA

set up the anova

test.aov.fit<-aov(response.aov~group.aov, data=test.aov)

if we use the plot function, we can see the standard normality, etc plots

plot(test.aov.fit)

see the output. But we see no p-value or other information?

test.aov.fit
## Call:
##    aov(formula = response.aov ~ group.aov, data = test.aov)
## 
## Terms:
##                 group.aov Residuals
## Sum of Squares   18817.98   1114.08
## Deg. of Freedom         2       297
## 
## Residual standard error: 1.936779
## Estimated effects may be unbalanced

let’s use summary() to get a summary of it. We see it’s really significant. But that shouldn’t be surprising given the values we told it to use

summary(test.aov.fit)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## group.aov     2  18818    9409    2508 <2e-16 ***
## Residuals   297   1114       4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example of Tukey HSD Post hoc test

since we see a significant difference in the ANOVA, but it isn’t necessarily between all groups, we can do a post-hoc test, like Tukey HSD

TukeyHSD(test.aov.fit)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = response.aov ~ group.aov, data = test.aov)
## 
## $group.aov
##          diff       lwr      upr p adj
## b-a  9.748184  9.103002 10.39337     0
## c-a 19.399908 18.754726 20.04509     0
## c-b  9.651724  9.006541 10.29691     0

the output will show us all of the comparisons and we see that all of them are different from one another. What about if we use different numbers?

ANOVA and Post hoc with only some significant differences

simulate groups and responses again, but with three groups, two are obviously very close, and they have different standard deviations, we’ll look at them with a histogram too

group.aov2<-rep(letters[1:3], length.out=300)
response.aov2<-rnorm(n=300, mean=c(10, 10.5, 30), sd=c(2,6,1.5))
hist(response.aov2, breaks=20)

test.aov2<-data.frame(group.aov2, response.aov2)

anova time

test.aov.fit2<-aov(response.aov2~group.aov2, data=test.aov2)
summary(test.aov.fit2)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## group.aov2    2  25880   12940   989.6 <2e-16 ***
## Residuals   297   3883      13                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

still very significant, but what about posthoc?

TukeyHSD(test.aov.fit2)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = response.aov2 ~ group.aov2, data = test.aov2)
## 
## $group.aov2
##          diff        lwr       upr     p adj
## b-a  0.664413 -0.5401587  1.868985 0.3967147
## c-a 20.026745 18.8221736 21.231317 0.0000000
## c-b 19.362332 18.1577606 20.566904 0.0000000

Nonparametric alternative to ANOVA

we can see there is not a significant difference between a and b, which is not surprising, given the means we gave them what about a nonparametric alternative?

nonpara.aov<-kruskal.test(response.aov2~group.aov2, data=test.aov2)
nonpara.aov
## 
##  Kruskal-Wallis rank sum test
## 
## data:  response.aov2 by group.aov2
## Kruskal-Wallis chi-squared = 199.36, df = 2, p-value < 2.2e-16

there are options for post-hoc for Kruskal Test, but we will not explore right now (may involve other packages)

Multiple response variables using available datasets in R

what if we are interested in more than one response variable (dependent variable). let’s pull in the iris flower dataset

data("iris")

check out the data

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Using Iris dataset to use MANOVA

we can see that there are a few different numerical traits (width, length, etc) and also species. so we can ask if these dependent traits are impacted by one independent trait so we can ask, are sepal length and petal length significantly different between species? we can use a manova to answer this question in order to do this, we need to bind together the two dependent variables (sepal length and petal length)

results.manova<-manova(cbind(iris$Sepal.Length, iris$Petal.Length)~iris$Species)
summary(results.manova)
##               Df Pillai approx F num Df den Df    Pr(>F)    
## iris$Species   2 0.9885   71.829      4    294 < 2.2e-16 ***
## Residuals    147                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

but what if we want to look at each trait separately?

summary.aov(results.manova)
##  Response 1 :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## iris$Species   2 63.212  31.606  119.26 < 2.2e-16 ***
## Residuals    147 38.956   0.265                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## iris$Species   2 437.10 218.551  1180.2 < 2.2e-16 ***
## Residuals    147  27.22   0.185                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

we can see that both traits are significantly different by species

Correlation data

now that we’ve covered basic reponse by group, what about continuous by continous? Correlation data? let’s use other datasets that are already in R since they can be useful let’s call in the motortrend car data

data("mtcars")

let’s check it out

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Setting up linear models

there’s a lot to look through here, but let’s keep it simple What is directly impacting the fuel efficiency (mpg) of these cars? so we’re going to set up a linear model (lm()) and use the same formula we’ve used before (dependent~independent) Let’s just start with one variable, Weight

carfit<-lm(mpg~wt, data=mtcars)

you can check out the plots of residuals if you like

plot(carfit)

what does the output look like?

summary(carfit)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

highly significant, with a strong r2 value (% of variation in mpg explained by weight!) what does it look like?

plot(mpg~wt, data=mtcars)

what about if we include another variable? maybe the cylinders?

carfit2<-lm(mpg~wt+cyl, data=mtcars)

we see that it is significant and has a higher R2 value, but does that matter?

summary(carfit2)
## 
## Call:
## lm(formula = mpg ~ wt + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2893 -1.5512 -0.4684  1.5743  6.1004 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
## wt           -3.1910     0.7569  -4.216 0.000222 ***
## cyl          -1.5078     0.4147  -3.636 0.001064 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple R-squared:  0.8302, Adjusted R-squared:  0.8185 
## F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12

what model is better?

anova(carfit, carfit2)
## Analysis of Variance Table
## 
## Model 1: mpg ~ wt
## Model 2: mpg ~ wt + cyl
##   Res.Df    RSS Df Sum of Sq     F   Pr(>F)   
## 1     30 278.32                               
## 2     29 191.17  1     87.15 13.22 0.001064 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

this output shows that there is a significantly better fit with the more complex model (p=0.001064)

Looking at interaction data

but what about an interaction effect of something like how many cylinders it has? if the car is heavier because of the engine and not something else, maybe there’s an effect?

carfit3<-lm(mpg~wt+cyl+wt*cyl, data=mtcars)
summary(carfit3)
## 
## Call:
## lm(formula = mpg ~ wt + cyl + wt * cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2288 -1.3495 -0.5042  1.4647  5.2344 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  54.3068     6.1275   8.863 1.29e-09 ***
## wt           -8.6556     2.3201  -3.731 0.000861 ***
## cyl          -3.8032     1.0050  -3.784 0.000747 ***
## wt:cyl        0.8084     0.3273   2.470 0.019882 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.368 on 28 degrees of freedom
## Multiple R-squared:  0.8606, Adjusted R-squared:  0.8457 
## F-statistic: 57.62 on 3 and 28 DF,  p-value: 4.231e-12

we see that weight is significant to the model, as is cylinder, BUT we see a significant interaction between cylinder and weight

what’s going on here? We know that as weight goes up, the mpg goes down

plot(mpg~cyl, data=mtcars)

but we know as the cylinders go up, the mpg goes down too

plot(mpg~cyl, data=mtcars)

what happens to wt as cylinders go up?

plot(wt~cyl, data=mtcars)

ggplot for scatterplots

ggplot is pretty easy to set up here too for scatter plots

p4<-ggplot(data=mtcars, aes(x=cyl, y=wt))+geom_point()
p4

if we look at our summary again, we can see that the estimate for the interaction is positive, whereas the estimates for the wt and cyl independently are negative. This means, while wt and cyl are negatively impacting mpg, as cylinder increases, the effect of wt on mpg decreases. There is an interaction of these variables

and if we compare the simpler wt+cyl model to our more complex model, the more complex model is significantly better

anova(carfit2, carfit3)
## Analysis of Variance Table
## 
## Model 1: mpg ~ wt + cyl
## Model 2: mpg ~ wt + cyl + wt * cyl
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     29 191.17                              
## 2     28 156.98  1    34.196 6.0995 0.01988 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

you can plot these with cool graphs, but that is something you can do easily outside of base R (ggplot2, etc)

What about categorical data? Using marginal frequency tables

So What if we have categorical/character data? Maybe something from a survey or phenotypes from genetic crosses? What if we want to know if there is a more common combination of hair and eye color let’s load the available dataset in R called HairEyeColor

data("HairEyeColor")

let’s look at it

str(HairEyeColor)
##  'table' num [1:4, 1:4, 1:2] 32 53 10 3 11 50 10 30 10 25 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ Hair: chr [1:4] "Black" "Brown" "Red" "Blond"
##   ..$ Eye : chr [1:4] "Brown" "Blue" "Hazel" "Green"
##   ..$ Sex : chr [1:2] "Male" "Female"
HairEyeColor
## , , Sex = Male
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    32   11    10     3
##   Brown    53   50    25    15
##   Red      10   10     7     7
##   Blond     3   30     5     8
## 
## , , Sex = Female
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    36    9     5     2
##   Brown    66   34    29    14
##   Red      16    7     7     7
##   Blond     4   64     5     8

it’s in kind of a weird format, and for our question, we are not worried about sex we can use the function margin.table to give us all of our output for a marginal frequency table

HairEyeNew<-margin.table(HairEyeColor, margin=c(1,2))

let’s see what it looks like

HairEyeNew
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    68   20    15     5
##   Brown   119   84    54    29
##   Red      26   17    14    14
##   Blond     7   94    10    16

we can see it is all in a usable table now, but raw numbers are not what we are interested in

Chi-square test

what kind of test would we use to see if something is more or less than expected? chi-square is common we can use this function with what we saved as a marginal frequency table

chisq.test(HairEyeNew)
## 
##  Pearson's Chi-squared test
## 
## data:  HairEyeNew
## X-squared = 138.29, df = 9, p-value < 2.2e-16

we see something is significant, but we want to see what frequencies we have. in order to calculate frequency, we need to divide what we have by the totals

HairEyeNew/sum(HairEyeNew)
##        Eye
## Hair          Brown        Blue       Hazel       Green
##   Black 0.114864865 0.033783784 0.025337838 0.008445946
##   Brown 0.201013514 0.141891892 0.091216216 0.048986486
##   Red   0.043918919 0.028716216 0.023648649 0.023648649
##   Blond 0.011824324 0.158783784 0.016891892 0.027027027