we’ll use the iris data set, since we have 4 continuous variables
data(iris)
Let’s see how the data looks
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
We can see that it is 4 numerical and 1 factor, species
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
We’ll log transform the data to standardize the variables This is because skewness and magnitude of variables influence the PC’s. Log transformation takes care of this This is log transforming columns 1:4 (our numerical data), remember that [r,c] (R is cool). We’ll also call in the species data for later
log.ir<-log(iris[,1:4])
ir.species<-iris[,5]
We’ll run the PCA, but set center and scale. equal to TRUE in the call to prcomp. This is not default, but advisable
ir.pca<-prcomp(log.ir, center=TRUE, scale.=TRUE)
Let’s see the output We can see what the rotation data is for each PC for each variable
print(ir.pca)
## Standard deviations (1, .., p=4):
## [1] 1.7124583 0.9523797 0.3647029 0.1656840
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Sepal.Length 0.5038236 -0.45499872 0.7088547 0.19147575
## Sepal.Width -0.3023682 -0.88914419 -0.3311628 -0.09125405
## Petal.Length 0.5767881 -0.03378802 -0.2192793 -0.78618732
## Petal.Width 0.5674952 -0.03545628 -0.5829003 0.58044745
We can plot ot see how much of the variation is explained by each principal component, type is L. since it can be hard to read
plot(ir.pca, type="l")
most is explained by the first two PCs
summary gives us standard deviations, proportion of variance, and the cumulative variance
summary(ir.pca)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7125 0.9524 0.36470 0.16568
## Proportion of Variance 0.7331 0.2268 0.03325 0.00686
## Cumulative Proportion 0.7331 0.9599 0.99314 1.00000
first two PCs contributed most of it
If we want to plot it, we need to use the below. It isn’t super pretty, but we can see the separation by component the ir.pca$x[,1:2] indicates the first two columns, which are PC1 and PC2
plot(ir.pca$x[,1:2], pch=20, col=ir.species)
If we change this, we can look at other PC’s, like PC1 and 3
plot(ir.pca$x[,c(1,3)], pch=20, col=ir.species)
if you want to plot it nicely, you have to install a couple packages. This can be tricky, depending on version of R you have
library(devtools)
## Loading required package: usethis
install_github("vqv/ggbiplot")
## Skipping install of 'ggbiplot' from a github remote, the SHA1 (7325e880) has not changed since last install.
## Use `force = TRUE` to force installation
library(ggbiplot)
## Loading required package: ggplot2
## Loading required package: plyr
## Loading required package: scales
## Loading required package: grid
Then we can use a variety of things within ggplot to see the nice plot
g <- ggbiplot(ir.pca, obs.scale = 1, var.scale = 1,
groups = ir.species, ellipse = TRUE,
circle = TRUE)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal',
legend.position = 'top')
print(g)