we’ll use the iris data set, since we have 4 continuous variables

data(iris)

Let’s see how the data looks

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

We can see that it is 4 numerical and 1 factor, species

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

We’ll log transform the data to standardize the variables This is because skewness and magnitude of variables influence the PC’s. Log transformation takes care of this This is log transforming columns 1:4 (our numerical data), remember that [r,c] (R is cool). We’ll also call in the species data for later

log.ir<-log(iris[,1:4])
ir.species<-iris[,5]

We’ll run the PCA, but set center and scale. equal to TRUE in the call to prcomp. This is not default, but advisable

ir.pca<-prcomp(log.ir, center=TRUE, scale.=TRUE)

Let’s see the output We can see what the rotation data is for each PC for each variable

print(ir.pca)
## Standard deviations (1, .., p=4):
## [1] 1.7124583 0.9523797 0.3647029 0.1656840
## 
## Rotation (n x k) = (4 x 4):
##                     PC1         PC2        PC3         PC4
## Sepal.Length  0.5038236 -0.45499872  0.7088547  0.19147575
## Sepal.Width  -0.3023682 -0.88914419 -0.3311628 -0.09125405
## Petal.Length  0.5767881 -0.03378802 -0.2192793 -0.78618732
## Petal.Width   0.5674952 -0.03545628 -0.5829003  0.58044745

We can plot ot see how much of the variation is explained by each principal component, type is L. since it can be hard to read

plot(ir.pca, type="l")

most is explained by the first two PCs

summary gives us standard deviations, proportion of variance, and the cumulative variance

summary(ir.pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7125 0.9524 0.36470 0.16568
## Proportion of Variance 0.7331 0.2268 0.03325 0.00686
## Cumulative Proportion  0.7331 0.9599 0.99314 1.00000

first two PCs contributed most of it

If we want to plot it, we need to use the below. It isn’t super pretty, but we can see the separation by component the ir.pca$x[,1:2] indicates the first two columns, which are PC1 and PC2

plot(ir.pca$x[,1:2], pch=20, col=ir.species)

If we change this, we can look at other PC’s, like PC1 and 3

plot(ir.pca$x[,c(1,3)], pch=20, col=ir.species)

if you want to plot it nicely, you have to install a couple packages. This can be tricky, depending on version of R you have

library(devtools)
## Loading required package: usethis
install_github("vqv/ggbiplot")
## Skipping install of 'ggbiplot' from a github remote, the SHA1 (7325e880) has not changed since last install.
##   Use `force = TRUE` to force installation
library(ggbiplot)
## Loading required package: ggplot2
## Loading required package: plyr
## Loading required package: scales
## Loading required package: grid

Then we can use a variety of things within ggplot to see the nice plot

g <- ggbiplot(ir.pca, obs.scale = 1, var.scale = 1, 
              groups = ir.species, ellipse = TRUE, 
              circle = TRUE)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal', 
               legend.position = 'top')
print(g)