R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

I am actually making this document in R. This is a functionality that I will probably not talk about much in this course, but it can be very helpful with documentation of your scripts and analyses.

I am able to type up notes and show you the scripts we used.

These scripts will “run”, so that you can see what was typed and run in R, AND see its output

4-7-22

We Went through the basics of R

Today we went through the basic functionality of R as a program. We discussed how R can be used as a calculator, how we can save objects, how it can read files in, and give us basic summary statistics of our data We can enter in scripts by clicking “run” in top right hand side of the script window, or by using ctrl+enter (cmd+enter in Mac) We can also type directly into the Console window You can run specific parts of your code by highlighting it and then clicking run (or ctrl+enter)

Remember to annotate your code with ‘#’. R does not read this in as code, but you can take notes about what each step is doing. Remember you are your best and worst collaborator!

We discussed how R can be used as as calculator and perform basic functions

2+2
## [1] 4

R follows the order of operations

2+4*2
## [1] 10

And we can include exponents using the ^ symbol (shift 6 on your keyboard)

4^3
## [1] 64

We can even get square roots using the sqrt() function

sqrt(144)
## [1] 12

We can even use some basic functions, like “log”. We learned that log() is the natural log

log(100)
## [1] 4.60517

But that log10() is log base 10

log10(100)
## [1] 2

R even has some built in constants, like pi!

pi
## [1] 3.141593

We learned how R works with an assignment operator

R is very object based. We assign “values” to objects using the “greater than” symbol and hyphen together. “<-”

We were also informed that there is a short cut for this using alt + “-”, (option + - on Mac)

We can assign an equation to a value. For example, if we assign “4*5” to the name “example”, we will assign the value of 20

example<-4*5
example
## [1] 20

This object can now be used in other ways (as a variable in an equation, for example)

example*5
## [1] 100

We have to be careful, though, as we can accidentally write over object values too

pi<-4129384213
pi
## [1] 4129384213

Working in directories

Recall that we work in “directories” (a fancy word for “folder”)

in order to see what directory we are in, we can use the getwd() function

getwd()
## [1] "/Users/10961380/Desktop/R_practice"

If this is not where we want to to work, we can change that with the setwd() function, where you’d include your directory path

Be sure it has quotations marks around it! Otherwise R won’t recognize it. You can also use the “three dots” in the file window on the bottom right window of R studio to find what fold you want to work in. Then you can click on the “more” button and click “set as working directory”

Once we are in our directory, we can see what files we have available with the list.files() function

list.files()
## [1] "day 3 section2.R"                "deprecated"                     
## [3] "ex_plots"                        "First Week and a half of R.Rmd" 
## [5] "First-Week-and-a-half-of-R.html" "First-Week-and-a-half-of-R.Rmd" 
## [7] "test.csv"                        "test.txt"

or we can use the dir() function

dir()
## [1] "day 3 section2.R"                "deprecated"                     
## [3] "ex_plots"                        "First Week and a half of R.Rmd" 
## [5] "First-Week-and-a-half-of-R.html" "First-Week-and-a-half-of-R.Rmd" 
## [7] "test.csv"                        "test.txt"

Reading in a file

We can see in this list of files in our directory that we have a file called test.csv that we want to read in In order to read in this file, we have to use the “read.csv()” function, and assign this value to an object. We also have to be sure that the file name is in quotation marks

test.data<-read.csv("test.csv")

When we read in data, we should always take some time to explore the data and see what it looks like

We can just look at the whole file by typing the object name

test.data
##    Type Number
## 1     A     15
## 2     A     16
## 3     A     18
## 4     B     12
## 5     D     21
## 6     C     25
## 7     D     20
## 8     C     23
## 9     B     11
## 10    B     13
## 11    D     21
## 12    C     24
## 13    A     17

this, however, is kind of unwieldy if we have a large datafile

We can use the head() function to see the first six rows of the datafile

head(test.data)
##   Type Number
## 1    A     15
## 2    A     16
## 3    A     18
## 4    B     12
## 5    D     21
## 6    C     25

or we can use the tail() function to look at the last six rows

tail(test.data)
##    Type Number
## 8     C     23
## 9     B     11
## 10    B     13
## 11    D     21
## 12    C     24
## 13    A     17

Sometimes we might only want the first few rows (not six). If we use the help function: ?head(), we can see what we can modify We see that there is an argument in which we can indicate an “n”, or number of rows to print. If we only want 3 rows, we can use this script:

head(test.data, n=3)
##   Type Number
## 1    A     15
## 2    A     16
## 3    A     18

We can look at just the “structure’ of the data using the str(). We can see that this file has 13 observations of two variables. One is a”character” (chr) data, and one is integer data (int)

str(test.data)
## 'data.frame':    13 obs. of  2 variables:
##  $ Type  : chr  "A" "A" "A" "B" ...
##  $ Number: int  15 16 18 12 21 25 20 23 11 13 ...

R can use a “coordinates” system

Sometimes we may only want to look at one row of data. How might we do that? We can use the “[,]” to tell R what to print. The first part before the comma is rows, the part after the comma is columns [r,c] An easy way to remember this is “R is Cool” If we only wanted the 8th row we’d do

test.data[8,]
##   Type Number
## 8    C     23

We have a comma there, but when we leave the “column” indicator blank, it means that we want all values for columns

What if we wanted multiple rows? Say, rows 6 through 9 We could call each row individually

test.data[6,]
##   Type Number
## 6    C     25
test.data[7,]
##   Type Number
## 7    D     20
test.data[8,]
##   Type Number
## 8    C     23
test.data[9,]
##   Type Number
## 9    B     11

But that is not very convenient. We can acutally use a colon to indicate we want a range of numbers

test.data[6:9,]
##   Type Number
## 6    C     25
## 7    D     20
## 8    C     23
## 9    B     11

Reading in a text file

We also had a .txt file (it’s a tsv, for tab separated values) It’s exactly the same as the csv we loaded in, but has to be read in differently.

test.data.2<-read.table("test.txt")

If we look at the structure and head, it seems a bit odd

str(test.data.2)
## 'data.frame':    14 obs. of  2 variables:
##  $ V1: chr  "Type" "A" "A" "A" ...
##  $ V2: chr  "Number" "15" "16" "18" ...
head(test.data.2)
##     V1     V2
## 1 Type Number
## 2    A     15
## 3    A     16
## 4    A     18
## 5    B     12
## 6    D     21

This is because the “read.table()” function has a couple different default settings than the “read.csv()” function If we use the help function “?read.table(), we can see that the read.csv() has a default of header=TRUE, whereas the read.table() has a default of header=FALSE. This tell R that it has column names or not. So we have to modify our command to include these headers

test.data.2<-read.table("test.txt", header=TRUE)
str(test.data.2)
## 'data.frame':    13 obs. of  2 variables:
##  $ Type  : chr  "A" "A" "A" "B" ...
##  $ Number: int  15 16 18 12 21 25 20 23 11 13 ...

We can see that it now reads everything in ok!

4-12-22

Summary Statistics

let’s go back to our original “test.data” file and work calculating other summary information ### Calculating Mean One of the best pieces of information we can get is “means”

mean(test.data)
## Warning in mean.default(test.data): argument is not numeric or logical:
## returning NA
## [1] NA

As you can see, this did not work too well. We have to indicate what we want the mean of. In this case here, we can only get the mean of numerical things

We can specify the “Number” column using the bracket trick (test.data[,2]) or we can use the “dollar sign symbol” in which we do

mean(test.data$Number)
## [1] 18.15385

Calculating Median

We might also want the median. R is not very creative with its function names, so we have

median(test.data$Number)
## [1] 18

Calculating Min and Max

If we want min and max, we do the same thing

min(test.data$Number)
## [1] 11
max(test.data$Number)
## [1] 25

using the summary() function

We can also use the summary function on the entire object

summary(test.data)
##      Type               Number     
##  Length:13          Min.   :11.00  
##  Class :character   1st Qu.:15.00  
##  Mode  :character   Median :18.00  
##                     Mean   :18.15  
##                     3rd Qu.:21.00  
##                     Max.   :25.00

Here we can get our min, median, mean, max, and quartile information We can also get a summary of our “type” column. It is character data, but if we switch it to factor, we can get more info.

A “factor” is just a fancy way of saying we want it as a “group” We have to be careful with this next line, as we are going to replace our data with new values. These values are the same values but we are going to have R read them as a factor

test.data$Type<-as.factor(test.data$Type)
summary(test.data)
##  Type      Number     
##  A:4   Min.   :11.00  
##  B:3   1st Qu.:15.00  
##  C:3   Median :18.00  
##  D:3   Mean   :18.15  
##        3rd Qu.:21.00  
##        Max.   :25.00

Making Plots

Now that we have that information, we will want to add some visualization of this data.

Making a histogram for one continous variable

We have one continous variable (Number), so we can use a histogram to visulize this data

hist(test.data$Number)

This is a pretty nice plot, but, it looks a little chunky, and the labels aren’t great. if we go into the ?hist help function, we can see modifications we can make like breaks

hist(test.data$Number, breaks=15)

And labels

hist(test.data$Number, breaks=15, xlab="Number", ylab="Amount")

And we can add a title

hist(test.data$Number, breaks=15, xlab="Number", ylab="Amount", main="First Histogram")

But this grey color is kind of boring, what if we want it light blue?

hist(test.data$Number, breaks=15, xlab="Number", ylab="Amount", main="First Histogram", col="lightblue")

Making a boxplot

We don’t just have a continuous variable, though, we also have a discrete variable (our type) We can use a boxplot to visualize this dimension in the plot as well

In order to do a boxplot, we should cover the “equation” format

If you think of the equation of a line, you probably say “y = mx+b”. Y is our response variable, what we get for Y is depending on what we plug in to the right side of the equation.

This is how R does equations too. but it uses the little tilde (~, found above your Tab key) y~x

So if we were to make a boxplot, the script would be like this

boxplot(test.data$Number~test.data$Type)

We can do all the same types of modifications in boxplots as we did with histograms

boxplot(test.data$Number~test.data$Type, ylab="Number", xlab="Type", col="lightblue", main="First Boxplot")

4-14-22

Using Iris data

On this day, we utilized the iris dataset, a dataset which is available in R To make it visible in our global environment, I’m going to assign the “iris” data to an object called flower

flower<-iris

Summary Stats

We can then investigate the data here like we’ve done before

str(flower)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(flower)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
summary(flower)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

We can see we have 4 different measurement values (in cm): Sepal length, Sepal Width, Petal Width, and Petal length We also have a value for “species” (the species of iris that measurement belongs to)

Since “Species” is a factor, we can look at its “levels” (the name for each “group”)

levels(flower$Species)
## [1] "setosa"     "versicolor" "virginica"

Making Histograms

Since these measurements are continuous variables, we can make histograms of each of them

hist(flower$Sepal.Length, col="pink", breaks=15,
     xlab="Length (in cm)", main="Histogram of Sepal Length")

hist(flower$Sepal.Width, col="pink", breaks=15,
     xlab="Length (in cm)", main="Histogram of Sepal Width")

hist(flower$Petal.Length, col="pink", breaks=25,
     xlab="Length (in cm)", main="Histogram of Petal Length")

hist(flower$Petal.Width, col="pink", breaks=25,
     xlab="Width (in cm)", main="Histogram of Petal Width")

Making Boxplots

Can see some evidence in some of them for different peaks. We concluded that this may be due to differences between species,

so we can make a boxplot to visualize those differences. Remember we have the equation of “y~x”.

I am showing another trick below too. We can just list the y and x, with the argument of data=flower to indicate everything comes from this object

boxplot(Petal.Width~Species, data=flower, ylab="Width (cm)",
        main="Petal Width Iris", col="pink")

boxplot(Petal.Length~Species, data=flower, ylab="Length (cm)",
        main="Petal Length Iris", col="pink")

boxplot(Sepal.Width~Species, data=flower, ylab="Width (cm)",
        main="Sepal Width Iris", col="pink")

boxplot(Sepal.Length~Species, data=flower, ylab="Length (cm)",
        main="Sepal Length Iris", col="pink")

We can see that there are some differences by species, especially in petal length and width, but many not as much in sepals

Scatterplots

A question you might ask, though, is: Is there a relationship between petal lenght and width? Do we see a relationship there, does one get bigger when the other gets bigger? (a positive relationship).

We can investigate this with a scatterplot (looking at two different continuous variables)

We use the same equation format as before, with the first variable is the “y” and the second is the “x”. Recall that the Y-axis is our vertical axis and the X axis is our horizontal axis.

If we want to look at petal length and width we could do the below. We use the plot() function (it is the name for scatterplot, kind of strange, I know) I used the “pch” arugment to change the shape of the points

plot(Petal.Length~Petal.Width, data=flower, xlab="Width (cm)",
     ylab="Length (cm)", main="Petal Information", col="purple",
     pch=16)

We see a positive relationship between these two traits

What about sepals?

plot(Sepal.Length~Sepal.Width, data=flower, xlab="Width (cm)",
     ylab="Length (cm)", main="Sepal Information", col="purple",
     pch=16)

We do not see a strong relationship here, positive or negative

An additional dimension

using ggplot

So, we have an additional element here, what about the effect of species? We saw that there were differences in petal length and width by species? can we indicate which “points” represent each species on our plot? In order to do this, I’ll use the ggplot2 package. You can install the package with this script

This will let us look at 2 continuous variables AND a discrete factor variable

install.packages(“ggplot2”)

You always have to load a package in when you want to use it

library(ggplot2)

You do not need to know how to use this script (or ggplot) for your exams, but you can use them if you’d like!

ggplot(data=flower, aes(x=Petal.Width, y=Petal.Length, colour=Species))+
  geom_point(size=4, aes(shape=Species))+theme_minimal()+xlab("Petal Width (cm)")+
  ylab("Petal Length (cm)")+ggtitle("Scatterplot")

We can see from this plot, that the species do cluster in the scatterplot