This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
I am actually making this document in R. This is a functionality that I will probably not talk about much in this course, but it can be very helpful with documentation of your scripts and analyses.
I am able to type up notes and show you the scripts we used.
These scripts will “run”, so that you can see what was typed and run in R, AND see its output
Today we went through the basic functionality of R as a program. We discussed how R can be used as a calculator, how we can save objects, how it can read files in, and give us basic summary statistics of our data We can enter in scripts by clicking “run” in top right hand side of the script window, or by using ctrl+enter (cmd+enter in Mac) We can also type directly into the Console window You can run specific parts of your code by highlighting it and then clicking run (or ctrl+enter)
2+2
## [1] 4
R follows the order of operations
2+4*2
## [1] 10
And we can include exponents using the ^ symbol (shift 6 on your keyboard)
4^3
## [1] 64
We can even get square roots using the sqrt() function
sqrt(144)
## [1] 12
We can even use some basic functions, like “log”. We learned that log() is the natural log
log(100)
## [1] 4.60517
But that log10() is log base 10
log10(100)
## [1] 2
R even has some built in constants, like pi!
pi
## [1] 3.141593
R is very object based. We assign “values” to objects using the “greater than” symbol and hyphen together. “<-”
We were also informed that there is a short cut for this using alt + “-”, (option + - on Mac)
We can assign an equation to a value. For example, if we assign “4*5” to the name “example”, we will assign the value of 20
example<-4*5
example
## [1] 20
This object can now be used in other ways (as a variable in an equation, for example)
example*5
## [1] 100
We have to be careful, though, as we can accidentally write over object values too
pi<-4129384213
pi
## [1] 4129384213
Recall that we work in “directories” (a fancy word for “folder”)
in order to see what directory we are in, we can use the getwd() function
getwd()
## [1] "/Users/10961380/Desktop/R_practice"
If this is not where we want to to work, we can change that with the setwd() function, where you’d include your directory path
Be sure it has quotations marks around it! Otherwise R won’t recognize it. You can also use the “three dots” in the file window on the bottom right window of R studio to find what fold you want to work in. Then you can click on the “more” button and click “set as working directory”
Once we are in our directory, we can see what files we have available with the list.files() function
list.files()
## [1] "day 3 section2.R" "deprecated"
## [3] "ex_plots" "First Week and a half of R.Rmd"
## [5] "First-Week-and-a-half-of-R.html" "First-Week-and-a-half-of-R.Rmd"
## [7] "test.csv" "test.txt"
or we can use the dir() function
dir()
## [1] "day 3 section2.R" "deprecated"
## [3] "ex_plots" "First Week and a half of R.Rmd"
## [5] "First-Week-and-a-half-of-R.html" "First-Week-and-a-half-of-R.Rmd"
## [7] "test.csv" "test.txt"
We can see in this list of files in our directory that we have a file called test.csv that we want to read in In order to read in this file, we have to use the “read.csv()” function, and assign this value to an object. We also have to be sure that the file name is in quotation marks
test.data<-read.csv("test.csv")
When we read in data, we should always take some time to explore the data and see what it looks like
We can just look at the whole file by typing the object name
test.data
## Type Number
## 1 A 15
## 2 A 16
## 3 A 18
## 4 B 12
## 5 D 21
## 6 C 25
## 7 D 20
## 8 C 23
## 9 B 11
## 10 B 13
## 11 D 21
## 12 C 24
## 13 A 17
this, however, is kind of unwieldy if we have a large datafile
We can use the head() function to see the first six rows of the datafile
head(test.data)
## Type Number
## 1 A 15
## 2 A 16
## 3 A 18
## 4 B 12
## 5 D 21
## 6 C 25
or we can use the tail() function to look at the last six rows
tail(test.data)
## Type Number
## 8 C 23
## 9 B 11
## 10 B 13
## 11 D 21
## 12 C 24
## 13 A 17
Sometimes we might only want the first few rows (not six). If we use the help function: ?head(), we can see what we can modify We see that there is an argument in which we can indicate an “n”, or number of rows to print. If we only want 3 rows, we can use this script:
head(test.data, n=3)
## Type Number
## 1 A 15
## 2 A 16
## 3 A 18
We can look at just the “structure’ of the data using the str(). We can see that this file has 13 observations of two variables. One is a”character” (chr) data, and one is integer data (int)
str(test.data)
## 'data.frame': 13 obs. of 2 variables:
## $ Type : chr "A" "A" "A" "B" ...
## $ Number: int 15 16 18 12 21 25 20 23 11 13 ...
Sometimes we may only want to look at one row of data. How might we do that? We can use the “[,]” to tell R what to print. The first part before the comma is rows, the part after the comma is columns [r,c] An easy way to remember this is “R is Cool” If we only wanted the 8th row we’d do
test.data[8,]
## Type Number
## 8 C 23
We have a comma there, but when we leave the “column” indicator blank, it means that we want all values for columns
What if we wanted multiple rows? Say, rows 6 through 9 We could call each row individually
test.data[6,]
## Type Number
## 6 C 25
test.data[7,]
## Type Number
## 7 D 20
test.data[8,]
## Type Number
## 8 C 23
test.data[9,]
## Type Number
## 9 B 11
But that is not very convenient. We can acutally use a colon to indicate we want a range of numbers
test.data[6:9,]
## Type Number
## 6 C 25
## 7 D 20
## 8 C 23
## 9 B 11
We also had a .txt file (it’s a tsv, for tab separated values) It’s exactly the same as the csv we loaded in, but has to be read in differently.
test.data.2<-read.table("test.txt")
If we look at the structure and head, it seems a bit odd
str(test.data.2)
## 'data.frame': 14 obs. of 2 variables:
## $ V1: chr "Type" "A" "A" "A" ...
## $ V2: chr "Number" "15" "16" "18" ...
head(test.data.2)
## V1 V2
## 1 Type Number
## 2 A 15
## 3 A 16
## 4 A 18
## 5 B 12
## 6 D 21
This is because the “read.table()” function has a couple different default settings than the “read.csv()” function If we use the help function “?read.table(), we can see that the read.csv() has a default of header=TRUE, whereas the read.table() has a default of header=FALSE. This tell R that it has column names or not. So we have to modify our command to include these headers
test.data.2<-read.table("test.txt", header=TRUE)
str(test.data.2)
## 'data.frame': 13 obs. of 2 variables:
## $ Type : chr "A" "A" "A" "B" ...
## $ Number: int 15 16 18 12 21 25 20 23 11 13 ...
We can see that it now reads everything in ok!
let’s go back to our original “test.data” file and work calculating other summary information ### Calculating Mean One of the best pieces of information we can get is “means”
mean(test.data)
## Warning in mean.default(test.data): argument is not numeric or logical:
## returning NA
## [1] NA
As you can see, this did not work too well. We have to indicate what we want the mean of. In this case here, we can only get the mean of numerical things
We can specify the “Number” column using the bracket trick (test.data[,2]) or we can use the “dollar sign symbol” in which we do
mean(test.data$Number)
## [1] 18.15385
We might also want the median. R is not very creative with its function names, so we have
median(test.data$Number)
## [1] 18
If we want min and max, we do the same thing
min(test.data$Number)
## [1] 11
max(test.data$Number)
## [1] 25
We can also use the summary function on the entire object
summary(test.data)
## Type Number
## Length:13 Min. :11.00
## Class :character 1st Qu.:15.00
## Mode :character Median :18.00
## Mean :18.15
## 3rd Qu.:21.00
## Max. :25.00
Here we can get our min, median, mean, max, and quartile information We can also get a summary of our “type” column. It is character data, but if we switch it to factor, we can get more info.
A “factor” is just a fancy way of saying we want it as a “group” We have to be careful with this next line, as we are going to replace our data with new values. These values are the same values but we are going to have R read them as a factor
test.data$Type<-as.factor(test.data$Type)
summary(test.data)
## Type Number
## A:4 Min. :11.00
## B:3 1st Qu.:15.00
## C:3 Median :18.00
## D:3 Mean :18.15
## 3rd Qu.:21.00
## Max. :25.00
Now that we have that information, we will want to add some visualization of this data.
We have one continous variable (Number), so we can use a histogram to visulize this data
hist(test.data$Number)
This is a pretty nice plot, but, it looks a little chunky, and the labels aren’t great. if we go into the ?hist help function, we can see modifications we can make like breaks
hist(test.data$Number, breaks=15)
And labels
hist(test.data$Number, breaks=15, xlab="Number", ylab="Amount")
And we can add a title
hist(test.data$Number, breaks=15, xlab="Number", ylab="Amount", main="First Histogram")
But this grey color is kind of boring, what if we want it light blue?
hist(test.data$Number, breaks=15, xlab="Number", ylab="Amount", main="First Histogram", col="lightblue")
We don’t just have a continuous variable, though, we also have a discrete variable (our type) We can use a boxplot to visualize this dimension in the plot as well
In order to do a boxplot, we should cover the “equation” format
If you think of the equation of a line, you probably say “y = mx+b”. Y is our response variable, what we get for Y is depending on what we plug in to the right side of the equation.
This is how R does equations too. but it uses the little tilde (~, found above your Tab key) y~x
So if we were to make a boxplot, the script would be like this
boxplot(test.data$Number~test.data$Type)
We can do all the same types of modifications in boxplots as we did with histograms
boxplot(test.data$Number~test.data$Type, ylab="Number", xlab="Type", col="lightblue", main="First Boxplot")
On this day, we utilized the iris dataset, a dataset which is available in R To make it visible in our global environment, I’m going to assign the “iris” data to an object called flower
flower<-iris
We can then investigate the data here like we’ve done before
str(flower)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(flower)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
summary(flower)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
We can see we have 4 different measurement values (in cm): Sepal length, Sepal Width, Petal Width, and Petal length We also have a value for “species” (the species of iris that measurement belongs to)
Since “Species” is a factor, we can look at its “levels” (the name for each “group”)
levels(flower$Species)
## [1] "setosa" "versicolor" "virginica"
Since these measurements are continuous variables, we can make histograms of each of them
hist(flower$Sepal.Length, col="pink", breaks=15,
xlab="Length (in cm)", main="Histogram of Sepal Length")
hist(flower$Sepal.Width, col="pink", breaks=15,
xlab="Length (in cm)", main="Histogram of Sepal Width")
hist(flower$Petal.Length, col="pink", breaks=25,
xlab="Length (in cm)", main="Histogram of Petal Length")
hist(flower$Petal.Width, col="pink", breaks=25,
xlab="Width (in cm)", main="Histogram of Petal Width")
Can see some evidence in some of them for different peaks. We concluded that this may be due to differences between species,
so we can make a boxplot to visualize those differences. Remember we have the equation of “y~x”.
I am showing another trick below too. We can just list the y and x, with the argument of data=flower to indicate everything comes from this object
boxplot(Petal.Width~Species, data=flower, ylab="Width (cm)",
main="Petal Width Iris", col="pink")
boxplot(Petal.Length~Species, data=flower, ylab="Length (cm)",
main="Petal Length Iris", col="pink")
boxplot(Sepal.Width~Species, data=flower, ylab="Width (cm)",
main="Sepal Width Iris", col="pink")
boxplot(Sepal.Length~Species, data=flower, ylab="Length (cm)",
main="Sepal Length Iris", col="pink")
We can see that there are some differences by species, especially in petal length and width, but many not as much in sepals
A question you might ask, though, is: Is there a relationship between petal lenght and width? Do we see a relationship there, does one get bigger when the other gets bigger? (a positive relationship).
We can investigate this with a scatterplot (looking at two different continuous variables)
We use the same equation format as before, with the first variable is the “y” and the second is the “x”. Recall that the Y-axis is our vertical axis and the X axis is our horizontal axis.
If we want to look at petal length and width we could do the below. We use the plot() function (it is the name for scatterplot, kind of strange, I know) I used the “pch” arugment to change the shape of the points
plot(Petal.Length~Petal.Width, data=flower, xlab="Width (cm)",
ylab="Length (cm)", main="Petal Information", col="purple",
pch=16)
We see a positive relationship between these two traits
What about sepals?
plot(Sepal.Length~Sepal.Width, data=flower, xlab="Width (cm)",
ylab="Length (cm)", main="Sepal Information", col="purple",
pch=16)
We do not see a strong relationship here, positive or negative
So, we have an additional element here, what about the effect of species? We saw that there were differences in petal length and width by species? can we indicate which “points” represent each species on our plot? In order to do this, I’ll use the ggplot2 package. You can install the package with this script
This will let us look at 2 continuous variables AND a discrete factor variable
install.packages(“ggplot2”)
You always have to load a package in when you want to use it
library(ggplot2)
You do not need to know how to use this script (or ggplot) for your exams, but you can use them if you’d like!
ggplot(data=flower, aes(x=Petal.Width, y=Petal.Length, colour=Species))+
geom_point(size=4, aes(shape=Species))+theme_minimal()+xlab("Petal Width (cm)")+
ylab("Petal Length (cm)")+ggtitle("Scatterplot")
We can see from this plot, that the species do cluster in the scatterplot