Today we worked on subsetting our data

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Please look to the first R markdown file for the basic notes. This one will give a summary of other information we have covered since then (and not necessarily include all things we covered before this)

we wil be working with the bird weight data from canvas the first step is to make sure you download that data and set your working directly to the location with the file

first we will read in the data as an object called “bird”

bird<-read.csv("birdweight.csv")

you should always look at your data to see how it is formatted. if you do the summary function, you will see we do not see how the breakdown of area and morph is

summary(bird)
##      area              morph               weight     
##  Length:12000       Length:12000       Min.   :116.1  
##  Class :character   Class :character   1st Qu.:268.2  
##  Mode  :character   Mode  :character   Median :362.6  
##                                        Mean   :392.4  
##                                        3rd Qu.:503.9  
##                                        Max.   :722.2

if we look at the structure of the data, we will see that area and morph are as characters, we want them as factors let’s change that

str(bird)
## 'data.frame':    12000 obs. of  3 variables:
##  $ area  : chr  "North" "South" "East" "West" ...
##  $ morph : chr  "Blue" "Red" "Red" "Blue" ...
##  $ weight: num  310 421 129 657 345 ...
bird$area<-as.factor(bird$area)
bird$morph<-as.factor(bird$morph)

Let’s check that it changed

str(bird)
## 'data.frame':    12000 obs. of  3 variables:
##  $ area  : Factor w/ 4 levels "East","North",..: 2 3 1 4 2 3 1 4 2 3 ...
##  $ morph : Factor w/ 2 levels "Blue","Red": 1 2 2 1 2 1 1 2 1 2 ...
##  $ weight: num  310 421 129 657 345 ...

The levels function

if you use the “levels” function, it will tell us what the factors are

levels(bird$morph)
## [1] "Blue" "Red"
levels(bird$area)
## [1] "East"  "North" "South" "West"

if we do a summary, we can see that all of our data is equally distributed between areas as well as morphs

summary(bird)
##     area       morph          weight     
##  East :3000   Blue:6000   Min.   :116.1  
##  North:3000   Red :6000   1st Qu.:268.2  
##  South:3000               Median :362.6  
##  West :3000               Mean   :392.4  
##                           3rd Qu.:503.9  
##                           Max.   :722.2

Exploring our data visually

let’s make a histogram for the weight

hist(bird$weight, breaks=25, col="lightblue", main="Bird Weight",
     xlab="Bird Weight (g)", xlim=c(0,730))

we can always add a box around our plot with the below command

hist(bird$weight, breaks=25, col="lightblue", main="Bird Weight",
     xlab="Bird Weight (g)", xlim=c(0,730))
box()

we can see that the histogram has different “populations”, and we know that our data can be broken into both morphs and areas. Let’s do a boxplot of the morphs

boxplot(bird$weight~bird$morph, xlab="Color",
        ylab="Weight (g)", main="Weight by Morph", 
        col=c("lightblue", "lightpink"))

we see there is no clear difference in weight between the morphs

however, we do see a difference by area!

boxplot(bird$weight~bird$area, xlab="Area",
        ylab="Weight (g)", main="Weight by Area")

now what if I wanted to know the summary statistics of the weight for the birds just in the west? we can subset the data to do that

Subsetting data

to do this, we will make an object called “west.bird” in which we put bird$weight only for rows which have the exact value of “West” for the area. that can be written like the below. The double “=” means “exactly as”

west.bird<-bird$weight[bird$area == "West"]

we can then get our summary statistics using “summary()”

summary(west.bird)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   651.1   657.1   682.4   684.6   712.1   722.2

we can even do a histogram of this data

hist(west.bird)

We can do this subsetting for each population/area. Here’s an example for the north

north.bird<-bird$weight[bird$area == "North"]
summary(north.bird)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   299.4   311.8   333.8   329.0   346.1   352.8
hist(north.bird)

we see that each of these populations have two peaks in the histograms, suggesting that theymust have different values per morphs, perhaps. But we have red and blue morphs. we can easily plot this in a boxplot, we can have more than variable in “x” portion of our plot

notice the below “equation”. We can say that weight is a function of morph AND area

boxplot(bird$weight~ bird$morph + bird$area, ylab="Weight(g)",
        xlab="location and morph", col=c("lightblue", "pink"))

Now what if we want to plot a boxplot of just the birds of the west with morph information. we already subset our weight by area, we should subset our morph by area. this line will give us bird morph data for the rows with the “West” as the area. this will be the same number of lines (and order) as the bird.west object we have for bird weight

west.bird.morph<-bird$morph[bird$area == "West"]

Now if we do a boxplot of this

boxplot(west.bird~west.bird.morph, col=c("lightblue", "pink"),
        ylab="Weight (g)", xlab="Morph", main="West Population")

what if we want values for blue birds from the west?we can subset with more than one qualifier using the ampersand (shift+7).

west.blue<-bird$weight[bird$morph == "Blue" & bird$area=="West"]
str(west.blue)
##  num [1:1500] 657 658 662 658 655 ...
summary(west.blue)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   651.1   655.7   657.1   657.1   658.4   663.4
hist(west.blue)

we can also subset by disqualifying an option. if we wanted all the data for birds from every location except the west we can do the below. this means “not equal to”.

not.west<-bird$weight[bird$area != "West"]
str(not.west)
##  num [1:9000] 310 421 129 345 408 ...
hist(not.west)

what if we want to get values for birds that are not west or east? We can do that as well with subsetting.

not.east.notwest<-bird$weight[bird$area !="West" & bird$area !="East"]
str(not.east.notwest)
##  num [1:6000] 310 421 345 408 319 ...
hist(not.east.notwest)

Subsetting will be important for your exam 3 and your final project!

Please let me know if it doesn’t make sense!