Mining Data-BIOL 4300

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

This script shows one way to mine a lot of data for many species you would have to change which gene you’re looking for. There are many ways to do this, so feel free to modify for efficiency, etc.

Load in packages from your library

#Loading in Libraries/packages####
library(rentrez)
library(XML)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(httr)

The below library is useful for reading in fasta files library(seqinr)

NCBI has a done of useful databases to search from, but it can be hard to remember them all, so they made a function which lists which databases they have available

entrez_dbs()

##  [1] "pubmed"          "protein"         "nuccore"         "ipg"            
##  [5] "nucleotide"      "structure"       "genome"          "annotinfo"      
##  [9] "assembly"        "bioproject"      "biosample"       "blastdbinfo"    
## [13] "books"           "cdd"             "clinvar"         "gap"            
## [17] "gapplus"         "grasp"           "dbvar"           "gene"           
## [21] "gds"             "geoprofiles"     "medgen"          "mesh"           
## [25] "nlmcatalog"      "omim"            "orgtrack"        "pmc"            
## [29] "proteinclusters" "pcassay"         "protfam"         "pccompound"     
## [33] "pcsubstance"     "seqannot"        "snp"             "sra"            
## [37] "taxonomy"        "biocollections"  "gtr"

There are a lot of key words we can specify. these can be shown by database using the below function

entrez_db_searchable(db="nuccore")

## Searchable fields for database 'nuccore'
##   ALL     All terms from all searchable fields 
##   UID     Unique number assigned to each sequence 
##   FILT    Limits the records 
##   WORD    Free text associated with record 
##   TITL    Words in definition line 
##   KYWD    Nonstandardized terms provided by submitter 
##   AUTH    Author(s) of publication 
##   JOUR    Journal abbreviation of publication 
##   VOL     Volume number of publication 
##   ISS     Issue number of publication 
##   PAGE    Page number(s) of publication 
##   ORGN    Scientific and common names of organism, and all higher levels of taxonomy 
##   ACCN    Accession number of sequence 
##   PACC    Does not include retired secondary accessions 
##   GENE    Name of gene associated with sequence 
##   PROT    Name of protein associated with sequence 
##   ECNO    EC number for enzyme or CAS registry number 
##   PDAT    Date sequence added to GenBank 
##   MDAT    Date of last update 
##   SUBS    CAS chemical name or MEDLINE Substance Name 
##   PROP    Classification by source qualifiers and molecule type 
##   SQID    String identifier for sequence 
##   GPRJ    BioProject 
##   SLEN    Length of sequence 
##   FKEY    Feature annotated on sequence 
##   PORG    Scientific and common names of primary organism, and all higher levels of taxonomy 
##   COMP    Component accessions for an assembly 
##   ASSM    Assembly 
##   DIV     Division 
##   STRN    Strain 
##   ISOL    Isolate 
##   CULT    Cultivar 
##   BRD     Breed 
##   BIOS    BioSample

entrez_db_searchable(db="taxonomy")

## Searchable fields for database 'taxonomy'
##   ALL     All terms from all searchable fields 
##   UID     Unique number assigned to publication 
##   FILT    Limits the records 
##   SCIN    Scientific name of organism 
##   COMN    Common name of organism 
##   TXSY    Synonym of organism name 
##   ALLN    All aliases for organism 
##   NXLV    Immediate parent in taxonomic hierarchy 
##   SBTR    Any parent node in taxonomic hierarchy 
##   LNGE    Lineage in taxonomic hierarchy 
##   GC      Nuclear genetic code 
##   MGC     Mitochondrial genetic code 
##   PGC     Plastid genetic code 
##   TXDV    GenBank division 
##   RANK    Hierarchical position (e.g., order, genus) 
##   EDAT    Date record first accessible through Entrez 
##   MDAT    Date of last update 
##   PROP    Property defined on particular node (e.g., terminal node) 
##   WORD    Free text associated with record 
##   NTOK    Name tokens associated with organism names 
##   HGC     Hydrogenosome genetic code

Doing a Search

We have to keep in mind that txid7227 is same as dmelanogaster If we want to do a search, we can use the “entrez_search” function, but we have to save the value of this search in an object, I’m writing the object with the name of “output”

We need to specify the database, and the “term”, which is what we’d put in our search bar. Our search has to be in quotation marks, as it is a string of characters (letters)

We also indicate how many records we want to return information for using the “retmax” argument. I set it as one for this. To specify that we want the term to correspond to “organisms” we put the searchable term “[ORGN]” after the organism inside of square brackets.

output<-entrez_search(db="nuccore", term="Drosophila melanogaster[ORGN] AND MITOCHONDRION[ALL]", retmax=1)

We can return the same result using the txid using the below line

output<-entrez_search(db=“nuccore”, term=“txid7227[ORGN] AND MITOCHONDRION[ALL]”, retmax=1)

This below function shows us how many search ids we returned. This should be one, as we set it retmax=1. However, we can also have it be zero, or NA, as not all searches will return values

length(output$ids)

## [1] 1

Getting Summaries

Doing a search just tells us if we can find something, but doesn’t actually tell us much about what we found. So we need to tell R to give us the summary of the result it found. We can use the “entrez_summary()” function to do this.

We just need to give it information on which ID we are searchin for. This value is stored in the “output” object we made above, inside the value for id

output3<-entrez_summary(db="nuccore", id=output$ids)

We can get informaiton such as accession number/verion, and confirm our ID number

output3$accessionversion

## [1] "PP764103.1"

output3$taxid

## [1] 7227

##Getting the sequence

If we want to write a sequence from our search, we need to tell R to “fetch” it we can use the entrez_fetch() function to do this, we just need to tell it we want it to return a fasta file by saying “rettype=fasta”. We tell it which sequence we want by providing the ID or the accession

output2<-entrez_fetch(db="nuccore", id=output$ids[1], rettype="fasta")

We can preview the sequence using the below line. It gives us the first 500 letters

cat(strwrap(substr(output2, 1, 500)), sep="\n")

## >PP764103.1 Drosophila melanogaster haplogroup B1 mitochondrion,
## complete genome
## AATGAATTGCCTGATAAAAAGGATTACCTTGATAGGGTAAATCATGCAGTTTTCTGCATTCATTGACTGA
## TTTATATATTATTTATAAAGATGATTTTATATTTAATAGAATTAAACTATTTCTAAAAGTATCAAAAACT
## TTTGTGCATCATACACCAAAATATATTTACAAAAAGATAAGCTAATTAAGCTACTGGGTTCATACCCCAT
## TTATAAAGGTTATAATCCTTTTCTTTTTAATTTTTAATAATTCGTCAAAAATTTTATTTATTACAATTAT
## AATTATTGGGACATTAATTACAGTTACATCTAATTCTTGGTTAGGAGCTTGAATAGGTTTAGAAATTAAT
## TTATTATCTTTTATCCCCCTATTAAGAGATAATAATAATTTAATATCTACAGAAGCTTCTTTAA

If we want to store our sequence to our working directory, we can use this function here write(mitosequence, file=“mitotestingoutput.fasta”)

Getting taxonomy information

We can search in the taxonomy database for information on our organism. If I do a search for drosophila, I find three IDs. You can then look through those ids to find which is the one you want. You won’t always have to do this

genusname<-entrez_search(db="taxonomy", term="Drosophila")
gennameid<-entrez_summary(db="taxonomy", id=genusname$ids)

this tells me “fruit flies”

gennameid$`7215`$commonname

## [1] "fruit flies"

this tells me it is at the rank of genus (we dont’ want subgenus)

gennameid$`7215`$rank

## [1] "genus"

Doing a loop to get a lot of data

By doing this search, we can return all of the taxa ids for things below the rank of Drosophila genus. This includes subgenera and species, and subspeces, etc We use the [SBTR] keyword. I told it to return a max of 2000, becuase I know there are less than 2000 ids. You would have to modify this for your system

taxout<-entrez_search(db="taxonomy", term="txid7215[SBTR]", retmax=2000)

#this tells us how many

length(unique(taxout$ids))

## [1] 1212

Here we’ll test our IDs by doing a summary of one of the ids. We can indicate which id by putting the number position of hte ID we want ot look for in brackets after the informaiton on where our IDs are stored

rec<-entrez_summary(db="taxonomy", id=taxout$ids[987])

This gives us the scientific name

rec$scientificname

## [1] "Drosophila setosifrons"

Now we’ll want to make an empty table (or matrix) to store our data in. We’ll want as many rows as we have taxa ids, so that’s why we used the formula below. We also decided on five columns

drosdat<-matrix(,nrow=length(unique(taxout$ids)), ncol=5)

We can give our columns names here

colnames(drosdat)<-c("taxid", "speciesname", "accnum", "seqname", "seqlen")

now we’ll also want to make an empty character string object to store our fasta sequences in

mitosequence<-character()

Our loop will run the below for every one of our taxa ids, replacing the number with the next number in the sequence as our variable “i”. I’ll put the code below, but it wont’ run, as I dont’ want ot crash this doc

for(i in 1:length(unique(taxout$ids))){
  #this reports which record we're on
  print(i)
  #this puts the id in the first column of the corresponding row
  drosdat[i,1]<-taxout$ids[i]
  #this gives us a taxonomy summary of the search
  record<-entrez_summary(db="taxonomy", id=taxout$ids[i])
  #this stores its species name. You might need/want to record other taxonomic information
  #such as family, tribe, etc.
  drosdat[i,2]<-record$scientificname
  #then we do a search to see if we have data.  below, the "paste" function in the argument
  ##will make it use the result of that function. In this case, it reports the taxaid appropriately
  seqout<-entrez_search(db="nuccore", term=paste("txid",taxout$ids[i],"[ORGN] AND MITOCHONDRION[ALL]", sep=""), retmax=1)
  #now we'll not always have results
  #this below "if" statment tells R what do do if we have no results of our search
  ##this can be indicated as "<1" result
  if(length(seqout$ids)<1){
    drosdat[i,3]<-NA
    drosdat[i,4]<-NA
    drosdat[i,5]<-NA
  }
  #now we can tell R what to do in every other situation
  else{
    #if we have an ID, let's get a summary of it
    outsum<-entrez_summary(db="nuccore", id=seqout$ids)
    #now that we have a summary, we'll get the accession number for the seqeuence
    drosdat[i,3]<-outsum$accessionversion
    #then we can get the title of the sequence
    drosdat[i,4]<-outsum$title
    #then its length
    drosdat[i,5]<-outsum$slen
    #now that we know we ahve a sequence, we can then ask R to get hte fasta version of it for us
    seqout<-entrez_fetch(db="nuccore", id=outsum$accessionversion, rettype="fasta")
    #this below line will replace the descriptor line of the fasta with just the species name
    ##this is useful for us later
    seqout<-sub(">([^\n]*)", paste0(">",outsum$organism), seqout)
    #we can then replace spaces with underscores, another useful formatting thing
    seqout<-gsub(" ", "_", seqout)
    #then we can put the sequence inside of the charcater string we made earlier
    ###this will simply append any new sequence to our object without deleting earlier ones
    ####this will make one long file for us
    mitosequence<-c(mitosequence, seqout)
    }
}

When the loop is finished, we can write the fasta file to our working director

write(mitosequence, file="mitotestingoutput.fasta")

We also dont’ want to forget to save the massive table we made too

write.csv(drosdat, "dros_mitochondrial_seq_info.csv")

If we load the ‘seqinr’ package, we can test that we can read in the file using hte below testing<-read.fasta(“mitotestingoutput.fasta”)