This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.
This script shows one way to mine a lot of data for many species you would have to change which gene you’re looking for. There are many ways to do this, so feel free to modify for efficiency, etc.
#Loading in Libraries/packages####
library(rentrez)
library(XML)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(httr)
The below library is useful for reading in fasta files library(seqinr)
NCBI has a done of useful databases to search from, but it can be hard to remember them all, so they made a function which lists which databases they have available
entrez_dbs()
## [1] "pubmed" "protein" "nuccore" "ipg"
## [5] "nucleotide" "structure" "genome" "annotinfo"
## [9] "assembly" "bioproject" "biosample" "blastdbinfo"
## [13] "books" "cdd" "clinvar" "gap"
## [17] "gapplus" "grasp" "dbvar" "gene"
## [21] "gds" "geoprofiles" "medgen" "mesh"
## [25] "nlmcatalog" "omim" "orgtrack" "pmc"
## [29] "proteinclusters" "pcassay" "protfam" "pccompound"
## [33] "pcsubstance" "seqannot" "snp" "sra"
## [37] "taxonomy" "biocollections" "gtr"
There are a lot of key words we can specify. these can be shown by database using the below function
entrez_db_searchable(db="nuccore")
## Searchable fields for database 'nuccore'
## ALL All terms from all searchable fields
## UID Unique number assigned to each sequence
## FILT Limits the records
## WORD Free text associated with record
## TITL Words in definition line
## KYWD Nonstandardized terms provided by submitter
## AUTH Author(s) of publication
## JOUR Journal abbreviation of publication
## VOL Volume number of publication
## ISS Issue number of publication
## PAGE Page number(s) of publication
## ORGN Scientific and common names of organism, and all higher levels of taxonomy
## ACCN Accession number of sequence
## PACC Does not include retired secondary accessions
## GENE Name of gene associated with sequence
## PROT Name of protein associated with sequence
## ECNO EC number for enzyme or CAS registry number
## PDAT Date sequence added to GenBank
## MDAT Date of last update
## SUBS CAS chemical name or MEDLINE Substance Name
## PROP Classification by source qualifiers and molecule type
## SQID String identifier for sequence
## GPRJ BioProject
## SLEN Length of sequence
## FKEY Feature annotated on sequence
## PORG Scientific and common names of primary organism, and all higher levels of taxonomy
## COMP Component accessions for an assembly
## ASSM Assembly
## DIV Division
## STRN Strain
## ISOL Isolate
## CULT Cultivar
## BRD Breed
## BIOS BioSample
entrez_db_searchable(db="taxonomy")
## Searchable fields for database 'taxonomy'
## ALL All terms from all searchable fields
## UID Unique number assigned to publication
## FILT Limits the records
## SCIN Scientific name of organism
## COMN Common name of organism
## TXSY Synonym of organism name
## ALLN All aliases for organism
## NXLV Immediate parent in taxonomic hierarchy
## SBTR Any parent node in taxonomic hierarchy
## LNGE Lineage in taxonomic hierarchy
## GC Nuclear genetic code
## MGC Mitochondrial genetic code
## PGC Plastid genetic code
## TXDV GenBank division
## RANK Hierarchical position (e.g., order, genus)
## EDAT Date record first accessible through Entrez
## MDAT Date of last update
## PROP Property defined on particular node (e.g., terminal node)
## WORD Free text associated with record
## NTOK Name tokens associated with organism names
## HGC Hydrogenosome genetic code
We have to keep in mind that txid7227 is same as dmelanogaster If we want to do a search, we can use the “entrez_search” function, but we have to save the value of this search in an object, I’m writing the object with the name of “output”
We need to specify the database, and the “term”, which is what we’d put in our search bar. Our search has to be in quotation marks, as it is a string of characters (letters)
We also indicate how many records we want to return information for using the “retmax” argument. I set it as one for this. To specify that we want the term to correspond to “organisms” we put the searchable term “[ORGN]” after the organism inside of square brackets.
output<-entrez_search(db="nuccore", term="Drosophila melanogaster[ORGN] AND MITOCHONDRION[ALL]", retmax=1)
We can return the same result using the txid using the below line
output<-entrez_search(db=“nuccore”, term=“txid7227[ORGN] AND MITOCHONDRION[ALL]”, retmax=1)
This below function shows us how many search ids we returned. This should be one, as we set it retmax=1. However, we can also have it be zero, or NA, as not all searches will return values
length(output$ids)
## [1] 1
Doing a search just tells us if we can find something, but doesn’t actually tell us much about what we found. So we need to tell R to give us the summary of the result it found. We can use the “entrez_summary()” function to do this.
We just need to give it information on which ID we are searchin for. This value is stored in the “output” object we made above, inside the value for id
output3<-entrez_summary(db="nuccore", id=output$ids)
We can get informaiton such as accession number/verion, and confirm our ID number
output3$accessionversion
## [1] "PP764103.1"
output3$taxid
## [1] 7227
##Getting the sequence
If we want to write a sequence from our search, we need to tell R to “fetch” it we can use the entrez_fetch() function to do this, we just need to tell it we want it to return a fasta file by saying “rettype=fasta”. We tell it which sequence we want by providing the ID or the accession
output2<-entrez_fetch(db="nuccore", id=output$ids[1], rettype="fasta")
We can preview the sequence using the below line. It gives us the first 500 letters
cat(strwrap(substr(output2, 1, 500)), sep="\n")
## >PP764103.1 Drosophila melanogaster haplogroup B1 mitochondrion,
## complete genome
## AATGAATTGCCTGATAAAAAGGATTACCTTGATAGGGTAAATCATGCAGTTTTCTGCATTCATTGACTGA
## TTTATATATTATTTATAAAGATGATTTTATATTTAATAGAATTAAACTATTTCTAAAAGTATCAAAAACT
## TTTGTGCATCATACACCAAAATATATTTACAAAAAGATAAGCTAATTAAGCTACTGGGTTCATACCCCAT
## TTATAAAGGTTATAATCCTTTTCTTTTTAATTTTTAATAATTCGTCAAAAATTTTATTTATTACAATTAT
## AATTATTGGGACATTAATTACAGTTACATCTAATTCTTGGTTAGGAGCTTGAATAGGTTTAGAAATTAAT
## TTATTATCTTTTATCCCCCTATTAAGAGATAATAATAATTTAATATCTACAGAAGCTTCTTTAA
If we want to store our sequence to our working directory, we can use this function here write(mitosequence, file=“mitotestingoutput.fasta”)
We can search in the taxonomy database for information on our organism. If I do a search for drosophila, I find three IDs. You can then look through those ids to find which is the one you want. You won’t always have to do this
genusname<-entrez_search(db="taxonomy", term="Drosophila")
gennameid<-entrez_summary(db="taxonomy", id=genusname$ids)
this tells me “fruit flies”
gennameid$`7215`$commonname
## [1] "fruit flies"
this tells me it is at the rank of genus (we dont’ want subgenus)
gennameid$`7215`$rank
## [1] "genus"
By doing this search, we can return all of the taxa ids for things below the rank of Drosophila genus. This includes subgenera and species, and subspeces, etc We use the [SBTR] keyword. I told it to return a max of 2000, becuase I know there are less than 2000 ids. You would have to modify this for your system
taxout<-entrez_search(db="taxonomy", term="txid7215[SBTR]", retmax=2000)
#this tells us how many
length(unique(taxout$ids))
## [1] 1212
Here we’ll test our IDs by doing a summary of one of the ids. We can indicate which id by putting the number position of hte ID we want ot look for in brackets after the informaiton on where our IDs are stored
rec<-entrez_summary(db="taxonomy", id=taxout$ids[987])
This gives us the scientific name
rec$scientificname
## [1] "Drosophila setosifrons"
Now we’ll want to make an empty table (or matrix) to store our data in. We’ll want as many rows as we have taxa ids, so that’s why we used the formula below. We also decided on five columns
drosdat<-matrix(,nrow=length(unique(taxout$ids)), ncol=5)
We can give our columns names here
colnames(drosdat)<-c("taxid", "speciesname", "accnum", "seqname", "seqlen")
now we’ll also want to make an empty character string object to store our fasta sequences in
mitosequence<-character()
Our loop will run the below for every one of our taxa ids, replacing the number with the next number in the sequence as our variable “i”. I’ll put the code below, but it wont’ run, as I dont’ want ot crash this doc
for(i in 1:length(unique(taxout$ids))){
#this reports which record we're on
print(i)
#this puts the id in the first column of the corresponding row
drosdat[i,1]<-taxout$ids[i]
#this gives us a taxonomy summary of the search
record<-entrez_summary(db="taxonomy", id=taxout$ids[i])
#this stores its species name. You might need/want to record other taxonomic information
#such as family, tribe, etc.
drosdat[i,2]<-record$scientificname
#then we do a search to see if we have data. below, the "paste" function in the argument
##will make it use the result of that function. In this case, it reports the taxaid appropriately
seqout<-entrez_search(db="nuccore", term=paste("txid",taxout$ids[i],"[ORGN] AND MITOCHONDRION[ALL]", sep=""), retmax=1)
#now we'll not always have results
#this below "if" statment tells R what do do if we have no results of our search
##this can be indicated as "<1" result
if(length(seqout$ids)<1){
drosdat[i,3]<-NA
drosdat[i,4]<-NA
drosdat[i,5]<-NA
}
#now we can tell R what to do in every other situation
else{
#if we have an ID, let's get a summary of it
outsum<-entrez_summary(db="nuccore", id=seqout$ids)
#now that we have a summary, we'll get the accession number for the seqeuence
drosdat[i,3]<-outsum$accessionversion
#then we can get the title of the sequence
drosdat[i,4]<-outsum$title
#then its length
drosdat[i,5]<-outsum$slen
#now that we know we ahve a sequence, we can then ask R to get hte fasta version of it for us
seqout<-entrez_fetch(db="nuccore", id=outsum$accessionversion, rettype="fasta")
#this below line will replace the descriptor line of the fasta with just the species name
##this is useful for us later
seqout<-sub(">([^\n]*)", paste0(">",outsum$organism), seqout)
#we can then replace spaces with underscores, another useful formatting thing
seqout<-gsub(" ", "_", seqout)
#then we can put the sequence inside of the charcater string we made earlier
###this will simply append any new sequence to our object without deleting earlier ones
####this will make one long file for us
mitosequence<-c(mitosequence, seqout)
}
}
When the loop is finished, we can write the fasta file to our working director
write(mitosequence, file="mitotestingoutput.fasta")
We also dont’ want to forget to save the massive table we made too
write.csv(drosdat, "dros_mitochondrial_seq_info.csv")
If we load the ‘seqinr’ package, we can test that we can read in the file using hte below testing<-read.fasta(“mitotestingoutput.fasta”)