The National Center for Biotechnology Information (NCBI) maintains biological and bibliographic databases including PubMed, GenBank, among many others. Although the data are hosted on NCBI servers, they are accesible through an application programming interface (API). This workshop will introduce a package called rentrez, which facilitates API calls for NCBI data via the R programming language. By the end of the class, students will be familiar with the syntax for retrieving, linking and analyzing records from NCBI databases.
NCBI houses a tremendous variety of biomedical data. From publication abstracts and keywords … to raw sequence data … to taxanomic names and lineages1. These records are stored in disparate databases but can be searched, retrieved and even linked together with a common set of tools. The application programming interface (API) for NCBI is the conduit that provides this access. The API is sometimes referred to as Entrez, which is techincally the name of the global search system (including the web interface), or as eUtils.
It’s probably worth noting that API tools can offer services at different levels of software development. More specifically, eUtils is a “Web API”2. So from a computer connected to a network, you can execute a script that sends a “request” to the NCBI servers with a series of parameters in a pre-defined “endpoint” syntax. The “response” from this query can be returned in a variety of formats, but in this case will typically be in Extensible Markup Language (XML) structure3.
XML is organized as a set of nodes, each of which usually has a individual fields.
Let’s take a look at an example:
<playlist>
<song>
<artist>Prince</artist>
<title>Kiss</title>
<album>Parade</album>
<length>3:38</length>
</song>
<song>
<artist>David Bowie</artist>
<title>Modern Love</title>
<album>Let's Dance</album>
<length>4:48</length>
</song>
<song>
<artist>Talking Heads</artist>
<title>Heaven</title>
<album>Fear of Music</album>
<length>4:03</length>
</song>
</playlist>
The value of the data housed in NCBI databases has driven the development of numerous API “clients” written and executed in various programming languages:
As mentioned above, this tutorial will use rentrez, which is a package developed and maintained by David Winter7. Because it’s written in R, it’s possible to incorporate any of the language’s data manipulation or statistical analysis functionality.
The package is available on the Comprehensive R Archive Network (CRAN). A stable release is available there. For the most recent development release (may include newer functions or versions of existing functions) you can install the package via Github:
# install via CRAN
install.packages("rentrez")
# install via Github
# nb you first need to install devtools
# install.pacakges("devtools")
devtools::install_github("ropensci/rentrez")
library(rentrez)
entrez_dbs()
One of the requirements for performing programmatic searches via rentrez is specification of the NCBI database to be searched. This is defined in the value passed to the “db” argument to entrez_search()
, which we’ll introduce in the next section. But note that the “db” must be an exact match to one of the searchable databases in NCBI. To view a full list and their respective abbreviations use entrez_dbs()
.
entrez_dbs()
## [1] "pubmed" "protein" "nuccore"
## [4] "ipg" "nucleotide" "nucgss"
## [7] "nucest" "structure" "sparcle"
## [10] "genome" "annotinfo" "assembly"
## [13] "bioproject" "biosample" "blastdbinfo"
## [16] "books" "cdd" "clinvar"
## [19] "clone" "gap" "gapplus"
## [22] "grasp" "dbvar" "gene"
## [25] "gds" "geoprofiles" "homologene"
## [28] "medgen" "mesh" "ncbisearch"
## [31] "nlmcatalog" "omim" "orgtrack"
## [34] "pmc" "popset" "probe"
## [37] "proteinclusters" "pcassay" "biosystems"
## [40] "pccompound" "pcsubstance" "pubmedhealth"
## [43] "seqannot" "snp" "sra"
## [46] "taxonomy" "biocollections" "unigene"
## [49] "gencoll" "gtr"
Each of these databases is configured to receive complex queries based on specific fields. You can view a given databases “searchable fields” with entrez_db_searchable()
.
entrez_db_searchable(db = "snp")
## Searchable fields for database 'snp'
## ALL All terms from all searchable fields
## UID Unique number assigned to publication
## FILT Limits the records
## RS Clustered SNP ID (rs)
## CHR chromosomes
## GENE locus link symbol
## HAN Submitter Handle
## ACCN nucleotide accessions
## LLID locus link UID
## ORGN Organism
## FXN Function class
## GTYP Genotype info
## NREF SNP not mapped to reference assembly
## HETZ Heterozygosity
## MPWT Map weight
## VALI Validation status
## SRAT Success rate
## CBID Original Build ID
## UBID Update Build ID
## PDAT SNP Publication date
## MDAT SNP modification date
## PCLS Population classification based on geographic location
## MCLS Assay Method
## SS Submitter ID
## SID Local SNP ID
## VARI Allele
## SCLS SNP class
## GDSC description of gene
## CPOS Chromosome base position
## GPOS Contig base position
## WORD Free text associated with record
## WTAA Reference Amino Acid
## MTAA Variant or Mutant Amino Acid
## RSNP Reference SNP
## SIDX SNP Index
## ALOR Allele originated from somatic or germline
## SUSP Variation suspected to be false based on evidence
## CLIN Variations with clinical effects or significances
## GMAF Minor Allele Frequency derived from global population (ie. 1000G)
entrez_db_searchable(db = "clinvar")
## Searchable fields for database 'clinvar'
## ALL All terms from all searchable fields
## UID Unique number assigned to variation
## FILT Limits the records
## TITL Constructed from variant and phenotype names
## WORD Free text associated with record
## ORGN scientific and common names of organism
## MDAT The last date on which the record was updated
## CHR Chromosome number or numbers; also 'mitochondrial', 'unknown' properties
## GENE Symbol or symbols of the gene
## MIM MIM number from OMIM
## DIS Diseases or traits associated with this record
## ACCN Accession of the genotype/phenotype assertion
## VRID Public ID of a variant
## TRID Public identifier for a trait (e.g. CUI, HPO)
## PROP Properties of ClinVar record
## CDAT The date on which this record first appeared
## PMID PubMed ids of accessions linked to the record
## GID Gene ID
## TID taxonomy id
## CPOS Chromosome base position
## GFN Gene full name
## SBM Submitter
## VRNM Names used for this allele
## VRTP Type of sequence change/variant call
## MCNS Consequence of the variation at the molecular level.
## RVST Review status
## ALID Unique identifier assigned to a specific sequence change at a location.
## ORIG Origin
## SID Submitter Batch
## CYT Cytogenetic location computed from sequence coordinates
## GTRT Accession for a test registered in GTR
## STNM Study Name
## CMPL Complexity
## C37 Chromosome base position for assembly GRCh37
## VLEN Length of the variant defined as the difference between reference and alternate, except when there is no lenght difference (then the length of the allele itself)
## ACCN Nucleotide or protein accession
## HGNC Identifier for Gene assigned by HUGO Gene Nomenclature Committee (HGNC)
## VID Unique number assigned to variation
## IMOD The latest date for each interpretation of a variant
nb in order to construct thorough, complex queries to these databases, you may be best served starting in a web browser and using the graphical “Advanced Search Builder” for the database of choice8.
entrez_search()
For this example, let’s assume we want to use PubMed to find records for articles published in PLoS Neglected Tropical Diseases in 2015. The two fields of interest in this case will be publication date (PDAT) and journal (JOUR).
The fundamental unit to retrieving data from any of the NCBI databases is the record ID. In fact, this is one of the first items returned in the query process. We’ll use entrez_search()
to begin searching NCBI. At minimum, this function takes two arguments:
The search function goes out to an NCBI database (in this example “pubmed”) and returns the “hits” as IDs. The record itself is not contained in this identifier, but rather will be accessed later using entrez_fetch()
.
nb we have to store the results of the search as an object in order to access the IDs later. This object is a named list, so we can access pieces of the results with the $
operator followed by the name of the attribute.
res <- entrez_search(db = "pubmed", term = "(PLoS Neglected Tropical Diseases[JOUR] AND 2015[PDAT])")
The “count” indicates the number of total results.
res$count
## [1] 888
The “ids” are identifiers for records and will be useful when fetching the individual XML content for the articles in this case.
res$ids
## [1] "26727257" "26720725" "26720603" "26720414" "26720278" "26720149"
## [7] "26720002" "26719978" "26719891" "26719888" "26713732" "26713614"
## [13] "26709922" "26709822" "26709695" "26701750" "26701602" "26697878"
## [19] "26694834" "26684831"
By default, entrez_search()
only returns 20 identifiers at a time. To change this, we can modify the “retmax” argument. It may be worth setting this to a high value to make sure as many IDs as possible are returned.
res <- entrez_search(db = "pubmed", term = "(PLoS Neglected Tropical Diseases[JOUR] AND 2015[PDAT])", retmax = 9999, use_history = TRUE)
Note that the count doesn’t change … but there are many more IDs.
Have we returned identifiers for all of the articles in which we’re interested?
length(res$ids) == res$count
## [1] TRUE
entrez_fetch()
With the IDs for the articles, we can now access their full records. As its name suggests, entrez_fetch()
will fetch the data. We’ll define this to be returned in XML format so we can parse the results.
Note that NCBI does place restricitons on the size of queries. We’ll address how to get around this issue by using web_history
(see web_history) … but for now we can just use a subset of the IDs.
recs <- entrez_fetch(db = "pubmed", id = res$ids[1:25], rettype = "xml", parsed = TRUE)
Because we’ve set the “rettype” and asked the results to be parsed, if we were to print this object we’d see nicely formatted XML. From here we can use the XML R package to parse out individual features.
library(XML)
Two functions that could be particularly useful when working with this XML are xpathSApply()
and xmlToList()
:
?xpathSApply
?xmlToList
XML documents are arranged as “trees” or “document object models (DOM)” … these must be traversed in order to find the information of interest. We need to define a starting point in that structure using XPath9. The //
and /
syntax helps point the query to the appropriate context in the document.
xpathSApply()
uses base R’s sapply()
to iterate over a list and return a vector.
xpathSApply(recs, "//MedlineCitation/Article/ArticleTitle", xmlValue)
Antoher workflow for parsing these results is to use entrez_summary()
and extract_from_esummary()
.
esums <- entrez_summary(db = "pubmed", id = res$ids[1:25])
Again, we’ve used just a subset of the PubMed IDs retrieved from our original entrez_search()
. Printing esums
will show all of the elements that could be extracted.
esums
## List of 25 esummary records. First record:
##
## $`26727257`
## esummary result with 42 items:
## [1] uid pubdate epubdate
## [4] source authors lastauthor
## [7] title sorttitle volume
## [10] issue pages lang
## [13] nlmuniqueid issn essn
## [16] pubtype recordstatus pubstatus
## [19] articleids history references
## [22] attributes pmcrefcount fulljournalname
## [25] elocationid doctype srccontriblist
## [28] booktitle medium edition
## [31] publisherlocation publishername srcdate
## [34] reportnumber availablefromurl locationlabel
## [37] doccontriblist docdate bookname
## [40] chapter sortpubdate sortfirstauthor
extract_from_esummary()
will give us a vector named with the associated PubMed ID:
extract_from_esummary(esums, "title")
## 26727257
## "Correction: Carbohydrate Recognition Specificity of Trans-sialidase Lectin Domain from Trypanosoma congolense."
## 26720725
## "Molecular Detection of Schistosome Infections with a Disposable Microfluidic Cassette."
## 26720603
## "Trichinella spiralis Paramyosin Binds Human Complement C1q and Inhibits Classical Complement Activation."
## 26720414
## "Social Network Analysis As a Tool for Research Policy."
## 26720278
## "Quantifying Poverty as a Driver of Ebola Transmission."
## 26720149
## "Glycans from Fasciola hepatica Modulate the Host Immune Response and TLR-Induced Maturation of Dendritic Cells."
## 26720002
## "Plasmodium malariae Infection Associated with a High Burden of Anemia: A Hospital-Based Surveillance Study."
## 26719978
## "Using Co-authorship Networks to Map and Analyse Global Neglected Tropical Disease Research with an Affiliation to Germany."
## 26719891
## "Schistosoma mansoni Egg, Adult Male and Female Comparative Gene Expression Analysis and Identification of Novel Genes by RNA-Seq."
## 26719888
## "The 1899 United States Kissing Bug Epidemic."
## 26713732
## "The Schistosoma mansoni Cytochrome P450 (CYP3050A1) Is Essential for Worm Survival and Egg Development."
## 26713614
## "Pregnancy Outcomes after a Mass Vaccination Campaign with an Oral Cholera Vaccine in Guinea: A Retrospective Cohort Study."
## 26709922
## "Systematic Review and Meta-analysis of the Impact of Chemical-Based Mollusciciding for Control of Schistosoma mansoni and S. haematobium Transmission."
## 26709822
## "Cissampelos pareira Linn: Natural Source of Potent Antiviral Activity against All Four Dengue Virus Serotypes."
## 26709695
## "Whole Genome Sequencing of Field Isolates Reveals Extensive Genetic Diversity in Plasmodium vivax from Colombia."
## 26701750
## "Immucillins ImmA and ImmH Are Effective and Non-toxic in the Treatment of Experimental Visceral Leishmaniasis."
## 26701602
## "A Library of Plasmodium vivax Recombinant Merozoite Proteins Reveals New Vaccine Candidates and Protein-Protein Interactions."
## 26697878
## "Identification and Characterization of Microsatellite Markers Derived from the Whole Genome Analysis of Taenia solium."
## 26694834
## "The Risk of Nosocomial Transmission of Rift Valley Fever."
## 26684831
## "Genomic and Proteomic Studies on the Mode of Action of Oxaboroles against the African Trypanosome."
## 26684756
## "Generation of a Novel Bacteriophage Library Displaying scFv Antibody Fragments from the Natural Buffalo Host to Identify Antigens from Adult Schistosoma japonicum for Diagnostic Development."
## 26678393
## "Earth Observation, Spatial Data Quality, and Neglected Tropical Diseases."
## 26678263
## "A Spatiotemporal Database to Track Human Scrub Typhus Using the VectorMap Application."
## 26671573
## "Peridomestic Infection as a Determining Factor of Dengue Transmission."
## 26659387
## "Characterization of the Paracoccidioides Hypoxia Response Reveals New Insights into Pathogenesis Mechanisms of This Important Human Pathogenic Fungus."
entrez_link()
One of the most fundamental advantages to using NCBI databases is the ability to connect data across repositories. This idea of “linking” is available in rentrez via the entrez_link()
function.
To demonstrate how this works, let’s say we want to retrieve sequence data for Aedes aegypti, which is the mosquito vector for Zika, Dengue fever and yellow fever. The first step is to retrieve the correct organism ID for the “yellow fever mosquito”:
yfm <- entrez_search(db = "taxonomy", term = "yellow fever mosquito")
yfm$ids
## [1] "7159"
If we were interested in the taxonomic record for this species of mosquito, then we could run entrez_fetch()
on the “taxonomy” database with this ID. However, since the goal is to retrieve sequences, we have to fetch from NCBI’s “nuccore” nucleotide database. But that will require a linkage of first taxonomy to genome, then genome to nuccore:
yfmlinks <- entrez_link(dbfrom = "taxonomy", id = yfm$ids, db = "genome")
genlinkid <- yfmlinks$links$taxonomy_genome
yfmlinks2 <- entrez_link(dbfrom = "genome", id = genlinkid, db = "nuccore")
nuclinkid <- yfmlinks2$links$genome_nuccore
nb the entrez_link()
function requires a single “dbfrom” argument. However, for the “db” argument you can pass either pass a single database or specify “all” to retrieve link IDs from all databases that share links with the record(s) of interest.
If you want to see which databases could possibly have links from a specific database, use entrez_db_links()
:
entrez_db_links(db = "taxonomy")
## Databases with linked records for database 'taxonomy'
## [1] assembly bioproject biosample biosystems
## [5] books cdd clone dbvar
## [9] gds gene genome geoprofiles
## [13] homologene mesh nucest nucgss
## [17] nuccore pcassay pccompound pcsubstance
## [21] pmc popset probe protein
## [25] protein proteinclusters pubmed pubmed
## [29] snp sra structure unigene
There are 12 records for Aedes aegypti in NCBI’s nucleotide database … entrez_fetch()
can pull down each in fasta format:
yfmfasta <- entrez_fetch(db = "nuccore", id = nuclinkid, rettype = "fasta")
strsplit(yfmfasta, ">")[[1]][1]
rentrez includes a number of features that extend the search -> link -> fetch workflow.
As mentioned above, NCBI places limits on large queries to its databases. However, the servers allow users to store long lists of IDs as “web history” objects. When we ran our original entrez_search()
we toggled “use_history” to TRUE. Because we did that, our original search object contains a “web_history” element.
res$web_history
## Web history object (QueryKey = 1, WebEnv = NCID_1_31327...)
So if we wanted to retrieve all records from our original search, regardless of how many there are, we can pass the web history object in the “web_history” argument to entrez_fetch()
rather than a list of IDs.
recs <- entrez_fetch(db = "pubmed", web_history = res$web_history, rettype = "xml", parsed = TRUE)
Did we get all of the titles?
plosidtitles <- xpathSApply(recs, "//MedlineCitation/Article/ArticleTitle", xmlValue)
length(plosidtitles) == res$count
entrez_post()
Another way to store IDs on NCBI servers is with the entrez_post()
function. This feature effectively also creats a web history object, but does so by explicitly using a POST protocol10.
res <- entrez_search(db = "clinvar", term= "atherosclerosis[Disease/Phenotype]")
postids <- entrez_post(db = "clinvar", res$ids)
postids
## Web history object (QueryKey = 1, WebEnv = NCID_1_27120...)
entrez_info()
For metadata about a specific database use entrez_info()
:
entrez_info(db = "mesh")
## <?xml version="1.0" encoding="UTF-8"?>
## <!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20130322//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130322/einfo.dtd">
## <eInfoResult>
## <DbInfo>
## <DbName>mesh</DbName>
## <MenuName>MeSH</MenuName>
## <Description>MeSH Database</Description>
## <DbBuild>Build171017-0315.1</DbBuild>
## <Count>273016</Count>
## <LastUpdate>2017/10/17 03:52</LastUpdate>
## <FieldList>
## <Field>
## <Name>ALL</Name>
## <FullName>All Fields</FullName>
## <Description>All terms from all searchable fields</Description>
## <TermCount>2335278</TermCount>
## <IsDate>N</IsDate>
## <IsNumerical>N</IsNumerical>
## <SingleToken>N</SingleToken>
## <Hierarchy>N</Hierarchy>
## <IsHidden>N</IsHidden>
## </Field>
## <Field>
## <Name>UID</Name>
## <FullName>UID</FullName>
## <Description>Unique number assigned to publication</Description>
## <TermCount>0</TermCount>
## <IsDate>N</IsDate>
## <IsNumerical>Y</IsNumerical>
## <SingleToken>Y</SingleToken>
## <Hierarchy>N</Hierarchy>
## <IsHidden>Y</IsHidden>
## </Field>
## <Field>
## <Name>FILT</Name>
## <FullName>Filter</FullName>
## <Description>Limits the records</Description>
## <TermCount>16</TermCount>
## <IsDate>N</IsDate>
## <IsNumerical>N</IsNumerical>
## <SingleToken>Y</SingleToken>
## <Hierarchy>N</Hierarchy>
## <IsHidden>Y</IsHidden>
## </Field>
## <Field>
## <Name>TN</Name>
## <FullName>Tree Number</FullName>
## <Description>Tree Number</Description>
## <TermCount>302401</TermCount>
## <IsDate>N</IsDate>
## <IsNumerical>N</IsNumerical>
## <SingleToken>Y</SingleToken>
## <Hierarchy>N</Hierarchy>
## <IsHidden>Y</IsHidden>
## </Field>
## <Field>
## <Name>MESH</Name>
## <FullName>MeSH Terms</FullName>
## <Description>MeSH Terms</Description>
## <TermCount>216343</TermCount>
## <IsDate>N</IsDate>
## <IsNumerical>N</IsNumerical>
## <SingleToken>Y</SingleToken>
## <Hierarchy>N</Hierarchy>
## <IsHidden>N</IsHidden>
## </Field>
## <Field>
## <Name>SUBS</Name>
## <FullName>Substance Name</FullName>
## <Description>Substance Name</Description>
## <TermCount>582168</TermCount>
## <IsDate>N</IsDate>
## <IsNumerical>N</IsNumerical>
## <SingleToken>Y</SingleToken>
## <Hierarchy>N</Hierarchy>
## <IsHidden>N</IsHidden>
## </Field>
## <Field>
## <Name>WORD</Name>
## <FullName>Text Word</FullName>
## <Description>Free text</Description>
## <TermCount>1029749</TermCount>
## <IsDate>N</IsDate>
## <IsNumerical>N</IsNumerical>
## <SingleToken>Y</SingleToken>
## <Hierarchy>N</Hierarchy>
## <IsHidden>N</IsHidden>
## </Field>
## <Field>
## <Name>ALSO</Name>
## <FullName>See Also</FullName>
## <Description>See Also</Description>
## <TermCount>9769</TermCount>
## <IsDate>N</IsDate>
## <IsNumerical>N</IsNumerical>
## <SingleToken>Y</SingleToken>
## <Hierarchy>N</Hierarchy>
## <IsHidden>Y</IsHidden>
## </Field>
## <Field>
## <Name>PREV</Name>
## <FullName>Previous Indexing</FullName>
## <Description>Previous Indexing</Description>
## <TermCount>8392</TermCount>
## <IsDate>N</IsDate>
## <IsNumerical>N</IsNumerical>
## <SingleToken>Y</SingleToken>
## <Hierarchy>N</Hierarchy>
## <IsHidden>Y</IsHidden>
## </Field>
## <Field>
## <Name>NOTE</Name>
## <FullName>Scope Note</FullName>
## <Description>Scope Note</Description>
## <TermCount>446780</TermCount>
## <IsDate>N</IsDate>
## <IsNumerical>N</IsNumerical>
## <SingleToken>Y</SingleToken>
## <Hierarchy>N</Hierarchy>
## <IsHidden>N</IsHidden>
## </Field>
## <Field>
## <Name>REG</Name>
## <FullName>Registry Number</FullName>
## <Description>Registry Number</Description>
## <TermCount>100605</TermCount>
## <IsDate>N</IsDate>
## <IsNumerical>N</IsNumerical>
## <SingleToken>Y</SingleToken>
## <Hierarchy>N</Hierarchy>
## <IsHidden>N</IsHidden>
## </Field>
## <Field>
## <Name>MULT</Name>
## <FullName>Multi</FullName>
## <Description>Multi</Description>
## <TermCount>798468</TermCount>
## <IsDate>N</IsDate>
## <IsNumerical>N</IsNumerical>
## <SingleToken>Y</SingleToken>
## <Hierarchy>N</Hierarchy>
## <IsHidden>Y</IsHidden>
## </Field>
## <Field>
## <Name>TYPE</Name>
## <FullName>Record Type</FullName>
## <Description>Record type - main heading, subheading, pharmacological action, substance name, publication type</Description>
## <TermCount>6</TermCount>
## <IsDate>N</IsDate>
## <IsNumerical>N</IsNumerical>
## <SingleToken>Y</SingleToken>
## <Hierarchy>N</Hierarchy>
## <IsHidden>N</IsHidden>
## </Field>
## <Field>
## <Name>MHUI</Name>
## <FullName>MeSH Unique ID</FullName>
## <Description>NLM MeSH Browser Unique ID</Description>
## <TermCount>272456</TermCount>
## <IsDate>N</IsDate>
## <IsNumerical>N</IsNumerical>
## <SingleToken>Y</SingleToken>
## <Hierarchy>N</Hierarchy>
## <IsHidden>N</IsHidden>
## </Field>
## </FieldList>
## <LinkList>
## <Link>
## <Name>mesh_gap</Name>
## <Menu>dbGaP Links</Menu>
## <Description>dbGaP Links</Description>
## <DbTo>gap</DbTo>
## </Link>
## <Link>
## <Name>mesh_medgen</Name>
## <Menu>MedGen</Menu>
## <Description>Related information in MedGen</Description>
## <DbTo>medgen</DbTo>
## </Link>
## <Link>
## <Name>mesh_pccompound</Name>
## <Menu>PubChem Compound Links</Menu>
## <Description>Related PubChem Compound with MeSH</Description>
## <DbTo>pccompound</DbTo>
## </Link>
## <Link>
## <Name>mesh_taxonomy</Name>
## <Menu>Taxonomy Links</Menu>
## <Description>Related Organisms</Description>
## <DbTo>taxonomy</DbTo>
## </Link>
## </LinkList>
## </DbInfo>
## </eInfoResult>
##
nb this information comes back as XML, so you’ll have to use XPATH parsing to extract specific fields.
Many of the examples and workflows presented in this lesson are heavily inspired by the developer of rentrez, David Winter, and his awesome vignette for the package11.