The National Center for Biotechnology Information (NCBI) maintains biological and bibliographic databases including PubMed, GenBank, among many others. Although the data are hosted on NCBI servers, they are accesible through an application programming interface (API). This workshop will introduce a package called rentrez, which facilitates API calls for NCBI data via the R programming language. By the end of the class, students will be familiar with the syntax for retrieving, linking and analyzing records from NCBI databases.

Introduction

NCBI houses a tremendous variety of biomedical data. From publication abstracts and keywords … to raw sequence data … to taxanomic names and lineages1. These records are stored in disparate databases but can be searched, retrieved and even linked together with a common set of tools. The application programming interface (API) for NCBI is the conduit that provides this access. The API is sometimes referred to as Entrez, which is techincally the name of the global search system (including the web interface), or as eUtils.

It’s probably worth noting that API tools can offer services at different levels of software development. More specifically, eUtils is a “Web API”2. So from a computer connected to a network, you can execute a script that sends a “request” to the NCBI servers with a series of parameters in a pre-defined “endpoint” syntax. The “response” from this query can be returned in a variety of formats, but in this case will typically be in Extensible Markup Language (XML) structure3.

XML is organized as a set of nodes, each of which usually has a individual fields.

Let’s take a look at an example:

<playlist>
    <song>
        <artist>Prince</artist>
        <title>Kiss</title>
        <album>Parade</album>
        <length>3:38</length>
    </song>
    <song>
        <artist>David Bowie</artist>
        <title>Modern Love</title>
        <album>Let's Dance</album>
        <length>4:48</length>
    </song>
    <song>
        <artist>Talking Heads</artist>
        <title>Heaven</title>
        <album>Fear of Music</album>
        <length>4:03</length>
    </song>
</playlist>

The value of the data housed in NCBI databases has driven the development of numerous API “clients” written and executed in various programming languages:

  • Entrez Direct (EDirect)4
  • Bio.Entrez5
  • bionode6

As mentioned above, this tutorial will use rentrez, which is a package developed and maintained by David Winter7. Because it’s written in R, it’s possible to incorporate any of the language’s data manipulation or statistical analysis functionality.

The package is available on the Comprehensive R Archive Network (CRAN). A stable release is available there. For the most recent development release (may include newer functions or versions of existing functions) you can install the package via Github:

# install via CRAN
install.packages("rentrez")

# install via Github
# nb you first need to install devtools
# install.pacakges("devtools")
devtools::install_github("ropensci/rentrez")
library(rentrez)

entrez_dbs()

One of the requirements for performing programmatic searches via rentrez is specification of the NCBI database to be searched. This is defined in the value passed to the “db” argument to entrez_search(), which we’ll introduce in the next section. But note that the “db” must be an exact match to one of the searchable databases in NCBI. To view a full list and their respective abbreviations use entrez_dbs().

entrez_dbs()
##  [1] "pubmed"          "protein"         "nuccore"        
##  [4] "ipg"             "nucleotide"      "nucgss"         
##  [7] "nucest"          "structure"       "sparcle"        
## [10] "genome"          "annotinfo"       "assembly"       
## [13] "bioproject"      "biosample"       "blastdbinfo"    
## [16] "books"           "cdd"             "clinvar"        
## [19] "clone"           "gap"             "gapplus"        
## [22] "grasp"           "dbvar"           "gene"           
## [25] "gds"             "geoprofiles"     "homologene"     
## [28] "medgen"          "mesh"            "ncbisearch"     
## [31] "nlmcatalog"      "omim"            "orgtrack"       
## [34] "pmc"             "popset"          "probe"          
## [37] "proteinclusters" "pcassay"         "biosystems"     
## [40] "pccompound"      "pcsubstance"     "pubmedhealth"   
## [43] "seqannot"        "snp"             "sra"            
## [46] "taxonomy"        "biocollections"  "unigene"        
## [49] "gencoll"         "gtr"

Each of these databases is configured to receive complex queries based on specific fields. You can view a given databases “searchable fields” with entrez_db_searchable().

entrez_db_searchable(db = "snp")
## Searchable fields for database 'snp'
##   ALL     All terms from all searchable fields 
##   UID     Unique number assigned to publication 
##   FILT    Limits the records 
##   RS      Clustered SNP ID (rs) 
##   CHR     chromosomes 
##   GENE    locus link symbol 
##   HAN     Submitter Handle 
##   ACCN    nucleotide accessions 
##   LLID    locus link UID 
##   ORGN    Organism 
##   FXN     Function class 
##   GTYP    Genotype info 
##   NREF    SNP not mapped to reference assembly 
##   HETZ    Heterozygosity 
##   MPWT    Map weight 
##   VALI    Validation status 
##   SRAT    Success rate 
##   CBID    Original Build ID 
##   UBID    Update Build ID 
##   PDAT    SNP Publication date 
##   MDAT    SNP modification date 
##   PCLS    Population classification based on geographic location 
##   MCLS    Assay Method 
##   SS      Submitter ID 
##   SID     Local SNP ID 
##   VARI    Allele 
##   SCLS    SNP class 
##   GDSC    description of gene 
##   CPOS    Chromosome base position 
##   GPOS    Contig base position 
##   WORD    Free text associated with record 
##   WTAA    Reference Amino Acid 
##   MTAA    Variant or Mutant Amino Acid 
##   RSNP    Reference SNP 
##   SIDX    SNP Index 
##   ALOR    Allele originated from somatic or germline 
##   SUSP    Variation suspected to be false based on evidence 
##   CLIN    Variations with clinical effects or significances 
##   GMAF    Minor Allele Frequency derived from global population (ie. 1000G)
entrez_db_searchable(db = "clinvar")
## Searchable fields for database 'clinvar'
##   ALL     All terms from all searchable fields 
##   UID     Unique number assigned to variation 
##   FILT    Limits the records 
##   TITL    Constructed from variant and phenotype names 
##   WORD    Free text associated with record 
##   ORGN    scientific and common names of organism 
##   MDAT    The last date on which the record was updated 
##   CHR     Chromosome number or numbers; also 'mitochondrial', 'unknown' properties 
##   GENE    Symbol or symbols of the gene 
##   MIM     MIM number from OMIM 
##   DIS     Diseases or traits associated with this record 
##   ACCN    Accession of the genotype/phenotype assertion 
##   VRID    Public ID of a variant 
##   TRID    Public identifier for a trait (e.g. CUI, HPO) 
##   PROP    Properties of ClinVar record 
##   CDAT    The date on which this record first appeared 
##   PMID    PubMed ids of accessions linked to the record 
##   GID     Gene ID 
##   TID     taxonomy id 
##   CPOS    Chromosome base position 
##   GFN     Gene full name 
##   SBM     Submitter 
##   VRNM    Names used for this allele 
##   VRTP    Type of sequence change/variant call 
##   MCNS    Consequence of the variation at the molecular level. 
##   RVST    Review status 
##   ALID    Unique identifier assigned to a specific sequence change at a location. 
##   ORIG    Origin 
##   SID     Submitter Batch 
##   CYT     Cytogenetic location computed from sequence coordinates 
##   GTRT    Accession for a test registered in GTR 
##   STNM    Study Name 
##   CMPL    Complexity 
##   C37     Chromosome base position for assembly GRCh37 
##   VLEN    Length of the variant defined as the difference between reference and alternate, except when there is no lenght difference (then the length of the allele itself) 
##   ACCN    Nucleotide or protein accession 
##   HGNC    Identifier for Gene assigned by HUGO Gene Nomenclature Committee (HGNC) 
##   VID     Unique number assigned to variation 
##   IMOD    The latest date for each interpretation of a variant

nb in order to construct thorough, complex queries to these databases, you may be best served starting in a web browser and using the graphical “Advanced Search Builder” for the database of choice8.

entrez_fetch()

With the IDs for the articles, we can now access their full records. As its name suggests, entrez_fetch() will fetch the data. We’ll define this to be returned in XML format so we can parse the results.

Note that NCBI does place restricitons on the size of queries. We’ll address how to get around this issue by using web_history (see web_history) … but for now we can just use a subset of the IDs.

recs <- entrez_fetch(db = "pubmed", id = res$ids[1:25], rettype = "xml", parsed = TRUE)

Because we’ve set the “rettype” and asked the results to be parsed, if we were to print this object we’d see nicely formatted XML. From here we can use the XML R package to parse out individual features.

library(XML)

Two functions that could be particularly useful when working with this XML are xpathSApply() and xmlToList():

?xpathSApply
?xmlToList

XML documents are arranged as “trees” or “document object models (DOM)” … these must be traversed in order to find the information of interest. We need to define a starting point in that structure using XPath9. The // and / syntax helps point the query to the appropriate context in the document.

xpathSApply() uses base R’s sapply() to iterate over a list and return a vector.

xpathSApply(recs, "//MedlineCitation/Article/ArticleTitle", xmlValue)

Antoher workflow for parsing these results is to use entrez_summary() and extract_from_esummary().

esums <- entrez_summary(db = "pubmed", id = res$ids[1:25])

Again, we’ve used just a subset of the PubMed IDs retrieved from our original entrez_search(). Printing esums will show all of the elements that could be extracted.

esums
## List of  25 esummary records. First record:
## 
##  $`26727257`
## esummary result with 42 items:
##  [1] uid               pubdate           epubdate         
##  [4] source            authors           lastauthor       
##  [7] title             sorttitle         volume           
## [10] issue             pages             lang             
## [13] nlmuniqueid       issn              essn             
## [16] pubtype           recordstatus      pubstatus        
## [19] articleids        history           references       
## [22] attributes        pmcrefcount       fulljournalname  
## [25] elocationid       doctype           srccontriblist   
## [28] booktitle         medium            edition          
## [31] publisherlocation publishername     srcdate          
## [34] reportnumber      availablefromurl  locationlabel    
## [37] doccontriblist    docdate           bookname         
## [40] chapter           sortpubdate       sortfirstauthor

extract_from_esummary() will give us a vector named with the associated PubMed ID:

extract_from_esummary(esums, "title")
##                                                                                                                                                                                         26727257 
##                                                                                 "Correction: Carbohydrate Recognition Specificity of Trans-sialidase Lectin Domain from Trypanosoma congolense." 
##                                                                                                                                                                                         26720725 
##                                                                                                         "Molecular Detection of Schistosome Infections with a Disposable Microfluidic Cassette." 
##                                                                                                                                                                                         26720603 
##                                                                                       "Trichinella spiralis Paramyosin Binds Human Complement C1q and Inhibits Classical Complement Activation." 
##                                                                                                                                                                                         26720414 
##                                                                                                                                         "Social Network Analysis As a Tool for Research Policy." 
##                                                                                                                                                                                         26720278 
##                                                                                                                                         "Quantifying Poverty as a Driver of Ebola Transmission." 
##                                                                                                                                                                                         26720149 
##                                                                                "Glycans from Fasciola hepatica Modulate the Host Immune Response and TLR-Induced Maturation of Dendritic Cells." 
##                                                                                                                                                                                         26720002 
##                                                                                    "Plasmodium malariae Infection Associated with a High Burden of Anemia: A Hospital-Based Surveillance Study." 
##                                                                                                                                                                                         26719978 
##                                                                     "Using Co-authorship Networks to Map and Analyse Global Neglected Tropical Disease Research with an Affiliation to Germany." 
##                                                                                                                                                                                         26719891 
##                                                              "Schistosoma mansoni Egg, Adult Male and Female Comparative Gene Expression Analysis and Identification of Novel Genes by RNA-Seq." 
##                                                                                                                                                                                         26719888 
##                                                                                                                                                   "The 1899 United States Kissing Bug Epidemic." 
##                                                                                                                                                                                         26713732 
##                                                                                        "The Schistosoma mansoni Cytochrome P450 (CYP3050A1) Is Essential for Worm Survival and Egg Development." 
##                                                                                                                                                                                         26713614 
##                                                                     "Pregnancy Outcomes after a Mass Vaccination Campaign with an Oral Cholera Vaccine in Guinea: A Retrospective Cohort Study." 
##                                                                                                                                                                                         26709922 
##                                         "Systematic Review and Meta-analysis of the Impact of Chemical-Based Mollusciciding for Control of Schistosoma mansoni and S. haematobium Transmission." 
##                                                                                                                                                                                         26709822 
##                                                                                 "Cissampelos pareira Linn: Natural Source of Potent Antiviral Activity against All Four Dengue Virus Serotypes." 
##                                                                                                                                                                                         26709695 
##                                                                               "Whole Genome Sequencing of Field Isolates Reveals Extensive Genetic Diversity in Plasmodium vivax from Colombia." 
##                                                                                                                                                                                         26701750 
##                                                                                 "Immucillins ImmA and ImmH Are Effective and Non-toxic in the Treatment of Experimental Visceral Leishmaniasis." 
##                                                                                                                                                                                         26701602 
##                                                                  "A Library of Plasmodium vivax Recombinant Merozoite Proteins Reveals New Vaccine Candidates and Protein-Protein Interactions." 
##                                                                                                                                                                                         26697878 
##                                                                         "Identification and Characterization of Microsatellite Markers Derived from the Whole Genome Analysis of Taenia solium." 
##                                                                                                                                                                                         26694834 
##                                                                                                                                      "The Risk of Nosocomial Transmission of Rift Valley Fever." 
##                                                                                                                                                                                         26684831 
##                                                                                             "Genomic and Proteomic Studies on the Mode of Action of Oxaboroles against the African Trypanosome." 
##                                                                                                                                                                                         26684756 
## "Generation of a Novel Bacteriophage Library Displaying scFv Antibody Fragments from the Natural Buffalo Host to Identify Antigens from Adult Schistosoma japonicum for Diagnostic Development." 
##                                                                                                                                                                                         26678393 
##                                                                                                                      "Earth Observation, Spatial Data Quality, and Neglected Tropical Diseases." 
##                                                                                                                                                                                         26678263 
##                                                                                                         "A Spatiotemporal Database to Track Human Scrub Typhus Using the VectorMap Application." 
##                                                                                                                                                                                         26671573 
##                                                                                                                         "Peridomestic Infection as a Determining Factor of Dengue Transmission." 
##                                                                                                                                                                                         26659387 
##                                         "Characterization of the Paracoccidioides Hypoxia Response Reveals New Insights into Pathogenesis Mechanisms of This Important Human Pathogenic Fungus."

Other Features

rentrez includes a number of features that extend the search -> link -> fetch workflow.

web_history

As mentioned above, NCBI places limits on large queries to its databases. However, the servers allow users to store long lists of IDs as “web history” objects. When we ran our original entrez_search() we toggled “use_history” to TRUE. Because we did that, our original search object contains a “web_history” element.

res$web_history
## Web history object (QueryKey = 1, WebEnv = NCID_1_31327...)

So if we wanted to retrieve all records from our original search, regardless of how many there are, we can pass the web history object in the “web_history” argument to entrez_fetch() rather than a list of IDs.

recs <- entrez_fetch(db = "pubmed", web_history = res$web_history, rettype = "xml", parsed = TRUE)

Did we get all of the titles?

plosidtitles <- xpathSApply(recs, "//MedlineCitation/Article/ArticleTitle", xmlValue)

length(plosidtitles) == res$count

entrez_post()

Another way to store IDs on NCBI servers is with the entrez_post() function. This feature effectively also creats a web history object, but does so by explicitly using a POST protocol10.

res <- entrez_search(db = "clinvar", term= "atherosclerosis[Disease/Phenotype]")
postids <- entrez_post(db = "clinvar", res$ids)
postids
## Web history object (QueryKey = 1, WebEnv = NCID_1_27120...)

entrez_info()

For metadata about a specific database use entrez_info():

entrez_info(db = "mesh")
## <?xml version="1.0" encoding="UTF-8"?>
## <!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20130322//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130322/einfo.dtd">
## <eInfoResult>
##   <DbInfo>
##     <DbName>mesh</DbName>
##     <MenuName>MeSH</MenuName>
##     <Description>MeSH Database</Description>
##     <DbBuild>Build171017-0315.1</DbBuild>
##     <Count>273016</Count>
##     <LastUpdate>2017/10/17 03:52</LastUpdate>
##     <FieldList>
##       <Field>
##         <Name>ALL</Name>
##         <FullName>All Fields</FullName>
##         <Description>All terms from all searchable fields</Description>
##         <TermCount>2335278</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>N</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>N</IsHidden>
##       </Field>
##       <Field>
##         <Name>UID</Name>
##         <FullName>UID</FullName>
##         <Description>Unique number assigned to publication</Description>
##         <TermCount>0</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>Y</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>Y</IsHidden>
##       </Field>
##       <Field>
##         <Name>FILT</Name>
##         <FullName>Filter</FullName>
##         <Description>Limits the records</Description>
##         <TermCount>16</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>Y</IsHidden>
##       </Field>
##       <Field>
##         <Name>TN</Name>
##         <FullName>Tree Number</FullName>
##         <Description>Tree Number</Description>
##         <TermCount>302401</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>Y</IsHidden>
##       </Field>
##       <Field>
##         <Name>MESH</Name>
##         <FullName>MeSH Terms</FullName>
##         <Description>MeSH Terms</Description>
##         <TermCount>216343</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>N</IsHidden>
##       </Field>
##       <Field>
##         <Name>SUBS</Name>
##         <FullName>Substance Name</FullName>
##         <Description>Substance Name</Description>
##         <TermCount>582168</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>N</IsHidden>
##       </Field>
##       <Field>
##         <Name>WORD</Name>
##         <FullName>Text Word</FullName>
##         <Description>Free text</Description>
##         <TermCount>1029749</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>N</IsHidden>
##       </Field>
##       <Field>
##         <Name>ALSO</Name>
##         <FullName>See Also</FullName>
##         <Description>See Also</Description>
##         <TermCount>9769</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>Y</IsHidden>
##       </Field>
##       <Field>
##         <Name>PREV</Name>
##         <FullName>Previous Indexing</FullName>
##         <Description>Previous Indexing</Description>
##         <TermCount>8392</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>Y</IsHidden>
##       </Field>
##       <Field>
##         <Name>NOTE</Name>
##         <FullName>Scope Note</FullName>
##         <Description>Scope Note</Description>
##         <TermCount>446780</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>N</IsHidden>
##       </Field>
##       <Field>
##         <Name>REG</Name>
##         <FullName>Registry Number</FullName>
##         <Description>Registry Number</Description>
##         <TermCount>100605</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>N</IsHidden>
##       </Field>
##       <Field>
##         <Name>MULT</Name>
##         <FullName>Multi</FullName>
##         <Description>Multi</Description>
##         <TermCount>798468</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>Y</IsHidden>
##       </Field>
##       <Field>
##         <Name>TYPE</Name>
##         <FullName>Record Type</FullName>
##         <Description>Record type - main heading, subheading, pharmacological action, substance name, publication type</Description>
##         <TermCount>6</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>N</IsHidden>
##       </Field>
##       <Field>
##         <Name>MHUI</Name>
##         <FullName>MeSH Unique ID</FullName>
##         <Description>NLM MeSH Browser Unique ID</Description>
##         <TermCount>272456</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>N</IsHidden>
##       </Field>
##     </FieldList>
##     <LinkList>
##       <Link>
##         <Name>mesh_gap</Name>
##         <Menu>dbGaP Links</Menu>
##         <Description>dbGaP Links</Description>
##         <DbTo>gap</DbTo>
##       </Link>
##       <Link>
##         <Name>mesh_medgen</Name>
##         <Menu>MedGen</Menu>
##         <Description>Related information in MedGen</Description>
##         <DbTo>medgen</DbTo>
##       </Link>
##       <Link>
##         <Name>mesh_pccompound</Name>
##         <Menu>PubChem Compound Links</Menu>
##         <Description>Related PubChem Compound with MeSH</Description>
##         <DbTo>pccompound</DbTo>
##       </Link>
##       <Link>
##         <Name>mesh_taxonomy</Name>
##         <Menu>Taxonomy Links</Menu>
##         <Description>Related Organisms</Description>
##         <DbTo>taxonomy</DbTo>
##       </Link>
##     </LinkList>
##   </DbInfo>
## </eInfoResult>
## 

nb this information comes back as XML, so you’ll have to use XPATH parsing to extract specific fields.

Acknowledgments

Many of the examples and workflows presented in this lesson are heavily inspired by the developer of rentrez, David Winter, and his awesome vignette for the package11.

References