Harvesting Data From NCBI

The National Center for Biotechnology Information (NCBI) maintains biological and bibliographic databases including PubMed, GenBank, among many others. Although the data are hosted on NCBI servers, they are accesible through an application programming interface (API). This workshop will introduce a package called rentrez, which facilitates API calls for NCBI data via the R programming language. By the end of the class, students will be familiar with the syntax for retrieving, linking and analyzing records from NCBI databases.

Introduction

NCBI houses a tremendous variety of biomedical data. From publication abstracts and keywords … to raw sequence data … to taxanomic names and lineages¹. These records are stored in disparate databases but can be searched, retrieved and even linked together with a common set of tools. The application programming interface (API) for NCBI is the conduit that provides this access. The API is sometimes referred to as Entrez, which is techincally the name of the global search system (including the web interface), or as eUtils.

It’s probably worth noting that API tools can offer services at different levels of software development. More specifically, eUtils is a “Web API”². So from a computer connected to a network, you can execute a script that sends a “request” to the NCBI servers with a series of parameters in a pre-defined “endpoint” syntax. The “response” from this query can be returned in a variety of formats, but in this case will typically be in Extensible Markup Language (XML) structure³.

XML is organized as a set of nodes, each of which usually has a individual fields.

Let’s take a look at an example:

<playlist>
    <song>
        <artist>Prince</artist>
        <title>Kiss</title>
        <album>Parade</album>
        <length>3:38</length>
    </song>
    <song>
        <artist>David Bowie</artist>
        <title>Modern Love</title>
        <album>Let's Dance</album>
        <length>4:48</length>
    </song>
    <song>
        <artist>Talking Heads</artist>
        <title>Heaven</title>
        <album>Fear of Music</album>
        <length>4:03</length>
    </song>
</playlist>

The value of the data housed in NCBI databases has driven the development of numerous API “clients” written and executed in various programming languages:

Entrez Direct (EDirect)⁴
Bio.Entrez⁵
bionode⁶

As mentioned above, this tutorial will use rentrez, which is a package developed and maintained by David Winter⁷. Because it’s written in R, it’s possible to incorporate any of the language’s data manipulation or statistical analysis functionality.

The package is available on the Comprehensive R Archive Network (CRAN). A stable release is available there. For the most recent development release (may include newer functions or versions of existing functions) you can install the package via Github:

# install via CRAN
install.packages("rentrez")

# install via Github
# nb you first need to install devtools
# install.pacakges("devtools")
devtools::install_github("ropensci/rentrez")

library(rentrez)

`entrez_dbs()`

One of the requirements for performing programmatic searches via rentrez is specification of the NCBI database to be searched. This is defined in the value passed to the “db” argument to entrez_search(), which we’ll introduce in the next section. But note that the “db” must be an exact match to one of the searchable databases in NCBI. To view a full list and their respective abbreviations use entrez_dbs().

entrez_dbs()

##  [1] "pubmed"          "protein"         "nuccore"        
##  [4] "ipg"             "nucleotide"      "nucgss"         
##  [7] "nucest"          "structure"       "sparcle"        
## [10] "genome"          "annotinfo"       "assembly"       
## [13] "bioproject"      "biosample"       "blastdbinfo"    
## [16] "books"           "cdd"             "clinvar"        
## [19] "clone"           "gap"             "gapplus"        
## [22] "grasp"           "dbvar"           "gene"           
## [25] "gds"             "geoprofiles"     "homologene"     
## [28] "medgen"          "mesh"            "ncbisearch"     
## [31] "nlmcatalog"      "omim"            "orgtrack"       
## [34] "pmc"             "popset"          "probe"          
## [37] "proteinclusters" "pcassay"         "biosystems"     
## [40] "pccompound"      "pcsubstance"     "pubmedhealth"   
## [43] "seqannot"        "snp"             "sra"            
## [46] "taxonomy"        "biocollections"  "unigene"        
## [49] "gencoll"         "gtr"

Each of these databases is configured to receive complex queries based on specific fields. You can view a given databases “searchable fields” with entrez_db_searchable().

entrez_db_searchable(db = "snp")

## Searchable fields for database 'snp'
##   ALL     All terms from all searchable fields 
##   UID     Unique number assigned to publication 
##   FILT    Limits the records 
##   RS      Clustered SNP ID (rs) 
##   CHR     chromosomes 
##   GENE    locus link symbol 
##   HAN     Submitter Handle 
##   ACCN    nucleotide accessions 
##   LLID    locus link UID 
##   ORGN    Organism 
##   FXN     Function class 
##   GTYP    Genotype info 
##   NREF    SNP not mapped to reference assembly 
##   HETZ    Heterozygosity 
##   MPWT    Map weight 
##   VALI    Validation status 
##   SRAT    Success rate 
##   CBID    Original Build ID 
##   UBID    Update Build ID 
##   PDAT    SNP Publication date 
##   MDAT    SNP modification date 
##   PCLS    Population classification based on geographic location 
##   MCLS    Assay Method 
##   SS      Submitter ID 
##   SID     Local SNP ID 
##   VARI    Allele 
##   SCLS    SNP class 
##   GDSC    description of gene 
##   CPOS    Chromosome base position 
##   GPOS    Contig base position 
##   WORD    Free text associated with record 
##   WTAA    Reference Amino Acid 
##   MTAA    Variant or Mutant Amino Acid 
##   RSNP    Reference SNP 
##   SIDX    SNP Index 
##   ALOR    Allele originated from somatic or germline 
##   SUSP    Variation suspected to be false based on evidence 
##   CLIN    Variations with clinical effects or significances 
##   GMAF    Minor Allele Frequency derived from global population (ie. 1000G)

entrez_db_searchable(db = "clinvar")

## Searchable fields for database 'clinvar'
##   ALL     All terms from all searchable fields 
##   UID     Unique number assigned to variation 
##   FILT    Limits the records 
##   TITL    Constructed from variant and phenotype names 
##   WORD    Free text associated with record 
##   ORGN    scientific and common names of organism 
##   MDAT    The last date on which the record was updated 
##   CHR     Chromosome number or numbers; also 'mitochondrial', 'unknown' properties 
##   GENE    Symbol or symbols of the gene 
##   MIM     MIM number from OMIM 
##   DIS     Diseases or traits associated with this record 
##   ACCN    Accession of the genotype/phenotype assertion 
##   VRID    Public ID of a variant 
##   TRID    Public identifier for a trait (e.g. CUI, HPO) 
##   PROP    Properties of ClinVar record 
##   CDAT    The date on which this record first appeared 
##   PMID    PubMed ids of accessions linked to the record 
##   GID     Gene ID 
##   TID     taxonomy id 
##   CPOS    Chromosome base position 
##   GFN     Gene full name 
##   SBM     Submitter 
##   VRNM    Names used for this allele 
##   VRTP    Type of sequence change/variant call 
##   MCNS    Consequence of the variation at the molecular level. 
##   RVST    Review status 
##   ALID    Unique identifier assigned to a specific sequence change at a location. 
##   ORIG    Origin 
##   SID     Submitter Batch 
##   CYT     Cytogenetic location computed from sequence coordinates 
##   GTRT    Accession for a test registered in GTR 
##   STNM    Study Name 
##   CMPL    Complexity 
##   C37     Chromosome base position for assembly GRCh37 
##   VLEN    Length of the variant defined as the difference between reference and alternate, except when there is no lenght difference (then the length of the allele itself) 
##   ACCN    Nucleotide or protein accession 
##   HGNC    Identifier for Gene assigned by HUGO Gene Nomenclature Committee (HGNC) 
##   VID     Unique number assigned to variation 
##   IMOD    The latest date for each interpretation of a variant

nb in order to construct thorough, complex queries to these databases, you may be best served starting in a web browser and using the graphical “Advanced Search Builder” for the database of choice⁸.

`entrez_search()`

For this example, let’s assume we want to use PubMed to find records for articles published in PLoS Neglected Tropical Diseases in 2015. The two fields of interest in this case will be publication date (PDAT) and journal (JOUR).

The fundamental unit to retrieving data from any of the NCBI databases is the record ID. In fact, this is one of the first items returned in the query process. We’ll use entrez_search() to begin searching NCBI. At minimum, this function takes two arguments:

The database to search
The search term(s)

The search function goes out to an NCBI database (in this example “pubmed”) and returns the “hits” as IDs. The record itself is not contained in this identifier, but rather will be accessed later using entrez_fetch().

nb we have to store the results of the search as an object in order to access the IDs later. This object is a named list, so we can access pieces of the results with the $ operator followed by the name of the attribute.

res <- entrez_search(db = "pubmed", term = "(PLoS Neglected Tropical Diseases[JOUR] AND 2015[PDAT])")

The “count” indicates the number of total results.

res$count

## [1] 888

The “ids” are identifiers for records and will be useful when fetching the individual XML content for the articles in this case.

res$ids

##  [1] "26727257" "26720725" "26720603" "26720414" "26720278" "26720149"
##  [7] "26720002" "26719978" "26719891" "26719888" "26713732" "26713614"
## [13] "26709922" "26709822" "26709695" "26701750" "26701602" "26697878"
## [19] "26694834" "26684831"

By default, entrez_search() only returns 20 identifiers at a time. To change this, we can modify the “retmax” argument. It may be worth setting this to a high value to make sure as many IDs as possible are returned.

res <- entrez_search(db = "pubmed", term = "(PLoS Neglected Tropical Diseases[JOUR] AND 2015[PDAT])", retmax = 9999, use_history = TRUE)

Note that the count doesn’t change … but there are many more IDs.

Have we returned identifiers for all of the articles in which we’re interested?

length(res$ids) == res$count

## [1] TRUE

`entrez_fetch()`

With the IDs for the articles, we can now access their full records. As its name suggests, entrez_fetch() will fetch the data. We’ll define this to be returned in XML format so we can parse the results.

Note that NCBI does place restricitons on the size of queries. We’ll address how to get around this issue by using web_history (see web_history) … but for now we can just use a subset of the IDs.

recs <- entrez_fetch(db = "pubmed", id = res$ids[1:25], rettype = "xml", parsed = TRUE)

Because we’ve set the “rettype” and asked the results to be parsed, if we were to print this object we’d see nicely formatted XML. From here we can use the XML R package to parse out individual features.

library(XML)

Two functions that could be particularly useful when working with this XML are xpathSApply() and xmlToList():

?xpathSApply
?xmlToList

XML documents are arranged as “trees” or “document object models (DOM)” … these must be traversed in order to find the information of interest. We need to define a starting point in that structure using XPath⁹. The // and / syntax helps point the query to the appropriate context in the document.

xpathSApply() uses base R’s sapply() to iterate over a list and return a vector.

xpathSApply(recs, "//MedlineCitation/Article/ArticleTitle", xmlValue)

Antoher workflow for parsing these results is to use entrez_summary() and extract_from_esummary().

esums <- entrez_summary(db = "pubmed", id = res$ids[1:25])

Again, we’ve used just a subset of the PubMed IDs retrieved from our original entrez_search(). Printing esums will show all of the elements that could be extracted.

esums

## List of  25 esummary records. First record:
## 
##  $`26727257`
## esummary result with 42 items:
##  [1] uid               pubdate           epubdate         
##  [4] source            authors           lastauthor       
##  [7] title             sorttitle         volume           
## [10] issue             pages             lang             
## [13] nlmuniqueid       issn              essn             
## [16] pubtype           recordstatus      pubstatus        
## [19] articleids        history           references       
## [22] attributes        pmcrefcount       fulljournalname  
## [25] elocationid       doctype           srccontriblist   
## [28] booktitle         medium            edition          
## [31] publisherlocation publishername     srcdate          
## [34] reportnumber      availablefromurl  locationlabel    
## [37] doccontriblist    docdate           bookname         
## [40] chapter           sortpubdate       sortfirstauthor

extract_from_esummary() will give us a vector named with the associated PubMed ID:

extract_from_esummary(esums, "title")

##                                                                                                                                                                                         26727257 
##                                                                                 "Correction: Carbohydrate Recognition Specificity of Trans-sialidase Lectin Domain from Trypanosoma congolense." 
##                                                                                                                                                                                         26720725 
##                                                                                                         "Molecular Detection of Schistosome Infections with a Disposable Microfluidic Cassette." 
##                                                                                                                                                                                         26720603 
##                                                                                       "Trichinella spiralis Paramyosin Binds Human Complement C1q and Inhibits Classical Complement Activation." 
##                                                                                                                                                                                         26720414 
##                                                                                                                                         "Social Network Analysis As a Tool for Research Policy." 
##                                                                                                                                                                                         26720278 
##                                                                                                                                         "Quantifying Poverty as a Driver of Ebola Transmission." 
##                                                                                                                                                                                         26720149 
##                                                                                "Glycans from Fasciola hepatica Modulate the Host Immune Response and TLR-Induced Maturation of Dendritic Cells." 
##                                                                                                                                                                                         26720002 
##                                                                                    "Plasmodium malariae Infection Associated with a High Burden of Anemia: A Hospital-Based Surveillance Study." 
##                                                                                                                                                                                         26719978 
##                                                                     "Using Co-authorship Networks to Map and Analyse Global Neglected Tropical Disease Research with an Affiliation to Germany." 
##                                                                                                                                                                                         26719891 
##                                                              "Schistosoma mansoni Egg, Adult Male and Female Comparative Gene Expression Analysis and Identification of Novel Genes by RNA-Seq." 
##                                                                                                                                                                                         26719888 
##                                                                                                                                                   "The 1899 United States Kissing Bug Epidemic." 
##                                                                                                                                                                                         26713732 
##                                                                                        "The Schistosoma mansoni Cytochrome P450 (CYP3050A1) Is Essential for Worm Survival and Egg Development." 
##                                                                                                                                                                                         26713614 
##                                                                     "Pregnancy Outcomes after a Mass Vaccination Campaign with an Oral Cholera Vaccine in Guinea: A Retrospective Cohort Study." 
##                                                                                                                                                                                         26709922 
##                                         "Systematic Review and Meta-analysis of the Impact of Chemical-Based Mollusciciding for Control of Schistosoma mansoni and S. haematobium Transmission." 
##                                                                                                                                                                                         26709822 
##                                                                                 "Cissampelos pareira Linn: Natural Source of Potent Antiviral Activity against All Four Dengue Virus Serotypes." 
##                                                                                                                                                                                         26709695 
##                                                                               "Whole Genome Sequencing of Field Isolates Reveals Extensive Genetic Diversity in Plasmodium vivax from Colombia." 
##                                                                                                                                                                                         26701750 
##                                                                                 "Immucillins ImmA and ImmH Are Effective and Non-toxic in the Treatment of Experimental Visceral Leishmaniasis." 
##                                                                                                                                                                                         26701602 
##                                                                  "A Library of Plasmodium vivax Recombinant Merozoite Proteins Reveals New Vaccine Candidates and Protein-Protein Interactions." 
##                                                                                                                                                                                         26697878 
##                                                                         "Identification and Characterization of Microsatellite Markers Derived from the Whole Genome Analysis of Taenia solium." 
##                                                                                                                                                                                         26694834 
##                                                                                                                                      "The Risk of Nosocomial Transmission of Rift Valley Fever." 
##                                                                                                                                                                                         26684831 
##                                                                                             "Genomic and Proteomic Studies on the Mode of Action of Oxaboroles against the African Trypanosome." 
##                                                                                                                                                                                         26684756 
## "Generation of a Novel Bacteriophage Library Displaying scFv Antibody Fragments from the Natural Buffalo Host to Identify Antigens from Adult Schistosoma japonicum for Diagnostic Development." 
##                                                                                                                                                                                         26678393 
##                                                                                                                      "Earth Observation, Spatial Data Quality, and Neglected Tropical Diseases." 
##                                                                                                                                                                                         26678263 
##                                                                                                         "A Spatiotemporal Database to Track Human Scrub Typhus Using the VectorMap Application." 
##                                                                                                                                                                                         26671573 
##                                                                                                                         "Peridomestic Infection as a Determining Factor of Dengue Transmission." 
##                                                                                                                                                                                         26659387 
##                                         "Characterization of the Paracoccidioides Hypoxia Response Reveals New Insights into Pathogenesis Mechanisms of This Important Human Pathogenic Fungus."

`entrez_link()`

One of the most fundamental advantages to using NCBI databases is the ability to connect data across repositories. This idea of “linking” is available in rentrez via the entrez_link() function.

To demonstrate how this works, let’s say we want to retrieve sequence data for Aedes aegypti, which is the mosquito vector for Zika, Dengue fever and yellow fever. The first step is to retrieve the correct organism ID for the “yellow fever mosquito”:

yfm <- entrez_search(db = "taxonomy", term = "yellow fever mosquito")
yfm$ids

## [1] "7159"

If we were interested in the taxonomic record for this species of mosquito, then we could run entrez_fetch() on the “taxonomy” database with this ID. However, since the goal is to retrieve sequences, we have to fetch from NCBI’s “nuccore” nucleotide database. But that will require a linkage of first taxonomy to genome, then genome to nuccore:

yfmlinks <- entrez_link(dbfrom = "taxonomy", id = yfm$ids, db = "genome")
genlinkid <- yfmlinks$links$taxonomy_genome

yfmlinks2 <- entrez_link(dbfrom = "genome", id = genlinkid, db = "nuccore")
nuclinkid <- yfmlinks2$links$genome_nuccore

nb the entrez_link() function requires a single “dbfrom” argument. However, for the “db” argument you can pass either pass a single database or specify “all” to retrieve link IDs from all databases that share links with the record(s) of interest.

If you want to see which databases could possibly have links from a specific database, use entrez_db_links():

entrez_db_links(db = "taxonomy")

## Databases with linked records for database 'taxonomy'
##  [1] assembly        bioproject      biosample       biosystems     
##  [5] books           cdd             clone           dbvar          
##  [9] gds             gene            genome          geoprofiles    
## [13] homologene      mesh            nucest          nucgss         
## [17] nuccore         pcassay         pccompound      pcsubstance    
## [21] pmc             popset          probe           protein        
## [25] protein         proteinclusters pubmed          pubmed         
## [29] snp             sra             structure       unigene

There are 12 records for Aedes aegypti in NCBI’s nucleotide database … entrez_fetch() can pull down each in fasta format:

yfmfasta <- entrez_fetch(db = "nuccore", id = nuclinkid, rettype = "fasta")

strsplit(yfmfasta, ">")[[1]][1]

Other Features

rentrez includes a number of features that extend the search -> link -> fetch workflow.

web_history

As mentioned above, NCBI places limits on large queries to its databases. However, the servers allow users to store long lists of IDs as “web history” objects. When we ran our original entrez_search() we toggled “use_history” to TRUE. Because we did that, our original search object contains a “web_history” element.

res$web_history

## Web history object (QueryKey = 1, WebEnv = NCID_1_31327...)

So if we wanted to retrieve all records from our original search, regardless of how many there are, we can pass the web history object in the “web_history” argument to entrez_fetch() rather than a list of IDs.

recs <- entrez_fetch(db = "pubmed", web_history = res$web_history, rettype = "xml", parsed = TRUE)

Did we get all of the titles?

plosidtitles <- xpathSApply(recs, "//MedlineCitation/Article/ArticleTitle", xmlValue)

length(plosidtitles) == res$count

`entrez_post()`

Another way to store IDs on NCBI servers is with the entrez_post() function. This feature effectively also creats a web history object, but does so by explicitly using a POST protocol¹⁰.

res <- entrez_search(db = "clinvar", term= "atherosclerosis[Disease/Phenotype]")
postids <- entrez_post(db = "clinvar", res$ids)
postids

## Web history object (QueryKey = 1, WebEnv = NCID_1_27120...)

`entrez_info()`

For metadata about a specific database use entrez_info():

entrez_info(db = "mesh")

## <?xml version="1.0" encoding="UTF-8"?>
## <!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20130322//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130322/einfo.dtd">
## <eInfoResult>
##   <DbInfo>
##     <DbName>mesh</DbName>
##     <MenuName>MeSH</MenuName>
##     <Description>MeSH Database</Description>
##     <DbBuild>Build171017-0315.1</DbBuild>
##     <Count>273016</Count>
##     <LastUpdate>2017/10/17 03:52</LastUpdate>
##     <FieldList>
##       <Field>
##         <Name>ALL</Name>
##         <FullName>All Fields</FullName>
##         <Description>All terms from all searchable fields</Description>
##         <TermCount>2335278</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>N</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>N</IsHidden>
##       </Field>
##       <Field>
##         <Name>UID</Name>
##         <FullName>UID</FullName>
##         <Description>Unique number assigned to publication</Description>
##         <TermCount>0</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>Y</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>Y</IsHidden>
##       </Field>
##       <Field>
##         <Name>FILT</Name>
##         <FullName>Filter</FullName>
##         <Description>Limits the records</Description>
##         <TermCount>16</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>Y</IsHidden>
##       </Field>
##       <Field>
##         <Name>TN</Name>
##         <FullName>Tree Number</FullName>
##         <Description>Tree Number</Description>
##         <TermCount>302401</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>Y</IsHidden>
##       </Field>
##       <Field>
##         <Name>MESH</Name>
##         <FullName>MeSH Terms</FullName>
##         <Description>MeSH Terms</Description>
##         <TermCount>216343</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>N</IsHidden>
##       </Field>
##       <Field>
##         <Name>SUBS</Name>
##         <FullName>Substance Name</FullName>
##         <Description>Substance Name</Description>
##         <TermCount>582168</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>N</IsHidden>
##       </Field>
##       <Field>
##         <Name>WORD</Name>
##         <FullName>Text Word</FullName>
##         <Description>Free text</Description>
##         <TermCount>1029749</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>N</IsHidden>
##       </Field>
##       <Field>
##         <Name>ALSO</Name>
##         <FullName>See Also</FullName>
##         <Description>See Also</Description>
##         <TermCount>9769</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>Y</IsHidden>
##       </Field>
##       <Field>
##         <Name>PREV</Name>
##         <FullName>Previous Indexing</FullName>
##         <Description>Previous Indexing</Description>
##         <TermCount>8392</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>Y</IsHidden>
##       </Field>
##       <Field>
##         <Name>NOTE</Name>
##         <FullName>Scope Note</FullName>
##         <Description>Scope Note</Description>
##         <TermCount>446780</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>N</IsHidden>
##       </Field>
##       <Field>
##         <Name>REG</Name>
##         <FullName>Registry Number</FullName>
##         <Description>Registry Number</Description>
##         <TermCount>100605</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>N</IsHidden>
##       </Field>
##       <Field>
##         <Name>MULT</Name>
##         <FullName>Multi</FullName>
##         <Description>Multi</Description>
##         <TermCount>798468</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>Y</IsHidden>
##       </Field>
##       <Field>
##         <Name>TYPE</Name>
##         <FullName>Record Type</FullName>
##         <Description>Record type - main heading, subheading, pharmacological action, substance name, publication type</Description>
##         <TermCount>6</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>N</IsHidden>
##       </Field>
##       <Field>
##         <Name>MHUI</Name>
##         <FullName>MeSH Unique ID</FullName>
##         <Description>NLM MeSH Browser Unique ID</Description>
##         <TermCount>272456</TermCount>
##         <IsDate>N</IsDate>
##         <IsNumerical>N</IsNumerical>
##         <SingleToken>Y</SingleToken>
##         <Hierarchy>N</Hierarchy>
##         <IsHidden>N</IsHidden>
##       </Field>
##     </FieldList>
##     <LinkList>
##       <Link>
##         <Name>mesh_gap</Name>
##         <Menu>dbGaP Links</Menu>
##         <Description>dbGaP Links</Description>
##         <DbTo>gap</DbTo>
##       </Link>
##       <Link>
##         <Name>mesh_medgen</Name>
##         <Menu>MedGen</Menu>
##         <Description>Related information in MedGen</Description>
##         <DbTo>medgen</DbTo>
##       </Link>
##       <Link>
##         <Name>mesh_pccompound</Name>
##         <Menu>PubChem Compound Links</Menu>
##         <Description>Related PubChem Compound with MeSH</Description>
##         <DbTo>pccompound</DbTo>
##       </Link>
##       <Link>
##         <Name>mesh_taxonomy</Name>
##         <Menu>Taxonomy Links</Menu>
##         <Description>Related Organisms</Description>
##         <DbTo>taxonomy</DbTo>
##       </Link>
##     </LinkList>
##   </DbInfo>
## </eInfoResult>
##

nb this information comes back as XML, so you’ll have to use XPATH parsing to extract specific fields.

Acknowledgments

Many of the examples and workflows presented in this lesson are heavily inspired by the developer of rentrez, David Winter, and his awesome vignette for the package¹¹.

Harvesting Data From NCBI

Introduction

entrez_dbs()

entrez_search()

entrez_fetch()

entrez_link()