--- title: "Enhancing Chemical Data Access with PubChemR: A Guide to Utilizing PUG View Service" author: "Selcuk Korkmaz, Bilge Eren Yamasan, Dincer Goksuluk" date: "`r Sys.Date()`" output: html_document: theme: cerulean toc: true toc_float: collapsed: true vignette: > %\VignetteIndexEntry{Enhancing Chemical Data Access with PubChemR: A Guide to Utilizing PUG View Service} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", class.output="scroll-100", cache.path = "cached/" ) library(PubChemR) ``` ```{css, echo=FALSE} pre { max-height: 300px; overflow-y: auto; } pre[class] { max-height: 300px; } ``` ```{css, echo=FALSE} .scroll-100 { max-height: 100px; overflow-y: auto; background-color: inherit; } ``` ## 1. Introduction The PubChemR package features the essential function get_pug_view, specifically developed to provide access to comprehensive summary reports and additional information not typically found in the primary PubChem Substance, Compound, or BioAssay records. Utilizing the PUG View service, a REST-style web service of PubChem, get_pug_view is primarily utilized for generating detailed summaries for the PubChem database record web pages. It can also function as a standalone programmatic web service. PUG View is expertly designed to provide complete summary reports on individual PubChem records, offering a different approach compared to the PUG REST service, which delivers smaller bits of information about one or more PubChem records. The get_pug_view function sends requests to the PubChem PUG View API, allowing users to retrieve various types of data for a given identifier, including annotations and QR codes. It supports multiple output formats such as JSON and SVG, making it a versatile tool for users who need comprehensive information from PubChem's database. This vignette aims to elucidate the structure and usage of the PUG View service, offering illustrative use cases to aid new users in understanding its operation and constructing effective requests. ## 2. Accessing PUG View with `get_pug_view` PUG View offers a versatile approach to access structured information from PubChem's extensive database. Here's how to leverage the get_pug_view function in R for accessing various data formats and specific record summaries: ### 2.1. Data Formats **Multiple Formats:** PUG View supports various formats. For example, to retrieve JSON data: ```{r} result <- get_pug_view(annotation = "data", identifier = "1234", domain = "compound", output = "JSON") ``` This code initializes a request to retrieve data from the PubChem database for the compound with the identifier "1234" (Gallopamil). The annotation parameter is set to "data" indicating that we want the compound data. ```{r} result ``` The response provides an overview of the PUG View data for the requested compound, including the domain, annotation, identifier, and available sections. The retrieve() function can be used to extract data from specific sections. To retrieve the record type: ```{r} retrieve(object = result, .slot = "RecordType", .to.data.frame = FALSE) ``` This code extracts the record type, which is "CID" (Compound Identifier). To retrieve the record number: ```{r} retrieve(object = result, .slot = "RecordNumber", .to.data.frame = FALSE) ``` This code extracts the record number, which is "1234". To retrieve the record title: ```{r} retrieve(object = result, .slot = "RecordTitle", .to.data.frame = FALSE) ``` This code extracts the record title, which is "Gallopamil". To retrieve sections of the data: ```{r} section <- retrieve(object = result, .slot = "Section", .to.data.frame = FALSE) section ``` This code retrieves the sections of the data available for the compound. It provides an overview of the number of sections and their headings. To list the available sections: ```{r} sectionList(section) ``` This code lists all the sections available for the compound data, along with their section IDs and headings. **S1: Structures** After retrieving the main sections of the compound data, you can extract detailed information about specific sections. Here, we'll focus on the "Structures" section (S1). To retrieve the "Structures" section: ```{r} structures <- section(object = result, .id = "S1") ``` This code retrieves the "Structures" section of the compound data. The output provides an overview of the section, including its Table of Contents (TOC) heading, description, and subsections. ```{r} structures ``` To extract the Table of Contents (TOC) heading of the "Structures" section: ```{r} retrieve(object = structures, .slot = "TOCHeading") ``` This code retrieves the TOC heading, which is "Structures". To extract the description of the "Structures" section: ```{r} retrieve(object = structures, .slot = "Description") ``` This code retrieves the description of the section, providing an overview of the types of structures included. To extract the subsections within the "Structures" section: ```{r} retrieve(object = structures, .slot = "Section") ``` This code retrieves the details of the subsections within the "Structures" section. There are two subsections: "2D Structure" and "3D Conformer". **Subsection: 2D Structure** The "2D Structure" subsection provides a two-dimensional representation of the compound: **TOCHeading:** "2D Structure" **Description:** A two-dimensional (2D) structure representation of the compound. This structure is processed through chemical structure standardization, meaning it may not be identical to structures provided by individual data contributors. **URL:** https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0293-8 **Subsection: 3D Conformer** The "3D Conformer" subsection provides a three-dimensional representation of the compound: **TOCHeading:** "3D Conformer" **Description:** A three-dimensional (3D) structure representation of the compound. This structure is computed by PubChem and aims to generate a protein-bound structure, which may differ from the inherent structure in vacuum or gas phase. **URL:** https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-5-1 This detailed extraction and interpretation provide a comprehensive understanding of the structure-related information available for the compound in the PubChem database. ### 2.2. Record Summaries **Specific Headings:** Retrieve data under specific headings for more targeted information. For instance, to get experimental properties: ```{r} result <- get_pug_view(annotation = "data", identifier = "2244", domain = "compound", output = "JSON", heading = "Experimental Properties") ``` This code requests detailed data for the compound identified by "2244" under the heading "Experimental Properties". The output format specified is JSON. The resulting object, result, contains the following details: ```{r} result ``` The output shows the "Chemical and Physical Properties" section, with references and details specific to the compound identified as "2244". To extract the "Chemical and Physical Properties" section: ```{r} s1 <- section(object = result, .id = "S1") ``` This code retrieves the "Chemical and Physical Properties" section from the result object. The output contains detailed information about this section: ```{r} s1 ``` This section contains the heading, description, and details about "Experimental Properties". To extract the "Experimental Properties" subsection: ```{r} s1_s1 <- section(object = s1, .id = "S1") ``` This code retrieves the "Experimental Properties" subsection from the s1 object. The output contains detailed information about this subsection: ```{r} s1_s1 ``` To extract a specific experimental property, such as LogP: ```{r} s1_s1_s10 <- section(object = s1_s1, .id = "S10") ``` This code retrieves the "LogP" section from the s1_s1 object. The output contains detailed information about this section: ```{r} s1_s1_s10 ``` This section contains the heading, description, and information related to the LogP property. To extract the information within the "LogP" section: ```{r} retrieve(object = s1_s1_s10, .slot = "Information", .to.data.frame = FALSE) ``` Above code retrieves detailed information about the "LogP" property of the compound, including references and values: This output shows four entries for the LogP property of the compound, each with a reference number, reference link or description, and the value of LogP (either as a number or a string with markup). ### 2.3. Accessing Different Record Types **Compounds, Substances, and BioAssays:** Access these records using their respective identifiers (CID, SID, AID). For example, to access a BioAssay record: ```{r} result <- get_pug_view(annotation = "data", identifier = "1", domain = "assay", output = "JSON") ``` This code requests the full data for the BioAssay identified by "1". The output format specified is JSON. The resulting object, result, contains the following details: ```{r} result ``` The output shows the structure of the data available for the BioAssay, including 10 sections such as "Record Description" and "Description". To view these sub-sections: ```{r} sectionList(object = result) ``` The code returns the list of sub-sections for AID 1: To access a specific sub-section, such as BioAssay Annotations (S10): ```{r} record_description <- section(object = result, .id = "S10") record_description ``` The output contains detailed information about the BioAssay Annotations section: ```{r} record_description ``` Now, we can use retrieve function to extract section details for "BioAssay Annotations". For example, we can fetch the information details as follow: To extract detailed information from the BioAssay Annotations section: ```{r} retrieve(object = record_description, .slot = "Information") ``` This output shows three entries for the BioAssay Annotations, each with a reference number, name, URL (if available), and value. **Patents, Genes, Proteins, Pathways, Taxonomies, Cell Lines, Elements:** Each of these can be accessed using their specific identifiers or names. For instance, to retrieve information on a specific gene: ```{r} result <- get_pug_view(annotation = "data", identifier = "1", domain = "gene", output = "JSON") ``` This code requests full data for the gene identified by "1". The output format specified is JSON. The resulting object, result, contains the following details: ```{r} result ``` The output shows the structure of the data available for the gene, including 13 sections such as "Record Description" and "Gene Information". The code returns the title of the gene record: ```{r} retrieve(object = result, .slot = "RecordTitle") ``` This output shows the title of the gene identified by "1" is "A1BG - alpha-1-B glycoprotein (human)". The code returns the list of sub-sections for gene ID 1: ```{r} sectionList(object = result) ``` These detailed interpretations illustrate how to access different types of records, view sub-sections, and extract specific information from PubChem using PUG REST. This approach is applicable for compounds, substances, BioAssays, genes, and other record types supported by PubChem. ### 2.4. Annotations **Access by Heading:** Retrieve specific types of information across PubChem's databases. For example, to get all experimental viscosity measurements: ```{r} result <- get_pug_view(annotation = "annotations", identifier = "Viscosity", domain = "heading", output = "JSON") ``` This code requests all annotations related to viscosity from the PubChem database. The output format specified is JSON. The resulting object, result, contains the following details: ```{r} result ``` To extract the annotations: ```{r} annotations <- retrieve(object = result, .slot = "Annotation", .to.data.frame = FALSE) ``` The code returns the first few viscosity annotations. The following is an example of what the extracted annotations might look like: ```{r} head(annotations) ``` The output shows detailed information for each annotation entry, including the source, name, description, reference, and values. Or equivalently, we can employ the following if the heading contains special characters not compatible with URL syntax: ```{r} result <- get_pug_view(annotation = "annotations", identifier = NULL, domain = "heading", output = "JSON", heading = "Viscosity") result ``` **Specify Heading Type:** For headings referring to different record types, specify the heading type. For example, boiling point for compounds: ```{r} result <- get_pug_view(annotation = "annotations", identifier = "Boiling Point", domain = "heading", output = "JSON", headingType = "Compound") ``` This code retrieves all annotations related to the boiling point of compounds. The resulting object, result, contains detailed data: ```{r} result ``` To extract the annotations: ```{r} annotations <- retrieve(object = result, .slot = "Annotation", .to.data.frame = FALSE) ``` The output might include annotations like these: ```{r} head(annotations) ``` The extracted data will contain detailed boiling point information for the compound, along with references and source details. These examples illustrate the versatility of the PubChem PUG REST API in retrieving specific annotations, allowing users to access detailed information efficiently and effectively. **Pagination:** When accessing data from PubChem, you may encounter situations where the data spans multiple pages. In such cases, you can use the page parameter to navigate through the pages. Here’s how you can access different pages of data using the page parameter. To access annotations related to CAS on the 10th page, you can use the following code: ```{r} result <- get_pug_view(annotation = "annotations", identifier = "CAS", domain = "heading", output = "JSON", page = "10") ``` This code retrieves the 10th page of annotations related to CAS. The resulting object, result, contains detailed data about the annotations on that page. ```{r} result ``` To extract the annotations, use the retrieve function: ```{r} annotations <- retrieve(object = result, .slot = "Annotation", .to.data.frame = FALSE) ``` The head(annotations) code will provide a preview of the annotations. The detailed structure of the annotations is shown below: ```{r} head(annotations) ``` The extracted data provides detailed information on each annotation, including the source, name, description, and CAS number, along with relevant references and linked records. This approach allows you to efficiently access specific annotations even when dealing with large datasets that span multiple pages. ### 2.5. Source Categories **Depositors and SIDs:** List all depositors for a given compound, categorized by source type. The following code snippet retrieves detailed annotated data from the PubChem database for the compound with CID "1234": ```{r} result <- get_pug_view(annotation = "categories", identifier = "1234", domain = "compound", output = "JSON") result ``` This code extracts the "Categories" data from the result using the retrieve() function. The extracted data is stored in the categories variable. ```{r} categories <- retrieve(object = result, .slot = "Categories", .to.data.frame = FALSE) ``` The displayed output of head(categories) will show a list of categories, where each category contains: The name of the category (e.g., "Chemical Vendors") A link to a PubChem search for substances in this category A list of sources that have deposited data in this category, including: The Substance Identifier (SID) The source's name and website URL The URL of the source's page on PubChem The registry ID of the compound The URL to the detailed record of the source The categories the source belongs to ```{r} head(categories) ``` ### 2.6. Neighbors **Similar Compounds:** Get a list of compounds with similar structures and associated information: ```{r} result <- get_pug_view(annotation = "neighbors", identifier = "1234", domain = "compound", output = "JSON") ``` This code fetches data related to compounds similar to the compound with CID "1234". The result is stored in the result variable and contains information about similar compounds. The function parameters specify that we are looking for "neighbors" (compounds with similar structures) in the "compound" domain. ```{r} result ``` The Pug View Details section includes several lists of similar compounds categorized by different types of related information. Each type is followed by a list of IDs corresponding to similar compounds. ```{r} retrieve(object = result, .slot = "NeighborsOfType", .to.data.frame = FALSE) ``` The output displays lists of similar compounds categorized by various types such as: * Biological Test Results * Interactions and Pathways * Chemical and Physical Properties * Classification * Drug and Medication Information * Identification * Literature * Taxonomy * Patents * Pharmacology and Biochemistry * Safety and Hazards * Toxicity * Use and Manufacturing * Associated Disorders and Diseases * Spectral Information * Each category lists the compound IDs that share similar properties or information with the compound "1234". ### 2.7. Literature **PubMed URLs:** Retrieve literature associated with a compound, organized by subheading: ```{r} result <- get_pug_view(annotation = "literature", identifier = "1234", domain = "compound", output = "JSON") ``` This code fetches data related to the literature associated with the compound identified by CID "1234". The result is stored in the result variable and contains information about the related literature. The function parameters specify that we are looking for "literature" in the "compound" domain. ```{r} result ``` The Pug View Details section includes the AllURL slot, which contains URLs to the relevant literature. The following code retrieves the detailed information about the literature from the result object: ```{r} retrieve(object = result, .slot = "AllURL", .to.data.frame = FALSE) ``` This URL directs to PubMed and shows the literature related to the compound "Gallopamil" identified by the CID "1234". ### 2.8. Linkout **NCBI LinkOut Records:** List all LinkOut records for a substance, compound, or assay: ```{r} result <- get_pug_view(annotation = "linkout", identifier = "1234", domain = "compound", output = "JSON") result ``` The following code extracts the LinkOut data from the result: ```{r} retrieve(result, .slot = "ObjUrl", .to.data.frame = FALSE) ``` The example output shows details such as the URL, subject type, category, attribute, and provider information for the LinkOut record. ### 2.9. PDB/MMDB Structures **3D Protein Structures:** List 3D protein structures associated with a compound: ```{r} result <- get_pug_view(annotation = "structure", identifier = "2244", domain = "compound", output = "JSON") result ``` The following code extracts the 3D structures data from the result: ```{r} retrieve(object = result, .slot = "Structures", .to.data.frame = FALSE) ``` The extracted structures data includes details such as MMDB ID, PDB ID, URLs, descriptions, and taxonomy information for each structure associated with the compound.