--- title: "Accessing and Exploring Chemical Data with PubChemR: A Guide to PUG REST Service" author: "Selcuk Korkmaz, Bilge Eren Korkmaz, Dincer Goksuluk" date: "`r Sys.Date()`" output: html_document: toc: true toc_float: collapsed: true vignette: > %\VignetteIndexEntry{Accessing and Exploring Chemical Data with PubChemR: A Guide to PUG REST Service} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", # cache = TRUE, class.output="scroll-100", cache.path = "cached/" ) library(PubChemR) ``` ```{css, echo=FALSE} pre { max-height: 300px; overflow-y: auto; } pre[class] { max-height: 300px; } ``` ```{css, echo=FALSE} .scroll-100 { max-height: 100px; overflow-y: auto; background-color: inherit; } ``` ## 1. Introduction The `PubChemR` package introduces a pivotal function, `get_pug_rest`, designed to facilitate seamless access to the vast chemical data repository of PubChem. This function leverages the capabilities of PubChem's Power User Gateway (PUG) REST service, providing a straightforward and efficient means for users to programmatically interact with PubChem's extensive database. This vignette aims to elucidate the structure and usage of the PUG REST service, offering a range of illustrative use cases to aid new users in understanding its operation and constructing effective requests. ## 2. PUG REST: A Gateway to Chemical Data PUG REST, standing for Power User Gateway RESTful interface, is a simplified access route to PubChem's data and services. It is designed for scripts, web page embedded JavaScript, and third-party applications, eliminating the need for the more complex XML and SOAP envelopes required by other PUG variants. PUG REST's design revolves around the PubChem identifier (SID for substances, CID for compounds, and AID for assays) and is structured into three main request components: input (identifiers), operation (actions on identifiers), and output (desired information format). ### 2.1. Key Features of PUG REST: * **Flexibility and Combinatorial Use:** PUG REST allows a wide range of combinations of inputs, operations, and outputs, enabling users to create customized requests. * **Support for Various Formats:** It supports multiple output formats, including XML, JSON, CSV, PNG, and plain text, catering to different application needs. * **RESTful Design:** The service is based on HTTP requests with details encoded in the URL path, making it RESTful and user-friendly. **Usage Policy:** * PUG REST is not intended for extremely high-volume requests (millions). Users are advised to limit their requests to no more than 5 per second to prevent server overload. * For large data sets, users are encouraged to contact PubChem for optimized query approaches. ## 3. Accessing PUG REST with `get_pug_rest` **Overview** The `get_pug_rest` function in the `PubChemR` package provides a versatile interface to access a wide range of chemical data from the PubChem database. This section of the vignette focuses on various methods to retrieve chemical structure information and other related data using the PUG REST service. The function is designed to be flexible, accommodating different input methods, operations, and output formats. This function sends a request to the PubChem PUG REST API to retrieve various types of data for a given identifier. It supports fetching data in different formats and allows saving the output. ```{r, eval=FALSE} get_pug_rest( identifier = NULL, namespace = "cid", domain = "compound", operation = NULL, output = "JSON", searchtype = NULL, property = NULL, options = NULL, save = FALSE, dpi = 300, path = NULL, file_name = NULL, ... ) ``` **Arguments** * **identifier:** A vector of identifiers for the query, either numeric or character. This is the main input for querying the PubChem database. * **namespace**: A character string specifying the namespace for the request. Default is 'cid'. This defines the type of identifier being used, such as 'cid' (Compound ID), 'sid' (Substance ID), etc. * **domain**: A character string specifying the domain for the request. Default is 'compound'. This indicates the type of entity being queried, such as 'compound', 'substance', etc. * **operation**: An optional character string specifying the operation for the request. This can be used to specify particular actions or methods within the API. * **output**: A character string specifying the output format. Possible values are 'SDF', 'JSON', 'JSONP', 'CSV', 'TXT', and 'PNG'. Default is 'JSON'. This defines the format in which the data will be returned. * **searchtype**: An optional character string specifying the search type. This allows for specifying the nature of the search being performed. * **property**: An optional character string specifying the property for the request. This can be used to filter or specify particular properties of the data being retrieved. * **options**: A list of additional options for the request. This allows for further customization and fine-tuning of the request parameters. * **save**: A logical value indicating whether to save the output as a file or image. Default is FALSE. When set to TRUE, the function will save the retrieved data to a specified file. * **dpi**: An integer specifying the DPI for image output. Default is 300. This is relevant when the output format is an image, determining the resolution of the saved image. * **path**: A character string specifying the directory path where the file will be saved. If not provided, the current working directory is used. * **file_name**: A character string of length 1. Defines the name of the file (without file extension) to save. If NULL, the default file name is set as "files_downloaded". * **...**: Additional arguments to be passed to the request. **Value** The function returns different types of content based on the specified output format: **JSON:** Returns a list. **CSV and TXT:** Returns a data frame. **SDF:** Returns an SDF file of the requested identifier. **PNG:** Returns an image object or saves an image file. ### 3.1. Input Methods In the context of the PubChem PUG REST API, the input methods define how records of interest are specified for a request. There are several ways to define this input, with the most common methods being outlined below: **1. By Identifier:** The most straightforward method to specify input is by using identifiers directly. These identifiers can be Substance IDs (SIDs) or Compound IDs (CIDs). For example, to retrieve the names of a substance with CID 2244, you can use the get_pug_rest function as follows: ```{r} result <- get_pug_rest(identifier = "2244", namespace = "cid", domain = "compound", output = "JSON") result ``` The pubChemData function then processes the result to extract and display the retrieved data. Here's an interpretation of the output for CID 2244: ```{r} pubChemDataResult <- pubChemData(result) ``` The JSON response contains detailed information about the compound identified by CID 2244. The PC_Compounds array holds the compound data, and within it, each element corresponds to a specific compound. For CID 2244, the following information is retrieved: ID: Confirms the compound identifier is CID 2244. ```{r} pubChemDataResult$PC_Compounds[[1]]$id ``` Atoms: Details the atomic composition, with an aid array listing the atom IDs and an element array listing the atomic numbers (e.g., 6 for carbon, 8 for oxygen, 1 for hydrogen). ```{r} pubChemDataResult$PC_Compounds[[1]]$atoms ``` Bonds: Describes the bonds between atoms, including arrays for the IDs of the atoms involved (aid1 and aid2) and the bond order. ```{r} pubChemDataResult$PC_Compounds[[1]]$bonds ``` Coordinates: Provides the spatial coordinates (x and y) for each atom, which can be used to visualize the molecular structure. ```{r} pubChemDataResult$PC_Compounds[[1]]$coords ``` Charge: Indicates the compound's charge, which is 0 in this case. ```{r} pubChemDataResult$PC_Compounds[[1]]$charge ``` Properties: Lists various properties of the compound, including: Compound Complexity: A measure of the molecular complexity. ```{r} pubChemDataResult$PC_Compounds[[1]]$props[[2]] ``` Hydrogen Bond Acceptor/Donor Count: Indicates the number of hydrogen bond acceptors and donors. ```{r} pubChemDataResult$PC_Compounds[[1]]$props[[3]] ``` ```{r} pubChemDataResult$PC_Compounds[[1]]$props[[4]] ``` Rotatable Bond Count: The number of rotatable bonds, which impacts the molecule's flexibility. ```{r} pubChemDataResult$PC_Compounds[[1]]$props[[5]] ``` IUPAC Names: Various standardized names for the compound, such as "2-acetoxybenzoic acid" and "2-acetyloxybenzoic acid". ```{r} pubChemDataResult$PC_Compounds[[1]]$props[[7]] ``` InChI and InChIKey: Standardized identifiers for the chemical structure. ```{r} pubChemDataResult$PC_Compounds[[1]]$props[[13]] ``` ```{r} pubChemDataResult$PC_Compounds[[1]]$props[[14]] ``` Log P: The partition coefficient, indicating the compound's hydrophobicity. ```{r} pubChemDataResult$PC_Compounds[[1]]$props[[15]] ``` Molecular Formula: The chemical formula of the compound, which is C9H8O4. ```{r} pubChemDataResult$PC_Compounds[[1]]$props[[17]] ``` Molecular Weight: The compound's molecular weight, 180.16 g/mol. ```{r} pubChemDataResult$PC_Compounds[[1]]$props[[18]] ``` SMILES: Canonical and isomeric Simplified Molecular Input Line Entry System (SMILES) strings, which are text representations of the chemical structure. Topological Polar Surface Area: A measure of the molecule's surface area that can form hydrogen bonds. ```{r} pubChemDataResult$PC_Compounds[[1]]$props[[19]] ``` For multiple IDs, a vector of IDs can be used. For instance, to retrieve a CSV table of compound properties: ```{r} result <- get_pug_rest(identifier = c("1","2","3","4","5"), namespace = "cid", domain = "compound", property = c("MolecularFormula","MolecularWeight","CanonicalSMILES"), output = "CSV") result ``` The output of this request, when processed with pubChemData, provides a data frame containing the specified properties for each CID: ```{r} pubChemData(result) ``` Each row in the table corresponds to a different CID and lists the requested properties, facilitating easy comparison and further analysis. **2. By Name:** In addition to using direct identifiers, you can refer to a chemical by its common name. This method allows users to search for compounds using familiar names instead of numerical identifiers. It's important to note that a single name might correspond to multiple records in the PubChem database. For example, the name "glucose" can refer to several different compounds or isomers. Here’s how you can retrieve Compound IDs (CIDs) for "glucose": ```{r} result <- get_pug_rest(identifier = "glucose", namespace = "name", domain = "compound", operation = "cids", output = "TXT") result ``` The pubChemData function is then used to process the result and extract the retrieved data. The function retrieves the output in data frame. The output indicates that the search for "glucose" returned a single CID: ```{r} pubChemData(result) ``` This output reveals that the common name "glucose" corresponds to the CID 5793. This CID can then be used in further queries to retrieve detailed information about the compound, such as its molecular structure, properties, and associated bioactivities. Using a common name for searching can simplify the process, especially when the numerical identifiers are not known. However, because a name can map to multiple records, the results might need further filtering or validation to ensure they correspond to the specific compound of interest. **3. By Structure Identity:** Another method to specify a compound in PubChem PUG REST API requests is by using structural identifiers such as SMILES or InChI keys. This approach allows for precise identification of chemical structures by providing a textual representation of the molecule. For example, to retrieve the CID for the SMILES string "CCCC" (which represents butane), you can use the following code: ```{r} result <- get_pug_rest(identifier = "CCCC", namespace = "smiles", domain = "compound", operation = "cids", output = "TXT") result ``` When the pubChemData(result) function is executed, it retrieves the output in data frame. The output indicates that the search for the SMILES string "CCCC" returned a single CID: ```{r} pubChemData(result) ``` This output reveals that the SMILES string "CCCC" corresponds to the CID 7843. This CID can then be used in further queries to gather detailed information about the compound, such as its molecular structure, physical and chemical properties, biological activities, and more. Using structure-based identifiers like SMILES or InChI keys is particularly useful for precise and unambiguous chemical searches, as these identifiers provide a detailed representation of the molecule's structure. This method ensures that the exact compound of interest is identified, reducing the risk of ambiguity that might arise with common names or other identifiers. **4. By Fast (Synchronous) Structure Search:** In PubChem PUG REST API, fast (synchronous) structure search allows for quicker searches by identity, similarity, substructure, and superstructure, often returning results in a single call. This method is efficient for obtaining results quickly and is useful for various types of structural queries. To illustrate, let’s perform a fast identity search for the compound with CID 5793, using the same connectivity option: ```{r} result <- get_pug_rest(identifier = "5793", namespace = "cid", domain = "compound", operation = "cids", output = "TXT", searchtype = "fastidentity", options = list(identity_type = "same_connectivity")) result ``` When the following code executed, it retrieves the output in data frame, listing all CIDs that match the fast identity search criteria. ```{r} pubChemData(result) ``` The output indicates that there are numerous compounds with similar connectivity to the compound identified by CID 5793. Each row in the output represents a different CID that shares the same connectivity pattern as the original compound. This extensive list includes hundreds of CIDs, showcasing the effectiveness of the fast identity search in identifying structurally related compounds quickly. This method is advantageous for researchers needing to identify compounds with similar structures for further study, such as drug development, chemical analysis, or bioactivity screening. The fast response time and comprehensive results make it a valuable tool for various chemical and pharmaceutical applications. **5. By Cross-Reference (XRef):** The cross-reference (XRef) method allows for reverse lookup of records using a cross-reference value. This method is particularly useful for linking external identifiers, such as patent numbers, to records in the PubChem database. For example, to find all SIDs linked to a specific patent identifier, you can use the following code: ```{r} result <- get_pug_rest(identifier = "US20050159403A1", namespace = "xref/PatentID", domain = "substance", operation = "sids", output = "JSON") result ``` pubChemData function retrieves all SIDs that are linked to the specified patent identifier. The output indicates that the specified patent identifier "US20050159403A1" is linked to a large number of SIDs. Each SID represents a different substance that has been referenced in the patent. ```{r} pubChemData(result) ``` This result provides a comprehensive list of substances associated with the specified patent, allowing for further exploration and analysis within the PubChem database. ### 3.2. Available Data Once you’ve specified the records of interest in PUG REST, the next step is to define what information you want to retrieve about these records. PUG REST excels in providing access to specific data points about each record, such as individual properties or cross-references, without the need to download and sift through large datasets. **1. Full Records:** PUG REST allows the retrieval of entire records in various formats like JSON, CSV, TXT, and SDF. For example, to retrieve the record for aspirin (CID 2244) in SDF format: ```{r, eval=FALSE} result <- get_pug_rest(identifier = "2244", namespace = "cid", domain = "compound", output = "SDF") ``` Multiple records can also be requested in a single call, though large lists may be subject to timeouts. **2. Images:** Images of chemical structures can be retrieved by specifying PNG format. This works with various input methods, including chemical names, SMILES strings, and InChI keys. For example, to get an image for the chemical name "lipitor": ```{r} get_pug_rest(identifier = "lipitor", namespace = "name", domain = "compound", output = "PNG") ``` **3. Compound Properties:** Pre-computed properties for PubChem compounds are accessible individually or in tables. For instance, to get the molecular weight of a compound: ```{r} result <- get_pug_rest(identifier = "2244", namespace = "cid", domain = "compound", property = "MolecularWeight", output = "TXT") result ``` ```{r} pubChemData(result) ``` Or to retrieve a CSV table of multiple compounds and properties: ```{r} result <- get_pug_rest(identifier = c("1","2","3","4","5"), namespace = "cid", domain = "compound", property = c("MolecularWeight", "MolecularFormula", "HBondDonorCount", "HBondAcceptorCount", "InChIKey", "InChI"), output = "CSV") ``` ```{r} pubChemData(result) ``` **4. Synonyms:** To view all synonyms of a compound, such as Vioxx: ```{r} result <- get_pug_rest(identifier = "vioxx", namespace = "name", domain = "compound", operation = "synonyms", output = "JSON") result ``` ```{r} pubChemData(result) ``` **5. Cross-References (XRefs):** PUG REST provides access to various cross-references. For example, to retrieve MMDB identifiers for protein structures containing aspirin: ```{r} result <- get_pug_rest(identifier = "2244", namespace = "cid", domain = "compound", operation = c("xrefs","MMDBID"), output = "JSON") result ``` ```{r} pubChemData(result) ``` Or to find all patent identifiers associated with a given SID: ```{r} result <- get_pug_rest(identifier = "137349406", namespace = "sid", domain = "substance", operation = c("xrefs","PatentID"), output = "TXT") result ``` ```{r} pubChemData(result) ``` These examples illustrate the versatility of PUG REST in fetching specific data efficiently. It's an ideal tool for users who need quick access to particular pieces of information from the vast PubChem database without the overhead of processing bulk data. ### 3.3. BioAssays PubChem BioAssays are complex entities containing a wealth of data. PUG REST provides access to both complete assay records and specific components of BioAssay data, allowing users to efficiently retrieve the information they need. **1. Assay Description:** To obtain the description section of a BioAssay, which includes authorship, general description, protocol, and data readout definitions, use a request like: ```{r} result <- get_pug_rest(identifier = "504526", namespace = "aid", domain = "assay", operation = "description", output = "JSON") result ``` ```{r} pubChemData(result) ``` For a simplified summary format that includes target information, active and inactive SID and CID counts: ```{r} result <- get_pug_rest(identifier = "1000", namespace = "aid", domain = "assay", operation = "summary", output = "JSON") result ``` ```{r} pubChemData(result) ``` **2. Assay Data:** To retrieve the entire data set of an assay in CSV format: ```{r} result <- get_pug_rest(identifier = "504526", namespace = "aid", domain = "assay", output = "CSV") result ``` ```{r} pubChemData(result) ``` For a subset of data rows, specify the SIDs: ```{r} result <- get_pug_rest(identifier = "504526", namespace = "aid", domain = "assay", operation = "JSON?sid=104169547,109967232", output = "JSON") result ``` ```{r} result <- pubChemData(result) result$PC_AssaySubmit$assay ``` For concise data (e.g., active concentration readout) with additional information: ```{r} result <- get_pug_rest(identifier = "504526", namespace = "aid", domain = "assay", operation = "concise", output = "JSON") result ``` ```{r} pubChemData(result) ``` For dose-response curve data: ```{r} result <- get_pug_rest(identifier = "504526", namespace = "aid", domain = "assay", operation = "doseresponse/CSV?sid=104169547,109967232", output = "CSV") result ``` ```{r} pubChemData(result) ``` **3. Targets:** To retrieve assay targets, including protein or gene identifiers: ```{r} result <- get_pug_rest(identifier = c("490","1000"), namespace = "aid", domain = "assay", operation = "targets/ProteinGI,ProteinName,GeneID,GeneSymbol", output = "JSON") result ``` ```{r} pubChemData(result) ``` To select assays via target identifier: ```{r} result <- get_pug_rest(identifier = "USP2", namespace = "target/genesymbol", domain = "assay", operation = "aids", output = "TXT") result ``` ```{r} pubChemData(result) ``` **4. Activity Name:** To select BioAssays by the name of the primary activity column: ```{r} result <- get_pug_rest(identifier = "EC50", namespace = "activity", domain = "assay", operation = "aids", output = "JSON") result ``` ```{r} pubChemData(result) ``` These examples demonstrate the flexibility of PUG REST in accessing specific BioAssay data. Users can efficiently retrieve detailed descriptions, comprehensive data sets, concise readouts, and target information, making it a valuable tool for researchers and scientists working with BioAssay data. ### 3.4. Genes PubChem provides various methods to access gene data, making it a valuable resource for genetic research. Here's how you can utilize PUG REST to access gene-related information: **1. Gene Input Methods:** * **By Gene ID:** Access gene data using NCBI Gene identifiers. For example, to get a summary for gene IDs 1956 and 13649 in JSON format: ```{r} result <- get_pug_rest(identifier = "1956,13649", namespace = "geneid", domain = "gene", operation = "summary", output = "JSON") result ``` ```{r} pubChemData(result) ``` * **By Gene Symbol:** Use the official gene symbol, which often maps to multiple genes. For example, to access data for the EGFR gene symbol: ```{r} result <- get_pug_rest(identifier = "EGFR", namespace = "genesymbol", domain = "gene", operation = "summary", output = "JSON") result ``` ```{r} pubChemData(result) ``` * **By Gene Synonym:** Access gene data using synonyms like alternative names. For example, for the ERBB1 synonym: ```{r} result <- get_pug_rest(identifier = "ERBB1", namespace = "synonym", domain = "gene", operation = "summary", output = "JSON") result ``` ```{r} pubChemData(result) ``` **2. Available Gene Data:** * **Gene Summary:** Returns a summary including GeneID, Symbol, Name, TaxonomyID, Description, and Synonyms. For example: ```{r} result <- get_pug_rest(identifier = "1956,13649", namespace = "geneid", domain = "gene", operation = "summary", output = "JSON") result ``` ```{r} pubChemData(result) ``` * **Assays from Gene:** Retrieves a list of AIDs tested against a specific gene. For example, for gene ID 13649: ```{r} result <- get_pug_rest(identifier = "13649", namespace = "geneid", domain = "gene", operation = "aids", output = "TXT") result ``` ```{r} pubChemData(result) ``` * **Bioactivities from Gene:** Returns concise bioactivity data for a specific gene. For example: ```{r} result <- get_pug_rest(identifier = "13649", namespace = "geneid", domain = "gene", operation = "concise", output = "JSON") result ``` ```{r} pubChemData(result) ``` * **Pathways from Gene:** Provides a list of pathways involving a specific gene. For example, for gene ID 13649: ```{r} result <- get_pug_rest(identifier = "13649", namespace = "geneid", domain = "gene", operation = "pwaccs", output = "TXT") result ``` ```{r} pubChemData(result) ``` These methods offer a comprehensive way to access and analyze gene-related data in PubChem, catering to various research needs in genetics and molecular biology. ### 3.5. Proteins PubChem provides a versatile platform for accessing detailed protein data, essential for researchers in biochemistry and molecular biology. Here's how you can utilize PUG REST for protein-related queries: **1. Protein Input Methods:** * **By Protein Accession:** Use NCBI Protein accessions to access protein data. For example, to get a summary for protein accessions P00533 and P01422 in JSON format: ```{r} result <- get_pug_rest(identifier = "P00533,P01422", namespace = "accession", domain = "protein", operation = "summary", output = "JSON") result ``` ```{r} pubChemData(result) ``` **By Protein Synonym:** Access protein data using synonyms or identifiers from external sources. For example, using a ChEMBL ID: ```{r} result <- get_pug_rest(identifier = "ChEMBL:CHEMBL203", namespace = "synonym", domain = "protein", operation = "summary", output = "JSON") result ``` ```{r} pubChemData(result) ``` **2. Available Protein Data:** * **Protein Summary:** Returns a summary including ProteinAccession, Name, TaxonomyID, and Synonyms. For example: ```{r} result <- get_pug_rest(identifier = "P00533,P01422", namespace = "accession", domain = "protein", operation = "summary", output = "JSON") result ``` ```{r} pubChemData(result) ``` **Assays from Protein:** Retrieves a list of AIDs tested against a specific protein. For example, for protein accession P00533: ```{r} result <- get_pug_rest(identifier = "P00533", namespace = "accession", domain = "protein", operation = "aids", output = "TXT") result ``` ```{r} pubChemData(result) ``` **Bioactivities from Protein:** Returns concise bioactivity data for a specific protein. For example: ```{r} result <- get_pug_rest(identifier = "Q01279", namespace = "accession", domain = "protein", operation = "concise", output = "JSON") result ``` ```{r} pubChemData(result) ``` **Pathways from Protein:** Provides a list of pathways involving a specific protein. For example, for protein accession P00533: ```{r} result <- get_pug_rest(identifier = "P00533", namespace = "accession", domain = "protein", operation = "pwaccs", output = "TXT") result ``` ```{r} pubChemData(result) ``` These methods offer a comprehensive approach to accessing and analyzing protein-related data in PubChem, supporting a wide range of research applications in the fields of biochemistry, molecular biology, and pharmacology. ### 3.6. Pathways PubChem's PUG REST service offers a detailed and comprehensive approach to accessing pathway information, crucial for researchers in fields like bioinformatics, pharmacology, and molecular biology. Here's how to utilize PUG REST for pathway-related queries: **1. Pathway Input Methods:** By Pathway Accession: The primary method to access pathway data in PubChem. Pathway Accession is formatted as Source:ID. For example, to get a summary for the Reactome pathway R-HSA-70171 in JSON format: ```{r} result <- get_pug_rest(identifier = "Reactome:R-HSA-70171", namespace = "pwacc", domain = "pathway", operation = "summary", output = "JSON") result ``` ```{r} pubChemData(result) ``` **2. Available Pathway Data:** **Pathway Summary:** Returns a summary including PathwayAccession, SourceName, Name, Type, Category, Description, TaxonomyID, and Taxonomy. For example: ```{r} result <- get_pug_rest(identifier = "Reactome:R-HSA-70171,BioCyc:HUMAN_PWY-4983", namespace = "pwacc", domain = "pathway", operation = "summary", output = "JSON") result ``` ```{r} pubChemData(result) ``` **Compounds from Pathway:** Retrieves a list of compounds involved in a specific pathway. For example, for the Reactome pathway R-HSA-70171: ```{r} result <- get_pug_rest(identifier = "Reactome:R-HSA-70171", namespace = "pwacc", domain = "pathway", operation = "cids", output = "TXT") result ``` ```{r} pubChemData(result) ``` **Genes from Pathway:** Provides a list of genes involved in a specific pathway. For example, for the Reactome pathway R-HSA-70171: ```{r} result <- get_pug_rest(identifier = "Reactome:R-HSA-70171", namespace = "pwacc", domain = "pathway", operation = "geneids", output = "TXT") result ``` ```{r} pubChemData(result) ``` **Proteins from Pathway:** Returns a list of proteins involved in a given pathway. For example, for the Reactome pathway R-HSA-70171: ```{r} result <- get_pug_rest(identifier = "Reactome:R-HSA-70171", namespace = "pwacc", domain = "pathway", operation = "accessions", output = "TXT") result ``` ```{r} pubChemData(result) ``` These methods offer a streamlined and efficient way to access and analyze pathway-related data in PubChem, supporting a wide range of research applications in bioinformatics, molecular biology, and related fields. ### 3.7. Taxonomies PubChem's PUG REST service provides a comprehensive approach to accessing taxonomy information, essential for researchers in fields like biology, pharmacology, and environmental science. Here's how to utilize PUG REST for taxonomy-related queries: **1. Taxonomy Input Methods:** **By Taxonomy ID:** The primary method to access taxonomy data in PubChem using NCBI Taxonomy identifiers. For example, to get a summary for human (Taxonomy ID 9606) and SARS-CoV-2 (Taxonomy ID 2697049) in JSON format: ```{r} result <- get_pug_rest(identifier = "9606,2697049", namespace = "taxid", domain = "taxonomy", operation = "summary", output = "JSON") result ``` ```{r} pubChemData(result) ``` **By Taxonomy Synonym:** Access taxonomy data using synonyms like scientific or common names. For example, for Homo sapiens: ```{r} result <- get_pug_rest(identifier = "Homo sapiens", namespace = "synonym", domain = "taxonomy", operation = "summary", output = "JSON") result ``` ```{r} pubChemData(result) ``` **2. Available Taxonomy Data:** **Taxonomy Summary:** Returns a summary including TaxonomyID, ScientificName, CommonName, Rank, RankedLineage, and Synonyms. For example: ```{r} result <- get_pug_rest(identifier = "9606,10090,10116", namespace = "taxid", domain = "taxonomy", operation = "summary", output = "JSON") result ``` ```{r} pubChemData(result) ``` **Assays and Bioactivities:** Retrieves a list of assays (AIDs) associated with a given taxonomy. For example, for SARS-CoV-2: ```{r} result <- get_pug_rest(identifier = "2697049", namespace = "taxid", domain = "taxonomy", operation = "aids", output = "TXT") result ``` ```{r} pubChemData(result) ``` To aggregate concise bioactivity data from each AID: ```{r} result <- get_pug_rest(identifier = "1409578", namespace = "aid", domain = "assay", operation = "concise", output = "JSON") result ``` ```{r} pubChemData(result) ``` These methods offer a streamlined and efficient way to access and analyze taxonomy-related data in PubChem, supporting a wide range of research applications in biology, pharmacology, environmental science, and related fields. ### 3.8. Cell Lines PubChem's PUG REST service offers a valuable resource for accessing detailed information about various cell lines, crucial for research in cellular biology, pharmacology, and related fields. Here's how to effectively use PUG REST for cell line-related queries: **1. Cell Line Input Methods:** **By Cell Line Accession:** Utilize Cellosaurus and ChEMBL cell line accessions for precise data retrieval. For example: ```{r} result <- get_pug_rest(identifier = "CHEMBL3308376,CVCL_0045", namespace = "cellacc", domain = "cell", operation = "summary", output = "JSON") result ``` ```{r} pubChemData(result) ``` **By Cell Line Synonym:** Access data using cell line names or other synonyms. For instance, for the HeLa cell line: ```{r} result <- get_pug_rest(identifier = "HeLa", namespace = "synonym", domain = "cell", operation = "summary", output = "JSON") result ``` ```{r} pubChemData(result) ``` **2. Available Cell Line Data:** **Cell Line Summary:** Provides a comprehensive overview including CellAccession, Name, Sex, Category, SourceTissue, SourceTaxonomyID, SourceOrganism, and Synonyms. For example: ```{r} result <- get_pug_rest(identifier = "CVCL_0030,CVCL_0045", namespace = "cellacc", domain = "cell", operation = "summary", output = "JSON") result ``` ```{r} pubChemData(result) ``` **Assays and Bioactivities from Cell Line:** Retrieves a list of assays (AIDs) tested on a specific cell line. For example, for HeLa: ```{r} result <- get_pug_rest(identifier = "HeLa", namespace = "synonym", domain = "cell", operation = "aids", output = "TXT") result ``` ```{r} pubChemData(result) ``` To aggregate concise bioactivity data from each AID: ```{r} result <- get_pug_rest(identifier = "79900", namespace = "aid", domain = "assay", operation = "concise", output = "JSON") result ``` ```{r} pubChemData(result) ``` These methods provide an efficient and targeted approach to access and analyze cell line-related data in PubChem, supporting a wide range of research applications in cellular biology, pharmacology, and related fields.