--- title: "Working with the RefSeq database" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Working with the RefSeq database} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Load the library. ```{r setup} library(refseqR) ``` ### 0. Introduction This vignette shows a tutorial of how I have been using `refseqR` to automate some common processes of my research. The package `refseqR` is built on top of `rentrez`, the excellent library written by **David Winter** to query the NCBI's API and fetch the resulting data. In short, `refseqR` provides summary information at three different levels: * gene * transcript * protein   ### 1. Gene Given a gene symbol/name obtained from the 'Gene' database, the `refseqR` library enables us to retrieve the associated mRNA/transcript and protein accessions. ```{r } GeneID <- c("LOC101512347") transcript <- refseq_fromGene(GeneID, sequence = "transcript") protein <- refseq_fromGene(GeneID, sequence = "protein") ``` The mRNA transcript identifier (id) for `r GeneID` = `r transcript`. The protein identifier (id) for `r GeneID` = `r protein`.   Similarly, the function is effective when utilizing gene symbols that encode for multiple transcripts. ```{r} GeneID <- c("LOC105852298") transcript <- refseq_fromGene(GeneID, sequence = "transcript") protein <- refseq_fromGene(GeneID, sequence = "protein") ``` The mRNA transcript ids. for `r GeneID` = `r transcript`. The protein ids. for `r GeneID` = `r protein`.   The function `refseq_description` returns the sequence description corresponding to a given accession. The identifier (id) can be either a transcript, protein, or Gene identifier. ```{r} id <- c("LOC101512347") refseq_description(id) ```   It is important to note that gene symbols (e.x. "LOC105852298") are not unique, and a single gene symbol search may map to multiple sequences. To avoid inconsistencies in function, it is highly recommended to use the actual GeneID (e.x. "105852298) as the first argument.   ### 2. Transcript Using the `rentrez` package, we can fetch data from NCBI. Here, the first 30 lines for accession "XM_004487701" : ```{r} mrna_gb <- rentrez::entrez_fetch(db= 'nuccore', id = "XM_004487701", rettype = 'gp') strsplit(mrna_gb, "\n")[[1]][1:30] ``` The `refseq_mRNAfeat` function serves as a wrapper built on top of `entrez_summary` from the `rentrez` package, designed to extract specific features from the obtained data. Typically, my focus lies on key features like id, accession, title, update, or sequence length (bp). However, you have the flexibility to tailor the function to extract additional features of interest from the `esummary_list` object. ```{r, eval = F} transcript = c("XM_004487701", "XM_004488493", "XM_004501904") feat = c("caption", "moltype", "sourcedb", "slen", "title") refseq_mRNAfeat(transcript, feat) ``` Another interesting function is `refseq_RNA2protein`, which retrieves the protein accession associated with the provided mRNA. ```{r} transcript <- "XM_004487701" refseq_RNA2protein(transcript) ``` The CDS coordinates come in handy when we want to get the fasta sequence. We sometimes do not want the 5'UTR or 3'UTR contained in the mRNA sequence and are interested just in the CDS. The function `refseq_CDScoords` creates an `IRanges` object with the CDS coordinates from an mRNA accession. The output object is the basis for `refseq_CDSseq`, which fetches the NCBI data, uses that coordinates and returns a `DNAString` object with the CDS nucleotide sequence. ```{r} refseq_CDScoords(transcript) refseq_CDSseq(transcript) ``` Here, the first 500 nucleotides of the mRNA 'XM_004487701': ```{r} mrna_fasta = rentrez::entrez_fetch(db="nuccore", id=transcript, rettype="fasta") # take a look at the first 500 chars. cat(strwrap(substr(mrna_fasta, 1, 500)), sep="\n") ``` Here, the first 60 nucleotides of the CDS from the mRNA 'XM_004487701': ```{r} substr(toString(refseq_CDSseq(transcript)), 1, 60) ``` As previously said, the function `refseq_description` returns the sequence description corresponding to a given accession. The identifier (id) can be either a transcript, protein, or Gene identifier. ```{r} id <- "XM_004487701" refseq_description(id) ``` ### 3. Protein Similarly to nucleotide sequences, `refseq_protein2RNA`, retrieves the mRNA associated with the provided protein accession. ```{r} protein <- "XP_020244413" refseq_protein2RNA(protein) ``` Two specific functions prove useful for managing protein accessions: `refseq_AAlen` offers the amino acid length of the sequence, while `refseq_mol.wt` provides the molecular weight in Daltons. ```{r} refseq_AAlen(protein) refseq_AAmol_wt(protein) ``` The `refseq_AAseq` function, fetches the NCBI data, and returns a `DNAString` object with the amino acid sequence. ```{r} refseq_AAseq(protein) ``` As previously mentioned, the `refseq_description` function ultimately provides the sequence description associated with a given accession. The identifier (id) can take the form of either a transcript, protein, or Gene identifier. ```{r} id <- "XP_020244413" refseq_description(id) ``` ### 4. Concluding Remarks The package `refseqR` contains a number of functions to programmatically automatize some common operations. Functions to apply on GeneID accessions * `refseq_description` * `refseq_fromGene` Functions to apply on transcript id. accessions * `refseq_GeneID` * `refseq_CDScoords` * `refseq_CDSseq` * `refseq_RNA2protein` Functions to apply on protein id. accessions * `refseq_GeneID` * `refseq_AAseq` * `refseq_AAmol_wt` * `refseq_AAlen` * `refseq_protein2RNA`   I'd really appreciate your feedback. The whole code used in this tutorial is available from my **[Github](https://github.com/jdieramon)** repository. You can contact me by **[email](mailto:jose.die@uco.es)** or visit my **[website](https://jdieramon.github.io)**.     Córdoba, (Spain), `r Sys.Date()`.