--- title: "Introduction to the GOxploreR package" author: "Kalifa Manjang, Shailesh Tripathi, Olli Yli-Harja, Matthias Dehmer & Frank Emmert-Streib" date: "`r format(Sys.time(), '%d %B %Y')`" output: rmarkdown::pdf_document: toc: true toc_depth: 2 fig_caption: yes csl: biomed-central.csl bibliography: bibliography.bib link-citations: true header-includes: \usepackage{float} \floatplacement{figure}{H} vignette: > %\VignetteIndexEntry{GOxploreR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- Predictive Society and Data Analytics Lab, Tampere University, Tampere, Korkeakoulunkatu 10, 33720, Tampere, Finland \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline ## Citation Kalifa Manjang, Shailesh Tripathi, Olli Yli-Harja, Matthias Dehmer, and Frank Emmert-Streib. 2020. Graph-based exploitation of gene ontology using goxplorer for scrutinizing biological significance. Scientific Reports 10, 1: 16672. https://doi.org/10.1038/s41598-020-73326-3 [@Manjang2020] ```{r include = FALSE} knitr::opts_chunk$set( collapse = FALSE, comment = "#>" ) ``` \newpage \tableofcontents \listoffigures \listoftables \newpage ## Introduction The GOxploreR package is an R package that provides a simple and efficient way to communicate with the gene ontology (GO) database. The gene ontology is a major bioinformatics initiative by the gene ontology consortium. The goal is to categorize the gene and gene product function. The ontology is structured into three distinct aspects of gene function: molecular function (MF), cellular component (CC), and biological process (BP) together with over 45,000 terms and 130,000 relations whereas the majority of information is centered around ten model organisms [@gene2018gene]. In addition, GO includes annotations by linking specific gene products to GO-terms. Currently, GO is the most comprehensive and widely used knowledgebase concerning functional information about genes. \begin{table}[!h] \centering \begin{tabular}{r|r|r} \hline Organism & Name used as options in the package & Number of genes \\ \hline Human & "Homo sapiens" / "Human" & 19155\\ \hline Mouse & "Mus musculus" / "Mouse" & 20929\\ \hline Caenorhabditis elegans & "Caenorhabditis elegans" / "Worm" & 14697 \\ \hline Drosophila melanogaster & "Drosophila melanogaster" / "Fruit fly" & 12683 \\ \hline Rat & "Rattus norvegicus" / "Rat" & 19383 \\ \hline Baker's yeast & "Saccharomyces cerevisiae" / "Yeast" & 5502 \\ \hline Zebrafish & "Danio rerio" / "Zebrafish" & 20718 \\ \hline Arabidopsis thaliana & "Arabidopsis thaliana" / "Cress" & 25891 \\ \hline Schizosaccharomyces pombe & "Schizosaccharomyces pombe" / "Fission yeast" & 5055 \\ \hline Escherichia coli & "Escherichia coli" / "E.coli" & 3449 \\ \hline \end{tabular} \caption{Organisms supported by the GOxploreR package.} \label{tab:caption} \end{table} This vignette gives an overview of the functionality provided by the GOxploreR package. The package is freely available on CRAN and can be installed using the following command: ```{} install.packages("GOxploreR") ``` The package function can be loaded using: ```{r setup, message=FALSE} library(GOxploreR) ``` Note that the package needs to be installed to be loaded. ## Overview of the functionality of the package The following is a brief description of the package functionality. ### Gene2GOTermAndLevel The Gene2GOTermAndLevel function provides information associated with a list of genes. Given a gene or a list of genes, an organism, and a domain (BP, MF or CC) the function provides the Gene Ontology terms (GO-terms) associated with the genes and their respective levels of the DAG. The default argument of the domain is BP. For the arguments of the option 'organism' see Table \ref{tab:caption}. ```{R} # The cellular component gene ontology terms will be retrieve and their levels Gene2GOTermAndLevel(genes = c(10212, 9833, 6713), organism = "Homo sapiens", domain = "CC") # The biological process gene ontology terms will be retrieve and their levels Gene2GOTermAndLevel(genes = c(100000642, 30592, 58153, 794484), organism = "Danio rerio") # The molecular function gene ontology terms will be retrieve and their levels Gene2GOTermAndLevel(genes = c(100009600, 18131, 100017), organism = "Mouse", domain = "MF") ``` ### Gene2GOTermAndLevel_ON This function is similar to the Gene2GOTermAndLevel function, the only difference is that this function queries the Ensembl database online (ON) for GO-terms (making it relatively slow). That means the results from the Gene2GOTermAndLevel_ON function are always up to date but an internet connection is needed for its execution. This function does not provide support for Escherichia coli. ```{} # The cellular component gene ontology terms will be retrieve and their levels Gene2GOTermAndLevel_ON(genes = c(10212, 9833, 6713), organism = "Homo sapiens", domain ="CC") # The biological process gene ontology terms will be retrieve and their levels Gene2GOTermAndLevel_ON(genes = c(100000711, 100000710, 100000277), organism = "Danio rerio") # The molecular function gene ontology terms will be retrieve and their levels Gene2GOTermAndLevel_ON(genes = c(100009609, 100017, 100034361), organism = "Mouse", domain = "MF") ``` ### GOTermXXOnLevel This function gives the level of a GO-term based on a DAG. The results for organism-specific GO-DAGs are the same as for the general GO-DAG. The XX in the name above should be replaced by either BP, MF, or CC. ```{r} # Retrieve the level of a GO biological process term goterms <- c("GO:0009083","GO:0006631","GO:0006629","GO:0014811","GO:0021961") GOTermBPOnLevel(goterm = goterms) # Retrieve the level of a GO molecular function term goterms <- c("GO:0005515","GO:0016835","GO:0046976","GO:0015425","GO:0005261") GOTermMFOnLevel(goterm = goterms) # Retrieve the level of a GO cellular component term goterms <- c("GO:0055044","GO:0030427","GO:0036436","GO:0034980","GO:0048226") GOTermCCOnLevel(goterm = goterms) ``` ### Level2GOTermXX This function gives all the GO-terms from a given GO-level. These GO-terms can be from the general GO-DAG or from an organism-specific GO-DAG. If the "organism" argument is given, the GO-terms will be acquired from the organism's (organism supported by the package) DAG level, However, if no value for the "organism" parameter is given then the general GO-DAG is used (default). The XX in the name should be replaced by either BP, MF, or CC. ```{R} # Retrieve all the GO-terms from a particular GO BP level Level2GOTermBP(level = 1, organism = "Human") # Retrieve all the GO-terms from a particular GO MF level Level2GOTermMF(level = 14, organism = "Rat") # Retrieve all the GO-terms from the general GO CC level Level2GOTermCC(level = 14) ``` ### Level2LeafNodeXX This function gives all the leaf nodes from a particular GO-level. Leaf nodes can also be attained from the organism-specific GO-DAG. The "organism" parameter is optional. If supplied, the leaf node from the respective organism's level will be acquired. The default is the general GO-DAG. The XX should be substituted with either BP, MF, or CC. ```{r} # Get all leaf nodes from a GO BP level Level2LeafNodeBP(level = 2, organism = "Danio rerio") # Get all leaf nodes from a GO MF level Level2LeafNodeMF(level = 12) # Get all leaf nodes from a GO CC level Level2LeafNodeCC(level = 10, organism = "Schizosaccharomyces pombe") ``` ### Level2JumpNodeXX This function gives for a GO-level the GO-terms which correspond to jump Nodes (JNs). The JNs are GO-terms which have at least one child term not present in the level below the parent term. If no organism is given, the default is the general GO-DAG. The XX in the name should be substituted with BP, MF, or CC. ```{r} # All jump nodes from the GO BP level head(Level2JumpNodeBP(level = 2, organism = "Homo sapiens")) # All jump nodes from the GO MF level head(Level2JumpNodeMF(level = 3, organism = "Homo sapiens")) # All jump nodes from the GO CC level head(Level2JumpNodeCC(level = 7, organism = "Homo sapiens")) ``` ### Level2RegularNodeXX This function gives for a GO-level the GO-terms which correspond to regular Nodes (RNs). The RNs are those GO-terms whose child terms are all present in the level right below the parent's level. The XX in the name should be subsitituted with BP, MF, or CC. ```{r} # All regular nodes from the BP level head(Level2RegularNodeBP(level = 9, organism = "Zebrafish")) # All regular nodes from the MF level head(Level2RegularNodeMF(level = 7, organism = "Homo sapiens")) # All jump nodes from the CC level head(Level2RegularNodeCC(level = 7)) ``` ### Level2NoLeafNodeXX This function gives all the GO-terms from a GO-level that are not leaf nodes. Similarly, all non-leaf GO-terms from an organism-specific DAG can be returned by providing the organism of interest. The default is the general GO-DAG. The XX in the name should be substituted with either BP, MF or CC. ```{R} # All GO-terms on a particular GO BP level that are not leaf nodes Level2NoLeafNodeBP(level = 16, organism = "Homo sapiens") # All GO-terms on a particular GO MF level that are not leaf nodes Level2NoLeafNodeMF(level = 10, organism = "Caenorhabditis elegans") # All GO-terms on a particular GO CC level that are not leaf nodes Level2NoLeafNodeCC(level = 12, organism = "Homo sapiens") ``` ### getGOcategory Given a GO-term or a list of GO-terms, this function returns the category of the term. The categories are jump nodes (JN), regular nodes (RN) and leaf nodes (LN). ```{R} goterm <- c("GO:0009083","GO:0006631","GO:0006629","GO:0016835","GO:0046976","GO:0048226") # Returns the categories of the GO-terms in the list getGOcategory(goterm = goterm) ``` ### degreeDistXX This function obtains the degree distribution of the GO-terms on a GO-level. A bar plot is obtained which shows how many nodes in the GO-level have a certain degree k. The XX in the name should be substituted with either BP, MF, or CC. ```{r, fig.height= 3.4, fig.cap= "Degree distribution of the biological process GO-terms on level 4."} # Degree distribution of the GO-terms on a particular GO BP level degreeDistBP(level = 4) ``` ```{r, fig.height= 3.4, fig.cap= "Degree distribution of the molecular function GO-terms on level 2."} # Degree distribution of the GO-terms on a particular GO MF level degreeDistMF(level = 2) ``` ```{r, fig.height= 3.4, fig.cap= "Degree distribution of the cellular component GO-terms on level 10."} # Degree distribution of the GO-terms on a particular GO CC level degreeDistCC(level = 10) ``` ### GOTermXX2ChildLevel For a GO-term it's children level are derived. The XX in the name should be substituted with BP, MF, or CC. ```{R} # Get the level of a GO BP term's children GOTermBP2ChildLevel(goterm = "GO:0007635") # Get the level of a GO MF term's children GOTermMF2ChildLevel(goterm = "GO:0098632") # Get the level of a GO CC term's children GOTermCC2ChildLevel(goterm = "GO:0071735") ``` ### GetLeafNodesXX This function gives all the leaf nodes of a certain organism. If the input value is empty or is "BP", "MF"" or "CC"" the default DAG is the general GO-DAG. The value for XX should be subsituted with BP, MF or CC. ```{} # All leaf nodes from the GO BP tree GetLeafNodesBP("BP") # All leaf nodes from the GO CC tree GetLeafNodesCC(organism = "Caenorhabditis elegans") ``` ### GO2DecXX This function returns all descedant child nodes of a GO-term. That means, we begin from a GO-term and find all the GO-terms of its children and their children until we reach all the way down of the DAG. The XX in the name should be substituted with BP, MF or CC. ```{r} # Biological process GO-term descendant terms GO2DecBP(goterm = "GO:0044582") # Molecular function GO-term descendant terms GO2DecMF(goterm = "GO:0008553") # Cellular component GO-term descendant terms GO2DecCC(goterm = "GO:0031233") ``` ### GetDAG This function gives the GO-terms in the Gene Ontology as an edgelist corresponding to a directed acyclic graph (DAG) for the GO-terms of a certain organism. This can also be obtained for the general GO-DAG (not organism-specific). ```{R} # Represent all the BP gene association GO-terms for human as an edgelist head(GetDAG(organism = "Human", domain = "BP")) # Represent all the MF gene association GO-terms for Mouse as an edgelist head(GetDAG(organism = "Mouse", domain = "MF")) # Represent all the CC gene association GO-terms for Caenorhabditis elegans as an edgelist head(GetDAG(organism = "Caenorhabditis elegans", domain = "CC")) ``` ### visRDAGXX The visualization of a GO-DAG is difficult primarily because of the size of the graphs containing thousands of GO-terms. For this reason, we invented a simple method that combines GO-terms with similar characteristics together. This includes a global summary of all GO-terms in the DAG. Every node in the reduced DAG comprises 1 or more GO-terms and these GO-terms can be accessed by using certain information (i.e. the level and what type of node category they represent for example "RN", "JN" or "LN"). This is what we call the reduced GO-DAG for an organism. Furthermore, the function returns a list which contains the GO-terms in each category and the plot of the reduced DAG. To retrieve just the GO-terms in each category, the "plot" argument can be set to "FALSE" . The XX in the name should be substituted with BP, MF or CC. The total number of GO-terms in each node is represented by the node label. For instance, in Figure \ref{fig:figs1} , Level 0 (i.e "L0 RN") has 1 GO-term present in the node category. The label "J","R" and "L" on the right side of Figure \ref{fig:figs1} gives the number of connections between the regular node (RN) on the level and the nodes right below it (RN are nodes that have all their children nodes represented in the next level). For example, on L1, The label J = 5, R = 9 and L = 6 means that the RNs on the level have 5 of it's children nodes as Jump nodes (JN) on L2, 9 of it's descendant are Regular nodes (RN) and 6 of its children GO-terms on L2 are leaf nodes (LN). ```{r, message=FALSE} # The GO-terms in each node category of the reduced Caenorhabditis elegans GO-DAG head(visRDAGMF(organism = "Caenorhabditis elegans", plot = FALSE)) # RN GO-terms on level 1 can be access as follows visRDAGMF(organism = "Caenorhabditis elegans", plot = FALSE)$"L1 RN" # JN GO-terms on level 9 can be access as follows visRDAGMF(organism = "Caenorhabditis elegans", plot = FALSE)$"L9 JN" # LN GO-terms on level 14 can be access as follows visRDAGMF(organism = "Caenorhabditis elegans", plot = FALSE)$"L11 LN" ``` ```{r, message=FALSE, fig.cap="\\label{fig:figs1}Visualization of a reduced GO-DAG for Caenorhabditis elegans."} # Represent the molecular function GO-DAG for organism Caenorhabditis elegans visRDAGMF(organism = "Caenorhabditis elegans", plot = TRUE)[["plot"]] ``` ### visRsubDAGXX The visRsubDAGXX function is similar to the visRDAGXX function, however, it visualizes an organism-specific sub-GO-DAG. The input of the function is a list of organism-specific GO-terms. If this list contains not all GO-terms of the organism, then category nodes are faded out. The XX in the function can be substituted with BP, MF or CC. ```{r , size= 0.05, fig.cap="Visualization of a reduced sub-GO-DAG of BPs for Human."} Terms <- c("GO:0022403","GO:0000278","GO:0006414","GO:0006415","GO:0006614", "GO:0045047","GO:0072599","GO:0006613","GO:0000279","GO:0000087", "GO:0070972","GO:0000184","GO:0000280","GO:0007067","GO:0006413", "GO:0048285","GO:0006412","GO:0000956","GO:0006612","GO:0019080", "GO:0019083","GO:0016071","GO:0006402","GO:0043624","GO:0043241", "GO:0006401","GO:0072594","GO:0022904","GO:0019058","GO:0032984", "GO:0045333","GO:0006259","GO:0051301","GO:0022900","GO:0006396", "GO:0060337","GO:0071357","GO:0034340","GO:0002682","GO:0051320", "GO:0045087","GO:0051325","GO:0022411","GO:0016032","GO:0044764", "GO:0022415","GO:0051329","GO:0050776","GO:0030198","GO:0043062") # visualization the DAG node categories of the given biological process GO-terms visRsubDAGBP(goterm = Terms, organism = "Human") ``` ### distRankingGO Given a list of GO-terms, the function provides ranking for the GO-terms according to the distance between the GO-terms hierarchy level and the maximal depth of paths in the GO-DAG passing through these GO-terms. The function provides options for "BP", "MF" and "CC" ontology. ```{r , size= 0.05, fig.cap="\\label{fig:rank}The hierarchy levels for a list of GO-terms (y-axis) are shown in purple and the hierarchy levels for the maximal depth of paths in the GO-DAG passing through these GO-terms is shown in red."} Terms <- c("GO:0000278","GO:0006414","GO:0022403","GO:0006415","GO:0006614", "GO:0045047","GO:0072599","GO:0006613","GO:0000184","GO:0070972", "GO:0006413","GO:0000087","GO:0000280","GO:0000279","GO:0006612", "GO:0000956","GO:0048285","GO:0019080","GO:0019083","GO:0043624", "GO:0006402","GO:0032984","GO:0006401","GO:0072594","GO:0019058", "GO:0051301","GO:0016071","GO:0006412","GO:0002682","GO:0022411", "GO:0001775","GO:0046649","GO:0045321","GO:0050776","GO:0007155", "GO:0022610","GO:0060337","GO:0071357","GO:0034340","GO:0016032", "GO:0044764","GO:0006396","GO:0010564","GO:0002684","GO:0006259", "GO:0051249","GO:0045087") # Ordering of the GO-terms in the list distRankingGO(goterm = Terms, domain = "BP", plot = TRUE) ``` The function produced as output the ranked GO-terms, the indices of the ranking (indices corresponding to the original list), the distance between the GO-terms hierarchy level and the maximal depth of paths in the GO-DAG passing through these GO-terms and a visualisation of the ranking. The GO-terms are ranked according to the distance between the two points (purple and red) shown in Figure \ref{fig:rank}. ### scoreRankingGO The function scoreRankingGO is similar to the distRankingGO function because both function provide ordering for a given list of GO-terms. The difference is scoreRankingGO rank the GO-terms according to a score which is computed using Equation \ref{equ:equ1}. \begin{equation} \label{equ:equ1} score = \frac{level(GO)}{level_{max}(GO)} \times \frac{level(GO)}{level_{GO-DAG}(GO)} \end{equation} The function produced as output the ranked GO-terms, the indices of the ranking (indices corresponding to the original list), the scores of each GO-terms and a visualisation of the ranking. ```{r , size= 0.05, fig.cap="\\label{fig:rank2}"} Terms <- c("GO:0000278","GO:0006414","GO:0022403","GO:0006415","GO:0006614", "GO:0045047","GO:0072599","GO:0006613","GO:0000184","GO:0070972", "GO:0006413","GO:0000087","GO:0000280","GO:0000279","GO:0006612", "GO:0000956","GO:0048285","GO:0019080","GO:0019083","GO:0006402", "GO:0032984","GO:0006401","GO:0072594","GO:0019058","GO:0051301", "GO:0016071","GO:0006412","GO:0002682","GO:0022411","GO:0001775", "GO:0046649","GO:0045321","GO:0050776","GO:0007155","GO:0060337", "GO:0071357","GO:0034340","GO:0016032","GO:0006396","GO:0010564", "GO:0002684","GO:0006259","GO:0051249","GO:0045087") # Ordering of the GO-terms in the list scoreRankingGO(goterm = Terms, domain = "BP", plot = FALSE) ``` ### prioritizedGOTerms Given a vector of GO-terms, this function prioritizes the GO-terms by expoiting the structure of a DAG. Starting from the GO-term on the highest level and searching all the paths to the root node iteratively. If the argument "sp" is TRUE, only shortest paths are used, otherwise all paths. If any GO-terms in the input vector are found along this path, these GO-terms are removed. This is because the GO-term at the end of a path is more specific than the GO-terms along the path. For an organism, the GO-terms of that organism are used for the prioritization. If the organism argument is NULL then all the (non-retired) GO-terms from a particular ontology are used in the ranking. ```{r} Terms <- c("GO:0042254", "GO:0022613", "GO:0034470", "GO:0006364", "GO:0016072", "GO:0034660", "GO:0006412", "GO:0006396", "GO:0007005", "GO:0032543", "GO:0044085", "GO:0044281", "GO:0044257", "GO:0030163", "GO:0006082", "GO:0044248", "GO:0006519", "GO:0009056", "GO:0019752", "GO:0043436") # We Prioritize the given biological process GO-terms prioritizedGOTerms(lst = Terms, organism = "Human", sp = TRUE, domain = "BP") ``` ### GO4Organism This function gives all the GO-terms association with an organism and their corresponding GO-levels. ```{R} # All the biological process gene association GO-terms for Human and their GO-level head(GO4Organism(organism = "Human", domain = "BP")) # All the molecular function gene association GO-terms for Mouse and their GO-level head(GO4Organism(organism = "Mouse", domain = "MF")) # All the cellular component gene association GO-terms for Rat and their GO-level head(GO4Organism(organism = "Rat", domain = "CC")) ``` Conclusion --- This vignette gave a brief overview of the functionality provided by the GOxploreR package. We showed all functions and how to use them. ## References ```{r refmgr references, results="asis", echo=FALSE} # PrintBibliography(bib) ```