--- title: "Getting Started with Bolt4jr" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with Bolt4jr} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- `bolt4jr` is an R package for querying, extracting, and processing network data from Neo4j databases using the Bolt protocol. This vignette will guide you through the installation, configuration, and basic usage of the package. ## Installation Install the package from GitHub using: ```{r,eval=FALSE} # Install the remotes package if not already installed install.packages("remotes") # # Install bolt4jr remotes::install_github("Broccolito/bolt4jr") library(bolt4jr) ``` Alternatively, install the package via CRAN using: ```{r,eval=FALSE} install.packages("bolt4jr") library(bolt4jr) ``` ## Setting Up Your Environment Add the Neo4j credentials to your `.Renviron` file: ```{r,eval=FALSE} usethis::edit_r_environ() ``` Then add: ``` NEO4J_URI=bolt:// NEO4J_USER= NEO4J_PASSWORD= ``` Save and restart R. ## Basic Usage ### Set up conda environment ```{r,eval=FALSE} setup_bolt4jr() ``` This function initializes the Conda environment required for the `bolt4jr` package. If no Conda binary is found, it installs Miniconda. If the required Conda environment (`bolt4jr`) is not found, it creates the environment and installs the necessary dependencies. ### Querying Nodes ```{r,eval=FALSE} library(bolt4jr) # Load credentials from .Renviron uri = Sys.getenv("NEO4J_URI") user = Sys.getenv("NEO4J_USER") password = Sys.getenv("NEO4J_PASSWORD") # Query nodes nodes = run_query( uri = uri, user = user, password = password, query = " MATCH (n)-[r]-(m) WHERE type(r) IN ['ISA_AiA', 'PARTOF_ApA'] RETURN DISTINCT elementId(n) AS node_id, n" ) # Convert the result to a data frame nodes_df = convert_df(nodes, field_names = c("node_id", "n.identifier", "n.name", "n.source")) head(nodes_df) ``` #### Example Output (Nodes Data Frame): | node_id | n.identifier | n.name | n.source | | ---------------------------------------- | -------------- | -------------------------------- | -------- | | 4:c77f6410-bc08-43ba-a172-0503ab1c93db:0 | UBERON:0003233 | epithelium of shoulder | Uberon | | 4:c77f6410-bc08-43ba-a172-0503ab1c93db:1 | UBERON:2001901 | ceratobranchial 3 element | Uberon | | 4:c77f6410-bc08-43ba-a172-0503ab1c93db:2 | UBERON:0004321 | middle phalanx of manual digit 3 | Uberon | | 4:c77f6410-bc08-43ba-a172-0503ab1c93db:3 | UBERON:0002414 | lumbar vertebra | Uberon | | 4:c77f6410-bc08-43ba-a172-0503ab1c93db:4 | UBERON:2005118 | middle lateral line primordium | Uberon | | 4:c77f6410-bc08-43ba-a172-0503ab1c93db:5 | UBERON:0034769 | lymphomyeloid tissue | Uberon | ### Querying Edges ```{r,eval=FALSE} # Query edges edges = run_query( uri = uri, user = username, password = password, query = " MATCH (n)-[r]-(m) WHERE type(r) IN ['ISA_AiA', 'PARTOF_ApA'] RETURN DISTINCT elementId(r) AS edge_id, elementId(startNode(r)) AS start_node_id, elementId(endNode(r)) AS end_node_id, r LIMIT 1000" ) # Examine the structure of the result unlist(edges[[1]]) # Extract specific fields and convert to a data frame edges = convert_df( edges, field_names = c("edge_id", "start_node_id", "end_node_id") ) # View the resulting data frame head(edges) ``` #### Example Output (Edges Data Frame): | edge_id | start_node_id | end_node_id | | ----------------------------------------- | ---------------------------------------- | ---------------------------------------- | | 4:c77f6410-bc08-43ba-a172-0503ab1c93db:10 | 4:c77f6410-bc08-43ba-a172-0503ab1c93db:0 | 4:c77f6410-bc08-43ba-a172-0503ab1c93db:1 | | 4:c77f6410-bc08-43ba-a172-0503ab1c93db:11 | 4:c77f6410-bc08-43ba-a172-0503ab1c93db:2 | 4:c77f6410-bc08-43ba-a172-0503ab1c93db:3 | ### Querying Netowrk in Batches For large networks, you can use the `run_batch_query` function to process data in chunks. This function appends results to a file incrementally, minimizing memory usage. #### Extracting Edges in Batches ```{r,eval=FALSE} run_batch_query( uri = uri, user = user, password = password, query = " MATCH (n)-[r]-(m) WHERE type(r) IN ['ISA_AiA', 'PARTOF_ApA'] RETURN DISTINCT elementId(r) AS edge_id, elementId(startNode(r)) AS start_node_id, elementId(endNode(r)) AS end_node_id", field_names = c("edge_id", "start_node_id", "end_node_id"), filename = "edges.tsv", batch_size = 1000 ) ``` ### Extracting Nodes in Batches ```{r,eval=FALSE} run_batch_query( uri = uri, user = username, password = password, query = " MATCH (n)-[r]-(m) WHERE type(r) IN ['ISA_AiA', 'PARTOF_ApA'] RETURN DISTINCT elementId(n) AS node_id, n", field_names = c("node_id", "n.identifier", "n.name", "n.source"), filename = "nodes.tsv", batch_size = 1000 ) ``` ## Advanced Features - Batch processing for large datasets. - Seamless data conversion into R data frames for downstream analysis. For more details, refer to the package documentation.