--- title: "Getting Started" author: "Dyfan Jones" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started} %\VignetteEngine{knitr::rmarkdown} %\usepackage[UTF-8]{inputenc} --- The `RAthena` package aims to make it easier to work with data stored in [`AWS Athena`](https://aws.amazon.com/athena/). `RAthena` package attempts to provide three levels of interacting with AWS Athena: * Low - level API: This provides more finer tuning of `AWS Athena` backend utilising the AWS SDK [`paws`](https://github.com/paws-r/paws). This includes configuring [`AWS Athena Work Groups`](https://aws.amazon.com/about-aws/whats-new/2019/02/athena_workgroups/) to assuming different roles within `AWS` when connecting to `AWS Athena`. * [DBI interface](https://dbi.r-dbi.org/): This is the primary goal of `RAthena`, by providing a `DBI` interface to `AWS Athena`. Users are able to interact with `AWS Athena` utilising familiar functions and methods they have used for other Databases from R. * [dplyr interface](https://dbplyr.tidyverse.org/): As `dplyr` is coming more popular, `RAthena` aims to give `dplyr` a seamless interface into `AWS Athena`. # Installing `RAthena`: As `RAthena` utilising the python AWS SDK `boto3`, Python 3+ is required. Please install Python 3+ either by [Python](https://www.python.org/downloads/) or [Python Anaconda](https://www.anaconda.com/products/distribution). To install `RAthena`: ```r # cran version install.packages("RAthena") # Dev version remotes::install_github("dyfanjones/RAthena") ``` Next is to install Python `boto3`. This can be done either by `RAthena`'s installation method: ```r RAthena::install_boto() ``` Or pip method: ``` pip install boto3 ``` ## Python Environments: If `RAthena` doesn't pick up `boto3` after using `install_boto()`, please consider specifying the python environment.`install_boto()` creates `RAthena` environment. This is either a Python virtual environment or a conda environment depending on your system. ```r library(DBI) # Specify python conda environment and force reticulate to use it reticulate::use_condaenv("RAthena", required = TRUE) # Or specify python virtual environment and force reticulate to use it reticulate::use_virtualenv("RAthena", required = TRUE) con <- dbConnect(RAthena::athena()) ``` **Note:** Python environments are not required if `boto3` is either in the root Python or if R and Python are in their own environment (for example conda environment). ## Docker Example: To help with users wishing to run `RAthena` in a [docker](https://hub.docker.com/), a simple docker file has been created [here](https://github.com/DyfanJones/RAthena/blob/master/docker/Dockerfile). To set up the docker please refer to [link](https://aws.amazon.com/premiumsupport/knowledge-center/codebuild-temporary-credentials-docker/). For demo purposes we will use the [example docker](https://github.com/DyfanJones/RAthena/blob/master/docker/Dockerfile) and run it locally: ```console # build docker image docker build . -t rathena # start container with aws credentials passed from local docker run \ -e AWS_ACCESS_KEY_ID="$(aws configure get aws_access_key_id)" \ -e AWS_SECRET_ACCESS_KEY="$(aws configure get aws_secret_access_key)" \ -e AWS_SESSION_TOKEN="$(aws configure get aws_session_token)" \ -e AWS_DEFAULT_REGION="$(aws configure get region)" \ -it rathena ``` When running `RAthena` in the docker environment you might be required to let `reticulate` know what python you are using. ```r reticulate::use_python("/usr/bin/python3") library(DBI) con <- dbConnect(RAthena::athena(), s3_staging_dir = "s3://mybucket/") ``` # Usage: ## Low - Level API: ```r library(DBI) library(RAthena) con <- dbConnect(athena()) # list all current work groups in AWS Athena list_work_groups(con) # Create a new work group create_work_group(con, "demo_work_group", description = "This is a demo work group", tags = tag_options(key= "demo_work_group", value = "demo_01")) ``` ## DBI: ```r library(DBI) con <- dbConnect(RAthena::athena()) # Get metadata dbGetInfo(con) # $profile_name # [1] "default" # # $s3_staging # [1] ######## NOTE: Please don't share your S3 bucket to the public # # $dbms.name # [1] "default" # # $work_group # [1] "primary" # # $poll_interval # NULL # # $encryption_option # NULL # # $kms_key # NULL # # $expiration # NULL # # $region_name # [1] "eu-west-1" # # $boto3 # [1] "1.11.5" # # $RAthena # [1] "1.7.1" # create table to AWS Athena dbWriteTable(con, "iris", iris) dbGetQuery(con, "select * from iris limit 10") # Info: (Data scanned: 860 Bytes) # sepal_length sepal_width petal_length petal_width species # 1: 5.1 3.5 1.4 0.2 setosa # 2: 4.9 3.0 1.4 0.2 setosa # 3: 4.7 3.2 1.3 0.2 setosa # 4: 4.6 3.1 1.5 0.2 setosa # 5: 5.0 3.6 1.4 0.2 setosa # 6: 5.4 3.9 1.7 0.4 setosa # 7: 4.6 3.4 1.4 0.3 setosa # 8: 5.0 3.4 1.5 0.2 setosa # 9: 4.4 2.9 1.4 0.2 setosa # 10: 4.9 3.1 1.5 0.1 setosa ``` ## dplyr: ```r library(dplyr) athena_iris <- tbl(con, "iris") athena_iris %>% select(species, sepal_length, sepal_width) %>% head(10) %>% collect() # Info: (Data scanned: 860 Bytes) # # A tibble: 10 x 3 # species sepal_length sepal_width # # 1 setosa 5.1 3.5 # 2 setosa 4.9 3 # 3 setosa 4.7 3.2 # 4 setosa 4.6 3.1 # 5 setosa 5 3.6 # 6 setosa 5.4 3.9 # 7 setosa 4.6 3.4 # 8 setosa 5 3.4 # 9 setosa 4.4 2.9 # 10 setosa 4.9 3.1 ``` # Useful Links: * [SQL](https://docs.aws.amazon.com/athena/latest/ug/functions-operators-reference-section.html) * [AWS Athena performance tips](https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/) * [AWS Athena User Guide](https://docs.aws.amazon.com/athena/latest/ug/athena-ug.pdf)