This R package allows R users to easily import large SAS datasets into Spark tables in parallel.
The package uses the spark-sas7bdat Spark package in order to read a SAS dataset in Spark. That Spark package imports the data in parallel on the Spark cluster using the Parso library and this process is launched from R using the sparklyr functionality.
More information about the spark-sas7bdat Spark package and sparklyr can be found at:
The following example reads in a file called iris.sas7bdat in parallel in a table called sas_example in Spark. Do try this with bigger data on your cluster and look at the help of the sparklyr package to connect to your Spark cluster.
library(sparklyr)
library(spark.sas7bdat)
<- system.file("extdata", "iris.sas7bdat", package = "spark.sas7bdat")
mysasfile
<- spark_connect(master = "local")
sc <- spark_read_sas(sc, path = mysasfile, table = "sas_example") x
The resulting pointer to a Spark table can be further used in dplyr statements. These will be executed in parallel using the Spark functionalities of the spark-sas7bdat package.
library(dplyr)
library(magrittr)
%>% group_by(Species) %>%
x summarise(count = n(), length = mean(Sepal_Length), width = mean(Sepal_Width))
Need support in big data and Spark analysis? Contact BNOSAC: http://www.bnosac.be