--- title: " Fitting Linear and Generalized Linear Models to out of the memory data sets in Divide and Recombine approach " output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Fitting GLMs to Out of the Memory Data Sets } %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- To fit Generalized Linear Models (GLMs) on large data sets that exceed memory limits, you can use the 'big.drglm' function. The 'biglm' package has been used for such applications, but it doesn't support factors. The 'speedglm' package does support factors, but its application on large CSV files for fitting GLMs hasn't been demonstrated. We'll show how to fit large datasets in chunks for model fitting. ```{r setup, include=FALSE} options(rmarkdown.html_vignette.check_title = FALSE) ``` Now , lets create a toy data set ```{r} set.seed(123) #Number of rows to be generated n <- 1000000 #creating dataset dataset <- data.frame( Var_1 = round(rnorm(n, mean = 50, sd = 10)), Var_2 = round(rnorm(n, mean = 7.5, sd = 2.1)), Var_3 = as.factor(sample(c("0", "1"), n, replace = TRUE)), Var_4 = as.factor(sample(c("0", "1", "2"), n, replace = TRUE)), Var_5 = as.factor(sample(0:15, n, replace = TRUE)), Var_6 = round(rnorm(n, mean = 60, sd = 5)) ) # Save the dataset to a temporary file temp_file <- tempfile(fileext = ".csv") write.csv(dataset, file = temp_file, row.names = FALSE) # Path to the temporary file dataset_path <- temp_file dataset_path # Display the path to the temporary file ``` We've saved the data set to disk, assuming the data set is too large to fit into memory. We'll explore how to handle such data sets. The 'drglm' package offers a function called 'make.data', which is a slight modification of the 'biglm' and 'speedglm' packages. It's specifically designed for large CSV files that can't be loaded into memory. ```{r} # Path to the temporary file dataset_path <- temp_file dataset_path # Display the path to the temporary file # Initialize the data reading function with the data set path and chunk size da <- drglm::make.data(dataset_path, chunksize = 100000) ``` # Fitting MLR Models ```{r} # Fitting MLR Model nmodel <- drglm::big.drglm(da, formula = Var_1 ~ Var_2+ factor(Var_3)+ factor(Var_4)+ factor(Var_5)+ Var_6, 10, family="gaussian") # View the results table print(nmodel) ``` In this similar manner , we can fit other family models using the above data set. # Fitting logistic Regression Model ```{r} # Fitting Logistic Model bmodel <- drglm::big.drglm(da,formula = Var_3 ~ Var_1+ Var_2+ factor(Var_4)+ factor(Var_5)+ Var_6, 10, family="binomial") # View the results table print(bmodel) ``` # Fitting Poisson Regression Model ```{r} # Fitting Poisson Regression Model pmodel <- drglm::big.drglm(da, formula = Var_5 ~ Var_1+ Var_2+ factor(Var_3)+ factor(Var_4)+ Var_6, 10, family="poisson") # View the results table print(pmodel) ``` This package currently allows for fitting GLMs to very large data sets in CSV format. Future updates will enable users to fit GLMs to data sets in other formats.