\documentclass[a4paper,11pt]{article} %\VignetteIndexEntry{LaF-manual} \title{{\Huge\texttt{LaF}}\\A package for processing large ASCII files} \author{D.J. van der Laan} \date{2011-11-06} \begin{document} \maketitle \section{Introduction} \texttt{LaF} is a R package for reading large ASCII files. It offers some functionality that is missing from the regular R routines for processing ASCII files. First of all, it is optimised for speed. Especially reading fixed width files is very slow with the regular R routine \texttt{read.fwf}. However, since it is optimised for speed some of the flexibility of the regular routines is lost. Seconly, it offers random access: only those rows and columns are read that are needed. With the regular routines one always has to read all columns. The problem with big files is that they do not fit into memory. One could consider this even to be the definition of `big'. To comfortably work with data in R the data set needs to fit multiple ($\sim$3) times into memory. There are roughly two methods for working with data that doesn't fit into memory. The first is to read the data in blocks that do fit into memory process each of these block and merge the results. More on this can be found in section~\ref{sec:blockwise}. The second is to read only that part of the data into memory which is needed for the calculation at hand hoping that that subset does fit into memory. For example to crosstabulate two variabeles one only needs these two variables. As most datasets contain dozens of variables this can easily reduce the memory needed for the operation by a factor of ten. More on this in section~\ref{sec:subsets}. Why ASCII? Why not use a binary format like \texttt{ff} and similar packages do? True, binary storage allows for much faster access since the conversion from ASCII to binary format is not needed and data can often be stored much more compact. The main reason is portability. Almost every program designed for data processing can read ASCII files. And even if one wants to use a package like \texttt{ff}, the source files are often ASCII files and first need to be converted to \texttt{ff} format. \texttt{LaF} can also speed up this last process. \section{Opening a file} \label{sec:opening} <>= options(width=54) library(LaF) # Generate data n <- 10000 data <- data.frame( id = trunc(runif(n, 1, 1E6)), gender = sample(c("M", "F"), n, replace=TRUE), postcode = paste( trunc(runif(n, 1000, 9999)), sample(LETTERS, n, replace=TRUE), sample(LETTERS, n, replace=TRUE), sep=""), age = round(runif(n, 0, 109)), income = round(rexp(n, 1/1000), 2), stringsAsFactors=FALSE) # Generate fwf file lines <- sprintf("%6.0f%1s%6s%3d%8.2f", data$id, data$gender, data$postcode, data$age, data$income) writeLines(lines, con="file.fwf") # Generate CSV file lines <- sprintf("%.0f,%s,%s,%d,%f", data$id, data$gender, data$postcode, data$age, data$income) writeLines(lines, con="file.csv") @ \subsection{Column types} LaF currently supports the following column types \begin{description} \item[double] Fields containing floating point numbers. Scientific notation (e.g. 1.9E-16) is not supported. The character used for the decimal mark can be specified using the \texttt{dec} option of the functions used to open files. \item[integer] Fields containing positive or negative integer numbers (e.g. 42, -100) \item[categorical] Categorical fields are treated as character fields except that a table is built mapping all observed values to integers. A factor vector is returned in R when this type is used. The levels can be read and set using the \texttt{levels} method. \item[string] Character fields such as postcodes, identification numbers. \end{description} As of version 0.5 of \texttt{LaF} it is also possible to set levels of non categorical columns using the \texttt{levels} method. For more information see paragraph~\ref{sec:levels}. \subsection{Fixed width files} \label{sec:fixedwidth} In fixed width files columns are defined by character positions in the files. For example, the first seven characters of each line belong to the first column, the next two characters belong to the second, etc. Each line therefore has the same number of characters. This is also a disadvantage of the format. If there is a column with variable string lenghts, the column has to be wide enough to accomodate the widest field. The main advantage of the format is that reading in large files (and especially random access) can be very efficient as the positions of rows and columns can be calculated. Fixed width files can be openen using the function \texttt{laf\_open\_fwf}. In order to open a file the following options can be specified: \begin{description} \item[filename] name of the file to be opened. \item[column\_types] Character vector containing the types of data in each of the columns. Valid types are: double, integer, categorical and string. \item[column\_widths] Numeric vector containing the width in number of character of each of the columns. \item[column\_names (optional)] Optional character vector containing the names of the columns. The default names are `V1', `V2', etc. \item[dec (optional)] Optional character specifying the decimal mark. The default value is `.'. \item[trim (optional)] Optional logical value specifying whether of not character strings should be trimmed left and right from white space. This applies to both columns of type `string' as `categorical'. For fixed width files the default is true (trim white space). \end{description} Suppose the following data is stored in the file `file.fwf' in the current working directory (showing only the first five lines): <>= lines <- readLines("file.fwf", n=5) cat(paste(lines,collapse="\n"), "\n") @ Then this file can be openened using the following command: <>= dat <- laf_open_fwf(filename="file.fwf", column_types=c("integer", "categorical", "string", "integer", "double"), column_names=c("id", "gender", "postcode", "age", "income"), column_widths=c(6, 1, 6, 3, 8)) @ \texttt{dat} is now a \texttt{laf} object. Data can be extracted from this object using the commands described in sections~\ref{sec:blockwise} and~\ref{sec:subsets}. For example, to read all data in the file one can use the following command: <>= alldata <- dat[ , ] @ \subsection{Comma separated files} In comma seperated files each line contains a row of data, the columns are seperated using a seperator character which is usually a comma although other seperator characters are also used (e.g. the `;' is often used in Europe where the comma is often used as the decimal seperator). It is a often used format. The disadvantage compared to the fixed width format is that the positions of columns and rows in the file can not be calculated. Therefore, a program reading a comma seperated file has to scan through the entire file to find a certain row or column making random access much slower than with fixed width files. A comma seperated file for the \texttt{LaF} package has to observe the following rules someof which slightly deviate from the `official' rules of comma seperated files: \begin{itemize} \item The first row can not contain the column names. These should be specified using the option \texttt{column\_names} of the function \texttt{laf\_open\_csv}. The first line in the file is treated as the first data row and the columns in this line should be of the correct type. \item Quotes are treated slightly different from the way they are normally treated in csv files. Only double quotes are accepted. Everything inside double quotes is considered part of the field except newline characters and double quotes. Double quotes in text fields are therefore not possible. Below are a few examples of how quotes are interpreted: \begin{itemize} \item \texttt{12345} = \texttt{12345} \item \texttt{"12345"} = \texttt{12345} \item \texttt{"123"45} = \texttt{12345} \item \texttt{"123""45"} = \texttt{123"45"} \item \texttt{"123$\setminus$n45"} = ERROR \item \texttt{12"345"} = \texttt{12"345"} \end{itemize} \item Each line in the file should contain exactly one row of data. Normally line breaks should be possible inside quoted columns. In order to keep the code as fast as possible, this is not the case in the \texttt{LaF} package. \end{itemize} Comma seperated files can be opened using the function \texttt{laf\_open\_csv}. This function accepts the following arguments: \begin{description} \item[filename] name of the file to be opened. \item[column\_types] Character vector containing the types of data in each of the columns. Valid types are: double, integer, categorical and string. \item[column\_names (optional)] Optional character vector containing the names of the columns. The default names are `V1', `V2', etc. \item[sep (optional)] Optional character specifying seperator mark used between the columns. The default value is `,'. \item[dec (optional)] Optional character specifying the decimal mark. The default value is `.'. \item[trim (optional)] Optional logical value specifying whether of not character strings should be trimmed left and right from white space. This applies to both columns of type `string' as `categorical'. For comma separated files files the default is false (do not trim white space). \item[skip (optional)] Optional numeric value specifying the number of lines at the beginning of the file that should be skipped before starting to read data. This can be used, for example, to skip the header as headers are not supported: the user is required to specify the types and names of the columns. \end{description} As of version 0.5 of the \texttt{LaF} package, there is also the \texttt{detect\_dm\_csv} routine, which can automatically detect column types. See paragraph~\ref{sec:datamodels} for more information on how to use data models to open files. Suppose the following data is stored in the file `file.csv' in the current working directory (showing only the first five lines): <>= lines <- readLines("file.csv", n=5) cat(paste(lines,collapse="\n"), "\n") @ Then this file can be openened using the following command: <>= dat <- laf_open_csv(filename="file.csv", column_types=c("integer", "categorical", "string", "integer", "double"), column_names=c("id", "gender", "postcode", "age", "income")) @ \texttt{dat} is now a \texttt{laf} object. Data can be extracted from this object using the commands described in sections~\ref{sec:blockwise} and~\ref{sec:subsets}. For example, to read all data in the file one can use the following command: <>= alldata <- dat[ , ] @ \subsection{Opening using data models} \label{sec:datamodels} As of version 0.5 \texttt{LaF} has the ability to store all of the arguments needed by \texttt{laf\_open\_fwf} and \texttt{laf\_open\_csv} in so called data models. These data models can be written to and read from files using the functions \texttt{write\_dm} and \texttt{read\_dm} respectively. \texttt{write\_dm} accepts either a data model or a \texttt{laf} object as its input. To write the data model of the data set from the previous section to file: <>= write_dm(dat, "model.yaml") @ The data model is written in the well documented and readable YAML format: <>= lines <- readLines("model.yaml") cat(paste(lines,collapse="\n"), "\n") @ The format probably speaks for itself. It is also probable to manually write these files and read them using \texttt{read\_dm}. To open a file using a data model the function \texttt{laf\_open} can be used: <>= dat <- laf_open(read_dm("model.yaml")) @ Data models can also be generated from CSV-files and Blaise data models using the routines \texttt{detect\_dm\_csv} and \texttt{read\_dm\_blaise}. See the documentation of these routines for more information. \section{Blockwise processing} \label{sec:blockwise} Blockwise processing of a file usually has the following structure: \begin{enumerate} \item Go to the beginning of the file \item Read a block of data \item Perform calculations on this block perhaps using results from the previous block. \item Store results \item Repeat 2--4 until all data has been processed. \item If necessary combine the results of all the blocks. \end{enumerate} In order to go to a specific position in the file \texttt{LaF} offers two methods: \texttt{begin} and \texttt{goto}. The first method simply goes to the beginning of the file while the second goes to the specified line. Assume, a \texttt{laf} object named \texttt{dat} has been created (see section~\ref{sec:opening} for this). The only argument needed by \texttt{begin} is the \texttt{laf} object: <>= begin(dat) @ For \texttt{goto} also the line number needs to be specified. The following command sets the filepointer at the beginning of line 1000. The next call to \texttt{next\_block} (see below) will therefore return as first row the data belonging in line 1000 of the file. <>= goto(dat, 1000) @ Blocks of data can be read using \texttt{next\_block}. The first argument needs to be the reference to the file (the \texttt{laf} object); other arguments are optional. By default all columns and 5000 lines are read: <>= d <- next_block(dat) nrow(d) @ The number of lines can be specified using the \texttt{nrows} argument and the columns that should be read can be specified using the \texttt{columns} argument. The following command reads 100 lines and the first and third column. <>= d <- next_block(dat, columns=c(1,3), nrows=100) dim(d) @ If possible the use of the \texttt{columns} argument is advised. This can significantly speed up the processing of the file. First of all, the amount of data that needs to be transfered to R is much smaller. Second, the strings in the unread columns do not need to be converted to numerical values. When the end of the file is reached \texttt{next\_block} returns a \texttt{data.frame} with zero rows. This can be used to detect the end of file. The following example shows how \texttt{begin} and \texttt{} can be used to calculate the number of elements equal to 2 in the second column. <>= n <- 0 begin(dat) while (TRUE) { d <- next_block(dat, 2) n <- n + sum(d$gender == 'M') if (nrow(d) == 0) break; } print(n) @ Since processing a file in this way is such a common task, the method \texttt{process\_blocks} has been defined that automates this and is faster. This method accepts as its first argument a \texttt{laf} object. The second argument should be the function that should be applied to each of the blocks. This function should accept as its first argument the data blocks. The last time the function is called it receives a \texttt{data.frame} with zero rows. This can be used to do some end calculations. The second argument of the function is the result of the previous function call. The first time the function is called the second argument had the value \texttt{NULL}. This can be used to perform initialisation. Additional arguments to \texttt{process\_blocks} are passed on to the function. The previous example can be translated into: <>= count <- function(d, prev) { if (is.null(prev)) prev <- 0 return(prev + sum(d$gender == 'M')) } (n <- process_blocks(dat, count)) @ Using \texttt{process\_blocks} is faster than using \texttt{next\_block} repeatedly since the \texttt{data.frame} containing the data that is read in, is destroyed and created every iteration, while in \texttt{process\_blocks} this \texttt{data.frame} is created only once. Below is an example that calculates the average of the third column of the file and illustrates initialisation and finilisation (note this is not how you will want to calculate the average over a column in a large file). Since only the third column of the file is needed for this calculation, the \texttt{columns} option is used which makes the calculation much faster. <>= ave <- function(d, prev) { # initialisation if (is.null(prev)) { prev <- c(sum=0, n=0) } # finilisation if (nrow(d) == 0) { return(as.numeric(prev[1]/prev[2])) } result <- prev + c(sum(d$income), nrow(d)) return(result) } (n <- process_blocks(dat, ave, columns=5)) @ \section{Selecting subsets} \label{sec:subsets} An other common way of handling large files is to only read in the data that is needed for the operation at hand. This is feasible when such a subset of the data does fit into memory. For this, selections can be performed on a \texttt{laf} object using the same methods one would use for a regular \texttt{data.frame}. The code below shows several different examples: <>= # select the first 10 rows result <- dat[1:10, ] # select the second column result <- dat[ , 2] # select the first 10 rows and the second column result <- dat[1:10, 2] @ Indexing a \texttt{laf} object always results in a \texttt{data.frame}. For example, the second and last example would have resulted in a vector when applied to a \texttt{data.frame}, while in the examples above a \texttt{data.frame} with one column is returned. Using the \texttt{\$} and \texttt{$[[$} operators columns can be selected from the \texttt{laf} object. The result is an object of type \texttt{laf\_column} which is a subclass of \texttt{laf}. It is a \texttt{laf} object with a field containing the column number. To get the data inside these columns indexing can be used as is shown in the following examples. In the first example the records are selected from the file for which the age is higher than 65: <>= result <- dat[dat$age[] > 65, ] @ The same can be done using the column number <>= result <- dat[dat[[4]][] > 65, ] @ or <>= result <- dat[dat[ , 4] > 65, ] @ or <>= result <- dat[dat[ , "age"] > 65, ] @ \section{Setting levels of columns} \label{sec:levels} It is possible to set the levels of categorical and non-categorical columns. Since the file is read only, it is not possible to renumber the columns as would happen if we would change a column in a \texttt{data.frame} to \texttt{factor}. Therefore, we need to specify both the levels and the corresponding labels as a \texttt{data.frame}. For example, to change the `age' column to a factor: <>= levels(dat)[["age"]] <- data.frame(levels=0:100, labels=paste(0:100, "years")) dat$age[1:10] @ These levels are also written to file when writing a data model to file using \texttt{write\_dm} and read in by \texttt{read\_dm}. You can therefore also specify the levels of a column in the data model. <>= write_dm(dat, "model.yaml") lines <- readLines("model.yaml") cat(paste(c(lines[1:29], "..."),collapse="\n"), "\n") @ \section{Calculating column statistics} \label{sec:stats} Using \texttt{process\_blocks} one can calculate all kinds of summary statistics for columns. However, as some summary statistics are very common, these have been implemented in the package. The available methods are: \vspace{1em}\noindent\begin{tabular}{p{0.3\textwidth} p{0.65\textwidth}} \texttt{colsum} & Calculate column sums \\ \texttt{colmean} & Calculate column means \\ \texttt{colfreq} & Calculate frequency tables of columns \\ \texttt{colrange} & Calculate the maximum and minimum value of columns \\ \texttt{colnmissing} & Calculate the number of missing values in columns \\ \end{tabular}\\[1em] All methods accept as first argument either a \texttt{laf} or \texttt{laf\_column} object. In case of a \texttt{laf} object the second argument should be a vector of column numbers for which the statistics should be calculated. For a \texttt{laf\_column} this is not necessary. For example, to calculate the average age the following options are available: <>= (m1 <- colmean(dat, columns=4)) (m1 <- colmean(dat$age)) @ Most methods also accept an \texttt{na.rm} argument, which ignores, as one would expect, missing values when calculating the statistics. The method \texttt{colnmissing} does not have this argument which would be meaningless. \texttt{colfreq} has the argument \texttt{useNA} which can take one the values `ifany', `always' or `no'. \end{document}