zlib for R

Description

The zlib package for R aims to offer an R-based equivalent of Python’s built-in zlib module for data compression and decompression. This package provides a suite of functions for working with zlib compression, including utilities for compressing and decompressing data streams, manipulating compressed files, and working with gzip, zlib and deflate formats.

Post

Medium

Usage

This example demonstrates how to use the zlib Rcpp module for chunked compression and decompression. We will take a string, write it to a temporary file, and then read it back into a raw vector. Then we will compress and decompress the data using the zlib Rcpp module.

Installation

To install the zlib package, you can use the following command:

install.packages("zlib")  # Uncomment this line if zlib is hosted on CRAN or a similar repo```

Code Example

First, make sure to load the zlib package:

library(zlib)

Writing and Reading a String to/from a Temporary File

# Create a temporary file  
temp_file <- tempfile(fileext = ".txt")  
  
# Generate example data and write to the temp file  
example_data <- "This is an example string. It contains more than just 'hello, world!'"  
writeBin(charToRaw(example_data), temp_file)  
  
# Read data from the temp file into a raw vector  
file_con <- file(temp_file, "rb")  
raw_data <- readBin(file_con, "raw", file.info(temp_file)$size)  
close(file_con)

Compressing the Data

# Create a Compressor object  
compressor <- zlib$compressobj(zlib$Z_DEFAULT_COMPRESSION, zlib$DEFLATED, zlib$MAX_WBITS + 16)  
  
# Initialize variables for chunked compression  
chunk_size <- 1024  
compressed_data <- raw(0)  
  
# Compress the data in chunks  
for (i in seq(1, length(raw_data), by = chunk_size)) {  
   chunk <- raw_data[i:min(i + chunk_size - 1, length(raw_data))]  
   compressed_chunk <- compressor$compress(chunk)
   compressed_data <- c(compressed_data, compressed_chunk)
}  
  
# Flush the compressor buffer  
compressed_data <- c(compressed_data, compressor$flush())

Decompressing the Data

# Create a Decompressor object  
decompressor <- zlib$decompressobj(zlib$MAX_WBITS + 16)  
  
# Initialize variable for decompressed data  
decompressed_data <- raw(0)  
  
# Decompress the data in chunks  
for (i in seq(1, length(compressed_data), by = chunk_size)) {  
  chunk <- compressed_data[i:min(i + chunk_size - 1, length(compressed_data))]  
   decompressed_chunk <- decompressor$decompress(chunk)  
   decompressed_data <- c(decompressed_data, decompressed_chunk)
}  
  
# Flush the decompressor buffer  
decompressed_data <- c(decompressed_data, decompressor$flush())

Verifying the Results

# Convert decompressed raw vector back to string  
decompressed_str <- rawToChar(decompressed_data)  
  
# Should print TRUE  
print(decompressed_str == example_data)

By following these steps, you can successfully compress and decompress data in chunks using the zlib Rcpp module.

Why Another Zlib Package for R?

You might wonder, “Why do we need another zlib package when R already has built-in methods for compression and decompression?” Let me clarify why I decided to develop this package.

The Problems with Built-in Methods

R’s built-in functions like memDecompress and memCompress are good for simple tasks, but they lack robustness and flexibility for advanced use-cases:

Handling Corrupt Data: Functions like memDecompress are unstable when dealing with gzip bytes that may be corrupt. If the data has multiple header blocks or too small a chunk, the function doesn’t just fail—it can even crash your computer. Incomplete Data Stream: A too-small chunk can hang your system when using

compressed_data <- memCompress(charToRaw(paste0(rep("This is an example string. It contains more than just 'hello, world!'", 1000), collapse = ", ")))  
rawToChar(memDecompress(compressed_data[1:300], type="gzip"))    # Caused a hang-up    
readLines(gzcon(rawConnection(compressed_data[1:300])))    
# Warnung in readLines(gzcon(rawConnection(compressed_data[1:300])))    
# unvollständige letzte Zeile in 'gzcon(compressed_data[1:300])' gefunden    
# [1] "x\x9c\...."

Multiple Header Blocks: R’s memDecompress doesn’t handle gzip data with multiple headers well.

multi_header_compressed_data_15wbits <- c(memCompress(charToRaw("Hello World"), type="gzip"),memCompress(charToRaw("Hello World"), type="gzip"))  
rawToChar(memDecompress(multi_header_compressed_data_15wbits, type="gzip"))  # Returns only one compressed header

Whereas, using pigz or gzip with pipes returns the expected concatenated strings:

tmp_file <- tempfile(fileext=".gzip")  
writeBin(multi_header_compressed_data_15wbits, tmp_file)   
readLines(pipe(sprintf("pigz -c -d %s --verbose 2>/dev/null", tmp_file), open = "rb")) # working correct but with warning   
# Warnung in readLines(pipe(sprintf("pigz -c -d %s --verbose 2>/dev/null",   
# unvollständige letzte Zeile in 'pigz -c -d /tmp/Rtmp5IRHIZ/file44665a0e2eb2.gzip --verbose 2>/dev/null' gefunden   
# [1] "Hello WorldHello World"   
readLines(pipe(sprintf("gzip -d %s --verbose --stdout", tmp_file), open = "rb")) # not working becouse of the wrong wbits (see next point)   
# character(0)

GZIP File Format Specification: R’s memCompress doesn’t adhere strictly to the GZIP File Format Specification, particularly regarding the usage of window bits.

memCompress("Hello World", type="gzip")  # Only 15 wbits -> without header checksum
# [1] 78 9c f3 48 cd c9 c9 57 08 cf 2f ca 49 01 00 18 0b 04 1d

Official GZIP File Format Specification Incorrect Behavior with Different wbits: The behavior of memCompress is inconsistent when different wbits are used for compression and decompression.

compressor <- zlib$compressobj(zlib$Z_DEFAULT_COMPRESSION, zlib$DEFLATED, zlib$MAX_WBITS + 16)  
multi_header_compressed_data_31wbits <- c(c(compressor$compress(charToRaw("Hello World")), compressor$flush()), c(compressor$compress(charToRaw("Hello World")), compressor$flush()))    
readLines(gzcon(rawConnection(multi_header_compressed_data_31wbits))) # returns a single line if the gzip wbits are correct!      
# Warnung in readLines(gzcon(rawConnection(wrong_compressed_data_31wbit)))      
# unvollständige letzte Zeile in 'gzcon(wrong_compressed_data_31wbit)' gefunden      
# [1] "Hello World"    
readLines(gzcon(rawConnection(multi_header_compressed_data_15wbits)))      
# Warnung in readLines(gzcon(rawConnection(wrong_compressed_data)))  
# Zeile 1 scheint ein nul Zeichen zu enthalten      
# Warnung in readLines(gzcon(rawConnection(wrong_compressed_data)))      
# unvollständige letzte Zeile in 'gzcon(wrong_compressed_data)' gefunden      
# [1] "x\x9c\xf3H\xcd\xc9\xc9W\b\xcf/\xcaI\001"    
tmp_file <- tempfile(fileext=".gzip")   
writeBin(multi_header_compressed_data_31wbits, tmp_file)    
readLines(pipe(sprintf("gzip -d %s --verbose --stdout", tmp_file), open = "rb")) # gzip pipe works with correct wbits      
# Warnung in readLines(pipe(sprintf("gzip -d %s --verbose --stdout", tmp_file),      
# unvollständige letzte Zeile in 'gzip -d /tmp/RtmpPIZPMP/file6eed29844dc4.gzip --verbose --stdout' gefunden      
# [1] "Hello WorldHello World"

No Streaming Support: There’s no nati ve way to handle Gzip streams from REST APIs or other data streams without creating temporary files or implementing cumbersome workarounds (e.g. with pipes and tmp files).

What My Package Offers

Robustness: Built to handle even corrupted or incomplete gzip data efficiently without causing system failures.

compressed_data <- memCompress(charToRaw(paste0(rep("This is an example string. It contains more than just 'hello, world!'", 1000), collapse = ", ")))  
decompressor <- zlib$decompressobj(zlib$MAX_WBITS)    
rawToChar(c(decompressor$decompress(compressed_data[1:300]), decompressor$flush()))  # Still working

Compliance: Strict adherence to the GZIP File Format Specification, ensuring compatibility across systems.

compressor <- zlib$compressobj(zlib$Z_DEFAULT_COMPRESSION, zlib$DEFLATED, zlib$MAX_WBITS + 16)
c(compressor$compress(charToRaw("Hello World")), compressor$flush())  # Correct 31 wbits (or custom wbits you provide)
# [1] 1f 8b 08 00 00 00 00 00 00 03 f3 48 cd c9 c9 57 08 cf 2f ca 49 01 00 56 b1 17 4a 0b 00 00 00

Flexibility: Ability to manage Gzip streams from REST APIs without the need for temporary files or other workarounds.

# Byte-Range Request and decompression in chunks

# Initialize the decompressor
decompressor <- zlib$decompressobj(zlib$MAX_WBITS + 16)

# Define the URL and initial byte ranges
url <- "https://example.com/api/data.gz"
range_start <- 0
range_increment <- 5000  # Adjust based on desired chunk size

# Placeholder for the decompressed content
decompressed_content <- character(0)

# Loop to make multiple requests and decompress chunk by chunk
for (i in 1:5) {  # Adjust the loop count based on the number of chunks you want to retrieve
range_end <- range_start + range_increment

# Make a byte-range request
response <- httr::GET(url, httr::add_headers(`Range` = paste0("bytes=", range_start, "-", range_end)))

# Check if the request was successful
if (httr::http_type(response) != "application/octet-stream" || httr::http_status(response)$category != "Success") {
stop("Failed to retrieve data.")
}

# Decompress the received chunk
compressed_data <- httr::content(response, "raw")
decompressed_chunk <- decompressor$decompress(compressed_data)
decompressed_content <- c(decompressed_content, rawToChar(decompressed_chunk))

# Update the byte range for the next request
range_start <- range_end + 1
}

# Flush the decompressor after all chunks have been processed
final_data <- decompressor$flush()
decompressed_content <- c(decompressed_content, rawToChar(final_data))

In summary, while R’s built-in methods could someday catch up in functionality, my zlib package for now fills an important gap by providing a more robust and flexible way to handle compression and decompression tasks.

Little Benchmark

The following benchmark compares the performance of the zlib package with the built-in memCompress and memDecompress functions. The benchmark was run on a Latitude 7430 with a 12th Gen Intel(R) Core(TM) i5-1245U (12) @ 4.4 GHz Processor and 32 GB of RAM.

library(zlib)
example_data <- charToRaw(paste0(rep("This is an example string. It contains more than just 'hello, world!'", 1000), collapse = "\n"))
microbenchmark::microbenchmark({
   compressed_data <- memCompress(example_data, type="gzip")
   decompressed_data <- memDecompress(compressed_data, type="gzip")
}, {
   compressor <- zlib$compressobj()
   compressed_data <- c(compressor$compress(example_data), compressor$flush())
   decompressor <- zlib$decompressobj()
   decompressed_data <- c(decompressor$decompress(compressed_data), decompressor$flush())
}, times = 5000)
# min       lq     mean      median     uq      max    neval
# 277.041 323.6640 408.4731 363.7165 395.5025 7931.280  5000
# 203.626 255.7815 308.8654 297.2095 337.2320 6864.512  5000

Future Enhancements

We’ve identified some exciting opportunities for extending the capabilities of this library. While these features are not currently planned for immediate development, we’re open to collaboration or feature requests to bring these ideas to life.

Gztool

Gztool specializes in indexing, compressing, and data retrieval for GZIP files. With Gztool integration, you could create lightweight indexes for your gzipped files, enabling you to extract data more quickly and randomly. This would eliminate the need to decompress large gzip files entirely just to access specific data at the end of the file.

Pugz

Pugz offers parallel decompression of gzipped text files. It employs a truly parallel algorithm that works in two passes, significantly accelerating the decompression process. This could be a valuable addition for users dealing with large datasets and seeking more efficient data processing.

If any of these feature enhancements interest you, or if you have other suggestions for improving the library, feel free to reach out for collaboration.

Dependencies

Software Requirements

CMake (version >= 3.10)
Ninja build system
R (version >= 4.0)
C++ Compiler (GCC, Clang, etc.)

Libraries

zlib

Development

Installing Dependencies on Ubuntu

sudo apt-get update
sudo apt-get install cmake ninja-build r-base libblas-dev liblapack-dev build-essential

Installing Dependencies on Red Hat

sudo yum update
sudo yum install cmake ninja-build R libblas-devel liblapack-devel gcc-c++

Building

Clone the repository:
bash git clone https://github.com/yourusername/zlib.git
Install the package local
```
make install   
```
Build the package local
```
make build   
```

License

This project is licensed under the MIT License - see the LICENSE file for details.