The zlib
package for R aims to offer an R-based
equivalent of Python’s built-in zlib
module for data
compression and decompression. This package provides a suite of
functions for working with zlib compression, including utilities for
compressing and decompressing data streams, manipulating compressed
files, and working with gzip
, zlib
and deflate
formats.
This example demonstrates how to use the zlib
Rcpp
module for chunked compression and decompression. We will take a string,
write it to a temporary file, and then read it back into a raw vector.
Then we will compress and decompress the data using the
zlib
Rcpp module.
To install the zlib
package, you can use the following
command:
install.packages("zlib") # Uncomment this line if zlib is hosted on CRAN or a similar repo```
First, make sure to load the zlib
package:
library(zlib)
# Create a temporary file
<- tempfile(fileext = ".txt")
temp_file
# Generate example data and write to the temp file
<- "This is an example string. It contains more than just 'hello, world!'"
example_data writeBin(charToRaw(example_data), temp_file)
# Read data from the temp file into a raw vector
<- file(temp_file, "rb")
file_con <- readBin(file_con, "raw", file.info(temp_file)$size)
raw_data close(file_con)
# Create a Compressor object
<- zlib$compressobj(zlib$Z_DEFAULT_COMPRESSION, zlib$DEFLATED, zlib$MAX_WBITS + 16)
compressor
# Initialize variables for chunked compression
<- 1024
chunk_size <- raw(0)
compressed_data
# Compress the data in chunks
for (i in seq(1, length(raw_data), by = chunk_size)) {
<- raw_data[i:min(i + chunk_size - 1, length(raw_data))]
chunk <- compressor$compress(chunk)
compressed_chunk <- c(compressed_data, compressed_chunk)
compressed_data
}
# Flush the compressor buffer
<- c(compressed_data, compressor$flush()) compressed_data
# Create a Decompressor object
<- zlib$decompressobj(zlib$MAX_WBITS + 16)
decompressor
# Initialize variable for decompressed data
<- raw(0)
decompressed_data
# Decompress the data in chunks
for (i in seq(1, length(compressed_data), by = chunk_size)) {
<- compressed_data[i:min(i + chunk_size - 1, length(compressed_data))]
chunk <- decompressor$decompress(chunk)
decompressed_chunk <- c(decompressed_data, decompressed_chunk)
decompressed_data
}
# Flush the decompressor buffer
<- c(decompressed_data, decompressor$flush()) decompressed_data
# Convert decompressed raw vector back to string
<- rawToChar(decompressed_data)
decompressed_str
# Should print TRUE
print(decompressed_str == example_data)
By following these steps, you can successfully compress and
decompress data in chunks using the zlib
Rcpp module.
You might wonder, “Why do we need another zlib package when R already has built-in methods for compression and decompression?” Let me clarify why I decided to develop this package.
R’s built-in functions like memDecompress
and
memCompress
are good for simple tasks, but they lack
robustness and flexibility for advanced use-cases:
Handling Corrupt Data: Functions like
memDecompress
are unstable when dealing with gzip bytes
that may be corrupt. If the data has multiple header blocks or too small
a chunk, the function doesn’t just fail—it can even crash your computer.
Incomplete Data Stream: A too-small chunk can hang your system
when using
<- memCompress(charToRaw(paste0(rep("This is an example string. It contains more than just 'hello, world!'", 1000), collapse = ", ")))
compressed_data rawToChar(memDecompress(compressed_data[1:300], type="gzip")) # Caused a hang-up
readLines(gzcon(rawConnection(compressed_data[1:300])))
# Warnung in readLines(gzcon(rawConnection(compressed_data[1:300])))
# unvollständige letzte Zeile in 'gzcon(compressed_data[1:300])' gefunden
# [1] "x\x9c\...."
Multiple Header Blocks: R’s memDecompress
doesn’t handle gzip data with multiple headers well.
<- c(memCompress(charToRaw("Hello World"), type="gzip"),memCompress(charToRaw("Hello World"), type="gzip"))
multi_header_compressed_data_15wbits rawToChar(memDecompress(multi_header_compressed_data_15wbits, type="gzip")) # Returns only one compressed header
Whereas, using pigz
or gzip
with pipes
returns the expected concatenated strings:
<- tempfile(fileext=".gzip")
tmp_file writeBin(multi_header_compressed_data_15wbits, tmp_file)
readLines(pipe(sprintf("pigz -c -d %s --verbose 2>/dev/null", tmp_file), open = "rb")) # working correct but with warning
# Warnung in readLines(pipe(sprintf("pigz -c -d %s --verbose 2>/dev/null",
# unvollständige letzte Zeile in 'pigz -c -d /tmp/Rtmp5IRHIZ/file44665a0e2eb2.gzip --verbose 2>/dev/null' gefunden
# [1] "Hello WorldHello World"
readLines(pipe(sprintf("gzip -d %s --verbose --stdout", tmp_file), open = "rb")) # not working becouse of the wrong wbits (see next point)
# character(0)
GZIP File Format Specification: R’s
memCompress
doesn’t adhere strictly to the GZIP File Format
Specification, particularly regarding the usage of window bits.
memCompress("Hello World", type="gzip") # Only 15 wbits -> without header checksum
# [1] 78 9c f3 48 cd c9 c9 57 08 cf 2f ca 49 01 00 18 0b 04 1d
Official GZIP File Format
Specification Incorrect Behavior with Different
wbits
: The behavior of memCompress
is
inconsistent when different wbits
are used for compression
and decompression.
<- zlib$compressobj(zlib$Z_DEFAULT_COMPRESSION, zlib$DEFLATED, zlib$MAX_WBITS + 16)
compressor <- c(c(compressor$compress(charToRaw("Hello World")), compressor$flush()), c(compressor$compress(charToRaw("Hello World")), compressor$flush()))
multi_header_compressed_data_31wbits readLines(gzcon(rawConnection(multi_header_compressed_data_31wbits))) # returns a single line if the gzip wbits are correct!
# Warnung in readLines(gzcon(rawConnection(wrong_compressed_data_31wbit)))
# unvollständige letzte Zeile in 'gzcon(wrong_compressed_data_31wbit)' gefunden
# [1] "Hello World"
readLines(gzcon(rawConnection(multi_header_compressed_data_15wbits)))
# Warnung in readLines(gzcon(rawConnection(wrong_compressed_data)))
# Zeile 1 scheint ein nul Zeichen zu enthalten
# Warnung in readLines(gzcon(rawConnection(wrong_compressed_data)))
# unvollständige letzte Zeile in 'gzcon(wrong_compressed_data)' gefunden
# [1] "x\x9c\xf3H\xcd\xc9\xc9W\b\xcf/\xcaI\001"
<- tempfile(fileext=".gzip")
tmp_file writeBin(multi_header_compressed_data_31wbits, tmp_file)
readLines(pipe(sprintf("gzip -d %s --verbose --stdout", tmp_file), open = "rb")) # gzip pipe works with correct wbits
# Warnung in readLines(pipe(sprintf("gzip -d %s --verbose --stdout", tmp_file),
# unvollständige letzte Zeile in 'gzip -d /tmp/RtmpPIZPMP/file6eed29844dc4.gzip --verbose --stdout' gefunden
# [1] "Hello WorldHello World"
No Streaming Support: There’s no nati ve way to handle Gzip streams from REST APIs or other data streams without creating temporary files or implementing cumbersome workarounds (e.g. with pipes and tmp files).
Robustness: Built to handle even corrupted or incomplete gzip data efficiently without causing system failures.
<- memCompress(charToRaw(paste0(rep("This is an example string. It contains more than just 'hello, world!'", 1000), collapse = ", ")))
compressed_data <- zlib$decompressobj(zlib$MAX_WBITS)
decompressor rawToChar(c(decompressor$decompress(compressed_data[1:300]), decompressor$flush())) # Still working
Compliance: Strict adherence to the GZIP File Format Specification, ensuring compatibility across systems.
<- zlib$compressobj(zlib$Z_DEFAULT_COMPRESSION, zlib$DEFLATED, zlib$MAX_WBITS + 16)
compressor c(compressor$compress(charToRaw("Hello World")), compressor$flush()) # Correct 31 wbits (or custom wbits you provide)
# [1] 1f 8b 08 00 00 00 00 00 00 03 f3 48 cd c9 c9 57 08 cf 2f ca 49 01 00 56 b1 17 4a 0b 00 00 00
Flexibility: Ability to manage Gzip streams from REST APIs without the need for temporary files or other workarounds.
# Byte-Range Request and decompression in chunks
# Initialize the decompressor
<- zlib$decompressobj(zlib$MAX_WBITS + 16)
decompressor
# Define the URL and initial byte ranges
<- "https://example.com/api/data.gz"
url <- 0
range_start <- 5000 # Adjust based on desired chunk size
range_increment
# Placeholder for the decompressed content
<- character(0)
decompressed_content
# Loop to make multiple requests and decompress chunk by chunk
for (i in 1:5) { # Adjust the loop count based on the number of chunks you want to retrieve
<- range_start + range_increment
range_end
# Make a byte-range request
<- httr::GET(url, httr::add_headers(`Range` = paste0("bytes=", range_start, "-", range_end)))
response
# Check if the request was successful
if (httr::http_type(response) != "application/octet-stream" || httr::http_status(response)$category != "Success") {
stop("Failed to retrieve data.")
}
# Decompress the received chunk
<- httr::content(response, "raw")
compressed_data <- decompressor$decompress(compressed_data)
decompressed_chunk <- c(decompressed_content, rawToChar(decompressed_chunk))
decompressed_content
# Update the byte range for the next request
<- range_end + 1
range_start
}
# Flush the decompressor after all chunks have been processed
<- decompressor$flush()
final_data <- c(decompressed_content, rawToChar(final_data)) decompressed_content
In summary, while R’s built-in methods could someday catch up in functionality, my zlib package for now fills an important gap by providing a more robust and flexible way to handle compression and decompression tasks.
The following benchmark compares the performance of the
zlib
package with the built-in memCompress
and
memDecompress
functions. The benchmark was run on a
Latitude 7430 with a 12th Gen Intel(R) Core(TM) i5-1245U (12) @ 4.4 GHz
Processor and 32 GB of RAM.
library(zlib)
<- charToRaw(paste0(rep("This is an example string. It contains more than just 'hello, world!'", 1000), collapse = "\n"))
example_data ::microbenchmark({
microbenchmark<- memCompress(example_data, type="gzip")
compressed_data <- memDecompress(compressed_data, type="gzip")
decompressed_data
}, {<- zlib$compressobj()
compressor <- c(compressor$compress(example_data), compressor$flush())
compressed_data <- zlib$decompressobj()
decompressor <- c(decompressor$decompress(compressed_data), decompressor$flush())
decompressed_data times = 5000)
}, # min lq mean median uq max neval
# 277.041 323.6640 408.4731 363.7165 395.5025 7931.280 5000
# 203.626 255.7815 308.8654 297.2095 337.2320 6864.512 5000
We’ve identified some exciting opportunities for extending the capabilities of this library. While these features are not currently planned for immediate development, we’re open to collaboration or feature requests to bring these ideas to life.
Gztool specializes in indexing, compressing, and data retrieval for GZIP files. With Gztool integration, you could create lightweight indexes for your gzipped files, enabling you to extract data more quickly and randomly. This would eliminate the need to decompress large gzip files entirely just to access specific data at the end of the file.
Pugz offers parallel decompression of gzipped text files. It employs a truly parallel algorithm that works in two passes, significantly accelerating the decompression process. This could be a valuable addition for users dealing with large datasets and seeking more efficient data processing.
If any of these feature enhancements interest you, or if you have other suggestions for improving the library, feel free to reach out for collaboration.
sudo apt-get update
sudo apt-get install cmake ninja-build r-base libblas-dev liblapack-dev build-essential
sudo yum update
sudo yum install cmake ninja-build R libblas-devel liblapack-devel gcc-c++
Clone the repository:
bash git clone https://github.com/yourusername/zlib.git
Install the package local
make install
Build the package local
make build
This project is licensed under the MIT License - see the LICENSE file for details.