Issue
I have a large directory of gzip-compressed files.
I want to use a tool to index these files. The tool works by walking over a folder and reading all text files. Unfortunately, the tool doesn't support reading gzip-compressed text files, and it's not practical for me to temporarily decompress all the files so the tool can access them. It would consume a massive amount of disk space, and even though disk space is cheap, it's still impractical in my use-case.
I also don't have access to the tool to modify it to add support for gzip-compression.
So I was thinking of a way to insert a middle-man, between the tool and my files, that would transparently perform the decompression on the fly.
To that end, is there any way for me, under Linux, to create a sort of symbolic filesystem that mirrors my folder contents, and create a "fake" file for each original file so that, when read, it silently calls a script that accesses the original file, pipes it through gunzip, and returns the output? The effect would be, from the tool's perspective, it's reading un-compressed files without me having to decompress them all at once.
Are there any other solutions that I'm overlooking?
Solution
There are a few approaches that occur to me, each with varying amounts of difficulty. The options are ordered by how easy they would be IMHO.
Option 1 -- A compressed-at-rest filesystem
Several modern file systems support compression at rest -- i.e., the data is stored compressed, and decompressed for you on demand. You could set up a partition of your disk with one of these filesystems (I would recommend zfs), and then copy all of your data into the partition.
Once you've done that, you'd have the disk usage of compressed data, but would be able to interact with the filesystem as if it were uncompressed.
Option 2 -- FUSE Wrapper
If you're willing to do some coding for this, using FUSE would be an attractive option. FUSE is a library that effectively lets you describe a file system, and implement reading/writing as just callbacks to user code.
If you weren't worried about performance, it would be relatively straightforward to write some python script that mirrors a directory tree and wraps all read calls with gunzip
.
Between option 1 and 2, I would lean towards 1. It will be more performant than any script you could hack out yourself, and would give you added convenience being able to use the data directly.
Answered By - Carson Answer Checked By - Clifford M. (WPSolving Volunteer)