ZSTD compression support (import and export)#26
Conversation
There was a problem hiding this comment.
LGTM, but seeing the complexity of the handling I feel it might be required to support compressed and uncompressed feeds. I know we said we would but restarting every step doesn't seems like a huge deal. Maybe we remove the code after the first week/month?
gtfs_loader/__init__.py
Outdated
| copy_file_silently(import_filename, export_filename) | ||
| continue | ||
|
|
||
| with open(import_filename, 'rb') as import_f: |
There was a problem hiding this comment.
Indentation is pretty intense in this method, I would suggest either a helper method that does the work or yield
Do you suggest removing logic for supporting both compressed and uncompressed inputs once the migration is done? I’m just not sure about that considering that we might want to continue running with uncompressed CSVs locally / when testing for better visibility (the way I’m currently implementing compression is by allowing turning on a flag). |
Goal: Supporting ZSTD compression (allowing reading both compressed and uncompressed GTFS files, as well as exporting files with compression via
patch(...)).Library: Using
zstandard- the most popular as well as the most stable ZSTD library for Python (all the others are either too simplistic or not used by many). Note: Python supports ZSTD natively, but only after 3.14, which is requires upgrading across all code-bases, which is way too heavy (and it's only recently been adopted, so it's also risky).Approach
Using stream-processing for reading and writing CSV files, integrating compression / decompression into the current setup.
Old: Reading from TEXT file-streams and writing to TEXT file-streams (opening files in text-mode with UTF-8 encoding / decoding done during read / write operations).
New: Additional compression / decompression stream-processing can be added into read / write streams, which means that streams need to support operating on both compressed and uncompressed data, as well as text and binary data.
open(..., 'b')) - allows writing both UTF-8 data and compressed data.TextIOWrapperinstances.ZstdDecompressor/ZstdCompressor).Utilities
Compression-check:
check_if_file_zstd_compressed(filepath)allows determining whether a file is ZSTD-compressed or not by inspecting its header (every ZSTD-file has a specific magic-number in its header).UTF encoding/decoding: Using
TextIOWrapperwhich allows encoding strings into UTF-8 bytes and wise-versa. Usingutf-8-sigfor import andutf-8for export (compatible with prior implementation + extracted into constants for better reuse).withstatement to ensure proper closing / flushing (mostly for writing, but better to do it for reading as well for consistency).Reading
Changing
load_csv(...)to allow reading both compressed and uncompressed CSV files (determines compression-state at runtime usingcheck_if_file_zstd_compressed(...)).Stream:
open(filepath, 'rb')ZstdDecompressor().stream_reader(...)- Reading from the fileTextIOWrapper()- Reading either from the decompressor (compressed) or from the file directly (uncompressed).csv.reader(...)- Reading from the UTF-8 decoder (CSV parsing requires actual strings).Writing
Updating
patch(...)in 2 parts to allow exporting compressed GTFS-files (via the newexport_compressedflag):patch)save_csv)Copying
Unlike with uncompressed-only files, which only needed direct copying of all files, it is now possible to run into the following 3 situations:
ZstdDecompressor(...).copy_stream(...)).ZstdCompressor(...).copy_stream).Additional: Compression logic only applied on CSV-files, so all other files (JSON) need to be copied directly with no extra logic regardless (simply expecting them to not be compressed).
Saving CSVs
Changing
save_csv(...)to allow writing both compressed and uncompressed CSV files, as specified byexport_compressedflag.Stream:
open(filepath, 'wb')ZstdCompressor(...).stream_writer(...)- Writing to the fileTextIOWrapper()- Writing either to the binary-file directly or to the ZSTD compressorcsv.writer(...)- Writing to the UTF-8 encoder (CSVs are string-based 100%).