Skip to content

ZSTD compression support (import and export)#26

Merged
gregory-pevnev-transit-app merged 7 commits intomainfrom
gregorypevnev/sc-210517/gtfs-loader-python-library-supporting-compression
Apr 2, 2026
Merged

ZSTD compression support (import and export)#26
gregory-pevnev-transit-app merged 7 commits intomainfrom
gregorypevnev/sc-210517/gtfs-loader-python-library-supporting-compression

Conversation

@gregory-pevnev-transit-app
Copy link
Copy Markdown
Contributor

@gregory-pevnev-transit-app gregory-pevnev-transit-app commented Mar 31, 2026

Goal: Supporting ZSTD compression (allowing reading both compressed and uncompressed GTFS files, as well as exporting files with compression via patch(...)).

  • Only TXT / CSV files can be compressed (no support for JSON files), so no extra cases are added for them.
  • No way to be sure if the input-files are compressed or not (no smooth migration), so having to check each file whether it is compressed or not.

Library: Using zstandard - the most popular as well as the most stable ZSTD library for Python (all the others are either too simplistic or not used by many). Note: Python supports ZSTD natively, but only after 3.14, which is requires upgrading across all code-bases, which is way too heavy (and it's only recently been adopted, so it's also risky).

Approach

Using stream-processing for reading and writing CSV files, integrating compression / decompression into the current setup.

Old: Reading from TEXT file-streams and writing to TEXT file-streams (opening files in text-mode with UTF-8 encoding / decoding done during read / write operations).

New: Additional compression / decompression stream-processing can be added into read / write streams, which means that streams need to support operating on both compressed and uncompressed data, as well as text and binary data.

  • Files: Opening in binary-mode (open(..., 'b')) - allows writing both UTF-8 data and compressed data.
  • UTF-8: Encoding to / decoding from UTF-8 separately from file-operations (as they return binary-data) - using TextIOWrapper instances.
  • Compression / Decompression: Performing stream-transforming from binary-files / to binary-files (ZstdDecompressor / ZstdCompressor).

Utilities

Compression-check: check_if_file_zstd_compressed(filepath) allows determining whether a file is ZSTD-compressed or not by inspecting its header (every ZSTD-file has a specific magic-number in its header).

  • The file has to be opened in binary-mode (gets done anyway).
  • The file-offset is reset back to start after the check is complete (the leading header-bytes can then be re-read for CSV-parsing or the actual ZSTD-decompression).

UTF encoding/decoding: Using TextIOWrapper which allows encoding strings into UTF-8 bytes and wise-versa. Using utf-8-sig for import and utf-8 for export (compatible with prior implementation + extracted into constants for better reuse).

  • Has to be opened using with statement to ensure proper closing / flushing (mostly for writing, but better to do it for reading as well for consistency).
  • The buffer-size is only 8KB, so it does not affect memory-usage that much.

Reading

Changing load_csv(...) to allow reading both compressed and uncompressed CSV files (determines compression-state at runtime using check_if_file_zstd_compressed(...)).

Stream:

  1. Binary-file: open(filepath, 'rb')
  2. (OPTIONAL) ZSTD decompression: ZstdDecompressor().stream_reader(...) - Reading from the file
  3. UTF-8 decoding: TextIOWrapper() - Reading either from the decompressor (compressed) or from the file directly (uncompressed).
  4. CSV-reader: csv.reader(...) - Reading from the UTF-8 decoder (CSV parsing requires actual strings).

Writing

Updating patch(...) in 2 parts to allow exporting compressed GTFS-files (via the new export_compressed flag):

  1. Copying of the files (body of patch)
  2. Actual saving of CSVs (save_csv)

Copying

Unlike with uncompressed-only files, which only needed direct copying of all files, it is now possible to run into the following 3 situations:

  1. Compression-state matches between input and output (copying uncompressed or compressed files directly): matches the existing logic (simply copy).
  2. Input-files are compressed, but export should be uncompressed: requires copying with decompression (ZstdDecompressor(...).copy_stream(...)).
  3. Input-files are uncompressed, but export should be compressed: requires copying with compression (ZstdCompressor(...).copy_stream).

Additional: Compression logic only applied on CSV-files, so all other files (JSON) need to be copied directly with no extra logic regardless (simply expecting them to not be compressed).

Saving CSVs

Changing save_csv(...) to allow writing both compressed and uncompressed CSV files, as specified by export_compressed flag.

Stream:

  1. Binary-file: open(filepath, 'wb')
  2. (OPTIONAL) ZSTD compression: ZstdCompressor(...).stream_writer(...) - Writing to the file
  3. UTF-8 encoding: TextIOWrapper() - Writing either to the binary-file directly or to the ZSTD compressor
  4. CSV-writer: csv.writer(...) - Writing to the UTF-8 encoder (CSVs are string-based 100%).

Copy link
Copy Markdown
Member

@gcamp gcamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but seeing the complexity of the handling I feel it might be required to support compressed and uncompressed feeds. I know we said we would but restarting every step doesn't seems like a huge deal. Maybe we remove the code after the first week/month?

copy_file_silently(import_filename, export_filename)
continue

with open(import_filename, 'rb') as import_f:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation is pretty intense in this method, I would suggest either a helper method that does the work or yield

@gregory-pevnev-transit-app
Copy link
Copy Markdown
Contributor Author

LGTM, but seeing the complexity of the handling I feel it might be required to support compressed and uncompressed feeds. I know we said we would but restarting every step doesn't seems like a huge deal. Maybe we remove the code after the first week/month?

Do you suggest removing logic for supporting both compressed and uncompressed inputs once the migration is done?

I’m just not sure about that considering that we might want to continue running with uncompressed CSVs locally / when testing for better visibility (the way I’m currently implementing compression is by allowing turning on a flag).

@gregory-pevnev-transit-app gregory-pevnev-transit-app merged commit 3e5a454 into main Apr 2, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants