ZSTD compression support (import and export) by gregory-pevnev-transit-app · Pull Request #26 · TransitApp/py-gtfs-loader

gregory-pevnev-transit-app · 2026-03-31T21:02:27Z

Goal: Supporting ZSTD compression (allowing reading both compressed and uncompressed GTFS files, as well as exporting files with compression via patch(...)).

Only TXT / CSV files can be compressed (no support for JSON files), so no extra cases are added for them.
No way to be sure if the input-files are compressed or not (no smooth migration), so having to check each file whether it is compressed or not.

Library: Using zstandard - the most popular as well as the most stable ZSTD library for Python (all the others are either too simplistic or not used by many). Note: Python supports ZSTD natively, but only after 3.14, which is requires upgrading across all code-bases, which is way too heavy (and it's only recently been adopted, so it's also risky).

Approach

Using stream-processing for reading and writing CSV files, integrating compression / decompression into the current setup.

Old: Reading from TEXT file-streams and writing to TEXT file-streams (opening files in text-mode with UTF-8 encoding / decoding done during read / write operations).

New: Additional compression / decompression stream-processing can be added into read / write streams, which means that streams need to support operating on both compressed and uncompressed data, as well as text and binary data.

Files: Opening in binary-mode (open(..., 'b')) - allows writing both UTF-8 data and compressed data.
UTF-8: Encoding to / decoding from UTF-8 separately from file-operations (as they return binary-data) - using TextIOWrapper instances.
Compression / Decompression: Performing stream-transforming from binary-files / to binary-files (ZstdDecompressor / ZstdCompressor).

Utilities

Compression-check: check_if_file_zstd_compressed(filepath) allows determining whether a file is ZSTD-compressed or not by inspecting its header (every ZSTD-file has a specific magic-number in its header).

The file has to be opened in binary-mode (gets done anyway).
The file-offset is reset back to start after the check is complete (the leading header-bytes can then be re-read for CSV-parsing or the actual ZSTD-decompression).

UTF encoding/decoding: Using TextIOWrapper which allows encoding strings into UTF-8 bytes and wise-versa. Using utf-8-sig for import and utf-8 for export (compatible with prior implementation + extracted into constants for better reuse).

Has to be opened using with statement to ensure proper closing / flushing (mostly for writing, but better to do it for reading as well for consistency).
The buffer-size is only 8KB, so it does not affect memory-usage that much.

Reading

Changing load_csv(...) to allow reading both compressed and uncompressed CSV files (determines compression-state at runtime using check_if_file_zstd_compressed(...)).

Stream:

Binary-file: open(filepath, 'rb')
(OPTIONAL) ZSTD decompression: ZstdDecompressor().stream_reader(...) - Reading from the file
UTF-8 decoding: TextIOWrapper() - Reading either from the decompressor (compressed) or from the file directly (uncompressed).
CSV-reader: csv.reader(...) - Reading from the UTF-8 decoder (CSV parsing requires actual strings).

Writing

Updating patch(...) in 2 parts to allow exporting compressed GTFS-files (via the new export_compressed flag):

Copying of the files (body of patch)
Actual saving of CSVs (save_csv)

Copying

Unlike with uncompressed-only files, which only needed direct copying of all files, it is now possible to run into the following 3 situations:

Compression-state matches between input and output (copying uncompressed or compressed files directly): matches the existing logic (simply copy).
Input-files are compressed, but export should be uncompressed: requires copying with decompression (ZstdDecompressor(...).copy_stream(...)).
Input-files are uncompressed, but export should be compressed: requires copying with compression (ZstdCompressor(...).copy_stream).

Additional: Compression logic only applied on CSV-files, so all other files (JSON) need to be copied directly with no extra logic regardless (simply expecting them to not be compressed).

Saving CSVs

Changing save_csv(...) to allow writing both compressed and uncompressed CSV files, as specified by export_compressed flag.

Stream:

Binary-file: open(filepath, 'wb')
(OPTIONAL) ZSTD compression: ZstdCompressor(...).stream_writer(...) - Writing to the file
UTF-8 encoding: TextIOWrapper() - Writing either to the binary-file directly or to the ZSTD compressor
CSV-writer: csv.writer(...) - Writing to the UTF-8 encoder (CSVs are string-based 100%).

gcamp

LGTM, but seeing the complexity of the handling I feel it might be required to support compressed and uncompressed feeds. I know we said we would but restarting every step doesn't seems like a huge deal. Maybe we remove the code after the first week/month?

gcamp · 2026-04-02T02:39:59Z

gtfs_loader/__init__.py

+            copy_file_silently(import_filename, export_filename)
+            continue
+
+        with open(import_filename, 'rb') as import_f:


Indentation is pretty intense in this method, I would suggest either a helper method that does the work or yield

gregory-pevnev-transit-app · 2026-04-02T15:29:07Z

LGTM, but seeing the complexity of the handling I feel it might be required to support compressed and uncompressed feeds. I know we said we would but restarting every step doesn't seems like a huge deal. Maybe we remove the code after the first week/month?

Do you suggest removing logic for supporting both compressed and uncompressed inputs once the migration is done?

I’m just not sure about that considering that we might want to continue running with uncompressed CSVs locally / when testing for better visibility (the way I’m currently implementing compression is by allowing turning on a flag).

gregory-pevnev-transit-app added 3 commits March 31, 2026 16:47

ZSTD support WiP

1df6b4d

Testing finished

3a58d87

Cleanup

877e9ec

gregory-pevnev-transit-app requested a review from jsteelz as a code owner March 31, 2026 21:02

gregory-pevnev-transit-app added 3 commits March 31, 2026 17:38

Finished

9663b63

Merged

d261095

Finishing changes

e419b87

gregory-pevnev-transit-app requested a review from gcamp April 1, 2026 16:26

gcamp approved these changes Apr 2, 2026

View reviewed changes

jsteelz approved these changes Apr 2, 2026

View reviewed changes

CSV-copying refactoring

3791582

gregory-pevnev-transit-app merged commit 3e5a454 into main Apr 2, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZSTD compression support (import and export)#26

ZSTD compression support (import and export)#26
gregory-pevnev-transit-app merged 7 commits intomainfrom
gregorypevnev/sc-210517/gtfs-loader-python-library-supporting-compression

gregory-pevnev-transit-app commented Mar 31, 2026 •

edited

Loading

Uh oh!

gcamp left a comment •

edited

Loading

Uh oh!

gcamp Apr 2, 2026

Uh oh!

gregory-pevnev-transit-app commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gregory-pevnev-transit-app commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approach

Utilities

Reading

Writing

Copying

Saving CSVs

Uh oh!

gcamp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gcamp Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gregory-pevnev-transit-app commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gregory-pevnev-transit-app commented Mar 31, 2026 •

edited

Loading

gcamp left a comment •

edited

Loading