Skip to content

Provide helper utility API to compute CRC32 across concurrent (map/reduce) chunks #15552

@mikemccand

Description

@mikemccand

Lucene uses CRC32 for its end-to-end checksumming. Every index file records its own checksum on write, and when lighting a new segment in IndexReader we also validate when we can. CheckIndex validates all files. IndexWriter validates source segment files when merging.

It's awesome, it catches insidious bit flips for those people still not using ECC RAM. I know at least @rmuir and myself and maybe @uschindler (?) have been hit by intermittent bad RAM. Once it strikes you personally you will never again tolerate non-ECC RAM on your dev boxes hah.

At Amazon customer facing product search, we validate Lucene's checksum through each step of our near-real-time segment replication (via S3), so we can (hopefully -- there are still technically vulnerabilities if you have adversarial bit-flipping RAM monster lurking in your box (see visuals from Gemini and Grok)) prevent any bit flips from metastasizing into our persistent S3 index snapshots. This really matters at Amazon's crazy scale (~ 10s of PiB replicated per day to/from S3 -- many chances for errant bit flips!). S3 write/read also does its own checksumming, but that won't catch bit flips on the IndexWriter node before a file is uploaded, so we use/validate both checksums.

One of the delightful properties of CRC32 (thank you MATH!) is you can map/reduce it!

I.e. slice up a large (5 GB is Lucene's default max merged segment) segment into N chunks, compute CRC32 of each chunk concurrently, and then use crc32_combine zlib API to merge the N CRC32s into a single CRC32 that matches the whole file's sequential CRC32 checksum. It's quite simple (says Claude and Grok). This would be awesome because then users like us could use chunk'd upload/download to improve aggregate S3 throughput and lower the nrt refresh latency during replication.

But, annoyingly, it looks like JDK's CRC32 implementation (java.util.zip.CRC32) does not expose crc32_combine?

Maybe Lucene could provide a utility class to make this simple? Claude provides a java implementation of crc32_combine, perhaps it is hallucination free? Or maybe we could use FFM to access the crc32_combine -- but I'm not sure we can rely on it always being accessible/visible to the JVM, always.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions