-
Notifications
You must be signed in to change notification settings - Fork 459
Open
Description
It was reported offline by Kees Albers that the VCF index performs poorly on very large data. In fact, when chatGPT was asked to write a bgzip parser and then the VCF was queried directly using a simple binary search algorithm, this solution was much much faster than using tabix.
The test data (available offline)
- 7.4 GB in size and contains 6807 variants for ~500K samples
- the tabix index size is 180 bytes
Test machine
- 8 cores, 16 GB of RAM, SSD disk
Tabix index performance
- 86 seconds
Direct binary search
- 0.17 seconds
Metadata
Metadata
Assignees
Labels
No labels