tabix performance

It was reported offline by Kees Albers that the VCF index performs poorly on very large data. In fact, when chatGPT was asked to write a bgzip parser and then the VCF was queried directly using a simple binary search algorithm, this solution was much much faster than using tabix.

The test data (available offline)
- 7.4 GB in size and contains 6807 variants for ~500K samples
- the tabix index size is 180 bytes

Test machine
- 8 cores, 16 GB of RAM, SSD disk

Tabix index performance
- 86 seconds

Direct binary search 
- 0.17 seconds

[varseek.txt](https://github.com/user-attachments/files/22093155/varseek.txt)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tabix performance #1949

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tabix performance #1949

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions