[Feature Request] TSDB on OpenSearch for large-scale metrics

### Is your feature request related to a problem? Please describe

OpenSearch is widely adopted for logging and tracing analytics in Observability. And there’s a growing industry trend of a unified observability solution of logging, tracing and metrics. However, OpenSearch is not efficient for storing large-scale metrics. We did a benchmark by ingesting metrics directly into OpenSearch, with one sample per document. The benchmark shows OpenSearch is (1) >3x storage inefficiency (2) 66.5% lower ingestion throughput compared to a purposely built TSDB like M3.

The root cause lies in OpenSearch’s underlying storage engine: **Lucene**. While Lucene is a powerful general-purpose search library, it is not optimized for metrics workloads.
Considering an example as shown in the Figure below.

<img width="716" height="338" alt="Image" src="https://github.com/user-attachments/assets/854d7956-0beb-4fb5-a8f3-49ccd7ee4f90" />

There are several inefficiencies with this one-sample-per-Lucene-document design:

- Label Duplication
Although Lucene is a column store, each sample document repeats the same set of labels—even though all samples from a time series share identical labels.

- **Lack** of Sample Encoding
Metrics data is highly compressible using [delta encoding](https://en.wikipedia.org/wiki/Delta_encoding). However, because timestamps and values are shared across all time series, consecutive samples cannot be efficiently compressed. Samples ingested at the same time across series are stored adjacently, preventing series-level delta encoding.

### Describe the solution you'd like

### A native time-series index in OpenSearch
To overcome these inefficiencies, we propose a **time-series index** concept in OpenSearch. Inspired by Prometheus, which groups time series into 2-hour **blocks** (further split into 20-minute chunks), this design stores **chunks of samples** for each series as a single Lucene document.

Key benefits:
- Shared labels: Labels are stored once per chunk, not per sample.
- Efficient encoding: Consecutive samples can be compressed using techniques like [XOR](https://www.vldb.org/pvldb/vol8/p1816-teller.pdf) or delta encoding.

As shown in figure below, samples from the same series are grouped into a chunk, enabling compression and efficient retrieval.  We’ll elaborate further on the chunk and block details in the design section below.

<img width="800" height="394" alt="Image" src="https://github.com/user-attachments/assets/b83479f4-02ca-4383-87b8-d641d38f8a4e" />

### Real-Time Queryability
A critical requirement is that metrics must be queryable in real time. Lucene supports only Near-Real-Time (NRT) queries, requiring indexed data to be flushed before becoming visible. However, waiting for 20-minute chunks to flush is unacceptable—metrics systems must make the latest sample queryable within seconds for monitoring use cases. 

To address this, we introduce a **LiveIndexSeries** concept, which is an in-memory data structure consuming new data and buffering in memory. The live index allows quick look up and updates via a map structure, this is also useful for handling temporary late arrivals via some out-of-order insertion. The live index can be queried via an IndexReader API that integrates with Lucene.

When we close a live index, we can perform optimizations on the values in the chunk, such as applying the delta-of-delta encoding, so that the closed chunk has an efficient compression format for smaller storage size. Once a chunk is closed, the files become immutable, and it can be loaded into memory via mmap lazily when the chunk is queried.

<img width="718" height="436" alt="Image" src="https://github.com/user-attachments/assets/bff98f88-17ad-4081-9018-975fcef70bfc" />

### Ingestion via Metrics Engine
We will also introduce a new MetricsEngine as a plugin, which supports the TS index data structure. For indexing, the MetricsEngine can append the data points to the LiveSeriesIndex. When the flush method is invoked on the engine, it checks if the live index needs to close and it can close it once full. We’ll reuse the translog for recovery, and the samples in the LiveSeriesIndex can be reconstructed during recovery. 

### Extending OpenSearch query framework
To be able to retrieve the encoded samples, we first filter the relevant chunks by applying the condition on metrics labels during the OpenSearch query phase, and then we retrieve the real samples during the aggregation phase, and perform the pipelines of metric transformation. To be able to support the metrics pipeline, we made some extensions to the OpenSearch aggregators with two additions:

- **UnfoldAggregator**, this aggregator is able to unfold the chunks and LiveSeriesIndex of a time series into samples during the collection phase, and then perform a series of **Stage** functions such as floor or sum. The aggregated results are then sent to the coordinator for global aggregation  

<img width="633" height="469" alt="Image" src="https://github.com/user-attachments/assets/4a281c66-27f2-4f84-b8b7-f20153d9e88d" />

- **CoordinatorPipelineAggregator**, this is an extension to SiblingPipelineAggregator for coordinator-only Stage transformations. Coordinator pipeline is crucial to support multiple bucket paths. In addition, Coordinator Pipeline also supports macro definition, so that a named macros can be referenced by the main pipeline. Such as the following in M3: ```macros: {e = a | asPercent(b)}, main pipeline: c | asPercent(e)```

### Metrics Language Support
Since we extended OpenSearch DSL for metric query execution, we also need to build an adapter to support the available metric languages like PromQL and M3QL. For each language, we provide a language parser and planner, which can translate into the OpenSearch DSL with the extended operators aforementioned.





### Related component

Extensions

### Describe alternatives you've considered

_No response_

### Additional context

### TSDB Plugin
We plan to implement this TSDB as a plugin to OpenSearch, and open source it once the core is complete. Also, we will share more detailed technical specifications as we progress more in this effort.

### Contributors
- @yupeng9 
- @philiplhchan
- @itschrispeck
- @Jinny-Wang 
- @xuxiong1 
- @ziwenwan
- @bpatelcs
- @msfroh
- @amberzsy
- @andrewmains12
- @shaan420
- @varunbharadwaj
- @venketep-prasad
- @jbacchus126
- @saad-zaman
- @mrkchang
- @justinjc
- @fengcheng1518
- @wkbjerry
- @eyunzhang
- @guptashubham

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] TSDB on OpenSearch for large-scale metrics #19461

Is your feature request related to a problem? Please describe

Describe the solution you'd like

A native time-series index in OpenSearch

Real-Time Queryability

Ingestion via Metrics Engine

Extending OpenSearch query framework

Metrics Language Support

Related component

Describe alternatives you've considered

Additional context

TSDB Plugin

Contributors

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] TSDB on OpenSearch for large-scale metrics #19461

Description

Is your feature request related to a problem? Please describe

Describe the solution you'd like

A native time-series index in OpenSearch

Real-Time Queryability

Ingestion via Metrics Engine

Extending OpenSearch query framework

Metrics Language Support

Related component

Describe alternatives you've considered

Additional context

TSDB Plugin

Contributors

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions