Proposal for Integrating Redis Distributed Cache alongside Caffeine for Enhanced Scalability and Consistency #8480

lzh010817 · 2025-09-09T11:37:28Z

lzh010817
Sep 9, 2025

I would like to initiate a discussion regarding the potential integration of Redis-based distributed caching to complement the existing Caffeine local cache. While Caffeine excels in providing low-latency, in-memory caching for single-node deployments, Gravitino may benefit from a distributed caching layer to address challenges in high-concurrency scenarios and multi-node environments.

Key Considerations for Redis Integration:

Scalability & Horizontal Expansion:
As Gravitino scales horizontally, node-specific local caches (e.g., Caffeine) may lead to data inconsistency during node restarts or parallel operations. Redis, as a distributed cache, could ensure consistent metadata access across all nodes, reducing redundant backend queries and improving throughput.
Cache Consistency & Fault Tolerance:
Redis offers features like persistence, replication, and automatic failover, which mitigate risks of cache loss during node failures. This aligns with Gravitino’s need for reliable metadata management in distributed setups.
Performance Optimization:
While Caffeine provides nanosecond-level access latency, Redis can handle cross-node cache synchronization with minimal latency penalties using pipelining and cluster-mode operations. For frequently accessed metadata (e.g., catalog details), Redis could serve as a shared L2 cache, while Caffeine remains the L1 node-local cache.
Implementation Approach:
4.1 Introduce a cache abstraction layer to support pluggable cache providers (e.g., Caffeine for local, Redis for distributed).
4.2 Lever Redis Cluster for high availability and cache strategies to preload hot metadata on startup.
4.3 Use key-based expiration and invalidation policies to ensure data freshness across nodes.

Open Questions for Community Feedback:

Are there specific use cases in Gravitino where distributed caching would provide the most value (e.g., multi-region deployments, frequent schema updates)?
How might we balance the trade-offs between added infrastructure complexity (Redis cluster management) and performance gains?
Would a hybrid cache architecture (Caffeine + Redis) be feasible, and what strategies could optimize cache coherence?

I believe exploring Redis integration could strengthen Gravitino’s performance in distributed environments while maintaining backward compatibility. Looking forward to your insights and collaboration!

jerryshao · 2025-09-09T12:02:51Z

jerryshao
Sep 9, 2025
Collaborator

Hi @lzh010817 thanks a lot for your proposal. I will check the details and back to you soon.

1 reply

lzh010817 Sep 10, 2025
Author

Okay

jerryshao · 2025-09-10T11:58:19Z

jerryshao
Sep 10, 2025
Collaborator

Hi @lzh010817, thanks a lot for your proposal.

IMO, I think a distributed cache is quite useful. We now only have the local cache, which will introduce the inconsistency problem when deploying multiple Gravitino nodes as a federation. But the problem you mentioned about the deployment complexity also exists. I would suggest if you can investigate more about different cache solutions, and we can discuss which one is the best fit for.

Currently, I can think of 3 options:

Using local cache + journals. Journals will record the operations and sync to the different systems consistently. Each node that received the journal can replay and update the cache.
Using the distributed cache as you proposed. But using a distributed cache still suffers from the transactional problem; for example, if one node fails to update the cache, then other nodes will get the old data.
Using some DB-level cache. Since Gravitino fetches data from DB, if there are some middlewares that can transparently cache the data from DB, then it is simple and easy to leverage.

Also loop in @unknowntpo . @unknowntpo has some initial investigations, maybe we can discuss more here.

1 reply

lzh010817 Sep 11, 2025
Author

Okay， Thanks

unknowntpo · 2025-09-10T12:32:34Z

unknowntpo
Sep 10, 2025
Collaborator

@lzh010817 Thanks for your proposal, please allow me some time to elaborate on my thoughts.

3 replies

lzh010817 Sep 11, 2025
Author

Thank you for your attention. I look forward to your reply.

unknowntpo Sep 11, 2025
Collaborator

I suggest using transactional outbox pattern to sync between different local cache across Gravitino nodes.

Let's say we have two Gravitino nodes. nodeA, nodeB. When node started, an startTimestamp is recorded within their memory. When nodeA updates an entity entity1, a record is inserted to entity_change_event table within the same DB transaction.
And each node have a EntityEventSyncer, which periodically polls change event from entity_change_event started from startTimestamp, and invalidate the key mentioned by the change event.

Note that The cache data is lazy loaded. There's no need to insert updated entity into cache.

Pros:

An entity change event is guaranteed to be received by each Gravitino node.
No additional dependencies added, easy to debug.

Cons:

Cache miss problem: If we have N Gravitino nodes, for a entity key which doesn't exist in any local cache, in worse case, we need to query DB N times to make this key present in local cache of each node.

Although Redis is a common tool for caching. Here's some concerns that make me don't suggest using Redis:

if one Gravitino instance fails to update the cache, then other nodes will get the old data.
Refer to the documentation of Redis cluster, Redis cluster does not guarantee strong consistency, It would be hard to discover a data loss.
Additional complexity: User needs to maintain Gravitino nodes, RDBMS like MySQL, a Redis Cluster.

unknowntpo Oct 8, 2025
Collaborator

For the cache miss problem, we can use Apache KvRocks with Caffeine EntityCache, KvRocks is a disk-based KV store backed by RocksDB, which can prevent data loss.

So we can have:

Transactional Outbox pattern to get change event
Caffeine EntityCache + KvRocks as two level cache

Proposal for Integrating Redis Distributed Cache alongside Caffeine for Enhanced Scalability and Consistency​ #8480

Uh oh!

Uh oh!

lzh010817 Sep 9, 2025

Replies: 3 comments · 5 replies

Uh oh!

jerryshao Sep 9, 2025 Collaborator

Uh oh!

lzh010817 Sep 10, 2025 Author

Uh oh!

Uh oh!

jerryshao Sep 10, 2025 Collaborator

Uh oh!

lzh010817 Sep 11, 2025 Author

Uh oh!

unknowntpo Sep 10, 2025 Collaborator

Uh oh!

lzh010817 Sep 11, 2025 Author

Uh oh!

unknowntpo Sep 11, 2025 Collaborator

Pros:

Cons:

Uh oh!

Uh oh!

unknowntpo Oct 8, 2025 Collaborator

Proposal for Integrating Redis Distributed Cache alongside Caffeine for Enhanced Scalability and Consistency #8480

lzh010817
Sep 9, 2025

Replies: 3 comments 5 replies

jerryshao
Sep 9, 2025
Collaborator

lzh010817 Sep 10, 2025
Author

jerryshao
Sep 10, 2025
Collaborator

lzh010817 Sep 11, 2025
Author

unknowntpo
Sep 10, 2025
Collaborator

lzh010817 Sep 11, 2025
Author

unknowntpo Sep 11, 2025
Collaborator

unknowntpo Oct 8, 2025
Collaborator