Currently we have a few major points of failure. Lets discuss potential solutions to address them.
-
Hetzner Load balancer for our 2 API nodes. If the LB goes down we go down. A global LB like Cloudflare LBs would potentially be much more reliable and it would also give us a lot of new stats and lots of options for anti-abuse and attack protection.
It would also allow us to split the nodes between multiple providers, not just Hetzner.
-
Redis. We have multiple redis instances running for different needs and they all run on a single server. A potential first step would be to move the most critical instances, like probe sync, to a multi-node cluster, preferably not-redis, either a fork or even a cloud option like Cloudflare KV, durable objects, D1, Queues, Sqlite. Lots of products that could fit our needs.
I don't think there is anything else. Our measurements DBs wont impact the API if it goes down.
Currently we have a few major points of failure. Lets discuss potential solutions to address them.
Hetzner Load balancer for our 2 API nodes. If the LB goes down we go down. A global LB like Cloudflare LBs would potentially be much more reliable and it would also give us a lot of new stats and lots of options for anti-abuse and attack protection.
It would also allow us to split the nodes between multiple providers, not just Hetzner.
Redis. We have multiple redis instances running for different needs and they all run on a single server. A potential first step would be to move the most critical instances, like probe sync, to a multi-node cluster, preferably not-redis, either a fork or even a cloud option like Cloudflare KV, durable objects, D1, Queues, Sqlite. Lots of products that could fit our needs.
I don't think there is anything else. Our measurements DBs wont impact the API if it goes down.