This repository was archived by the owner on Jun 11, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 5
This repository was archived by the owner on Jun 11, 2024. It is now read-only.
Hydra upgrade #132
Copy link
Copy link
Open
Description
We've identified a number of weaknesses in the hydra design and implementation, which cause ungraceful failures (worker crashes) and downtimes when utilization spikes. The problem occurred in the window 7/7/2021-7/21/2021.
Problem analysis (theory)
The backend Postgres database can become overloaded under high volume of DHT requests to the hydras.
This causes query times to the database to increase. This in turn causes DHT requests to backup in the provider manager loop, which in turn causes the hydra nodes to crash.
Corrective steps
- Ensure the entire fleet of hydra heads (across machines) always uses the same sequence of balanced IDs:
Keep Hydra head peerIDs between restarts #128
Resolved by Allow deterministic key generation from seed #130 - Ensure ID/address mappings persist across restarts (design goal)
- Fix aggregate metrics to use fast approximate Postgres queries (as opposed to slow exact queries)
Use fast approximate queries to Postgres for metrics collection #133 - Upgrades in DHT provider manager:
- Use multiple threads in the provider loop (diminishes the effect of individual straggler requests to the datastore)
Allow the ProviderManager to have more paralleism go-libp2p-kad-dht#729 - Gracefully decline quality of service when under load
Degrade provider handling quality gracefully under load go-libp2p-kad-dht#730 - Fully decline service at a configurable peak level of load
- Use multiple threads in the provider loop (diminishes the effect of individual straggler requests to the datastore)
- Monitor (via metrics) the query latency of the backing Postgres database (at the infra level)
- Setup automatic pprof dumps near out-of-memory events, perhaps using https://github.com/ipfs-shipyard/go-dumpotron (at infra level)
Acceptance criteria
- Verify that a sustained increased request load at the hydra level does not propagate to the Postgres backing datastore. This should be ensured by measures for graceful degradation of quality (above) at the DHT provider manager.
Metadata
Metadata
Assignees
Labels
No labels