Reduce SPIRE Agent memory usage during k8s workload attestation#6243
Reduce SPIRE Agent memory usage during k8s workload attestation#6243szvincze wants to merge 2 commits intospiffe:mainfrom
Conversation
Signed-off-by: Szilard Vincze <szilard.vincze@est.tech>
Signed-off-by: Szilard Vincze <szilard.vincze@est.tech>
|
Thanks @szvincze. Do you happen to have some data about memory usage before and after this change? Would it be possible to have a benchmark to measure it? That would allow us to pitch some implementations against each other, such as the experimental |
|
Hi @sorindumitru, Unfortunately I cannot come up with a reliable benchmark as you can see in the discussion under the mentioned PR, the tests gave misleading results: But there was some validation via memory profiling: If you check the graph I posted in the original issue, you can see the spike that caused OOMKill in our case: So, when I saw the PR, I created a fork and started testing with that fastjson library with much better results, the spikes completely disappeared. Then I found the mentioned fork which we started using with the same good results. I don't think this fork satisfies all the points listed in #5109, so maybe it will not be the final solution, but it could work as a quick fix for the current situation. |
|
I had an attempt at doing some benchmarking here. I just took the podlist as json from a kind cluster and extracted the parsing parts into a function for reuse in the benchmark. Here are some of my results: The The good news seems to be that There's some other parsing of objects that happens, so there might be some other benchmarking that needs to be done. I have a branch with the benchmarks at https://github.com/sorindumitru/spire/tree/json-benchmarks. You can run the benchmarks with: You can run them against your known large pod list by replacing it in |
|
Hi @sorindumitru, Thanks for the benchmarks. I ran them on two different pod list data files from customers, and here are the results. I just compared the baseline with the fork, and my conclusion is the same as yours. Of course, I understand your concern about the supply chain of the fork. However, the baseline ( Just as a side note, our original problem was that Kubernetes’ built-in JSON encoder was extremely slow, so we replaced it with |
|
@szvincze Would it also be possible to get the result from |
|
@sorindumitru, here it is: |
|
Thanks @szvincze. Our current plan is to wait for
Based on the runs you provided, this seems to also hold for your environment so we hope this is going to be an acceptable path forward for you. If that's the case, could you close this issue? |
|
Yes, it is acceptable. We go with the fork until upstream switches to |
Pull Request check list
Affected functionality
Eliminate spikes in memory usage after the spire-agent restarts during k8s workload attestation.
Description of change
I have replaced the original
valyala/fastjsonwith the forkaperturerobotics/fastjsonwhich includes, among other useful additions, this PR that we discussed over a year ago here.We have been using it for more than six months, and our tests show significantly better performance without the memory spikes that could cause OOMKills of
spire-agent. It has also been deployed successfully in production environment.For this reason, I have created this PR to propose using the forked
fastjsonin upstream as a replacement for the current version.Which issue this PR fixes
#5067 spire-agent gets OOMKilled after pod restart