Skip to content

Comments

Reduce SPIRE Agent memory usage during k8s workload attestation#6243

Closed
szvincze wants to merge 2 commits intospiffe:mainfrom
Nordix:fastjson-fork
Closed

Reduce SPIRE Agent memory usage during k8s workload attestation#6243
szvincze wants to merge 2 commits intospiffe:mainfrom
Nordix:fastjson-fork

Conversation

@szvincze
Copy link
Contributor

Pull Request check list

  • Commit conforms to CONTRIBUTING.md?
  • Proper tests/regressions included?
  • Documentation updated?

Affected functionality

Eliminate spikes in memory usage after the spire-agent restarts during k8s workload attestation.

Description of change

I have replaced the original valyala/fastjson with the fork aperturerobotics/fastjson which includes, among other useful additions, this PR that we discussed over a year ago here.
We have been using it for more than six months, and our tests show significantly better performance without the memory spikes that could cause OOMKills of spire-agent. It has also been deployed successfully in production environment.

For this reason, I have created this PR to propose using the forked fastjson in upstream as a replacement for the current version.

Which issue this PR fixes

#5067 spire-agent gets OOMKilled after pod restart

Signed-off-by: Szilard Vincze <szilard.vincze@est.tech>
Signed-off-by: Szilard Vincze <szilard.vincze@est.tech>
@sorindumitru
Copy link
Collaborator

Thanks @szvincze. Do you happen to have some data about memory usage before and after this change?

Would it be possible to have a benchmark to measure it? That would allow us to pitch some implementations against each other, such as the experimental encoding/json/v2 from go 1.25 or some of the other implementations mentioned in https://github.com/go-json-experiment/jsonbench and pick an implementation that covers all the points mentioned in #5109

@szvincze
Copy link
Contributor Author

Hi @sorindumitru,

Unfortunately I cannot come up with a reliable benchmark as you can see in the discussion under the mentioned PR, the tests gave misleading results:
valyala/fastjson#101 (comment)

But there was some validation via memory profiling:
valyala/fastjson#101 (comment)

If you check the graph I posted in the original issue, you can see the spike that caused OOMKill in our case:
#5067

So, when I saw the PR, I created a fork and started testing with that fastjson library with much better results, the spikes completely disappeared. Then I found the mentioned fork which we started using with the same good results.

I don't think this fork satisfies all the points listed in #5109, so maybe it will not be the final solution, but it could work as a quick fix for the current situation.

@sorindumitru sorindumitru self-assigned this Aug 14, 2025
@sorindumitru
Copy link
Collaborator

sorindumitru commented Aug 31, 2025

I had an attempt at doing some benchmarking here. I just took the podlist as json from a kind cluster and extracted the parsing parts into a function for reuse in the benchmark. Here are some of my results:

# base line
BenchmarkParsingPodList-16         18428            651655 ns/op         1958250 B/op       3706 allocs/op
# aperturerobotics/fastjson
BenchmarkParsingPodList-16         28106            428963 ns/op          935001 B/op       3704 allocs/op
# encoding/json
BenchmarkParsingPodList-16         14235            843702 ns/op          304613 B/op         44 allocs/op
# encoding/json/v2 (in combination with encoding/json for json.RawMessage)
BenchmarkParsingPodList-16         20342            592776 ns/op          304633 B/op         44 allocs/op

The fastjson fork does seem the best here, but I'm not sure it's something we can reasonably depend on from the supply chain perspective. We'll discuss these results in the maintainer sync some more.

The good news seems to be that encoding/json seems to be the best option here from the point of view of memory allocations, which was what you cared about. With encoding/json/v2 we seem to be able to get some improvements that make it better than valyala/fastjson on paper (although still slower than the fork). So maybe we can just wait for those.

There's some other parsing of objects that happens, so there might be some other benchmarking that needs to be done. I have a branch with the benchmarks at https://github.com/sorindumitru/spire/tree/json-benchmarks. You can run the benchmarks with:

GOEXPERIMENT=jsonv2 go test -bench=. -benchmem -benchtime=10s ./pkg/agent/plugin/workloadattestor/k8s

You can run them against your known large pod list by replacing it in test/fixtures/config/podlist.json.

@szvincze
Copy link
Contributor Author

szvincze commented Sep 1, 2025

Hi @sorindumitru,

Thanks for the benchmarks. I ran them on two different pod list data files from customers, and here are the results. I just compared the baseline with the fork, and my conclusion is the same as yours.

# baseline
BenchmarkParsingPodList-16          	     297	  38907116 ns/op	65334632 B/op	   93866 allocs/op
# aperturerobotics
BenchmarkParsingPodList-16          	     475	  24430078 ns/op	20171069 B/op	   94116 allocs/op
# baseline
BenchmarkParsingPodList-16          	     315	  37460539 ns/op	50663259 B/op	   83283 allocs/op
# aperturerobotics
BenchmarkParsingPodList-16          	     585	  20591592 ns/op	16107910 B/op	   83516 allocs/op

Of course, I understand your concern about the supply chain of the fork. However, the baseline (valyala/fastjson) also has issues in that regard, since it hasn’t been updated for several years. That’s why I suggested checking this fork, as it implements some of the improvements discussed in the valyala/fastjson issue tracker. So, if you believe the fork is better than the baseline, I would propose using it in the short to mid term, as it only requires changing a dependency. In the meantime, we can still evaluate other options for the long term.

Just as a side note, our original problem was that Kubernetes’ built-in JSON encoder was extremely slow, so we replaced it with valyala/fastjson. Later, we ran into memory spike issues that caused the spire-agents to be OOMKilled, that this fork could fix.

@sorindumitru
Copy link
Collaborator

@szvincze Would it also be possible to get the result from BenchmarkParsingPodListJSON-16 benchmark? That would be the result using encoding/json/v2, it would be interesting to see if that's still a good option with your larger pod list.

@szvincze
Copy link
Contributor Author

szvincze commented Sep 3, 2025

@sorindumitru, here it is:

# baseline
BenchmarkParsingPodList-16          	     291	  39793890 ns/op	65334613 B/op	   93866 allocs/op
BenchmarkParsingPodListJSON-16      	     358	  33392486 ns/op	 5881160 B/op	    1875 allocs/op

# aperturerobotics
BenchmarkParsingPodList-16          	     486	  24549996 ns/op	20171070 B/op	   94116 allocs/op
BenchmarkParsingPodListJSON-16      	     366	  32863580 ns/op	 5881094 B/op	    1875 allocs/op
# baseline
BenchmarkParsingPodList-16          	     318	  37327507 ns/op	50663245 B/op	   83283 allocs/op
BenchmarkParsingPodListJSON-16      	     367	  32609636 ns/op	 3194823 B/op	    3304 allocs/op

# aperturerobotics
BenchmarkParsingPodList-16          	     573	  20644869 ns/op	16107901 B/op	   83516 allocs/op
BenchmarkParsingPodListJSON-16      	     361	  32671034 ns/op	 3194867 B/op	    3305 allocs/op

@sorindumitru
Copy link
Collaborator

Thanks @szvincze. Our current plan is to wait for encoding/json/v2 to be made non-experimental and then switch to that. That package seems to be the best option from our point of view:

  • Better performance than valyala/fastjson (although not better than the fork), based on the benchmark.
  • Much better memory usage, based on the benchmark.
  • Part of the standard library, so least susceptible to supply chain issues.

Based on the runs you provided, this seems to also hold for your environment so we hope this is going to be an acceptable path forward for you. If that's the case, could you close this issue?

@szvincze
Copy link
Contributor Author

Yes, it is acceptable. We go with the fork until upstream switches to encoding/json/v2.

@szvincze szvincze closed this Sep 10, 2025
@szvincze szvincze deleted the fastjson-fork branch September 10, 2025 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants