fix: retain terminating pod in cache to prevent premature eviction by maishivamhoo123 · Pull Request #1719 · Project-HAMi/HAMi

maishivamhoo123 · 2026-03-29T16:38:01Z

The Fix:
This PR updates the scheduler to retain pods in the device cache while they are in the Terminating state. The cache will now only evict the pod once the Kubelet fully reports it as terminated (reaching a Succeeded or Failed phase).

Fixes #1368

Verification / Testing Performed

Ran standard unit tests via make test (Tests for scheduler and util passed).
Verified full pod lifecycle state transitions in scheduler_test.go.
Ran linter via make verify .

Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>

hami-robot · 2026-03-29T16:38:06Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: maishivamhoo123
Once this PR has been reviewed and has the lgtm label, please assign wawa0210 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist

Code Review

This pull request addresses issue #1368 by preventing terminating pods from being prematurely removed from the scheduler's cache. It introduces a new IsPodTerminating utility function and includes tests to verify the behavior. Feedback highlights that while the early return preserves the cache entry, it leaves the cached pod object in a stale state because the DeletionTimestamp is not updated, which may cause issues for logic expecting current pod metadata.

pkg/scheduler/scheduler.go

codecov · 2026-03-29T16:42:39Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Flag	Coverage Δ
unittests	`52.09% <100.00%> (+0.16%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
pkg/device/pods.go	`47.87% <100.00%> (+5.51%)`	⬆️
pkg/scheduler/scheduler.go	`42.49% <100.00%> (+2.02%)`	⬆️
pkg/util/util.go	`75.17% <100.00%> (+0.35%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>

maishivamhoo123 · 2026-03-29T19:14:45Z

@archlitchi @Shouren @FouoF and @team can you please review this PR?

pkg/scheduler/scheduler.go

archlitchi · 2026-03-30T07:35:10Z

@maishivamhoo123 could you tell me how this PR fixes the issue #1368?

That issue seems to address that pod in pending state will not try to be scheduled again before around 5 minutes even after pod terminated and release some resources

maishivamhoo123 · 2026-03-30T10:47:58Z

@archlitchi @Shouren Thank you for the review.

Before this PR
func onAddPod
Pod A starts terminating and receives a DeletionTimestamp.
HAMi immediately removes Pod A from its internal device cache.
Because the cache is empty, HAMi incorrectly reports to the kube-scheduler that the Node's resources are free.
The kube-scheduler attempts to schedule and bind Pending Pod B to that Node.
The Kubelet has not finished tearing down Pod A, so the physical resources remain locked and the binding fails.
Due to the binding failure, the kube-scheduler places Pod B into its exponential backoff queue, causing the 5-minute delay.

After this PR

Pod A starts terminating and receives a DeletionTimestamp.
The util.IsPodTerminating(pod) logic detects this. HAMi retains Pod A in the cache and updates its state instead of deleting it.
HAMi accurately reports to the kube-scheduler that the Node's resources are still in use.
Pending Pod B waits safely in the active scheduling queue without triggering any errors.
Pod A fully terminates (reaching Succeeded or Failed phase), and the Kubelet releases the physical resources.
HAMi evicts Pod A from the cache. The kube-scheduler successfully binds Pod B immediately, bypassing the 5-minute backoff penalty.

fix: retain terminating pod in cache to prevent premature eviction

5830162

Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>

hami-robot bot added the dco-signoff: yes label Mar 29, 2026

hami-robot bot requested review from DSFans2014 and wawa0210 March 29, 2026 16:38

github-actions bot added the kind/bug Something isn't working label Mar 29, 2026

hami-robot bot added the size/M label Mar 29, 2026

gemini-code-assist bot reviewed Mar 29, 2026

View reviewed changes

pkg/scheduler/scheduler.go Show resolved Hide resolved

maishivamhoo123 temporarily deployed to nvidia March 29, 2026 16:46 — with GitHub Actions Inactive

fix: retain terminating pod in cache to prevent premature eviction

176c6b3

Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>

hami-robot bot added size/L and removed size/M labels Mar 29, 2026

maishivamhoo123 temporarily deployed to nvidia March 29, 2026 18:13 — with GitHub Actions Inactive

Shouren reviewed Mar 30, 2026

View reviewed changes

pkg/scheduler/scheduler.go Show resolved Hide resolved

maishivamhoo123 requested a review from Shouren March 30, 2026 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: retain terminating pod in cache to prevent premature eviction#1719

fix: retain terminating pod in cache to prevent premature eviction#1719
maishivamhoo123 wants to merge 2 commits intoProject-HAMi:masterfrom
maishivamhoo123:fix/retain-terminating-pod-cache

maishivamhoo123 commented Mar 29, 2026

Uh oh!

hami-robot bot commented Mar 29, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

codecov bot commented Mar 29, 2026 •

edited

Loading

Uh oh!

maishivamhoo123 commented Mar 29, 2026

Uh oh!

Uh oh!

archlitchi commented Mar 30, 2026

Uh oh!

maishivamhoo123 commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

maishivamhoo123 commented Mar 29, 2026

Verification / Testing Performed

Uh oh!

hami-robot bot commented Mar 29, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

codecov bot commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

maishivamhoo123 commented Mar 29, 2026

Uh oh!

Uh oh!

archlitchi commented Mar 30, 2026

Uh oh!

maishivamhoo123 commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Mar 29, 2026 •

edited

Loading