Skip to content

Reload defunct runners #68

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 18 additions & 4 deletions pkg/inference/scheduling/loader.go
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,13 @@
for r, slot := range l.runners {
unused := l.references[slot] == 0
idle := unused && now.Sub(l.timestamps[slot]) > runnerIdleTimeout
if unused && (!idleOnly || idle) {
defunct := false
select {
case <-l.slots[slot].done:
defunct = true
default:
}
if unused && (!idleOnly || idle || defunct) {
Comment on lines +165 to +171
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This chunk looks good, I would just update the doc comment for evict to reflect that it also evicts defunct runners if possible.

l.log.Infof("Evicting %s backend runner with model %s in %s mode",
r.backend, r.model, r.mode,
)
Expand Down Expand Up @@ -372,9 +378,17 @@
// See if we can satisfy the request with an existing runner.
existing, ok := l.runners[runnerKey{backendName, model, mode}]
if ok {
l.references[existing] += 1
l.timestamps[existing] = time.Time{}
return l.slots[existing], nil
select {
case <-l.slots[existing].done:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd make sense to also run l.evictRunner(backendName, model) so we don't have to evict all runners in order to find a free slot. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that makes sense.

l.log.Warnf("Will reload defunct %s runner for %s. Runner error: %s.", backendName, model,

Check failure

Code scanning / CodeQL

Log entries created from user input High

This log entry depends on a
user-provided value
.

Copilot Autofix

AI 2 days ago

To fix the issue, the model variable should be sanitized before being used in the log entry on line 383 of loader.go. Specifically, we should remove any newline characters (\n, \r) from the model string to prevent log injection attacks. This can be achieved using strings.ReplaceAll or similar methods.

The sanitization should be applied directly before the log statement to ensure that the logged value is safe. This fix will not alter the functionality of the code but will enhance its security.


Suggested changeset 1
pkg/inference/scheduling/loader.go

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/pkg/inference/scheduling/loader.go b/pkg/inference/scheduling/loader.go
--- a/pkg/inference/scheduling/loader.go
+++ b/pkg/inference/scheduling/loader.go
@@ -12,2 +12,3 @@
 	"github.com/docker/model-runner/pkg/logging"
+	"strings"
 )
@@ -382,3 +383,5 @@
 			case <-l.slots[existing].done:
-				l.log.Warnf("Will reload defunct %s runner for %s. Runner error: %s.", backendName, model,
+				safeModel := strings.ReplaceAll(model, "\n", "")
+				safeModel = strings.ReplaceAll(safeModel, "\r", "")
+				l.log.Warnf("Will reload defunct %s runner for %s. Runner error: %s.", backendName, safeModel,
 					l.slots[existing].err)
EOF
@@ -12,2 +12,3 @@
"github.com/docker/model-runner/pkg/logging"
"strings"
)
@@ -382,3 +383,5 @@
case <-l.slots[existing].done:
l.log.Warnf("Will reload defunct %s runner for %s. Runner error: %s.", backendName, model,
safeModel := strings.ReplaceAll(model, "\n", "")
safeModel = strings.ReplaceAll(safeModel, "\r", "")
l.log.Warnf("Will reload defunct %s runner for %s. Runner error: %s.", backendName, safeModel,
l.slots[existing].err)
Copilot is powered by AI and may make mistakes. Always verify output.
l.slots[existing].err)
// Evict the defunct runner if it is not in use by anyone else.
l.evictRunner(backendName, model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
l.evictRunner(backendName, model)
// Reset the reference count to zero so that we can evict the runner and then start a new one.
l.references[existing] = 0
l.evictRunner(backendName, model)

Copy link
Member Author

@p1-0tr p1-0tr Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Though I wonder if it would not be safer to let the reference counting work normally, issue and idle check here, and expand the idle check logic to look for defunct or stale runners. WDYT?

Copy link
Contributor

@doringeman doringeman Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expand the idle check logic to look for defunct or stale runners

I like this!

Although, in this specific case, this code which comes right after the code you're changing will evict all (1, currently, but still) runners if all the slots are full and the current one that's attempted to be loaded is defunct and not clean up, right?

// If there's not sufficient memory or all slots are full, then try
// evicting unused runners.
if memory > l.availableMemory || len(l.runners) == len(l.slots) {
	l.evict(false)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure forcing the refcount to 0 does put us at a risk of panicing in loader.release. I've opted not to force the refcount to 0, and added logic in evict to remove defunct runners.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we can't force the refcount to 0 here.

The bigger issue I see with the new logic is that evictRunner in this case might not actually evict if there's a non-zero reference count for the defunct runner (e.g. a client that hasn't realized its backend is defunct yet). The problem is that this code would then continue and override the l.runners entry for runnerKey{backend, model, mode} with a newly created runner, so when that hypothetical outstanding defunct runner is finally released, it will decrement the reference count for the new runner in release (since it uses the same key to look up the slot).

I think what I would do is put a label (say WaitForChange:) just above the last block of code in this loop (grep for "Wait for something to change") and then in the case <-l.slots[existing].done: path, I would goto WaitForChange. Then, in release, add a check for <-runner.done and immediately evict if l.references[slot] == 0. Because realistically any client using a defunct runner will find out quite quickly once the socket connection closes, which means the runner will be release'd quickly, which will call broadcast and break the waiting load call out of its waiting loop.

default:
l.references[existing] += 1
l.timestamps[existing] = time.Time{}
return l.slots[existing], nil
}
}

// If there's not sufficient memory or all slots are full, then try
Expand Down