Reload defunct runners #68

p1-0tr · 2025-06-05T11:19:51Z

In case a runner becomes defunct, e.g. as a result of a backend crash it would be neat to be able to reload it. So, if the loader finds runner, have it check if the runner is still alive, and create a new one if the runner is defunct.

pkg/inference/scheduling/loader.go

-			return l.slots[existing], nil
+			select {
+			case <-l.slots[existing].done:
+				l.log.Warnf("Will reload defunct %s runner for %s. Runner error: %s.", backendName, model,


To fix the issue, the model variable should be sanitized before being used in the log entry on line 383 of loader.go. Specifically, we should remove any newline characters (\n, \r) from the model string to prevent log injection attacks. This can be achieved using strings.ReplaceAll or similar methods.

The sanitization should be applied directly before the log statement to ensure that the logged value is safe. This fix will not alter the functionality of the code but will enhance its security.

doringeman · 2025-06-05T12:53:49Z

pkg/inference/scheduling/loader.go

-			l.timestamps[existing] = time.Time{}
-			return l.slots[existing], nil
+			select {
+			case <-l.slots[existing].done:


I think it'd make sense to also run l.evictRunner(backendName, model) so we don't have to evict all runners in order to find a free slot. WDYT?

Yep, that makes sense.

doringeman · 2025-06-05T13:30:03Z

pkg/inference/scheduling/loader.go

+			case <-l.slots[existing].done:
+				l.log.Warnf("Will reload defunct %s runner for %s. Runner error: %s.", backendName, model,
+					l.slots[existing].err)
+				l.evictRunner(backendName, model)


Suggested change

l.evictRunner(backendName, model)

// Reset the reference count to zero so that we can evict the runner and then start a new one.

l.references[existing] = 0

l.evictRunner(backendName, model)

Makes sense. Though I wonder if it would not be safer to let the reference counting work normally, issue and idle check here, and expand the idle check logic to look for defunct or stale runners. WDYT?

expand the idle check logic to look for defunct or stale runners

I like this!

Although, in this specific case, this code which comes right after the code you're changing will evict all (1, currently, but still) runners if all the slots are full and the current one that's attempted to be loaded is defunct and not clean up, right?

// If there's not sufficient memory or all slots are full, then try // evicting unused runners. if memory > l.availableMemory || len(l.runners) == len(l.slots) { l.evict(false) }

I'm pretty sure forcing the refcount to 0 does put us at a risk of panicing in loader.release. I've opted not to force the refcount to 0, and added logic in evict to remove defunct runners.

I agree that we can't force the refcount to 0 here.

The bigger issue I see with the new logic is that evictRunner in this case might not actually evict if there's a non-zero reference count for the defunct runner (e.g. a client that hasn't realized its backend is defunct yet). The problem is that this code would then continue and override the l.runners entry for runnerKey{backend, model, mode} with a newly created runner, so when that hypothetical outstanding defunct runner is finally released, it will decrement the reference count for the new runner in release (since it uses the same key to look up the slot).

I think what I would do is put a label (say WaitForChange:) just above the last block of code in this loop (grep for "Wait for something to change") and then in the case <-l.slots[existing].done: path, I would goto WaitForChange. Then, in release, add a check for <-runner.done and immediately evict if l.references[slot] == 0. Because realistically any client using a defunct runner will find out quite quickly once the socket connection closes, which means the runner will be release'd quickly, which will call broadcast and break the waiting load call out of its waiting loop.

In case a runner becomes defunct, e.g. as a result of a backend crash it would be neat to be able to reload it. So, if the loader finds runner, have it check if the runner is still alive, and create a new one if the runner is defunct. Signed-off-by: Piotr Stankiewicz <[email protected]>

xenoscopic

I like the idea, but I think we'll need a slightly different approach.

xenoscopic · 2025-06-06T16:59:04Z

pkg/inference/scheduling/loader.go

+		defunct := false
+		select {
+		case <-l.slots[slot].done:
+			defunct = true
+		default:
+		}
+		if unused && (!idleOnly || idle || defunct) {


This chunk looks good, I would just update the doc comment for evict to reflect that it also evicts defunct runners if possible.

xenoscopic · 2025-06-06T17:08:51Z

pkg/inference/scheduling/loader.go

+			case <-l.slots[existing].done:
+				l.log.Warnf("Will reload defunct %s runner for %s. Runner error: %s.", backendName, model,
+					l.slots[existing].err)
+				l.evictRunner(backendName, model)


I agree that we can't force the refcount to 0 here.

The bigger issue I see with the new logic is that evictRunner in this case might not actually evict if there's a non-zero reference count for the defunct runner (e.g. a client that hasn't realized its backend is defunct yet). The problem is that this code would then continue and override the l.runners entry for runnerKey{backend, model, mode} with a newly created runner, so when that hypothetical outstanding defunct runner is finally released, it will decrement the reference count for the new runner in release (since it uses the same key to look up the slot).

I think what I would do is put a label (say WaitForChange:) just above the last block of code in this loop (grep for "Wait for something to change") and then in the case <-l.slots[existing].done: path, I would goto WaitForChange. Then, in release, add a check for <-runner.done and immediately evict if l.references[slot] == 0. Because realistically any client using a defunct runner will find out quite quickly once the socket connection closes, which means the runner will be release'd quickly, which will call broadcast and break the waiting load call out of its waiting loop.

p1-0tr requested review from xenoscopic, ilopezluna and doringeman June 5, 2025 11:19

github-advanced-security bot found potential problems Jun 5, 2025

View reviewed changes

p1-0tr mentioned this pull request Jun 5, 2025

Return error in case of runner crash #69

Draft

doringeman reviewed Jun 5, 2025

View reviewed changes

p1-0tr force-pushed the ps-reload-defunct-runners branch from c4243a2 to 8d5a74a Compare June 5, 2025 13:03

doringeman reviewed Jun 5, 2025

View reviewed changes

p1-0tr force-pushed the ps-reload-defunct-runners branch from 8d5a74a to 869b389 Compare June 5, 2025 13:41

p1-0tr force-pushed the ps-reload-defunct-runners branch from 869b389 to e69a618 Compare June 6, 2025 11:46

xenoscopic requested changes Jun 6, 2025

View reviewed changes

@@ -12,2 +12,3 @@
             	"github.com/docker/model-runner/pkg/logging"
+            	"strings"
             )
@@ -382,3 +383,5 @@
             			case <-l.slots[existing].done:
-            				l.log.Warnf("Will reload defunct %s runner for %s. Runner error: %s.", backendName, model,
+            				safeModel := strings.ReplaceAll(model, "\n", "")
+            				safeModel = strings.ReplaceAll(safeModel, "\r", "")
+            				l.log.Warnf("Will reload defunct %s runner for %s. Runner error: %s.", backendName, safeModel,
             					l.slots[existing].err)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reload defunct runners #68

Reload defunct runners #68

Uh oh!

p1-0tr commented Jun 5, 2025

Uh oh!

Check failure

Copilot Autofix

doringeman Jun 5, 2025

Uh oh!

p1-0tr Jun 5, 2025

Uh oh!

doringeman Jun 5, 2025

Uh oh!

p1-0tr Jun 5, 2025 •

edited

Loading

Uh oh!

doringeman Jun 5, 2025 •

edited

Loading

Uh oh!

p1-0tr Jun 6, 2025

Uh oh!

xenoscopic Jun 6, 2025

Uh oh!

xenoscopic left a comment

Uh oh!

xenoscopic Jun 6, 2025

Uh oh!

xenoscopic Jun 6, 2025

Uh oh!

Uh oh!

Reload defunct runners #68

Are you sure you want to change the base?

Reload defunct runners #68

Uh oh!

Conversation

p1-0tr commented Jun 5, 2025

Uh oh!

Check failure

Uh oh!

Copilot Autofix

doringeman Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

p1-0tr Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

doringeman Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

p1-0tr Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

doringeman Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

p1-0tr Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

xenoscopic Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

xenoscopic left a comment

Choose a reason for hiding this comment

Uh oh!

xenoscopic Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

xenoscopic Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

p1-0tr Jun 5, 2025 •

edited

Loading

doringeman Jun 5, 2025 •

edited

Loading