[RFC] 146 - Operation Cancellation & Interruption Handling Architecture #10288

arvinxx · 2025-11-18T16:55:24Z

arvinxx
Nov 18, 2025
Maintainer

Summary

This RFC documents the comprehensive architecture for operation cancellation and interruption handling in LobeHub's agent runtime system. It describes the hierarchical operation structure, cancellation propagation mechanism, and interrupt handling behavior across different execution phases.

Motivation

As LobeHub's agent runtime system has evolved with complex multi-step workflows involving LLM calls, tool executions, and human interventions, we needed a robust and consistent cancellation mechanism that:

Properly cleans up pending operations when users cancel mid-execution
Maintains state consistency across the operation tree
Provides clear feedback to users about what was cancelled
Prevents resource leaks from incomplete operations
Follows clear design principles for maintainability

This RFC serves as the comprehensive documentation of our cancellation architecture after implementing and refining the system.

Operation Type Hierarchy

Core Operation Types

type OperationType =
  // === Message sending ===
  | 'sendMessage'
  | 'createTopic'
  | 'regenerate'
  | 'continue'

  // === AI generation ===
  | 'execAgentRuntime'        // Top-level operation
  | 'createAssistantMessage'  // Sub-operation
  | 'callLLM'                 // Sub-operation
  | 'reasoning'               // Sub-operation

  // === Tool calling ===
  | 'toolCalling'             // Top-level for a tool
  | 'createToolMessage'       // Sub-operation
  | 'executeToolCall'         // Sub-operation
  | 'builtinToolSearch'       // Leaf operation
  | 'builtinToolInterpreter'  // Leaf operation
  | 'builtinToolLocalSystem'  // Leaf operation
  | 'pluginApi'               // Leaf operation

  // === Others ===
  | 'rag'
  | 'searchWorkflow'
  | 'translate'
  | 'topicSummary'
  | 'historySummary'
  | 'supervisorDecision'
  | 'groupAgentGenerate';

Operation Status

type OperationStatus =
  | 'pending'    // Waiting to start (not currently used)
  | 'running'    // Executing
  | 'paused'     // Paused (for user intervention scenarios)
  | 'completed'  // Successfully completed
  | 'cancelled'  // User cancelled
  | 'failed';    // Execution failed

Operation Tree Structure

Typical Execution Flow

User Action (sendMessage/regenerate/continue)
│
├─ execAgentRuntime (top-level operation)
│  │
│  ├─ Step 1: call_llm
│  │  ├─ createAssistantMessage
│  │  └─ callLLM (streaming)
│  │     └─ reasoning (optional, if model supports)
│  │
│  ├─ Step 2: call_tool (if LLM returns tool calls)
│  │  ├─ toolCalling (top-level for this tool)
│  │  │  ├─ createToolMessage
│  │  │  └─ executeToolCall
│  │  │     ├─ builtinToolSearch (if builtin tool)
│  │  │     ├─ builtinToolInterpreter
│  │  │     ├─ builtinToolLocalSystem
│  │  │     └─ pluginApi (if plugin tool)
│  │  │
│  │  └─ (repeat for each tool call)
│  │
│  ├─ Step 3: call_llm (process tool results)
│  │  └─ ... (same as Step 1)
│  │
│  └─ Step N: finish / resolve_aborted_tools
│
└─ (operation completed)

Example: Simple LLM Chat (No Tool Calls)

execAgentRuntime
└─ call_llm
   └─ finish (completed)

Example: Single Tool Call

execAgentRuntime
├─ call_llm (returns 1 tool call)
├─ call_tool
│  └─ toolCalling
│     ├─ createToolMessage
│     └─ executeToolCall
│        └─ builtinToolSearch
└─ call_llm (process tool results)
   └─ finish (completed)

Cancellation Propagation Mechanism

Cancellation Workflow

When a user cancels an operation:

cancelOperation(operationId: string, reason?: string) {
  const operation = operations[operationId];

  // 1. Abort the operation (triggers AbortSignal)
  operation.abortController.abort(reason);

  // 2. Call cancel handler if registered
  if (operation.onCancelHandler) {
    operation.onCancelHandler({ operationId, type, reason, metadata });
  }

  // 3. Update status to 'cancelled'
  operation.status = 'cancelled';
  operation.metadata.cancelReason = reason;

  // 4. Recursively cancel child operations
  if (operation.childOperationIds) {
    for (const childId of operation.childOperationIds) {
      cancelOperation(childId, reason);
    }
  }
}

Propagation Example

User clicks cancel → cancelOperation(execAgentRuntime)
  ↓
1. Abort execAgentRuntime.abortController
  ↓
2. Execute execAgentRuntime.onCancelHandler (if any)
  ↓
3. Update execAgentRuntime.status = 'cancelled'
  ↓
4. Recursively cancel children:
   ├─ cancelOperation(toolCalling)
   │  ├─ Abort toolCalling.abortController
   │  ├─ Execute toolCalling.onCancelHandler
   │  ├─ Update toolCalling.status = 'cancelled'
   │  └─ Recursively cancel:
   │     ├─ cancelOperation(createToolMessage)
   │     │  └─ Execute cancel handler: wait for message creation, mark aborted
   │     └─ cancelOperation(executeToolCall)
   │        └─ Execute cancel handler: update message to aborted
   │           └─ cancelOperation(builtinToolSearch)
   │              └─ Abort search, signal.aborted = true
   └─ (other children...)

Layered Cancellation Handling

Layer 1: Streaming Executor

Responsibility: Detect cancellation and trigger agent interrupt handling

while (state.status !== 'done' && state.status !== 'error') {
  // Check if operation has been cancelled
  const currentOperation = get().operations[operationId];
  if (currentOperation?.status === 'cancelled') {
    log('[internal_execAgentRuntime] Operation cancelled, marking state as interrupted');

    // Set state.status to 'interrupted' to trigger agent abort handling
    state = { ...state, status: 'interrupted' };

    // Let agent handle the abort (will clean up pending tools if needed)
    const result = await runtime.step(state, nextContext);
    state = result.newState;

    log('[internal_execAgentRuntime] Operation cancelled, stopping loop');
    break;
  }

  // Execute step
  const result = await runtime.step(state, nextContext);
  // ...
}

Key Design:

Does NOT directly handle abort logic - Avoids duplicating cleanup code
Sets state.status = 'interrupted' - Triggers agent's unified abort handling
Calls runtime.step() - Lets agent execute cleanup logic
Agent automatically calls handleAbort() - Cleans up pending tools (if any)

Layer 2: Agent (GeneralChatAgent)

Responsibility: Unified abort checking and cleanup decision

async runner(context: AgentRuntimeContext, state: AgentState) {
  // Unified abort check: before all phase handling
  if (state.status === 'interrupted') {
    return this.handleAbort(context, state);
  }

  // ... phase handling
}

private handleAbort(context: AgentRuntimeContext, state: AgentState): AgentInstruction {
  const { hasToolsCalling, parentMessageId, toolsCalling } =
    this.extractAbortInfo(context, state);

  // If there are pending tool calls, clean them up
  if (hasToolsCalling && toolsCalling.length > 0) {
    return {
      type: 'resolve_aborted_tools',
      payload: { parentMessageId, toolsCalling }
    };
  }

  // No tools to clean up, finish directly
  return {
    type: 'finish',
    reason: 'user_requested',
    reasonDetail: 'Operation cancelled by user'
  };
}

Abort Information Extraction:

Extracts different information based on current phase:

llm_result phase:
- Extract toolsCalling from payload
- Tools haven't created messages yet
tool_result / tools_batch_result phase:
- Find messages with pluginIntervention.status === 'pending' in state.messages
- Extract plugin info as toolsCalling

Layer 3: Executor

Responsibility: Register cancel handlers and cleanup resources

createToolMessage - "Ensure Complete" Strategy

onOperationCancel(createToolMsgOpId, async ({ metadata }) => {
  // Wait for message creation to complete
  const createResult = await metadata?.createMessagePromise;

  if (createResult) {
    // Update message to aborted state
    await Promise.all([
      optimisticUpdateMessageContent(msgId, 'Tool execution was cancelled by user.'),
      optimisticUpdateMessagePlugin(msgId, { intervention: { status: 'aborted' } })
    ]);
  }
});

Rationale: Message creation is async; when cancelled, it might be in progress. Wait for completion then mark as aborted.

executeToolCall - "Immediate Cleanup" Strategy

onOperationCancel(executeToolOpId, async () => {
  // Update message to aborted state immediately
  await Promise.all([
    optimisticUpdateMessageContent(toolMessageId, 'Tool execution was cancelled by user.'),
    optimisticUpdateMessagePlugin(toolMessageId, { intervention: { status: 'aborted' } })
  ]);
});

Rationale: Message already exists; update state immediately for fast response.

Parent Operation Check

// Check if parent operation was cancelled while creating message
const toolOperation = toolOperationId ? get().operations[toolOperationId] : undefined;
if (toolOperation?.abortController.signal.aborted) {
  log('[call_tool] Parent operation cancelled, skipping tool execution');
  return { events, newState: state };
}

Layer 4: Tool

Responsibility: Check AbortSignal and stop execution

// Builtin tool checks abort
if (abortController.signal.aborted) {
  log('[search] Operation cancelled, stopping');
  return;
}

Rationale: Relies on parent cancel handler to update message status.

Interrupt Behavior by Phase

Phase 1: init / user_input

Interrupt Timing: User message just submitted, LLM not called yet

Behavior:

Agent detects status === 'interrupted'
Calls handleAbort()
No pending tools
Returns finish instruction
state.status becomes 'done'

UI:

No assistant message created
Operation completes directly

Phase 2: llm_result (During LLM Streaming)

Interrupt Timing: LLM is streaming output

Behavior (call_llm executor):

// internal_fetchAIChatMessage detects abort
if (aborted) {
  onFinish('', { finishType: 'abort' });
}

// call_llm executor returns human_abort phase
if (finishType === 'abort') {
  return {
    nextContext: {
      phase: 'human_abort',
      payload: {
        reason: 'user_cancelled',
        hasToolsCalling: false,  // During streaming, no tool calls yet
        toolsCalling: []
      }
    }
  }
}

Agent handles human_abort phase:

Detects no pending tools
Returns finish instruction
state.status becomes 'done'

UI:

Assistant message shows partial content output
Message status normal (no error)
Operation completes

Phase 3: llm_result (LLM Complete, Ready to Execute Tools)

Interrupt Timing: LLM returned tool calls, but tool messages not created yet

Behavior (streaming executor):

// Loop detects operation cancelled
if (currentOperation?.status === 'cancelled') {
  // Set state.status = 'interrupted'
  state = { ...state, status: 'interrupted' };

  // Call runtime.step(), triggers agent abort handling
  const result = await runtime.step(state, nextContext);
  state = result.newState;  // state.status becomes 'done'
  break;
}

Agent handling (unified abort check in runner):

if (state.status === 'interrupted') {
  return this.handleAbort(context, state);
}

// handleAbort extracts abort info:
// - phase = 'llm_result'
// - hasToolsCalling = true
// - toolsCalling = [...] (extracted from payload)

// Returns resolve_aborted_tools instruction
return {
  type: 'resolve_aborted_tools',
  payload: { parentMessageId, toolsCalling }
};

resolve_aborted_tools executor executes:

Creates tool message for each tool call
Sets content: 'Tool execution was aborted by user.'
Sets pluginIntervention: { status: 'aborted' }
Sets state.status = 'done'

UI:

Assistant message shows complete content
Tool messages show "Tool execution was aborted by user."
Tool cards show aborted status (gray/disabled style)
Operation completes

Phase 4: tool_result (During Tool Execution)

Interrupt Timing: Tool is executing (e.g., search, code execution)

Behavior (call_tool executor):

Case A: Cancelled During createToolMessage

// createToolMessage cancel handler executes
onOperationCancel(createToolMsgOpId, async ({ metadata }) => {
  // Wait for message creation to complete
  const createResult = await metadata?.createMessagePromise;

  // Update message to aborted state
  await optimisticUpdateMessageContent(msgId, 'Tool execution was cancelled by user.');
  await optimisticUpdateMessagePlugin(msgId, { intervention: { status: 'aborted' } });
});

Result:

Tool message creation completes
Message shows "Tool execution was cancelled by user."
Tool doesn't execute
Returns empty events, doesn't affect subsequent flow

Case B: Cancelled During executeToolCall

Builtin tool detects abort:

if (abortController.signal.aborted) {
  log('[search] Operation cancelled, stopping');
  return;
}

executeToolCall cancel handler executes:

onOperationCancel(executeToolOpId, async () => {
  // Update message to aborted state
  await optimisticUpdateMessageContent(toolMessageId, 'Tool execution was cancelled by user.');
  await optimisticUpdateMessagePlugin(toolMessageId, { intervention: { status: 'aborted' } });
});

Result:

Tool stops execution
Tool message updates to aborted state
Returns error event or empty events

Agent runtime loop continues:

Loop detects operation cancelled
Sets state.status = 'interrupted'
Agent handleAbort checks for other pending tools
- If yes: execute resolve_aborted_tools
- If no: finish directly

UI:

Tool message shows "Tool execution was cancelled by user."
Tool card shows aborted status
If other pending tools exist, they're also marked aborted
Operation completes

Phase 5: tool_result (Tool Complete, Ready to Call LLM)

Interrupt Timing: Tool execution complete, ready to call LLM with tool results

Behavior:

Similar to Phase 3, but tool result messages already exist
Agent handleAbort checks pending tools (if multiple tools, some completed)
Marks unexecuted tools as aborted
Sets state.status = 'done'

UI:

Executed tools show normal results
Unexecuted tools show aborted status
No new LLM response
Operation completes

Design Principles

1. Layered Responsibilities

Streaming Executor Layer
- Detect operation cancel
- Set state.status = 'interrupted'
- Does NOT directly handle abort logic
Agent Layer
- Unified abort check (state.status === 'interrupted')
- Extract abort info (extractAbortInfo)
- Decide how to handle (handleAbort)
Executor Layer
- Register cancel handlers
- Clean up resources (messages, state)
- Respond to AbortSignal
Tool Layer
- Check abortController.signal.aborted
- Stop execution
- Rely on parent cancel handler to update state

2. Cancellation Strategies

Ensure Complete Strategy (createToolMessage)
- Wait for async operation to complete
- Then mark as aborted
- Ensures state consistency
Immediate Cleanup Strategy (executeToolCall)
- Message already exists
- Update state directly
- Fast response to cancel
Recursive Cancel Strategy (all parent operations)
- Automatically cancel all child operations
- Guarantee complete cleanup
- Avoid dangling operations

3. State Consistency

Operation State
- Tracked via status field
- 'cancelled' status indicates cancelled
- 'completed' status indicates normal completion
Agent State
- status: 'interrupted' triggers abort handling
- status: 'done' indicates completion
- status: 'waiting_for_human' indicates waiting for approval
Message State
- pluginIntervention.status: 'aborted' indicates tool cancelled
- content shows cancellation reason
- UI displays different styles based on status

Best Practices

Creating Operations

// Always specify parentOperationId to establish hierarchy
const { operationId, abortController } = get().startOperation({
  type: 'toolCalling',
  context: {
    sessionId,
    topicId,
    messageId,  // Auto-associate message
  },
  parentOperationId,
  metadata: {
    startTime: Date.now(),
    // Other metadata...
  },
});

Registering Cancel Handlers

// Register cancel handler if cleanup needed
get().onOperationCancel(operationId, async ({ metadata }) => {
  // Cleanup logic
  await cleanupResources();

  // Update UI state
  await updateMessageState(messageId, 'aborted');
});

Checking Abort

// Periodically check in long-running operations
async function longRunningTask(abortSignal: AbortSignal) {
  for (const item of items) {
    // Check if cancelled
    if (abortSignal.aborted) {
      log('Task cancelled');
      return;
    }

    // Process item
    await processItem(item);
  }
}

Completing Operations

// On success
get().completeOperation(operationId);

// On failure
get().failOperation(operationId, {
  type: 'NetworkError',
  message: 'Failed to fetch data',
});

Implementation Summary

Problems Solved

✅ Problem 1: Loop level didn't check cancellation status

Solution: Check operation.status === 'cancelled' at while loop start

✅ Problem 2: Cancellation didn't clean up pending tools

Solution: Set state.status = 'interrupted', trigger agent's handleAbort()

✅ Problem 3: Cancellation logic scattered across multiple places

Solution: Agent's unified abort check and handling

Current Limitations & Future Improvements

⚠️ Improvement 1: Operation timeout mechanism

Issue: No automatic timeout cleanup
Suggestion: Add operation timeout, auto-cleanup long-running operations

⚠️ Improvement 2: Operation monitoring and debugging

Issue: Difficult to trace operation tree and execution history
Suggestion: Add DevTools panel to display operation tree and event flow

Reference Files

Core Files

src/store/chat/slices/operation/types.ts - Operation type definitions
src/store/chat/slices/operation/actions.ts - Operation management logic
src/store/chat/agents/GeneralChatAgent.ts - Agent abort handling
src/store/chat/agents/createAgentExecutors.ts - Executor implementation
src/store/chat/slices/aiChat/actions/streamingExecutor.ts - Streaming executor

Tool Files

src/store/chat/slices/builtinTool/actions/search.ts - Search tool
src/store/chat/slices/builtinTool/actions/interpreter.ts - Interpreter tool
src/store/chat/slices/builtinTool/actions/localSystem.ts - LocalSystem tool
src/store/chat/slices/plugin/actions/pluginTypes.ts - Plugin tools

Test Files

src/store/chat/agents/__tests__/GeneralChatAgent.test.ts - Agent tests
src/store/chat/agents/__tests__/createAgentExecutors/ - Executor tests
src/store/chat/slices/aiChat/actions/__tests__/streamingExecutor.test.ts - Streaming tests

Conclusion

This operation cancellation architecture provides:

Clear layered responsibilities - Each layer has well-defined duties
Consistent cancellation behavior - All cancellations follow the same pattern
Proper resource cleanup - No dangling operations or messages
User-friendly feedback - Clear UI indication of what was cancelled
Maintainable design - Easy to understand and extend

The architecture has been battle-tested and proven effective in production. This RFC serves as the definitive documentation for understanding and maintaining the system.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC] 146 - Operation Cancellation & Interruption Handling Architecture #10288

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

[RFC] 146 - Operation Cancellation & Interruption Handling Architecture #10288

Uh oh!

Uh oh!

arvinxx Nov 18, 2025 Maintainer

Summary

Motivation

Operation Type Hierarchy

Core Operation Types

Operation Status

Operation Tree Structure

Typical Execution Flow

Example: Simple LLM Chat (No Tool Calls)

Example: Single Tool Call

Cancellation Propagation Mechanism

Cancellation Workflow

Propagation Example

Layered Cancellation Handling

Layer 1: Streaming Executor

Layer 2: Agent (GeneralChatAgent)

Layer 3: Executor

createToolMessage - "Ensure Complete" Strategy

executeToolCall - "Immediate Cleanup" Strategy

Parent Operation Check

Layer 4: Tool

Interrupt Behavior by Phase

Phase 1: init / user_input

Phase 2: llm_result (During LLM Streaming)

Phase 3: llm_result (LLM Complete, Ready to Execute Tools)

Phase 4: tool_result (During Tool Execution)

Case A: Cancelled During createToolMessage

Case B: Cancelled During executeToolCall

Phase 5: tool_result (Tool Complete, Ready to Call LLM)

Design Principles

1. Layered Responsibilities

2. Cancellation Strategies

3. State Consistency

Best Practices

Creating Operations

Registering Cancel Handlers

Checking Abort

Completing Operations

Implementation Summary

Problems Solved

Current Limitations & Future Improvements

Reference Files

Core Files

Tool Files

Test Files

Conclusion

Replies: 0 comments

arvinxx
Nov 18, 2025
Maintainer