fix(metrics): Prevent race condition crash during metrics collection on shutdown #3741

adduali1310 · 2025-11-17T09:49:07Z

This fixes a segmentation fault that occurs when /metrics endpoint is accessed during Falco shutdown. The crash happens as there is a very minute window, wherein the webserver continues serving /metrics requests after outputs have been destroyed.

The race condition is timing-dependent and occurs in this window:

process_events()
Webserver still running and can receive /metrics requests
Metrics code accesses NULL outputs elements→ crash

Changes:

Create cleanup_outputs action to handle outputs destruction
Reorder teardown steps to stop webserver before destorying outputs
Move outputs.reset() from process_events to cleanup_outputs()

This eliminates the race condition by ensuring the webserver stops accepting requests before any subsystems are destroyed. The synchronisation behaviour of output.reset() block till queue flushed is preserved.

What type of PR is this?

/kind bug

Any specific area of the project related to this PR?

/area engine

What this PR does / why we need it:
Fixes a race condition that causes a segmentation fault during Falco shutdown when prometheus metrics endpoint is accessed. The crash occurs in a very short time window before the webserver is stopped but after outputs are destroyed.
Due to the destruction, accessing them leads to a SEGV and invalid access.
The fix reorders the shutdown sequence to stop the webserver before destroying the subsystems.

Which issue(s) this PR fixes:

Fixes #3739

Special notes for your reviewer:
Created a simple bash script to reproduce the issue.

#!/bin/bash

FALCO_BIN="Binary Path"
METRICS_URL="http://localhost:3018/metrics"

echo "Starting race condition test..."

$FALCO_BIN -c /etc/falco/falco.yaml &
FALCO_PID=$!

sleep 3

echo "Falco started (PID: $FALCO_PID), launching 20 parallel scrapers..."

for i in {1..20}; do
    while true; do
        curl -s $METRICS_URL > /dev/null 2>&1
    done &
done

sleep 2

echo "Sending SIGTERM to Falco..."
kill -SIGTERM $FALCO_PID

sleep 3

echo "Cleaning up scrapers..."

pkill -P $$ curl 2>/dev/null
killall curl 2>/dev/null


wait $FALCO_PID 2>/dev/null
EXIT_CODE=$?

echo "Falco exit code: $EXIT_CODE"

if [ $EXIT_CODE -eq 139 ]; then
    echo "CRASH DETECTED: Segmentation fault (exit code 139)"
    exit 1
elif [ $EXIT_CODE -eq 0 ]; then
    echo "SUCCESS: Clean shutdown"
    exit 0
else
    echo "UNEXPECTED: Exit code $EXIT_CODE"
    exit $EXIT_CODE
fi

Before Fix:

Test run 1
Starting race condition test...
Mon Nov 17 08:08:24 2025: Falco version: 0.42.1-2025.11.2 (x86_64)
Mon Nov 17 08:08:24 2025: Falco initialized with configuration files:
Mon Nov 17 08:08:24 2025:    /etc/falco/falco.yaml | schema validation: ok
Mon Nov 17 08:08:24 2025: Loading rules from:
Mon Nov 17 08:08:24 2025:    /etc/falco/falco_rules.local.yaml | schema validation: none
Mon Nov 17 08:08:24 2025: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Mon Nov 17 08:08:24 2025: Starting health webserver with threadiness 96, listening on 0.0.0.0:3018
Mon Nov 17 08:08:24 2025: Setting metrics interval to 1h, equivalent to 3600000 (ms)
Mon Nov 17 08:08:24 2025: Loaded event sources: syscall
Mon Nov 17 08:08:24 2025: Enabled event sources: syscall
Mon Nov 17 08:08:24 2025: Opening 'syscall' source with modern BPF probe.
Mon Nov 17 08:08:24 2025: One ring buffer every '2' CPUs.
Mon Nov 17 08:08:24 2025: [libs]: Trying to open the right engine!
Falco started (PID: 317455), launching 20 parallel scrapers...
Sending SIGTERM to Falco...
Cleaning up scrapers...
Terminated
Terminated
Terminated
Terminated
Terminated
Mon Nov 17 08:08:44 2025: SIGINT received, exiting...
Syscall event drop monitoring:
   - event drop detected: 0 occurrences
   - num times actions taken: 0
Events detected: 0
Rule counts by severity:
Triggered rules by rule name:
Falco exit code: 139
CRASH DETECTED: Segmentation fault (exit code 139)
FAILED on run 1

After Fix:

Starting race condition test...
...
Mon Nov 17 08:06:07 2025: Falco initialized with configuration files:
Mon Nov 17 08:06:07 2025:    /etc/falco/falco.yaml | schema validation: ok
Mon Nov 17 08:06:07 2025: Loading rules from:
Mon Nov 17 08:06:07 2025:    /etc/falco/falco_rules.local.yaml | schema validation: none
Mon Nov 17 08:06:07 2025: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Mon Nov 17 08:06:07 2025: Starting health webserver with threadiness 96, listening on 0.0.0.0:3018
Mon Nov 17 08:06:07 2025: Setting metrics interval to 1h, equivalent to 3600000 (ms)
Mon Nov 17 08:06:07 2025: Loaded event sources: syscall
Mon Nov 17 08:06:07 2025: Enabled event sources: syscall
Mon Nov 17 08:06:07 2025: Opening 'syscall' source with modern BPF probe.
Mon Nov 17 08:06:07 2025: One ring buffer every '2' CPUs.
Mon Nov 17 08:06:07 2025: [libs]: Trying to open the right engine!
Falco started (PID: 153669), launching 20 parallel scrapers...
Sending SIGTERM to Falco...
Cleaning up scrapers...
Terminated
Terminated
Terminated
Terminated
Terminated
Terminated
Mon Nov 17 08:06:26 2025: SIGINT received, exiting...
Syscall event drop monitoring:
   - event drop detected: 0 occurrences
   - num times actions taken: 0
Events detected: 0
Rule counts by severity:
Triggered rules by rule name:
Falco exit code: 0
SUCCESS: Clean shutdown

Post the fix in this PR, can confirm haven't seen the crashes.
Can be confirmed by running the script aggressively as well.

Does this PR introduce a user-facing change?:

None

…on shutdown This fixes a segmentation fault that occurs when /metrics endpoint is accessed during Falco shutdown. The crash happens as the webserver continues serving /metrics requests after outputs and inspectors have been destroyed. Changes: - Create cleanup_outputs action to handle outputs destruction - Reorder teardown steps to stop webserver before destorying outputs - Move outputs.reset() from process_events to cleanup_outputs() This eliminates the race condition by ensuring the webserver stops accepting requests before any subsystems are destroyed. The synchronisation behaviour of output.reset() block till queue flushed is preserved. Signed-off-by: Adnan Ali <[email protected]>

poiana · 2025-11-17T09:49:12Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: adduali1310
Once this PR has been reviewed and has the lgtm label, please assign ekoops for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

leogr · 2025-11-17T14:29:14Z

good catch! thank you

/milestone 0.43.0

github-project-automation bot added this to Falco Roadmap Nov 17, 2025

github-project-automation bot moved this to Todo in Falco Roadmap Nov 17, 2025

poiana added release-note-none kind/bug dco-signoff: yes area/engine labels Nov 17, 2025

poiana requested review from Kaizhe and leogr November 17, 2025 09:49

poiana added the size/M label Nov 17, 2025

poiana added this to the 0.43.0 milestone Nov 17, 2025

adduali1310 mentioned this pull request Nov 17, 2025

fix(metrics): Add null check for state.outputs in metrics collection #3740

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(metrics): Prevent race condition crash during metrics collection on shutdown #3741

fix(metrics): Prevent race condition crash during metrics collection on shutdown #3741

adduali1310 commented Nov 17, 2025 •

edited

Loading

Uh oh!

poiana commented Nov 17, 2025

Uh oh!

leogr commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(metrics): Prevent race condition crash during metrics collection on shutdown #3741

Are you sure you want to change the base?

fix(metrics): Prevent race condition crash during metrics collection on shutdown #3741

Conversation

adduali1310 commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

poiana commented Nov 17, 2025

Uh oh!

leogr commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adduali1310 commented Nov 17, 2025 •

edited

Loading