Skip to content

Conversation

@adduali1310
Copy link

@adduali1310 adduali1310 commented Nov 17, 2025

This fixes a segmentation fault that occurs when /metrics endpoint is accessed during Falco shutdown. The crash happens as there is a very minute window, wherein the webserver continues serving /metrics requests after outputs have been destroyed.

The race condition is timing-dependent and occurs in this window:

  1. process_events()
  2. Webserver still running and can receive /metrics requests
  3. Metrics code accesses NULL outputs elements→ crash

Changes:

  • Create cleanup_outputs action to handle outputs destruction
  • Reorder teardown steps to stop webserver before destorying outputs
  • Move outputs.reset() from process_events to cleanup_outputs()

This eliminates the race condition by ensuring the webserver stops accepting requests before any subsystems are destroyed. The synchronisation behaviour of output.reset() block till queue flushed is preserved.

What type of PR is this?

/kind bug

Any specific area of the project related to this PR?

/area engine

What this PR does / why we need it:
Fixes a race condition that causes a segmentation fault during Falco shutdown when prometheus metrics endpoint is accessed. The crash occurs in a very short time window before the webserver is stopped but after outputs are destroyed.
Due to the destruction, accessing them leads to a SEGV and invalid access.
The fix reorders the shutdown sequence to stop the webserver before destroying the subsystems.

Which issue(s) this PR fixes:

Fixes #3739

Special notes for your reviewer:
Created a simple bash script to reproduce the issue.

#!/bin/bash

FALCO_BIN="Binary Path"
METRICS_URL="http://localhost:3018/metrics"

echo "Starting race condition test..."

$FALCO_BIN -c /etc/falco/falco.yaml &
FALCO_PID=$!

sleep 3

echo "Falco started (PID: $FALCO_PID), launching 20 parallel scrapers..."

for i in {1..20}; do
    while true; do
        curl -s $METRICS_URL > /dev/null 2>&1
    done &
done

sleep 2

echo "Sending SIGTERM to Falco..."
kill -SIGTERM $FALCO_PID

sleep 3

echo "Cleaning up scrapers..."

pkill -P $$ curl 2>/dev/null
killall curl 2>/dev/null


wait $FALCO_PID 2>/dev/null
EXIT_CODE=$?

echo "Falco exit code: $EXIT_CODE"

if [ $EXIT_CODE -eq 139 ]; then
    echo "CRASH DETECTED: Segmentation fault (exit code 139)"
    exit 1
elif [ $EXIT_CODE -eq 0 ]; then
    echo "SUCCESS: Clean shutdown"
    exit 0
else
    echo "UNEXPECTED: Exit code $EXIT_CODE"
    exit $EXIT_CODE
fi

Before Fix:

Test run 1
Starting race condition test...
Mon Nov 17 08:08:24 2025: Falco version: 0.42.1-2025.11.2 (x86_64)
Mon Nov 17 08:08:24 2025: Falco initialized with configuration files:
Mon Nov 17 08:08:24 2025:    /etc/falco/falco.yaml | schema validation: ok
Mon Nov 17 08:08:24 2025: Loading rules from:
Mon Nov 17 08:08:24 2025:    /etc/falco/falco_rules.local.yaml | schema validation: none
Mon Nov 17 08:08:24 2025: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Mon Nov 17 08:08:24 2025: Starting health webserver with threadiness 96, listening on 0.0.0.0:3018
Mon Nov 17 08:08:24 2025: Setting metrics interval to 1h, equivalent to 3600000 (ms)
Mon Nov 17 08:08:24 2025: Loaded event sources: syscall
Mon Nov 17 08:08:24 2025: Enabled event sources: syscall
Mon Nov 17 08:08:24 2025: Opening 'syscall' source with modern BPF probe.
Mon Nov 17 08:08:24 2025: One ring buffer every '2' CPUs.
Mon Nov 17 08:08:24 2025: [libs]: Trying to open the right engine!
Falco started (PID: 317455), launching 20 parallel scrapers...
Sending SIGTERM to Falco...
Cleaning up scrapers...
Terminated
Terminated
Terminated
Terminated
Terminated
Mon Nov 17 08:08:44 2025: SIGINT received, exiting...
Syscall event drop monitoring:
   - event drop detected: 0 occurrences
   - num times actions taken: 0
Events detected: 0
Rule counts by severity:
Triggered rules by rule name:
Falco exit code: 139
CRASH DETECTED: Segmentation fault (exit code 139)
FAILED on run 1

After Fix:

Starting race condition test...
...
Mon Nov 17 08:06:07 2025: Falco initialized with configuration files:
Mon Nov 17 08:06:07 2025:    /etc/falco/falco.yaml | schema validation: ok
Mon Nov 17 08:06:07 2025: Loading rules from:
Mon Nov 17 08:06:07 2025:    /etc/falco/falco_rules.local.yaml | schema validation: none
Mon Nov 17 08:06:07 2025: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Mon Nov 17 08:06:07 2025: Starting health webserver with threadiness 96, listening on 0.0.0.0:3018
Mon Nov 17 08:06:07 2025: Setting metrics interval to 1h, equivalent to 3600000 (ms)
Mon Nov 17 08:06:07 2025: Loaded event sources: syscall
Mon Nov 17 08:06:07 2025: Enabled event sources: syscall
Mon Nov 17 08:06:07 2025: Opening 'syscall' source with modern BPF probe.
Mon Nov 17 08:06:07 2025: One ring buffer every '2' CPUs.
Mon Nov 17 08:06:07 2025: [libs]: Trying to open the right engine!
Falco started (PID: 153669), launching 20 parallel scrapers...
Sending SIGTERM to Falco...
Cleaning up scrapers...
Terminated
Terminated
Terminated
Terminated
Terminated
Terminated
Mon Nov 17 08:06:26 2025: SIGINT received, exiting...
Syscall event drop monitoring:
   - event drop detected: 0 occurrences
   - num times actions taken: 0
Events detected: 0
Rule counts by severity:
Triggered rules by rule name:
Falco exit code: 0
SUCCESS: Clean shutdown

Post the fix in this PR, can confirm haven't seen the crashes.
Can be confirmed by running the script aggressively as well.

Does this PR introduce a user-facing change?:

None

…on shutdown

This fixes a segmentation fault that occurs when /metrics endpoint is accessed during Falco shutdown. The crash happens as the webserver continues serving /metrics requests after outputs and inspectors have been destroyed.

Changes:

- Create cleanup_outputs action to handle outputs destruction
- Reorder teardown steps to stop webserver before destorying outputs
- Move outputs.reset() from process_events to cleanup_outputs()

This eliminates the race condition by ensuring the webserver stops accepting requests before any subsystems are destroyed. The synchronisation behaviour of output.reset() block till queue flushed is preserved.

Signed-off-by: Adnan Ali <[email protected]>
@poiana
Copy link
Contributor

poiana commented Nov 17, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: adduali1310
Once this PR has been reviewed and has the lgtm label, please assign ekoops for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@poiana poiana requested review from Kaizhe and leogr November 17, 2025 09:49
@poiana poiana added the size/M label Nov 17, 2025
@leogr
Copy link
Member

leogr commented Nov 17, 2025

good catch! thank you

/milestone 0.43.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

Segmentation Fault when accessing outputs_queue_num_drops prometheus metrics

3 participants