Skip to content

Fluent-bit crashes with SIGSEGV in v3.2.8 due to memory corruption when handling many connection timeouts and using asynchronous DNS and TLS simultaneously (SIGSEGVs in c-ares and OpenSSL mainly) #11197

@giuliano-sider

Description

@giuliano-sider

Bug Report

Describe the bug
In Google Kubernetes Engine 1.34, we started to use fluent-bit v3.2.8 and the stackdriver output plugin to send logs to Google Cloud Logging. We are seeing some instances of fluent-bit crash every hour or so with a SIGSEGV in parts of the code that deal with upstream connections (e.g. DNS with the c-ares library or TLS with OpenSSL). The stack traces and core dumps we have seen so far suggest that fluent-bit may not be able to safely handle high-throughput connection timeouts when using Asynchronous DNS and TLS simultaneously.

To Reproduce
It's not known yet what specific conditions induce this issue.

Expected behavior
Fluent-bit should not be crashing with SIGSEGVs.

Your Environment
We observed this in a Google Kubernetes Engine cluster at version 1.34.1-gke.1829001. The fluent-bit logging agent runs as a Kubernetes DaemonSet on each VM, with version v3.2.8. Here are some of the relevant snippets of the fluent-bit config:

        - name:                   stackdriver
          Alias:                  node-logging-k8s-container
          Match_Regex:            kube_.*
          Resource:               k8s_container
          # k8s_container resource type requires the following fields to be set, even though they are not used.
          # MonitoredResource.labels are set in parser.lua filter and parsed in out_stackdriver plugin.
          # See https://github.com/fluent/fluent-bit/blob/v3.1.4/plugins/out_stackdriver/stackdriver_conf.c#L466
          k8s_cluster_name:       {{.ClusterName}}
          k8s_cluster_location:   {{.ClusterLocation}}
          severity_key:           severity
          http_request_key:       httpRequest
          text_payload_key:       textPayload
          trace_sampled_key:      "logging.googleapis.com/trace_sampled"
          export_to_project_id:   {{.ClusterProjectID}}
          workers:                8
          # fluentbit requires the protocol in the url
          cloud_logging_base_url: https://{{.CloudLoggingEndpoint}}
          Retry_Limit:            3
        - name:                   stackdriver
          Alias:                  node-logging-k8s-pod
          Match_Regex:            kube-pod_.*
          Resource:               k8s_pod
          # k8s_pod resource type requires the following fields to be set, even though they are not used.
          # MonitoredResource.labels are set in parser.lua filter and parsed in out_stackdriver plugin.
          # See https://github.com/fluent/fluent-bit/blob/v3.1.4/plugins/out_stackdriver/stackdriver_conf.c#L466
          k8s_cluster_name:       {{.ClusterName}}
          k8s_cluster_location:   {{.ClusterLocation}}
          severity_key:           severity
          http_request_key:       httpRequest
          text_payload_key:       textPayload
          trace_sampled_key:      "logging.googleapis.com/trace_sampled"
          export_to_project_id:   {{.ClusterProjectID}}
          # fluentbit requires the protocol in the url
          cloud_logging_base_url: https://{{.CloudLoggingEndpoint}}
          Retry_Limit:            3
        - name:                   stackdriver
          Alias:                  node-logging-k8s-node
          Match_Regex:            ^(?!kube(_|-pod_)).*$
          Resource:               k8s_node
          # k8s_node resource type requires the following fields to be set, even though they are not used.
          # MonitoredResource.labels are set in parser.lua filter and parsed in out_stackdriver plugin.
          # See https://github.com/fluent/fluent-bit/blob/v3.1.4/plugins/out_stackdriver/stackdriver_conf.c#L466
          k8s_cluster_name:       {{.ClusterName}}
          k8s_cluster_location:   {{.ClusterLocation}}
          severity_key:           severity
          http_request_key:       httpRequest
          text_payload_key:       textPayload
          trace_sampled_key:      "logging.googleapis.com/trace_sampled"
          export_to_project_id:   {{.ClusterProjectID}}
          # fluentbit requires the protocol in the url
          cloud_logging_base_url: https://{{.CloudLoggingEndpoint}}
          Retry_Limit:            3

Additional context
Examples of some stack traces of SIGSEGV crashes we observed:

This crash happens when cancelling a query due to a timeout:

#0  0x00007d0aa211a50f in abort () from ./lib/x86_64-linux-gnu/libc.so.6
#1  0x000059daa99ad2f8 in flb_signal_handler (signal=<optimized out>) at /src/fluent-bit/src/fluent-bit.c:602
#2  <signal handler called>
#3  0x000059daa9e12899 in ares_slist_node_first (list=0x7d0a9160f230) at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/dsa/ares_slist.c:386
#4  0x000059daa9e038c3 in ares_check_cleanup_conns (channel=channel@entry=0x7d0a91663000) at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/ares_close_sockets.c:92
#5  0x000059daa9e0373e in ares_cancel (channel=0x7d0a91663000) at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/ares_cancel.c:81
#6  0x000059daa9a591cf in flb_net_getaddrinfo_timeout_handler (config=<optimized out>, data=0x7d0a916032a0) at /src/fluent-bit/src/flb_network.c:907
#7  flb_net_getaddrinfo_timeout_handler (config=<optimized out>, data=0x7d0a916032a0) at /src/fluent-bit/src/flb_network.c:885
#8  0x000059daa9a63efc in flb_sched_event_handler (config=0x7d0aa103ca00, event=event@entry=0x7d0a9160a460) at /src/fluent-bit/src/flb_scheduler.c:614
#9  0x000059daa9a50063 in output_thread (data=0x7d0a9e6fb300) at /src/fluent-bit/src/flb_output_thread.c:285
#10 0x000059daa9a69bee in step_callback (data=0x7d0a9fc09880) at /src/fluent-bit/src/flb_worker.c:43
#11 0x00007d0aa217d1f5 in ?? () from ./lib/x86_64-linux-gnu/libc.so.6
#12 0x00007d0aa21fcb40 in clone () from ./lib/x86_64-linux-gnu/libc.so.6

#3ares_slist_node_first The function tries to access the first node of a linked list inside c-ares. #4ares_check_cleanup_conns c-ares is iterating through active connections to clean them up. #5ares_cancel This function was called to stop a DNS resolution. #6flb_net_getaddrinfo_timeout_handlerThe Fluent Bit's scheduler decided the DNS lookup took too long. #8flb_sched_event_handler The scheduler fired the timeout event.

(gdb) frame 3 #3 0x000059daa9e12899 in ares_slist_node_first (list=0x7d0a9160f230) at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/dsa/ares_slist.c:386 warning: 386 /src/fluent-bit/lib/c-ares-1.34.4/src/lib/dsa/ares_slist.c: No such file or directory (gdb) p list $1 = (const ares_slist_t *) 0x7d0a9160f230 (gdb) x/4gx list 0x7d0a9160f230: 0x00007d0a9dac3a00 0x00007d0a9dac3a00 0x7d0a9160f240: 0x0000000000000000 0x0000000000000000 (gdb) x/2gx 0x00007d0a9dac3a00 0x7d0a9dac3a00: 0x00000000000002a1 0x0000000000000030

https://github.com/fluent/fluent-bit/blob/master/lib/c-ares-1.34.4/src/lib/dsa/ares_slist.c#L33

As a skip list, this (0x00000000000002a1) should be a memory address but it is too small to be a valid memory address.

Another stack trace:

#0  0x00007ea580d1850f in abort () from ./lib/x86_64-linux-gnu/libc.so.6
#1  0x000056bb3ad2c2f8 in flb_signal_handler (signal=<optimized out>) at /src/fluent-bit/src/fluent-bit.c:602
#2  <signal handler called>
#3  0x00007ea581126ff0 in ?? () from ./usr/lib/x86_64-linux-gnu/libcrypto.so.3
#4  0x00007ea581133377 in ?? () from ./usr/lib/x86_64-linux-gnu/libcrypto.so.3
#5  0x00007ea5811337e3 in ?? () from ./usr/lib/x86_64-linux-gnu/libcrypto.so.3
#6  0x00007ea581133604 in ?? () from ./usr/lib/x86_64-linux-gnu/libcrypto.so.3
#7  0x00007ea581133785 in ?? () from ./usr/lib/x86_64-linux-gnu/libcrypto.so.3
#8  0x00007ea581133604 in ?? () from ./usr/lib/x86_64-linux-gnu/libcrypto.so.3
#9  0x00007ea5811337e3 in ?? () from ./usr/lib/x86_64-linux-gnu/libcrypto.so.3
#10 0x00007ea581133604 in ?? () from ./usr/lib/x86_64-linux-gnu/libcrypto.so.3
#11 0x00007ea5811336f5 in ASN1_item_free () from ./usr/lib/x86_64-linux-gnu/libcrypto.so.3
#12 0x00007ea5812e55a0 in OPENSSL_sk_pop_free () from ./usr/lib/x86_64-linux-gnu/libcrypto.so.3
#13 0x00007ea5814f86c2 in SSL_free () from ./usr/lib/x86_64-linux-gnu/libssl.so.3
#14 0x000056bb3adfeff2 in tls_session_destroy (session=0x7ea564f6c000) at /src/fluent-bit/src/tls/openssl.c:715
#15 tls_session_destroy (session=0x7ea564f6c000) at /src/fluent-bit/src/tls/openssl.c:699
#16 0x000056bb3ae00b96 in flb_tls_session_destroy (session=0x7ea564f1aa80) at /src/fluent-bit/src/tls/flb_tls.c:691
#17 0x000056bb3ade6aa6 in prepare_destroy_conn (u_conn=u_conn@entry=0x7ea564f63c40) at /src/fluent-bit/src/flb_upstream.c:514
#18 0x000056bb3ade7f50 in flb_upstream_conn_timeouts (list=<optimized out>) at /src/fluent-bit/src/flb_upstream.c:1019
#19 0x000056bb3ade2c56 in flb_sched_event_handler (config=0x7ea57fc3c740, event=event@entry=0x7ea564f100a0) at /src/fluent-bit/src/flb_scheduler.c:624
#20 0x000056bb3adcf063 in output_thread (data=0x7ea57c0f4300) at /src/fluent-bit/src/flb_output_thread.c:285
#21 0x000056bb3ade8bee in step_callback (data=0x7ea57b545000) at /src/fluent-bit/src/flb_worker.c:43
#22 0x00007ea580d7b1f5 in ?? () from ./lib/x86_64-linux-gnu/libc.so.6
#23 0x00007ea580dfab40 in clone () from ./lib/x86_64-linux-gnu/libc.so.6

Similar issue as the one we found initially tls session destroy

static int tls_session_destroy(void *session)

(gdb) x/4gx 0x7ea564f6c000 0x7ea564f6c000: 0x00007ea564f2c000 0x00000000000002e2 0x7ea564f6c010: 0x0000000000000000 0x0000000000000000 (gdb) x/20gx 0x00007ea564f2c000 0x7ea564f2c000: 0x0000000000000304 0x00007ea58155f100 0x7ea564f2c010: 0x0000000000000000 0x0000000000000000 0x7ea564f2c020: 0x0000000000000000 0x0000000000000001 0x7ea564f2c030: 0x00007ea581523e60 0x0000000000000000 0x7ea564f2c040: 0x0000000100000000 0x0000000100000004 0x7ea564f2c050: 0x0000000100000001 0x0000000100000002 0x7ea564f2c060: 0x0000000000000000 0x0000000000000000 0x7ea564f2c070: 0x0000000000000000 0x0000000000000000 0x7ea564f2c080: 0x0000000000000000 0x0000000000000000 0x7ea564f2c090: 0x00007ea564f9b004 0x0000000000000000

Similar to previous one, strong signal of this SSL connection object has likely already been cleared or reset

I think the cause could be the plugin is handling high throughput or network timeouts.

It cause race condition:

Thread A decides a connection has timed out (flb_upstream_conn_timeouts) and decides to kill it. Thread B is potentially still trying to write to, or has just finished using, that same connection.

One more stack trace:

#0  0x00007f7d04b6950f in abort () from ./lib/x86_64-linux-gnu/libc.so.6
#1  0x00005ad5fbc442f8 in flb_signal_handler (signal=<optimized out>) at /src/fluent-bit/src/fluent-bit.c:602
#2  <signal handler called>
#3  ares_array_len (arr=0x747379732d656275) at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/dsa/ares_array.c:64
#4  0x00005ad5fc0adace in ares_dns_record_query_cnt (dnsrec=dnsrec@entry=0x7f7ced6595e8)
    at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/record/ares_dns_record.c:249
#5  0x00005ad5fc0a02f0 in same_questions (arec=<optimized out>, query=0x7f7ced20a280) at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/ares_process.c:1279
#6  process_answer (now=0x7f7cee836530, conn=0x7f7ced20f1e0, alen=<optimized out>, abuf=<optimized out>, channel=0x7f7ced26a000)
    at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/ares_process.c:740
#7  read_answers (now=0x7f7cee836530, conn=<optimized out>) at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/ares_process.c:553
#8  process_read (now=0x7f7cee836530, read_fd=<optimized out>, channel=0x7f7ced26a000) at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/ares_process.c:587
#9  ares_process_fds_nolock (channel=channel@entry=0x7f7ced26a000, events=events@entry=0x7f7cee8365c0, nevents=nevents@entry=1, flags=flags@entry=0)
    at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/ares_process.c:225
#10 0x00005ad5fc0a0ba9 in ares_process_fds_nolock (flags=0, nevents=1, events=0x7f7cee8365c0, channel=0x7f7ced26a000)
    at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/ares_process.c:260
#11 ares_process_fds (channel=0x7f7ced26a000, events=events@entry=0x7f7cee8365c0, nevents=1, flags=flags@entry=0)
    at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/ares_process.c:257
#12 0x00005ad5fc0a0c0c in ares_process_fd (channel=<optimized out>, read_fd=<optimized out>, write_fd=<optimized out>)
    at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/ares_process.c:284
#13 0x00005ad5fbcf0224 in flb_net_getaddrinfo_event_handler (arg=0x7f7ced2032a0) at /src/fluent-bit/src/flb_network.c:874
#14 0x00005ad5fbce706e in output_thread (data=0x7f7cf61b2600) at /src/fluent-bit/src/flb_output_thread.c:318
#15 0x00005ad5fbd00bee in step_callback (data=0x7f7d01ae9ac0) at /src/fluent-bit/src/flb_worker.c:43
#16 0x00007f7d04bcc1f5 in ?? () from ./lib/x86_64-linux-gnu/libc.so.6
#17 0x00007f7d04c4bb40 in clone () from ./lib/x86_64-linux-gnu/libc.so.6

Frame #3: ares_array_len (arr=0x747379732d656275), since it's x86 so 75 62 65 2d 73 79 73 74, which is "ube-syst" This 99% is a fragment of"kube-system"

Memory Overwrite: Fluent Bit allocated memory for a DNS array structure. (within c-ares)

Corruption: Another part of Fluent Bit (likely processing Kubernetes metadata or logs tagged with kube-system) wrote data into the wrong location, overwriting the DNS structure with the text "ube-syst".

Crash: When c-ares tried to read the length of its array (ares_array_len), it tried to use the text "ube-syst" as a memory address. Since that is not a valid memory location, the CPU triggered a segmentation fault (abort).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions