-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Bug Report
Describe the bug
In Google Kubernetes Engine 1.34, we started to use fluent-bit v3.2.8 and the stackdriver output plugin to send logs to Google Cloud Logging. We are seeing some instances of fluent-bit crash every hour or so with a SIGSEGV in parts of the code that deal with upstream connections (e.g. DNS with the c-ares library or TLS with OpenSSL). The stack traces and core dumps we have seen so far suggest that fluent-bit may not be able to safely handle high-throughput connection timeouts when using Asynchronous DNS and TLS simultaneously.
To Reproduce
It's not known yet what specific conditions induce this issue.
Expected behavior
Fluent-bit should not be crashing with SIGSEGVs.
Your Environment
We observed this in a Google Kubernetes Engine cluster at version 1.34.1-gke.1829001. The fluent-bit logging agent runs as a Kubernetes DaemonSet on each VM, with version v3.2.8. Here are some of the relevant snippets of the fluent-bit config:
- name: stackdriver
Alias: node-logging-k8s-container
Match_Regex: kube_.*
Resource: k8s_container
# k8s_container resource type requires the following fields to be set, even though they are not used.
# MonitoredResource.labels are set in parser.lua filter and parsed in out_stackdriver plugin.
# See https://github.com/fluent/fluent-bit/blob/v3.1.4/plugins/out_stackdriver/stackdriver_conf.c#L466
k8s_cluster_name: {{.ClusterName}}
k8s_cluster_location: {{.ClusterLocation}}
severity_key: severity
http_request_key: httpRequest
text_payload_key: textPayload
trace_sampled_key: "logging.googleapis.com/trace_sampled"
export_to_project_id: {{.ClusterProjectID}}
workers: 8
# fluentbit requires the protocol in the url
cloud_logging_base_url: https://{{.CloudLoggingEndpoint}}
Retry_Limit: 3
- name: stackdriver
Alias: node-logging-k8s-pod
Match_Regex: kube-pod_.*
Resource: k8s_pod
# k8s_pod resource type requires the following fields to be set, even though they are not used.
# MonitoredResource.labels are set in parser.lua filter and parsed in out_stackdriver plugin.
# See https://github.com/fluent/fluent-bit/blob/v3.1.4/plugins/out_stackdriver/stackdriver_conf.c#L466
k8s_cluster_name: {{.ClusterName}}
k8s_cluster_location: {{.ClusterLocation}}
severity_key: severity
http_request_key: httpRequest
text_payload_key: textPayload
trace_sampled_key: "logging.googleapis.com/trace_sampled"
export_to_project_id: {{.ClusterProjectID}}
# fluentbit requires the protocol in the url
cloud_logging_base_url: https://{{.CloudLoggingEndpoint}}
Retry_Limit: 3
- name: stackdriver
Alias: node-logging-k8s-node
Match_Regex: ^(?!kube(_|-pod_)).*$
Resource: k8s_node
# k8s_node resource type requires the following fields to be set, even though they are not used.
# MonitoredResource.labels are set in parser.lua filter and parsed in out_stackdriver plugin.
# See https://github.com/fluent/fluent-bit/blob/v3.1.4/plugins/out_stackdriver/stackdriver_conf.c#L466
k8s_cluster_name: {{.ClusterName}}
k8s_cluster_location: {{.ClusterLocation}}
severity_key: severity
http_request_key: httpRequest
text_payload_key: textPayload
trace_sampled_key: "logging.googleapis.com/trace_sampled"
export_to_project_id: {{.ClusterProjectID}}
# fluentbit requires the protocol in the url
cloud_logging_base_url: https://{{.CloudLoggingEndpoint}}
Retry_Limit: 3
Additional context
Examples of some stack traces of SIGSEGV crashes we observed:
This crash happens when cancelling a query due to a timeout:
#0 0x00007d0aa211a50f in abort () from ./lib/x86_64-linux-gnu/libc.so.6
#1 0x000059daa99ad2f8 in flb_signal_handler (signal=<optimized out>) at /src/fluent-bit/src/fluent-bit.c:602
#2 <signal handler called>
#3 0x000059daa9e12899 in ares_slist_node_first (list=0x7d0a9160f230) at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/dsa/ares_slist.c:386
#4 0x000059daa9e038c3 in ares_check_cleanup_conns (channel=channel@entry=0x7d0a91663000) at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/ares_close_sockets.c:92
#5 0x000059daa9e0373e in ares_cancel (channel=0x7d0a91663000) at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/ares_cancel.c:81
#6 0x000059daa9a591cf in flb_net_getaddrinfo_timeout_handler (config=<optimized out>, data=0x7d0a916032a0) at /src/fluent-bit/src/flb_network.c:907
#7 flb_net_getaddrinfo_timeout_handler (config=<optimized out>, data=0x7d0a916032a0) at /src/fluent-bit/src/flb_network.c:885
#8 0x000059daa9a63efc in flb_sched_event_handler (config=0x7d0aa103ca00, event=event@entry=0x7d0a9160a460) at /src/fluent-bit/src/flb_scheduler.c:614
#9 0x000059daa9a50063 in output_thread (data=0x7d0a9e6fb300) at /src/fluent-bit/src/flb_output_thread.c:285
#10 0x000059daa9a69bee in step_callback (data=0x7d0a9fc09880) at /src/fluent-bit/src/flb_worker.c:43
#11 0x00007d0aa217d1f5 in ?? () from ./lib/x86_64-linux-gnu/libc.so.6
#12 0x00007d0aa21fcb40 in clone () from ./lib/x86_64-linux-gnu/libc.so.6
#3ares_slist_node_first The function tries to access the first node of a linked list inside c-ares. #4ares_check_cleanup_conns c-ares is iterating through active connections to clean them up. #5ares_cancel This function was called to stop a DNS resolution. #6flb_net_getaddrinfo_timeout_handlerThe Fluent Bit's scheduler decided the DNS lookup took too long. #8flb_sched_event_handler The scheduler fired the timeout event.
(gdb) frame 3 #3 0x000059daa9e12899 in ares_slist_node_first (list=0x7d0a9160f230) at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/dsa/ares_slist.c:386 warning: 386 /src/fluent-bit/lib/c-ares-1.34.4/src/lib/dsa/ares_slist.c: No such file or directory (gdb) p list $1 = (const ares_slist_t *) 0x7d0a9160f230 (gdb) x/4gx list 0x7d0a9160f230: 0x00007d0a9dac3a00 0x00007d0a9dac3a00 0x7d0a9160f240: 0x0000000000000000 0x0000000000000000 (gdb) x/2gx 0x00007d0a9dac3a00 0x7d0a9dac3a00: 0x00000000000002a1 0x0000000000000030
https://github.com/fluent/fluent-bit/blob/master/lib/c-ares-1.34.4/src/lib/dsa/ares_slist.c#L33
As a skip list, this (0x00000000000002a1) should be a memory address but it is too small to be a valid memory address.
Another stack trace:
#0 0x00007ea580d1850f in abort () from ./lib/x86_64-linux-gnu/libc.so.6
#1 0x000056bb3ad2c2f8 in flb_signal_handler (signal=<optimized out>) at /src/fluent-bit/src/fluent-bit.c:602
#2 <signal handler called>
#3 0x00007ea581126ff0 in ?? () from ./usr/lib/x86_64-linux-gnu/libcrypto.so.3
#4 0x00007ea581133377 in ?? () from ./usr/lib/x86_64-linux-gnu/libcrypto.so.3
#5 0x00007ea5811337e3 in ?? () from ./usr/lib/x86_64-linux-gnu/libcrypto.so.3
#6 0x00007ea581133604 in ?? () from ./usr/lib/x86_64-linux-gnu/libcrypto.so.3
#7 0x00007ea581133785 in ?? () from ./usr/lib/x86_64-linux-gnu/libcrypto.so.3
#8 0x00007ea581133604 in ?? () from ./usr/lib/x86_64-linux-gnu/libcrypto.so.3
#9 0x00007ea5811337e3 in ?? () from ./usr/lib/x86_64-linux-gnu/libcrypto.so.3
#10 0x00007ea581133604 in ?? () from ./usr/lib/x86_64-linux-gnu/libcrypto.so.3
#11 0x00007ea5811336f5 in ASN1_item_free () from ./usr/lib/x86_64-linux-gnu/libcrypto.so.3
#12 0x00007ea5812e55a0 in OPENSSL_sk_pop_free () from ./usr/lib/x86_64-linux-gnu/libcrypto.so.3
#13 0x00007ea5814f86c2 in SSL_free () from ./usr/lib/x86_64-linux-gnu/libssl.so.3
#14 0x000056bb3adfeff2 in tls_session_destroy (session=0x7ea564f6c000) at /src/fluent-bit/src/tls/openssl.c:715
#15 tls_session_destroy (session=0x7ea564f6c000) at /src/fluent-bit/src/tls/openssl.c:699
#16 0x000056bb3ae00b96 in flb_tls_session_destroy (session=0x7ea564f1aa80) at /src/fluent-bit/src/tls/flb_tls.c:691
#17 0x000056bb3ade6aa6 in prepare_destroy_conn (u_conn=u_conn@entry=0x7ea564f63c40) at /src/fluent-bit/src/flb_upstream.c:514
#18 0x000056bb3ade7f50 in flb_upstream_conn_timeouts (list=<optimized out>) at /src/fluent-bit/src/flb_upstream.c:1019
#19 0x000056bb3ade2c56 in flb_sched_event_handler (config=0x7ea57fc3c740, event=event@entry=0x7ea564f100a0) at /src/fluent-bit/src/flb_scheduler.c:624
#20 0x000056bb3adcf063 in output_thread (data=0x7ea57c0f4300) at /src/fluent-bit/src/flb_output_thread.c:285
#21 0x000056bb3ade8bee in step_callback (data=0x7ea57b545000) at /src/fluent-bit/src/flb_worker.c:43
#22 0x00007ea580d7b1f5 in ?? () from ./lib/x86_64-linux-gnu/libc.so.6
#23 0x00007ea580dfab40 in clone () from ./lib/x86_64-linux-gnu/libc.so.6
Similar issue as the one we found initially tls session destroy
Line 1088 in 4f8c50b
| static int tls_session_destroy(void *session) |
(gdb) x/4gx 0x7ea564f6c000 0x7ea564f6c000: 0x00007ea564f2c000 0x00000000000002e2 0x7ea564f6c010: 0x0000000000000000 0x0000000000000000 (gdb) x/20gx 0x00007ea564f2c000 0x7ea564f2c000: 0x0000000000000304 0x00007ea58155f100 0x7ea564f2c010: 0x0000000000000000 0x0000000000000000 0x7ea564f2c020: 0x0000000000000000 0x0000000000000001 0x7ea564f2c030: 0x00007ea581523e60 0x0000000000000000 0x7ea564f2c040: 0x0000000100000000 0x0000000100000004 0x7ea564f2c050: 0x0000000100000001 0x0000000100000002 0x7ea564f2c060: 0x0000000000000000 0x0000000000000000 0x7ea564f2c070: 0x0000000000000000 0x0000000000000000 0x7ea564f2c080: 0x0000000000000000 0x0000000000000000 0x7ea564f2c090: 0x00007ea564f9b004 0x0000000000000000
Similar to previous one, strong signal of this SSL connection object has likely already been cleared or reset
I think the cause could be the plugin is handling high throughput or network timeouts.
It cause race condition:
Thread A decides a connection has timed out (flb_upstream_conn_timeouts) and decides to kill it. Thread B is potentially still trying to write to, or has just finished using, that same connection.
One more stack trace:
#0 0x00007f7d04b6950f in abort () from ./lib/x86_64-linux-gnu/libc.so.6
#1 0x00005ad5fbc442f8 in flb_signal_handler (signal=<optimized out>) at /src/fluent-bit/src/fluent-bit.c:602
#2 <signal handler called>
#3 ares_array_len (arr=0x747379732d656275) at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/dsa/ares_array.c:64
#4 0x00005ad5fc0adace in ares_dns_record_query_cnt (dnsrec=dnsrec@entry=0x7f7ced6595e8)
at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/record/ares_dns_record.c:249
#5 0x00005ad5fc0a02f0 in same_questions (arec=<optimized out>, query=0x7f7ced20a280) at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/ares_process.c:1279
#6 process_answer (now=0x7f7cee836530, conn=0x7f7ced20f1e0, alen=<optimized out>, abuf=<optimized out>, channel=0x7f7ced26a000)
at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/ares_process.c:740
#7 read_answers (now=0x7f7cee836530, conn=<optimized out>) at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/ares_process.c:553
#8 process_read (now=0x7f7cee836530, read_fd=<optimized out>, channel=0x7f7ced26a000) at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/ares_process.c:587
#9 ares_process_fds_nolock (channel=channel@entry=0x7f7ced26a000, events=events@entry=0x7f7cee8365c0, nevents=nevents@entry=1, flags=flags@entry=0)
at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/ares_process.c:225
#10 0x00005ad5fc0a0ba9 in ares_process_fds_nolock (flags=0, nevents=1, events=0x7f7cee8365c0, channel=0x7f7ced26a000)
at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/ares_process.c:260
#11 ares_process_fds (channel=0x7f7ced26a000, events=events@entry=0x7f7cee8365c0, nevents=1, flags=flags@entry=0)
at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/ares_process.c:257
#12 0x00005ad5fc0a0c0c in ares_process_fd (channel=<optimized out>, read_fd=<optimized out>, write_fd=<optimized out>)
at /src/fluent-bit/lib/c-ares-1.34.4/src/lib/ares_process.c:284
#13 0x00005ad5fbcf0224 in flb_net_getaddrinfo_event_handler (arg=0x7f7ced2032a0) at /src/fluent-bit/src/flb_network.c:874
#14 0x00005ad5fbce706e in output_thread (data=0x7f7cf61b2600) at /src/fluent-bit/src/flb_output_thread.c:318
#15 0x00005ad5fbd00bee in step_callback (data=0x7f7d01ae9ac0) at /src/fluent-bit/src/flb_worker.c:43
#16 0x00007f7d04bcc1f5 in ?? () from ./lib/x86_64-linux-gnu/libc.so.6
#17 0x00007f7d04c4bb40 in clone () from ./lib/x86_64-linux-gnu/libc.so.6
Frame #3: ares_array_len (arr=0x747379732d656275), since it's x86 so 75 62 65 2d 73 79 73 74, which is "ube-syst" This 99% is a fragment of"kube-system"
Memory Overwrite: Fluent Bit allocated memory for a DNS array structure. (within c-ares)
Corruption: Another part of Fluent Bit (likely processing Kubernetes metadata or logs tagged with kube-system) wrote data into the wrong location, overwriting the DNS structure with the text "ube-syst".
Crash: When c-ares tried to read the length of its array (ares_array_len), it tried to use the text "ube-syst" as a memory address. Since that is not a valid memory location, the CPU triggered a segmentation fault (abort).