-
Notifications
You must be signed in to change notification settings - Fork 780
Fix jq healthcheck to account for nulls #370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
eozturk1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also test a failing case as well?
|
Failing case |
|
Unsubscribe
Pada tanggal Sel, 15 Apr 2025 21.52, Peixian Wang ***@***.***>
menulis:
… Merged #370 <#370> into main.
—
Reply to this email directly, view it on GitHub
<#370 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/A37ML62ZEJ7RLVSQKUFAKOL2ZUMMRAVCNFSM6AAAAAB3BSXIPOVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJXGI3TCMJVGY3TEOI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
|
q |
| # Select the "check_desc" field (Description of the check result) | ||
| # and take all results that do NOT equal "Layer4 check passed" from HAProxy | ||
| RESULT=$(tail -n +1 /tmp/stats.txt | jq -R 'split(",")' | jq -c '. | select(.[1] | contains("whatsapp_net"))' | jq --raw-output '.[65]| select(. | test("Layer4 check passed") | not)') | ||
| RESULT=$(tail -n +1 /tmp/stats.txt | jq -R 'split(",")' | jq -c 'select(.[1] != null) | select(.[1] | contains("whatsapp_net"))' | jq --raw-output '.[65]| select(. | test("Layer4 check passed") | not)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mio
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Then convert the ugly CSV to slightly less ugly JSON | ||
| # Filter out the lines for *.whatsapp_net backend status | ||
| # Select the "check_desc" field (Description of the check result) | ||
| # and take all results that do NOT equal "Layer4 check passed" from HAProxy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mio
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mio
| # Filter out the lines for *.whatsapp_net backend status | ||
| # Select the "check_desc" field (Description of the check result) | ||
| # and take all results that do NOT equal "Layer4 check passed" from HAProxy | ||
| RESULT=$(tail -n +1 /tmp/stats.txt | jq -R 'split(",")' | jq -c '. | select(.[1] | contains("whatsapp_net"))' | jq --raw-output '.[65]| select(. | test("Layer4 check passed") | not)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
El sáb, 15 de nov de 2025, 19:20, dylanreyes0457-ux <
***@***.***> escribió:
… ***@***.**** commented on this pull request.
------------------------------
In proxy/src/healthcheck.sh
<#370 (comment)>:
> @@ -11,7 +11,7 @@ curl -s -w 2 "http://127.0.0.1:8199/;csv" > /tmp/stats.txt || exit 1
# Filter out the lines for *.whatsapp_net backend status
# Select the "check_desc" field (Description of the check result)
# and take all results that do NOT equal "Layer4 check passed" from HAProxy
-RESULT=$(tail -n +1 /tmp/stats.txt | jq -R 'split(",")' | jq -c '. | select(.[1] | contains("whatsapp_net"))' | jq --raw-output '.[65]| select(. | test("Layer4 check passed") | not)')
+RESULT=$(tail -n +1 /tmp/stats.txt | jq -R 'split(",")' | jq -c 'select(.[1] != null) | select(.[1] | contains("whatsapp_net"))' | jq --raw-output '.[65]| select(. | test("Layer4 check passed") | not)')
proxy-main.zip
<https://github.com/user-attachments/files/23564853/proxy-main.zip>
—
Reply to this email directly, view it on GitHub
<#370 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/BYTPOEAUPVYD4VIM2DTWHRL3467MFAVCNFSM6AAAAAB3BSXIPOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTINRYG44DKNJXGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
This healthcheck has been broken for a while now, although I think we usually just don't notice it.
How this started
I noticed that we were seeing pretty frequent healthcheck errors when this got deployed
However, the pods themselves were clearly healthy, so this was very odd, and manual inspection of pod metrics made it look okay, and WA was clearly not down.
Debugging
On a pod:
nullhere implies an odd value, and indeed we can see that by selecting the raw fields{"field1":null,"isNull":true}is our problem.By skipping over nulls:
We can see that this health check now succeeds:
Hasn't this been around forever?
Yep. Which meant that pods sometimes would get restarted in kubernetes for random reasons. I suspect the root cause is that when HAproxy gets a lot of load, the stats file becomes slightly different, which adds this null value. We should switch to prometheus based metrics for health checks when they're available.