Skip to content

esp_websocket_client_stop() get stuck (IDFGH-16741) #932

@huming2207

Description

@huming2207

Answers checklist.

  • I have read the documentation for esp-protocols components and the issue is not addressed there.
  • I have updated my esp-protocols branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

What component are you using? If you choose Other, provide details in More Information.

esp_websocket_client

component version

git commit c078c36 or current master

IDF version.

v6.0-dev-3218-g811e27118d

More Information.

Hi @david-cermak @gabsuren @euripedesrocha

It looks like somehow with ESP-IDF v6.0 and the recent versions of esp-websocket-client, calling esp_websocket_client_stop() may get stuck forever in some conditions when the internet is down. Here's our test environment:

  • Genuine ESP32-S3-WROOM-1-N16R8 with PSRAM enabled
  • ESP-IDF v6.0
  • Recent esp-websocket-client (e.g. c078c36)
  • Quectel EG800K modem with a SIM that has no data credit left, but still able to dial out and then retrieved IP address, but no Internet access, and a W5500 NIC but no ethernet cable plugged in
  • Also have a MQTT client running, it talks to a MQTT server on the internet over WebSocket protocol
  • Set esp_websocket_client_config_t->disable_auto_reconnect to true so that we can handle the reconnect logic manually
  • Set esp_websocket_client_config_t->network_timeout_ms to 10000ms
  • No external proxy is used

Now, if I let the device to run for a while and attempt to connect to a WS and a MQTT server over the modem, since there's no internet access, it will always fail. Then in my firmware, I call the esp_websocket_client_stop() and then esp_websocket_client_destroy(), and try to recreate a new WS client. If I run more than 3-5 attempts, it will stuck forever.

We dig into this issue a bit further and we realised it looks like in the esp_websocket_client_task(), the client->state is 1 (WEBSOCKET_STATE_INIT), and this websocket task stuck at the previous esp_transport_connect() call, even though the network_timeout_ms is 10000ms and it should've been timed out and return way earlier. Therefore I guess this might also be a tcp_transport issue, not the WS client's issue.

This sort of lockup also occasionally happens on MQTT client. It may stuck at the esp_mqtt_client_stop() forever as well. But somehow for us it's less likely to happen.

Here are some of our logs:

  • I added a line of log ESP_LOGW(TAG, "Client state: %u; state_bit: 0x%x", client->state, xEventGroupGetBits(client->status_bits)); in the esp_websocket_client_stop() before xEventGroupGetBits(client->status_bits) & STOPPED_BIT
  • This is what it be like when it did not stuck, WS client state is 0 (because never connect successfully)
Image
  • This is what it be like when it get stuck, WS client state is 1
Image

Regards,
Jackson

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions