Skip to content

Conversation

@zyk6271
Copy link

@zyk6271 zyk6271 commented Jul 14, 2025

Description

Related

Testing


Checklist

Before submitting a Pull Request, please ensure the following:

  • 🚨 This PR does not introduce breaking changes.
  • All CI checks (GH Actions) pass.
  • Documentation is updated as needed.
  • Tests are updated or added as necessary.
  • Code is well-commented, especially in complex areas.
  • Git history is clean — commits are squashed to the minimum necessary.

@CLAassistant
Copy link

CLAassistant commented Jul 14, 2025

CLA assistant check
All committers have signed the CLA.

@zyk6271
Copy link
Author

zyk6271 commented Aug 25, 2025

这有什么冲突请问

@alisitsyn
Copy link
Collaborator

alisitsyn commented Aug 26, 2025

这有什么冲突请问

Thank you for contribution and patience.

In spite the fix looks simple and clear, but it may cause a serious issue with other configurations.

This behavior depends on the configured priority of the Modbus tasks, user application tasks, and other system tasks, and also the setting of CONFIG_FMB_MASTER_TIMEOUT_MS_RESPOND. This may not work as expected and can cause a serious issue when some group events are incorrectly cleared after processing in poll task. I think the existing way of clearing the events by using the xEventGroupWaitBits() function and setting its xClearOnExit parameter to pdTRUE should be kept. This is the most common and safest way to wait for event bits and then automatically clear them in a single, atomic operation, which prevents race conditions.

@zyk6271
Copy link
Author

zyk6271 commented Aug 27, 2025

你可以试下,当工作在master模式跟多个slaver进行通讯时,如果通过RS485工具模拟一下混入了干扰数据,会造成后面的事件全部错位,你可以看下下面两张图,一张是混入干扰数据后通讯就会一直失败了,一张是正常通讯的日志
image

0062d9c65a14e652a02c5a052aca3cb9

@alisitsyn
Copy link
Collaborator

@zyk6271,

Please additionally send your sdkconfig file. Thanks.

@zyk6271
Copy link
Author

zyk6271 commented Aug 28, 2025

sdkconfig.txt
在附件里

@alisitsyn
Copy link
Collaborator

alisitsyn commented Aug 28, 2025

@zyk6271,

According to log your sdkconfig is misleading (in the code the response time is 1000 ms in the sdkconfig is 5000 ms). I guess you override it in master. The situation is typical modbus race condition when you set the response time to 1000ms but your actual slave processing time is a bit longer 1012 ms. This causes the response comes from slave a bit later when the master starts next transaction and as a result the master drops the transaction because new one is started the next response is also will be discarded because during the actual transaction the master gets two responses from slave. In this situation the fix above is just local solution because with a bit changed settings and priorities if the slave respond a bit later the result will be the same.

This is misconfiguration issue and if you don't change the CONFIG_FMB_MASTER_TIMEOUT_MS_RESPOND to correct value you will have the same result.
So, the CONFIG_FMB_MASTER_TIMEOUT_MS_RESPOND parameter in the sdkconfig file or the corresponded field in the configuration structure must be set to a value that is greater than the longest possible processing time of the slave. Setting this to 5000 ms in sdkconfig while the code overrides it to 1000 ms is the direct cause of the problem.

Please read the important-configuration-aspects. As I said before the fix you proposed is just a local solution that may cause master to discard the first response (expired) but receive the second one. This may cause the situation when the actual response will be skipped instead. Since the master poll code is shared between all masters the fix is not safe and can cause the additional issues later.
This race condition is typical for Modbus due to Modbus protocol specifics. I will think about safer approach in addition to the already implemented in master (injected sequence number corresponded to timestamp).

@zyk6271
Copy link
Author

zyk6271 commented Aug 28, 2025

现在的参数配置确实是1000,见以下代码, mb_communication_info_t comm = {
.ser_opts.port = MB_PORT_NUM,
.ser_opts.mode = MB_RTU,
.ser_opts.baudrate = MB_DEV_SPEED,
.ser_opts.parity = MB_PARITY_NONE,
.ser_opts.uid = 0,
.ser_opts.response_tout_ms = 1000,
.ser_opts.data_bits = UART_DATA_8_BITS,
.ser_opts.stop_bits = UART_STOP_BITS_1
};,你的意思是response_tout_ms 这个值改为2000,CONFIG_FMB_MASTER_TIMEOUT_MS_RESPOND也改为2000是吗

@alisitsyn
Copy link
Collaborator

alisitsyn commented Aug 28, 2025

The line .ser_opts.response_tout_ms = 1000, overrides the master timeout to 1000 ms, so it is enough to change it in this line only to ~2300ms. This should solve this issue.

@zyk6271
Copy link
Author

zyk6271 commented Aug 28, 2025

好的,我试试,谢谢

@zyk6271
Copy link
Author

zyk6271 commented Aug 28, 2025

刚才试了下,.ser_opts.response_tout_ms = 2300,CONFIG_FMB_MASTER_TIMEOUT_MS_RESPOND=5000以及CONFIG_FMB_MASTER_TIMEOUT_MS_RESPOND=2000以及,仍然可以复现该问题,在出现这行代码之后就不行了,mb_object.master: 0x3fcb171c, drop data received outside of transaction.

@alisitsyn
Copy link
Collaborator

alisitsyn commented Aug 28, 2025

Could you send me debug log when the issue happens and if possible the communication traffic between master and slave?
Which Modbus slave you are using? Does it happen with every request or only every second request?

@zyk6271
Copy link
Author

zyk6271 commented Aug 29, 2025

modbus_log.txt
chip_log.txt
日志请看附件,在modbus发送FFFFFFFFFFFFFFF模拟干扰后,就可以复现BUG,你应该也可以复现的

@alisitsyn
Copy link
Collaborator

@zyk6271,
Thank you for the logs. I will try to check this as soon as possible and will let you know.

@zyk6271
Copy link
Author

zyk6271 commented Sep 2, 2025

请问有进展吗

@alisitsyn
Copy link
Collaborator

alisitsyn commented Sep 3, 2025

Sorry for the late reply. I have many other activities and the response may take some time.
I checked your partial logs and can see some stress testing aspects there. You have some unresponsive nodes and send incorrect frames as well. The stack should able to skip these frames and repair the communication from next correct frame.

Request: [08:52:54.544]收←◆02 03 00 00 00 15 84 36
Response + incorrect next request [08:52:54.602]收←◆02 03 2A 00 00 00 00 00 01 00 01 00 00 00 00 00 01 00 01 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 FF AA 00 20 00 06 C5 E0
The following "00 20 00 06 C5 E0 " is the next request which should be: "03 03 00 20 00 06 c5 e0".
It is not clear why it was cut like this because the log says that the request was correctly sent out by port level.
As per target log it looks like the device is not able to repair from error and reports the ESP_ERR_INVALID_RESPONSE in spite of correct response. Is this correct? Does this apply to all following requests on your side?

This issue is clear for me. Your slave sends the wrong data outside of transaction at any time (FF FF FF FF, can be expired response from slave). This causes the EV_ERROR_PROCESS event generated outside of transaction and the corresponded event bit is set. If the wrong event is not cleared the next transaction will be completed with fail immediately even if the data does not come from slave. This then should clear all event bits and next transaction should be handled normally if the proper data comes from slave. If the slave response is set incorrectly this sequence with expired response can be repeated forever. I agree this issue should be solved and approve your solution. The only note that the current approach may cause some additional issues. I will try to find appropriate solution to cover all possible issues. Thank you for your contribution. I will return to it once possible.

@github-actions github-actions bot changed the title fix port event bug fix port event bug (EPROT-6) Oct 1, 2025
@alisitsyn
Copy link
Collaborator

alisitsyn commented Oct 6, 2025

@zyk6271 ,

Sorry for the late response. The workaround for the issue was implemented and should solve the issue you observed.
The component v2.1.1 includes the fix.
It is recommended to wait the feature freeze of esp-idf v6.0 for stable results.
Please let me know if you have any comments.
Thanks.

@zyk6271
Copy link
Author

zyk6271 commented Oct 21, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants