Skip to content

Fixed a bug in the channel that was not detecting out of seq messages #663

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

vidhyav
Copy link
Contributor

@vidhyav vidhyav commented Jul 27, 2025

Summary:
This was because we need to check for out of seq messages
only if it is not the first message of the connection (even if it is
for subsequent sessions).

Differential Revision: D79044526

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 27, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79044526

vidhyav added a commit to vidhyav/monarch that referenced this pull request Jul 28, 2025
…pytorch-labs#663)

Summary:

This was because we need to check for out of seq messages
only if it is not the first message of the connection (even if it is
for subsequent sessions).

Differential Revision: D79044526
@vidhyav vidhyav force-pushed the export-D79044526 branch from 9717486 to bbc9258 Compare July 28, 2025 00:25
vidhyav added a commit to vidhyav/monarch that referenced this pull request Jul 28, 2025
…pytorch-labs#663)

Summary:

This was because we need to check for out of seq messages
only if it is not the first message of the connection (even if it is
for subsequent sessions).

Differential Revision: D79044526
@vidhyav vidhyav force-pushed the export-D79044526 branch from bbc9258 to f0b8b45 Compare July 28, 2025 00:25
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79044526

vidhyav added a commit to vidhyav/monarch that referenced this pull request Jul 28, 2025
…pytorch-labs#663)

Summary:
Pull Request resolved: pytorch-labs#663

This was because we need to check for out of seq messages
only if it is not the first message of the connection (even if it is
for subsequent sessions).

Differential Revision: D79044526
@vidhyav vidhyav force-pushed the export-D79044526 branch from f0b8b45 to ab59d0e Compare July 28, 2025 00:29
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79044526

vidhyav added a commit to vidhyav/monarch that referenced this pull request Jul 28, 2025
…pytorch-labs#663)

Summary:
Pull Request resolved: pytorch-labs#663

1) This was because we need to check for out of seq messages
only if it is not the first message of the connection (even if it is
for subsequent sessions).

2) There was another bug in which we don't exactly mutate session state into the session manager correctly. We clone the session state and don't mutate. This particular case is NOT taken care of the tcp_reconnect test: in this test, we shut down the entire server and reenter the listen loop thereby resetting the entire session manager.

The only way we can test (2) is if we can disconnect the underlying connection and reconnect altogether while the listen stays alive.

Differential Revision: D79044526
@vidhyav vidhyav force-pushed the export-D79044526 branch from ab59d0e to 6682e28 Compare July 28, 2025 00:34
vidhyav added a commit to vidhyav/monarch that referenced this pull request Jul 28, 2025
…pytorch-labs#663)

Summary:

1) This was because we need to check for out of seq messages
only if it is not the first message of the connection (even if it is
for subsequent sessions).

2) There was another bug in which we don't exactly mutate session state into the session manager correctly. We clone the session state and don't mutate. This particular case is NOT taken care of the tcp_reconnect test: in this test, we shut down the entire server and reenter the listen loop thereby resetting the entire session manager. 

The only way we can test (2) is if we can disconnect the underlying connection and reconnect altogether while the listen stays alive.

Differential Revision: D79044526
@vidhyav vidhyav force-pushed the export-D79044526 branch from 6682e28 to 2a97f6d Compare July 28, 2025 14:46
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79044526

@vidhyav vidhyav force-pushed the export-D79044526 branch from 2a97f6d to fd2696c Compare July 29, 2025 00:32
vidhyav added a commit to vidhyav/monarch that referenced this pull request Jul 29, 2025
…pytorch-labs#663)

Summary:

1) This was because we need to check for out of seq messages
only if it is not the first message of the connection (even if it is
for subsequent sessions).

2) There was another bug in which we don't exactly mutate session state into the session manager correctly. We clone the session state and don't mutate. This particular case is NOT taken care of the tcp_reconnect test: in this test, we shut down the entire server and reenter the listen loop thereby resetting the entire session manager. 

The only way we can test (2) is if we can disconnect the underlying connection and reconnect altogether while the listen stays alive.

Differential Revision: D79044526
@vidhyav vidhyav force-pushed the export-D79044526 branch from fd2696c to d6d8df2 Compare July 29, 2025 00:32
vidhyav added a commit to vidhyav/monarch that referenced this pull request Jul 29, 2025
…pytorch-labs#663)

Summary:

1) This was because we need to check for out of seq messages
only if it is not the first message of the connection (even if it is
for subsequent sessions).

2) There was another bug in which we don't exactly mutate session state into the session manager correctly. We clone the session state and don't mutate. This particular case is NOT taken care of the tcp_reconnect test: in this test, we shut down the entire server and reenter the listen loop thereby resetting the entire session manager. 

The only way we can test (2) is if we can disconnect the underlying connection and reconnect altogether while the listen stays alive.

Differential Revision: D79044526
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79044526

vidhyav added a commit to vidhyav/monarch that referenced this pull request Jul 29, 2025
…pytorch-labs#663)

Summary:
Pull Request resolved: pytorch-labs#663

1) This was because we need to check for out of seq messages
only if it is not the first message of the connection (even if it is
for subsequent sessions).

2) There was another bug in which we don't exactly mutate session state into the session manager correctly. We clone the session state and don't mutate. This particular case is NOT taken care of the tcp_reconnect test: in this test, we shut down the entire server and reenter the listen loop thereby resetting the entire session manager.

The only way we can test (2) is if we can disconnect the underlying connection and reconnect altogether while the listen stays alive.

Differential Revision: D79044526
@vidhyav vidhyav force-pushed the export-D79044526 branch from d6d8df2 to b0a21c7 Compare July 29, 2025 00:36
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79044526

vidhyav added a commit to vidhyav/monarch that referenced this pull request Jul 29, 2025
…pytorch-labs#663)

Summary:
Pull Request resolved: pytorch-labs#663

1) This was because we need to check for out of seq messages
only if it is not the first message of the connection (even if it is
for subsequent sessions).

2) There was another bug in which we don't exactly mutate session state into the session manager correctly. We clone the session state and don't mutate. This particular case is NOT taken care of the tcp_reconnect test: in this test, we shut down the entire server and reenter the listen loop thereby resetting the entire session manager.

The only way we can test (2) is if we can disconnect the underlying connection and reconnect altogether while the listen stays alive.

Differential Revision: D79044526
@vidhyav vidhyav force-pushed the export-D79044526 branch 2 times, most recently from ed44d80 to 9576e50 Compare July 29, 2025 14:31
vidhyav added a commit to vidhyav/monarch that referenced this pull request Jul 29, 2025
…pytorch-labs#663)

Summary:

1) This was because we need to check for out of seq messages
only if it is not the first message of the connection (even if it is
for subsequent sessions).

2) There was another bug in which we don't exactly mutate session state into the session manager correctly. We clone the session state and don't mutate. This particular case is NOT taken care of the tcp_reconnect test: in this test, we shut down the entire server and reenter the listen loop thereby resetting the entire session manager. 

The only way we can test (2) is if we can disconnect the underlying connection and reconnect altogether while the listen stays alive.

Differential Revision: D79044526
vidhyav added a commit to vidhyav/monarch that referenced this pull request Jul 29, 2025
…pytorch-labs#663)

Summary:

1) This was because we need to check for out of seq messages
only if it is not the first message of the connection (even if it is
for subsequent sessions).

2) There was another bug in which we don't exactly mutate session state into the session manager correctly. We clone the session state and don't mutate. This particular case is NOT taken care of the tcp_reconnect test: in this test, we shut down the entire server and reenter the listen loop thereby resetting the entire session manager. 

The only way we can test (2) is if we can disconnect the underlying connection and reconnect altogether while the listen stays alive.

Differential Revision: D79044526
@vidhyav vidhyav force-pushed the export-D79044526 branch from 9576e50 to 2162ba4 Compare July 29, 2025 14:31
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79044526

vidhyav added a commit to vidhyav/monarch that referenced this pull request Jul 29, 2025
…pytorch-labs#663)

Summary:
Pull Request resolved: pytorch-labs#663

Now, we break the session if we detect an out of seq message. This was disabled because of a failing test. But it turns out the test is wrong. The test stops NetRx but that should kill NetTx because netrx failure results in the entire session state nullified on  the receiver side.

Differential Revision: D79044526
@vidhyav vidhyav force-pushed the export-D79044526 branch from 2162ba4 to 7ad4261 Compare July 29, 2025 14:36
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79044526

vidhyav added a commit to vidhyav/monarch that referenced this pull request Jul 29, 2025
…pytorch-labs#663)

Summary:
Pull Request resolved: pytorch-labs#663

Now, we break the session if we detect an out of seq message. This was disabled because of a failing test. But it turns out the test is wrong. The test stops NetRx but that should kill NetTx because netrx failure results in the entire session state nullified on  the receiver side.

Differential Revision: D79044526
@vidhyav vidhyav force-pushed the export-D79044526 branch 2 times, most recently from 4c50da7 to b79a68d Compare July 31, 2025 13:59
vidhyav added a commit to vidhyav/monarch that referenced this pull request Jul 31, 2025
…pytorch-labs#663)

Summary:


Now, we break the session if we detect an out of seq message. This was disabled because of a failing test. But it turns out the test is wrong. The test stops NetRx but that should kill NetTx because netrx failure results in the entire session state nullified on  the receiver side.

Reviewed By: pzhan9, mariusae

Differential Revision: D79044526
@vidhyav vidhyav force-pushed the export-D79044526 branch from b79a68d to e0e7fe2 Compare July 31, 2025 14:00
vidhyav added a commit to vidhyav/monarch that referenced this pull request Jul 31, 2025
…pytorch-labs#663)

Summary:


Now, we break the session if we detect an out of seq message. This was disabled because of a failing test. But it turns out the test is wrong. The test stops NetRx but that should kill NetTx because netrx failure results in the entire session state nullified on  the receiver side.

Reviewed By: pzhan9, mariusae

Differential Revision: D79044526
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79044526

@vidhyav vidhyav force-pushed the export-D79044526 branch from e0e7fe2 to 9297e9e Compare July 31, 2025 14:02
vidhyav added a commit to vidhyav/monarch that referenced this pull request Jul 31, 2025
…pytorch-labs#663)

Summary:
Pull Request resolved: pytorch-labs#663

Now, we break the session if we detect an out of seq message. This was disabled because of a failing test. But it turns out the test is wrong. The test stops NetRx but that should kill NetTx because netrx failure results in the entire session state nullified on  the receiver side.

Reviewed By: pzhan9, mariusae

Differential Revision: D79044526
…pytorch-labs#663)

Summary:
Pull Request resolved: pytorch-labs#663

Now, we break the session if we detect an out of seq message. This was disabled because of a failing test. But it turns out the test is wrong. The test stops NetRx but that should kill NetTx because netrx failure results in the entire session state nullified on  the receiver side.

Reviewed By: pzhan9, mariusae

Differential Revision: D79044526
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79044526

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 434137a.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants