Skip to content

Signal handler for RemoteProcessAlloc #540

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

thomasywang
Copy link
Contributor

Summary:
What's going on here:

  1. RemoteProcessAlloc is instantiated in the client code (ex. https://fburl.com/code/p4t5aewo)
  2. RemoteProcessAlloc::new() now spawns a signal handler and holds onto the JoinHandle. A tx-rx pair is created so that the signal handler task is aware of the addresses of hosts in RemoteProcessAlloc::host_states as they are added and removed
  3. RemoteProcessAlloc::host_states is now wrapped in a struct HostStates which contains the tx side and aims to have the same interface as a HashMap but sends updates to the map over the tx. When a RemoteProcessAllocHostState is inserted, the address and HostId is sent over the tx. When a RemoteProcessAllocHostState is removed, the HostId is sent over the tx (address is None).
  4. When the handler receives a HostId and Some(ChannelAddr) it will dial this address, and insert the ChannelTx into it's own HashMap with the HostId as the key
  5. When the handler receives a HostId and None, it will remove the corresponding entry from it's HashMap
  6. When the handler receives a signal, it will iterate over all ChannelTxs in the HashMap and send RemoteProcessAllocatorMessage::Signal(signal) over each ChannelTx to the RemoteProcessAllocator running on a remote machine
  7. TheRemoteProcessAllocator receives the message. If the signal == SIGINT, it calls ensure_previous_alloc_stopped to stop gracefully, then reraises the signal

Differential Revision: D78097380

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 15, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78097380

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78097380

thomasywang added a commit to thomasywang/monarch-1 that referenced this pull request Jul 18, 2025
Summary:
Pull Request resolved: pytorch-labs#540

What's going on here:

1. `RemoteProcessAlloc` is instantiated in the client code (ex. https://fburl.com/code/p4t5aewo)
2. `RemoteProcessAlloc::new()` now spawns a signal handler and holds onto the `JoinHandle`. A tx-rx pair is created so that the signal handler task is aware of the addresses of hosts in `RemoteProcessAlloc::host_states` as they are added and removed
3.  `RemoteProcessAlloc::host_states` is now wrapped in a struct `HostStates` which contains the tx side and aims to have the same interface as a `HashMap` but sends updates to the map over the tx. When a `RemoteProcessAllocHostState` is inserted, the address and `HostId` is sent over the tx. When a `RemoteProcessAllocHostState` is removed, the `HostId` is sent over the tx (address is None).
4. When the handler receives a `HostId` and `Some(ChannelAddr)` it will dial this address, and insert the `ChannelTx` into it's own `HashMap` with the `HostId` as the key
5. When the handler receives a `HostId` and `None`, it will remove the corresponding entry from it's `HashMap`
6. When the handler receives a signal, it will iterate over all `ChannelTx`s in the `HashMap` and send `RemoteProcessAllocatorMessage::Signal(signal)` over each `ChannelTx` to the `RemoteProcessAllocator` running on a remote machine
7. The`RemoteProcessAllocator` receives the message. If the signal == SIGINT, it calls `ensure_previous_alloc_stopped` to stop gracefully, then reraises the signal

Reviewed By: moonli

Differential Revision: D78097380
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78097380

thomasywang added a commit to thomasywang/monarch-1 that referenced this pull request Jul 18, 2025
Summary:
Pull Request resolved: pytorch-labs#540

What's going on here:

1. `RemoteProcessAlloc` is instantiated in the client code (ex. https://fburl.com/code/p4t5aewo)
2. `RemoteProcessAlloc::new()` now spawns a signal handler and holds onto the `JoinHandle`. A tx-rx pair is created so that the signal handler task is aware of the addresses of hosts in `RemoteProcessAlloc::host_states` as they are added and removed
3.  `RemoteProcessAlloc::host_states` is now wrapped in a struct `HostStates` which contains the tx side and aims to have the same interface as a `HashMap` but sends updates to the map over the tx. When a `RemoteProcessAllocHostState` is inserted, the address and `HostId` is sent over the tx. When a `RemoteProcessAllocHostState` is removed, the `HostId` is sent over the tx (address is None).
4. When the handler receives a `HostId` and `Some(ChannelAddr)` it will dial this address, and insert the `ChannelTx` into it's own `HashMap` with the `HostId` as the key
5. When the handler receives a `HostId` and `None`, it will remove the corresponding entry from it's `HashMap`
6. When the handler receives a signal, it will iterate over all `ChannelTx`s in the `HashMap` and send `RemoteProcessAllocatorMessage::Signal(signal)` over each `ChannelTx` to the `RemoteProcessAllocator` running on a remote machine
7. The`RemoteProcessAllocator` receives the message. If the signal == SIGINT, it calls `ensure_previous_alloc_stopped` to stop gracefully, then reraises the signal

Reviewed By: moonli

Differential Revision: D78097380
@thomasywang thomasywang force-pushed the export-D78097380 branch 2 times, most recently from 7386ca5 to 4a12d9e Compare July 21, 2025 16:20
thomasywang added a commit to thomasywang/monarch-1 that referenced this pull request Jul 21, 2025
Summary:

What's going on here:

1. `RemoteProcessAlloc` is instantiated in the client code (ex. https://fburl.com/code/p4t5aewo)
2. `RemoteProcessAlloc::new()` now spawns a signal handler and holds onto the `JoinHandle`. A tx-rx pair is created so that the signal handler task is aware of the addresses of hosts in `RemoteProcessAlloc::host_states` as they are added and removed
3.  `RemoteProcessAlloc::host_states` is now wrapped in a struct `HostStates` which contains the tx side and aims to have the same interface as a `HashMap` but sends updates to the map over the tx. When a `RemoteProcessAllocHostState` is inserted, the address and `HostId` is sent over the tx. When a `RemoteProcessAllocHostState` is removed, the `HostId` is sent over the tx (address is None).
4. When the handler receives a `HostId` and `Some(ChannelAddr)` it will dial this address, and insert the `ChannelTx` into it's own `HashMap` with the `HostId` as the key
5. When the handler receives a `HostId` and `None`, it will remove the corresponding entry from it's `HashMap`
6. When the handler receives a signal, it will iterate over all `ChannelTx`s in the `HashMap` and send `RemoteProcessAllocatorMessage::Signal(signal)` over each `ChannelTx` to the `RemoteProcessAllocator` running on a remote machine
7. The`RemoteProcessAllocator` receives the message. If the signal == SIGINT, it calls `ensure_previous_alloc_stopped` to stop gracefully, then reraises the signal

Reviewed By: moonli

Differential Revision: D78097380
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78097380

thomasywang added a commit to thomasywang/monarch-1 that referenced this pull request Jul 21, 2025
Summary:
Pull Request resolved: pytorch-labs#540

What's going on here:

1. `RemoteProcessAlloc` is instantiated in the client code (ex. https://fburl.com/code/p4t5aewo)
2. `RemoteProcessAlloc::new()` now spawns a signal handler and holds onto the `JoinHandle`. A tx-rx pair is created so that the signal handler task is aware of the addresses of hosts in `RemoteProcessAlloc::host_states` as they are added and removed
3.  `RemoteProcessAlloc::host_states` is now wrapped in a struct `HostStates` which contains the tx side and aims to have the same interface as a `HashMap` but sends updates to the map over the tx. When a `RemoteProcessAllocHostState` is inserted, the address and `HostId` is sent over the tx. When a `RemoteProcessAllocHostState` is removed, the `HostId` is sent over the tx (address is None).
4. When the handler receives a `HostId` and `Some(ChannelAddr)` it will dial this address, and insert the `ChannelTx` into it's own `HashMap` with the `HostId` as the key
5. When the handler receives a `HostId` and `None`, it will remove the corresponding entry from it's `HashMap`
6. When the handler receives a signal, it will iterate over all `ChannelTx`s in the `HashMap` and send `RemoteProcessAllocatorMessage::Signal(signal)` over each `ChannelTx` to the `RemoteProcessAllocator` running on a remote machine
7. The`RemoteProcessAllocator` receives the message. If the signal == SIGINT, it calls `ensure_previous_alloc_stopped` to stop gracefully, then reraises the signal

Reviewed By: moonli

Differential Revision: D78097380
thomasywang added a commit to thomasywang/monarch-1 that referenced this pull request Jul 21, 2025
Summary:

What's going on here:

1. `RemoteProcessAlloc` is instantiated in the client code (ex. https://fburl.com/code/p4t5aewo)
2. `RemoteProcessAlloc::new()` now spawns a signal handler and holds onto the `JoinHandle`. A tx-rx pair is created so that the signal handler task is aware of the addresses of hosts in `RemoteProcessAlloc::host_states` as they are added and removed
3.  `RemoteProcessAlloc::host_states` is now wrapped in a struct `HostStates` which contains the tx side and aims to have the same interface as a `HashMap` but sends updates to the map over the tx. When a `RemoteProcessAllocHostState` is inserted, the address and `HostId` is sent over the tx. When a `RemoteProcessAllocHostState` is removed, the `HostId` is sent over the tx (address is None).
4. When the handler receives a `HostId` and `Some(ChannelAddr)` it will dial this address, and insert the `ChannelTx` into it's own `HashMap` with the `HostId` as the key
5. When the handler receives a `HostId` and `None`, it will remove the corresponding entry from it's `HashMap`
6. When the handler receives a signal, it will iterate over all `ChannelTx`s in the `HashMap` and send `RemoteProcessAllocatorMessage::Signal(signal)` over each `ChannelTx` to the `RemoteProcessAllocator` running on a remote machine
7. The`RemoteProcessAllocator` receives the message. If the signal == SIGINT, it calls `ensure_previous_alloc_stopped` to stop gracefully, then reraises the signal

Reviewed By: moonli

Differential Revision: D78097380
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78097380

thomasywang added a commit to thomasywang/monarch-1 that referenced this pull request Jul 21, 2025
Summary:
Pull Request resolved: pytorch-labs#540

What's going on here:

1. `RemoteProcessAlloc` is instantiated in the client code (ex. https://fburl.com/code/p4t5aewo)
2. `RemoteProcessAlloc::new()` now spawns a signal handler and holds onto the `JoinHandle`. A tx-rx pair is created so that the signal handler task is aware of the addresses of hosts in `RemoteProcessAlloc::host_states` as they are added and removed
3.  `RemoteProcessAlloc::host_states` is now wrapped in a struct `HostStates` which contains the tx side and aims to have the same interface as a `HashMap` but sends updates to the map over the tx. When a `RemoteProcessAllocHostState` is inserted, the address and `HostId` is sent over the tx. When a `RemoteProcessAllocHostState` is removed, the `HostId` is sent over the tx (address is None).
4. When the handler receives a `HostId` and `Some(ChannelAddr)` it will dial this address, and insert the `ChannelTx` into it's own `HashMap` with the `HostId` as the key
5. When the handler receives a `HostId` and `None`, it will remove the corresponding entry from it's `HashMap`
6. When the handler receives a signal, it will iterate over all `ChannelTx`s in the `HashMap` and send `RemoteProcessAllocatorMessage::Signal(signal)` over each `ChannelTx` to the `RemoteProcessAllocator` running on a remote machine
7. The`RemoteProcessAllocator` receives the message. If the signal == SIGINT, it calls `ensure_previous_alloc_stopped` to stop gracefully, then reraises the signal

Reviewed By: moonli

Differential Revision: D78097380
thomasywang added a commit to thomasywang/monarch-1 that referenced this pull request Jul 21, 2025
Summary:

What's going on here:

1. `RemoteProcessAlloc` is instantiated in the client code (ex. https://fburl.com/code/p4t5aewo)
2. `RemoteProcessAlloc::new()` now spawns a signal handler and holds onto the `JoinHandle`. A tx-rx pair is created so that the signal handler task is aware of the addresses of hosts in `RemoteProcessAlloc::host_states` as they are added and removed
3.  `RemoteProcessAlloc::host_states` is now wrapped in a struct `HostStates` which contains the tx side and aims to have the same interface as a `HashMap` but sends updates to the map over the tx. When a `RemoteProcessAllocHostState` is inserted, the address and `HostId` is sent over the tx. When a `RemoteProcessAllocHostState` is removed, the `HostId` is sent over the tx (address is None).
4. When the handler receives a `HostId` and `Some(ChannelAddr)` it will dial this address, and insert the `ChannelTx` into it's own `HashMap` with the `HostId` as the key
5. When the handler receives a `HostId` and `None`, it will remove the corresponding entry from it's `HashMap`
6. When the handler receives a signal, it will iterate over all `ChannelTx`s in the `HashMap` and send `RemoteProcessAllocatorMessage::Signal(signal)` over each `ChannelTx` to the `RemoteProcessAllocator` running on a remote machine
7. The`RemoteProcessAllocator` receives the message. If the signal == SIGINT, it calls `ensure_previous_alloc_stopped` to stop gracefully, then reraises the signal

Reviewed By: moonli

Differential Revision: D78097380
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78097380

thomasywang added a commit to thomasywang/monarch-1 that referenced this pull request Jul 22, 2025
Summary:

What's going on here:

1. `RemoteProcessAlloc` is instantiated in the client code (ex. https://fburl.com/code/p4t5aewo)
2. `RemoteProcessAlloc::new()` now spawns a signal handler and holds onto the `JoinHandle`. A tx-rx pair is created so that the signal handler task is aware of the addresses of hosts in `RemoteProcessAlloc::host_states` as they are added and removed
3.  `RemoteProcessAlloc::host_states` is now wrapped in a struct `HostStates` which contains the tx side and aims to have the same interface as a `HashMap` but sends updates to the map over the tx. When a `RemoteProcessAllocHostState` is inserted, the address and `HostId` is sent over the tx. When a `RemoteProcessAllocHostState` is removed, the `HostId` is sent over the tx (address is None).
4. When the handler receives a `HostId` and `Some(ChannelAddr)` it will dial this address, and insert the `ChannelTx` into it's own `HashMap` with the `HostId` as the key
5. When the handler receives a `HostId` and `None`, it will remove the corresponding entry from it's `HashMap`
6. When the handler receives a signal, it will iterate over all `ChannelTx`s in the `HashMap` and send `RemoteProcessAllocatorMessage::Signal(signal)` over each `ChannelTx` to the `RemoteProcessAllocator` running on a remote machine
7. The`RemoteProcessAllocator` receives the message. If the signal == SIGINT, it calls `ensure_previous_alloc_stopped` to stop gracefully, then reraises the signal

Reviewed By: moonli

Differential Revision: D78097380
thomasywang added a commit to thomasywang/monarch-1 that referenced this pull request Jul 22, 2025
Summary:

What's going on here:

1. `RemoteProcessAlloc` is instantiated in the client code (ex. https://fburl.com/code/p4t5aewo)
2. `RemoteProcessAlloc::new()` now spawns a signal handler and holds onto the `JoinHandle`. A tx-rx pair is created so that the signal handler task is aware of the addresses of hosts in `RemoteProcessAlloc::host_states` as they are added and removed
3.  `RemoteProcessAlloc::host_states` is now wrapped in a struct `HostStates` which contains the tx side and aims to have the same interface as a `HashMap` but sends updates to the map over the tx. When a `RemoteProcessAllocHostState` is inserted, the address and `HostId` is sent over the tx. When a `RemoteProcessAllocHostState` is removed, the `HostId` is sent over the tx (address is None).
4. When the handler receives a `HostId` and `Some(ChannelAddr)` it will dial this address, and insert the `ChannelTx` into it's own `HashMap` with the `HostId` as the key
5. When the handler receives a `HostId` and `None`, it will remove the corresponding entry from it's `HashMap`
6. When the handler receives a signal, it will iterate over all `ChannelTx`s in the `HashMap` and send `RemoteProcessAllocatorMessage::Signal(signal)` over each `ChannelTx` to the `RemoteProcessAllocator` running on a remote machine
7. The`RemoteProcessAllocator` receives the message. If the signal == SIGINT, it calls `ensure_previous_alloc_stopped` to stop gracefully, then reraises the signal

Reviewed By: moonli

Differential Revision: D78097380
thomasywang added a commit to thomasywang/monarch-1 that referenced this pull request Jul 22, 2025
Summary:

What's going on here:

1. `RemoteProcessAlloc` is instantiated in the client code (ex. https://fburl.com/code/p4t5aewo)
2. `RemoteProcessAlloc::new()` now spawns a signal handler and holds onto the `JoinHandle`. A tx-rx pair is created so that the signal handler task is aware of the addresses of hosts in `RemoteProcessAlloc::host_states` as they are added and removed
3.  `RemoteProcessAlloc::host_states` is now wrapped in a struct `HostStates` which contains the tx side and aims to have the same interface as a `HashMap` but sends updates to the map over the tx. When a `RemoteProcessAllocHostState` is inserted, the address and `HostId` is sent over the tx. When a `RemoteProcessAllocHostState` is removed, the `HostId` is sent over the tx (address is None).
4. When the handler receives a `HostId` and `Some(ChannelAddr)` it will dial this address, and insert the `ChannelTx` into it's own `HashMap` with the `HostId` as the key
5. When the handler receives a `HostId` and `None`, it will remove the corresponding entry from it's `HashMap`
6. When the handler receives a signal, it will iterate over all `ChannelTx`s in the `HashMap` and send `RemoteProcessAllocatorMessage::Signal(signal)` over each `ChannelTx` to the `RemoteProcessAllocator` running on a remote machine
7. The`RemoteProcessAllocator` receives the message. If the signal == SIGINT, it calls `ensure_previous_alloc_stopped` to stop gracefully, then reraises the signal

Reviewed By: moonli

Differential Revision: D78097380
Summary:

What's going on here:

1. `RemoteProcessAlloc` is instantiated in the client code (ex. https://fburl.com/code/p4t5aewo)
2. `RemoteProcessAlloc::new()` now spawns a signal handler and holds onto the `JoinHandle`. A tx-rx pair is created so that the signal handler task is aware of the addresses of hosts in `RemoteProcessAlloc::host_states` as they are added and removed
3.  `RemoteProcessAlloc::host_states` is now wrapped in a struct `HostStates` which contains the tx side and aims to have the same interface as a `HashMap` but sends updates to the map over the tx. When a `RemoteProcessAllocHostState` is inserted, the address and `HostId` is sent over the tx. When a `RemoteProcessAllocHostState` is removed, the `HostId` is sent over the tx (address is None).
4. When the handler receives a `HostId` and `Some(ChannelAddr)` it will dial this address, and insert the `ChannelTx` into it's own `HashMap` with the `HostId` as the key
5. When the handler receives a `HostId` and `None`, it will remove the corresponding entry from it's `HashMap`
6. When the handler receives a signal, it will iterate over all `ChannelTx`s in the `HashMap` and send `RemoteProcessAllocatorMessage::Signal(signal)` over each `ChannelTx` to the `RemoteProcessAllocator` running on a remote machine
7. The`RemoteProcessAllocator` receives the message. If the signal == SIGINT, it calls `ensure_previous_alloc_stopped` to stop gracefully, then reraises the signal

Reviewed By: moonli

Differential Revision: D78097380
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78097380

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 1c96f6d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants