|
| 1 | +# Update Weights |
| 2 | + |
| 3 | +LMDeploy supports update model weights online for scenes such as RL training. Here are the steps to do so. |
| 4 | + |
| 5 | +## Step 1: Launch server |
| 6 | + |
| 7 | +For pytorch backend you have to add `--distributed-executor-backend ray`. |
| 8 | + |
| 9 | +```shell |
| 10 | +lmdeploy serve api_server internlm/internlm2_5-7b-chat --server-port 23333 --distributed-executor-backend ray # for pytorch backend |
| 11 | +``` |
| 12 | + |
| 13 | +## Step 2: Offloads weights & kv cache |
| 14 | + |
| 15 | +Before update model weights, the server should offloads weights and kv cache. |
| 16 | + |
| 17 | +```python |
| 18 | +from lmdeploy.utils import serialize_state_dict |
| 19 | +import requests |
| 20 | + |
| 21 | +BASE_URL = 'http://0.0.0.0:23333' |
| 22 | +api_key = 'sk-xxx' |
| 23 | + |
| 24 | +headers = { |
| 25 | + "Content-Type": "application/json", |
| 26 | + "Authorization": f"Bearer {api_key}", |
| 27 | + } |
| 28 | + |
| 29 | +# offloads weights and kv cache with level=2 |
| 30 | +response = requests.post(f"{BASE_URL}/sleep", headers=headers, params=dict(tags=['weights', 'kv_cache'], level=2)) |
| 31 | +assert response.status_code == 200, response.status_code |
| 32 | + |
| 33 | +# wake up weights, the server is ready for update weights |
| 34 | +response = requests.post(f"{BASE_URL}/wakeup", headers=headers, params=dict(tags=['weights'])) |
| 35 | +assert response.status_code == 200, response.status_code |
| 36 | +``` |
| 37 | + |
| 38 | +## Step 3: Update weights |
| 39 | + |
| 40 | +Split model weights into multi segments and update through `update_weights` endpoint. |
| 41 | + |
| 42 | +```python |
| 43 | +segmented_state_dict: List[Dict[str, torch.Tensor]] = ... |
| 44 | +num_segment = len(segmented_state_dict) |
| 45 | +for seg_idx in range(num_segment): |
| 46 | + serialized_data = serialize_state_dict(segmented_state_dict[seg_idx]) |
| 47 | + data = dict(serialized_named_tensors=serialized_data, finished=seg_idx == num_segment-1) |
| 48 | + response = requests.post(f"{BASE_URL}/update_weights", headers=headers, json=data) |
| 49 | + assert response.status_code == 200, f"response.status_code = {response.status_code}" |
| 50 | + |
| 51 | +``` |
| 52 | + |
| 53 | +**Note**: For pytorch backend, lmdeploy also supports flattened bucket tensors: |
| 54 | + |
| 55 | +```python |
| 56 | +from lmdeploy.utils import serialize_state_dict, FlattenedTensorBucket, FlattenedTensorMetadata |
| 57 | + |
| 58 | +segmented_state_dict: List[Dict[str, torch.Tensor]] = ... |
| 59 | +num_segment = len(segmented_state_dict) |
| 60 | +for seg_idx in range(num_segment): |
| 61 | + named_tensors = list(segmented_state_dict[seg_idx].items()) |
| 62 | + bucket = FlattenedTensorBucket(named_tensors=named_tensors) |
| 63 | + metadata = bucket.get_metadata() |
| 64 | + flattened_tensor_data = dict(flattened_tensor=bucket.get_flattened_tensor(), metadata=metadata) |
| 65 | + serialized_data = serialize_state_dict(flattened_tensor_data) |
| 66 | + data = dict(serialized_named_tensors=serialized_data, finished=seg_idx == num_segment-1, load_format='flattened_bucket') |
| 67 | + response = requests.post(f"{BASE_URL}/update_weights", headers=headers, json=data) |
| 68 | + assert response.status_code == 200, f"response.status_code = {response.status_code}" |
| 69 | +``` |
| 70 | + |
| 71 | +## Step 4: Wakeup server |
| 72 | + |
| 73 | +After update model weights, the server should onloads kv cache and provide serving again with the new updated weights. |
| 74 | + |
| 75 | +```python |
| 76 | +response = requests.post(f"{BASE_URL}/wakeup", headers=headers, params=dict(tags=['kv_cache'])) |
| 77 | +assert response.status_code == 200, response.status_code |
| 78 | +``` |
0 commit comments