-
Notifications
You must be signed in to change notification settings - Fork 362
[TransferEngine] heterogeneous_ascend support kv-cache transfer between npu and gpu #759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: AscendTransport<[email protected]>
mooncake-common/common.cmake
Outdated
@@ -61,6 +61,7 @@ option(USE_NVMEOF "option for using NVMe over Fabric" OFF) | |||
option(USE_TCP "option for using TCP transport" ON) | |||
option(USE_ASCEND "option for using npu with HCCL" OFF) | |||
option(USE_ASCEND_DIRECT "option for using ascend npu with adxl engine" OFF) | |||
option(USE_ASCEND_HETEROGENEOUS "option for using Heterogeneous npu" OFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description seems inconsistent with the title. How about changing it to "option for transferring between ascend npu and gpu"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, Done
@@ -130,7 +130,7 @@ int TransferEnginePy::initializeExt(const char *local_hostname, | |||
} | |||
|
|||
free_list_.resize(kSlabSizeKBTabLen); | |||
#if !defined(USE_ASCEND) && !defined(USE_ASCEND_DIRECT) | |||
#if !defined(USE_ASCEND) && !defined(USE_ASCEND_DIRECT) && !defined(USE_HETEROGENEOUS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to fix code format: https://github.com/kvcache-ai/Mooncake/actions/runs/17121655553/job/48565109863?pr=759
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -130,7 +130,8 @@ int TransferEnginePy::initializeExt(const char *local_hostname, | |||
} | |||
|
|||
free_list_.resize(kSlabSizeKBTabLen); | |||
#if !defined(USE_ASCEND) && !defined(USE_ASCEND_DIRECT) | |||
#if !defined(USE_ASCEND) && !defined(USE_ASCEND_DIRECT) && \ | |||
!defined(USE_HETEROGENEOUS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be USE_ASCEND_HETEROGENEOUS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -30,6 +30,14 @@ if (USE_ASCEND) | |||
) | |||
endif() | |||
|
|||
if (USE_HETEROGENEOUS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Since I don't have an env to verify, and the CI could not cover this part as well. So please CC @alogfans to double-check on the changes.
@@ -1,18 +0,0 @@ | |||
【替换命令】 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove these files? by accident?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of these files was not an accident. Here's the detailed explanation:
The files in the pkg
directory were additional dependencies required for users of the CANN 8.1 version. However, the community has now updated and released the CANN 8.2 version, which natively includes all the content that was previously in the pkg
directory. As a result, the contents under pkg
are no longer necessary.
In the previous PR for AscendTransport titled [TransferEngine] Update to support CANN 8.2.RC1 #714, support for the CANN 8.2 version was already implemented. Therefore, during the current PR, we took the opportunity to remove the obsolete pkg
files as part of the cleanup.
For users who need these dependencies, they can follow the official instructions to download the CANN 8.2 version from the community. The functionality remains fully consistent with what was provided in the pkg
directory for CANN 8.1.
firstSubmit_ = false; | ||
} | ||
|
||
memcpy_mutex_.lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Advise to use std::lock_guard<>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
// - Target side directly reuses RDMA Transport | ||
// - Initiator side uses heterogeneous_rdma_transport | ||
if (target_segment_desc->protocol == "rdma") { | ||
proto = "ascend"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this approach is a good idea. Our test program simply treats one node as target, and the other one as initiator. This is okay for P/D disaggregation. However, for a border area, a node can be presented as both source and target sides. Thus we can simply suppose HeterogeneousRdmaTransport
is loaded in both sides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ADXL TRANSPORT team has identified this issue and will address it in subsequent iterations. For HeterogeneousRdmaTransport, in the current PR, we first adapted and enabled the method of 910B actively writing to H20, and explained this limitation in the readme. The read semantics are still under development and not supported yet. So currently, we reused the RDMA TRANSPORT on the target side. The use of a unified HeterogeneousRdmaTransport will be modified after the read semantics adaptation is completed. By then, this change will be made as you requested. Please be informed.
Co-authored-by: AscendTransport[email protected]
Heterogeneous Ascend Transport Feature Implementation
Overview
This PR introduces the Heterogeneous Ascend Transport, a high-performance data transmission library designed for heterogeneous inference scenarios. Key features include:
Collaborative Heterogeneous Computing
910B NPU: Executes PREFILL operations
H20 GPU: Handles DECODE operations
Cross-Device KVCACHE Transfer
Efficient data exchange between NPU (910B) and GPU (H20) memory
Current version supports WRITE semantics (READ semantics will follow in future updates)
Key Changes
Build System:
Added USE_ASCEND_HETEROGENEOUS compilation flag to toggle the feature
Separate build configurations for PREFILL (910B) and DECODE (H20) sides
Core Functionality:
Implemented RDMA-based heterogeneous memory transfer
Added GPU Direct support for VRAM access
Configuration parameters: source, target_offset, opcode
Testing Framework:
New initiator test: transfer_engine_heterogeneous_ascend_perf_initiator.cpp
Reused rdma_transport_test.cpp as the target-side test
P2P handshake protocol with auto-port selection
Usage
Compilation Notes
PREFILL side (910B): Enable USE_ASCEND_HETEROGENEOUS and rebuild
DECODE side (H20): Use existing RDMA Transport with GPU Direct
Test Commands
bash
Target (H20)
./rdma_transport_test --mode=target --local_server_name=10.10.10.10 --metadata_server=P2PHANDSHAKE --operation=write --protocol=rdma --device_name=mlx5_1 --use_vram=true --gpu_id=0
Initiator (910B)
./transfer_engine_heterogeneous_ascend_perf_initiator --mode=initiator --local_server_name=10.10.10.10 --metadata_server=P2PHANDSHAKE --operation=write --npu_id=1 --segment_id=10.10.10.10:12345 --device_name=mlx5_1 --block_size=65536 --batch_size=128
Roadmap
Add READ semantics support
Optimize cross-device transfer performance
Extend to more heterogeneous computing scenarios