-
Notifications
You must be signed in to change notification settings - Fork 599
[Draft] Qualcomm AI Engine Direct - Unexpected graph for mutable buffer after export during Quantization #11309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11309
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 5985058 with merge base 879eee0 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Hi @cccclai, While enabling GA mode, we encountered an issue where the copy node was not inserted for the mutable buffer node after export during quantization. This results in the mutable buffer being treated as a constant buffer after compiling the graph. I noticed that you applied a patch to fix this issue in Llama. I have two questions regarding this issue:
|
I think the general flow today is export -> high level non functional ir <quant happens here> edge -> decomps to functional ir + smaller operator lib to_executorch -> where we will try and reinject some mutation. Export used to spit out functional ir but it hasnt for some time. It might be possible for a quantizer to functionalize the graph on its own. Right now I believe the logic is coupled with the decompose api not totally sure. Asking around. |
Asking compiler team (PoC for torch.export) and quantization team (PoC for quantization) to help... |
In my view, the main challenge is integrating the mutable buffer feature with the quantization flow. export -> high level non functional ir (Issue: Missing mutable buffer information in graph signature) prepare_pt2e edge -> decomps to functional ir + smaller operator lib to_executorch -> where we will try and reinject some mutation. |
If we have a way to get a functional IR during quantization, does it solve the issue? |
I'm not very clear about the relationship between functional IR and the mutable buffer issue. As far as I know, functional IR refers to operators that do not mutate or alias inputs. However, there seem to be two issues here:
Also, I'm curious if other backends have encountered this problem? |
If we have a functional IR during quantization, then instead of |
Got it. Is it similar to use legacy export which mean use the following patch? Actually, I currently use this patch to workaround mutable buffer issue. And It seems work. But it still doesn't resolve another issue which is the input for index_put is replaced by frozen after convert_pt2e. It will result in that we cannot compare with the CPU results after convert_pt2e since the input of index_put is fixed. Maybe I can try replace the frozen param with original mutable buffer after convert_pt2e to check is the workable.
|
oh I remember that issue, and thought you sent a patch.. I would like to check if it's for support llama and similar decoder-only model in optimum. If so, maybe we can see the proposed option to use static llama instead of the optimum model definition, given that it will be much slower... |
Yes, I have a workaround PR before.
Yes, it is for support LLM and similar decoder model in GA model list. Actually, we are also trying on using static llama instead of huggingingface model definition now. |
I see, I will bring it up to the team. Using static llama likely is the easier solution |
Thanks a lot.
Yes, or modify the model definition to move kv cache as the inputs of the model such as mimi, whisper and T5.. etc. |
The answer I got from compiler team is that:
If it's resolved, then only the second issue needs to be solved. |
The answer I got from compiler team is that:
If it's resolved, then only the second issue need to be solved. |
I also remember there is a way to selected fuse operator, @jerryzh168 any chance you know? |
Just discuss with the team, still unclear if it's okay, but if the effort to enable them in static llama is minimum, it will be great to have them working there first, while we're trying to figure out this issue. |
Regarding the second issue, @jerryzh168 was proposing
I feel like it's reasonable, does it work for you. |
@tugsbayasgalan made another suggestion
I feel like it's also reasonable, and less work. I will try to make a PR |
Here is the example PR pytorch/ao#2345, I don't have a better idea for re-tracing, and also it looks like we need to set
such that they won't be frozen. @jerryzh168 may have a better idea on this issue. |
I have tried don't quantize the mutable buffer of inplace ops. It seems work well. Avoid annotating the input node because mutable buffers will be folded during the convert_pt2e process. During the comple, I can fill in quant_attr of mutable buffer from the quant_attr of node becuase index_copy shold not affect quant_attr.
|
I've also tried using run_decompositions({}), and it appears to yield the same results as with patch. However, I'm uncertain if this affects other test cases. |
Do you mean that you avoid quantizing the index_put_ op? And you rely on the other ops next to the For
I'm not quite following this question, if the buffer is not folded, like this
THen we can see the |
In fact, I am trying to not quantize the mutable buffer input (args[0]) of the index_put operation. Because I have annotated the output of the index_put, I assign the quant_attr of the mutable buffer with the quant_attr of the index_put in the operation builder. Therefore, I'm considering if there's a way to identify whether the input is a mutable buffer to avoid annotation during annotation. As you say, I think this approach cannot be used for computation op. class IndexPutVisitor(NodeVisitor):
target = ["aten.index_put.default"]
def __init__(self, *args) -> None:
super().__init__(*args)
def define_node(
input_node = self.get_node(node.args[0])
if quant_attrs := node.meta.get(QCOM_QUANT_ATTRS):
quant_attrs = quant_attrs.copy()
input_node.meta[QCOM_QUANT_ATTRS] = quant_attrs
input_tensor = self.get_tensor(input_node, node)
... |
Yes, I agree with ep.run_decomposition({}) as a pass in transform_for_quantization. It would be easy to use for user. |
By the way, does ExecuTorch initialize mutable buffer zero? # test model
class IndexCopy(torch.nn.Module):
def __init__(self):
super().__init__()
self.register_buffer(
"k_cache",
torch.ones((1, 1024, 12, 64), dtype=torch.float32),
persistent=False
)
def forward(self, input_pos, k_val):
k_out = self.k_cache
k_out.index_copy_(1, input_pos, k_val)
return k_out + 0
# test
def test_qnn_backend_index_put(self):
test_comb = [
{
QCOM_MODULE: IndexCopy(), # noqa: F405
QCOM_SAMPLE_INPUTS: [(
torch.tensor([2], dtype=torch.int64),
torch.randn([1, 1, 12, 64]),
),(
torch.tensor([1], dtype=torch.int64),
torch.randn([1, 1, 12, 64]),
)],
},
]
for i, test in enumerate(test_comb):
with self.subTest(i=i):
module = self.get_qdq_module(
test[QCOM_MODULE], test[QCOM_SAMPLE_INPUTS]
)
# module.reset
self.lower_module_and_test_output(module, test[QCOM_SAMPLE_INPUTS])
# ref_output from nn.module:
tensor([[[[0.0218, 0.8506, 1.0905, ..., 0.9814, 0.6325, 1.0905],
[0.3271, 0.0000, 0.3708, ..., 0.0000, 0.0000, 0.0000],
[0.2835, 0.0000, 0.0436, ..., 0.1963, 1.0905, 0.0000],
...,
[1.0033, 1.0905, 0.5016, ..., 0.2835, 0.0000, 1.0905],
[0.0000, 1.0905, 1.0905, ..., 0.0000, 0.0000, 1.0905],
[0.2617, 0.6543, 0.6761, ..., 0.0000, 1.0905, 0.0000]],
[[0.2617, 0.0654, 0.3053, ..., 0.0000, 0.0000, 0.7852],
[0.7852, 0.3708, 0.0000, ..., 0.2181, 0.4144, 1.0905],
[0.0000, 1.0905, 0.0000, ..., 0.0000, 1.0905, 1.0905],
...,
[0.0000, 0.0000, 0.1090, ..., 0.5234, 0.0000, 0.7197],
[0.3490, 0.0000, 0.0000, ..., 0.0218, 0.7197, 0.0000],
[0.1745, 0.4798, 0.0000, ..., 0.1309, 0.2399, 0.0000]],
[[1.0033, 1.0033, 1.0033, ..., 1.0033, 1.0033, 1.0033],
[1.0033, 1.0033, 1.0033, ..., 1.0033, 1.0033, 1.0033],
[1.0033, 1.0033, 1.0033, ..., 1.0033, 1.0033, 1.0033],
...,
[1.0033, 1.0033, 1.0033, ..., 1.0033, 1.0033, 1.0033],
[1.0033, 1.0033, 1.0033, ..., 1.0033, 1.0033, 1.0033],
[1.0033, 1.0033, 1.0033, ..., 1.0033, 1.0033, 1.0033]],
[[1.0033, 1.0033, 1.0033, ..., 1.0033, 1.0033, 1.0033],
[1.0033, 1.0033, 1.0033, ..., 1.0033, 1.0033, 1.0033],
[1.0033, 1.0033, 1.0033, ..., 1.0033, 1.0033, 1.0033],
...,
[1.0033, 1.0033, 1.0033, ..., 1.0033, 1.0033, 1.0033],
[1.0033, 1.0033, 1.0033, ..., 1.0033, 1.0033, 1.0033],
[1.0033, 1.0033, 1.0033, ..., 1.0033, 1.0033, 1.0033]]]])
# runner_output from qnn_executor_runner:
tensor([[[[0.0259, 0.8576, 1.0989, ..., 0.9826, 0.6292, 1.0989],
[0.3318, 0.0000, 0.3749, ..., 0.0000, 0.0000, 0.0000],
[0.2844, 0.0000, 0.0474, ..., 0.2025, 1.0989, 0.0000],
...,
[1.0041, 1.0989, 0.5042, ..., 0.2758, 0.0000, 1.0989],
[0.0000, 1.0989, 1.0989, ..., 0.0000, 0.0000, 1.0946],
[0.2543, 0.6594, 0.6809, ..., 0.0000, 1.0989, 0.0000]],
[[0.2715, 0.0646, 0.3146, ..., 0.0000, 0.0000, 0.7800],
[0.7887, 0.3663, 0.0000, ..., 0.2284, 0.4094, 1.0989],
[0.0000, 1.0989, 0.0000, ..., 0.0000, 1.0989, 1.0989],
...,
[0.0000, 0.0000, 0.1164, ..., 0.5128, 0.0000, 0.7111],
[0.3405, 0.0000, 0.0000, ..., 0.0215, 0.7111, 0.0000],
[0.1810, 0.4741, 0.0000, ..., 0.1250, 0.2327, 0.0000]],
[[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
...,
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]],
[[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
...,
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]]]]) |
Let me try to summary current results. Approach 1: Use run_decomposition as a pass before annotation and convert_pt2e(m, fold_quantized=False)
Follow up question:
Approach 2: Avoid quantize mutable buffer
Would this be in line with your thoughts? |
Yes, it's aligned. Thank you for summarizing.
The pass will run under the hood, so users won't notice the change.
I think the graph input will be a buffer instead of a place holder in this case? We may need to trace up the graph
Yeah agree that re-tracing might not be the best idea, maybe we can add a pass to manually convert in place op to functional op, just so we're not tied to retracing. It seems like the general recommendation from compiler team is to trace to be safe, but I feel like with passes we have more control, but possibly more work (the effort to add passes). What do you think?
I'm still discussing with Jerry about it, I think we should have a way to choose what op to fold.
I think in general, we want to avoid annotate non-compute op, as it might slightly affect accuracy. So we should avoid annotating index_put_ anyway, and it will be an improvement. This will unblock us for decoder-only model. For the general solution (option 1), we haven't hit a GA model like this, so maybe safe to go with approach 2 first, while we're working on a proper solution for in place compute op in option 1. |
Are you suggesting that a buffer should be a get_attr node? If so, shouldn't weights or other constant nodes also be get_attr nodes?
Yes, I also intend to use a pass to convert them. I think we need put some efforts to test this pass.
Thanks for your help. The main effort for this task is required because if fold_quantized=False, we need to make some changes in annotate_quant_attr.py and other related passes.
It makes sense to me. I will put a PR to fix mutable buffer issue with approach 2 including this change to improve performance. By the way, we have a good new. The Qwen model weights can be loaded into our static LLaMA structure. If other LLM-based models are compatible in the same way, it would be very helpful. |
@shewu-quic @haowhsu-quic |
Sorry for my late reply. I have created a PR to go approach 2 and add a option to choose whether delegates mutable buffer or not. And based on this PR, I can successfully enable whisper model. Thanks for your PR and I will take a shot. |
Issue
We observed that a copy node was not inserted for the mutable buffer after export during quantization.
The following results can be reproduced to generate the graph using this PR:
export.svg
Background
Given that the GA model is currently being enabled, some models use index_put/index_copy to update the key-value cache, similar to the Llama in Executorch.
In previous PR, we observed that a copy node would be inserted after the mutable buffer (b_k_cache), even if the input of the index_put node was frozen as b__frozen_param0. Therefore, we added a workaround pass to replace the input of index_put.
In the past
Seems workaround
I am curious why the Llama model is not affected. I found that you applied a patch for Llama export, which seems to reproduce the previous result.
Is this the expected solution for the mutable buffer issue?
export_with_patch.svg
cc @haowhsu-quic @winskuo-quic @DannyYuyang-quic