Unit Tests for On Device Sampling #463

quic-sanising · 2025-06-18T18:40:48Z

This PR adds the following Unit Tests for On Device Sampling:

test_sampler_transform: Test if SamplerTransform adds nodes at the output of a QEffForCausalLM model to enable the sampling of next tokens at the device (instead of the host) and returns the next tokens and/or probability distributions.
test_greedy_sampling: Test greedy sampling with QPC compiled with and without On Device Sampling.
test_random_sampling: Test random sampling with QPC compiled with and without On Device Sampling.

Signed-off-by: quic-sanising <[email protected]>

tests/transformers/sampler/test_sampler.py

quic-rishinr · 2025-06-25T11:58:04Z

@quic-sanising can you add a small feature description under /docs/source/quick_start.md supported feature section? also provide the example script link in the description

Signed-off-by: sanising <[email protected]>

quic-sanising · 2025-06-30T23:36:23Z

@quic-sanising can you add a small feature description under /docs/source/quick_start.md supported feature section? also provide the example script link in the description

Done

Signed-off-by: sanising <[email protected]>

quic-amitraj

Please fix lint error.

docs/source/quick_start.md

QEfficient/generation/text_generation_inference.py

Signed-off-by: sanising <[email protected]>

QEfficient/generation/text_generation_inference.py

Signed-off-by: sanising <[email protected]>

quic-sanising · 2025-07-03T19:21:59Z

Please fix lint error.

@quic-amitraj The lint failures were happening because the linter is installing ruff v0.12.2 whereas the .pre-commit-config.yaml file has an older version of v0.5.2.

To fix the errors, we need to either install ruff v0.5.2 in the linter or update the .pre-commit-config.yaml file to version v0.12.2.

tests/transformers/sampler/test_sampler.py

ochougul

Everything else LGTM, just see if we can use single layer model in our tests

QEfficient/generation/text_generation_inference.py

quic-rishinr · 2025-07-30T13:00:56Z

QEfficient/generation/text_generation_inference.py

+                "sampling support is not available. Please check the QPC and try again."
+            )
+
+        if include_sampler and not self.include_sampler:


Can we include this check in line 489 itself?

Could you clarify this?

QEfficient/generation/text_generation_inference.py

quic-rishinr · 2025-07-30T13:11:35Z

QEfficient/generation/text_generation_inference.py

-            if len(logits.shape) == 2:
-                logits = np.expand_dims(logits, 1)
-            next_token_id = logits.argmax(2)
+            next_token_id = self._fetch_next_token_id(outputs)


Since this is inside the decode loop would it create any performance drop? Have you done any tests for checking any performance deviation?

I do not follow. How will this cause performance drop? Instead of performing argmax on the host CPU, we are simply reading the next_token provided by the QAIC device. In my opinion, this would lead to performance improvement instead. Please let me know if you are talking about something else.

Due to the additional reshape just wanted to check it would create any perfromance deviation. its good to have a performance number for a smaller model just for reference

I still do not follow. Are you talking about this reshape?

return outputs["next_tokens"].reshape(outputs["next_tokens"].shape[0], outputs["next_tokens"].shape[1])

If yes, this converts 3D tensor of shape (batch_size, 1, 1) to a 2D tensor of shape (batch_size, 1). This operation doesn't cause a drop in performance.

Otherwise, if you are curious about the overall performance gains, please reach out to me for a complete performance report.

tests/transformers/sampler/test_sampler.py

quic-rishinr · 2025-07-30T13:46:50Z

tests/transformers/sampler/test_sampler.py

+    )
+
+    # Compare generated texts
+    golden_texts = {


Can we make this test more robust? The goal is to verify whether the random sampler is working as expected right? One idea is to check if the sampler produces the same output when given the same random seed or random values, and different outputs when the seed or values change. Secondly if we want to compare outputs with and without the sampler cam we use cosine similarity with a threshold instead of exact matching? im open for suggestions

Thanks for the suggestion! The current test checks for exact output match, which allows us to reproduce it with a fixed seed value.

As for using cosine similarity or other metrics like perplexity or rogue score, they are useful for semantic comparison. But can be vague as there are no clear-defined or universally accepted thresholds.

So, for this test, exact matching might help keep things deterministic and easier to maintain in my opinion. Happy to explore alternatives if you want to...

Signed-off-by: sanising <[email protected]>

QEfficient/generation/text_generation_inference.py

quic-rishinr · 2025-08-06T11:55:24Z

QEfficient/generation/text_generation_inference.py


        # Load QPC
        self._session = QAICInferenceSession(qpc_path, device_id, enable_debug_logs=enable_debug_logs)

+        # Validate sampler inputs for On-Device Sampling
+        self.include_sampler = validate_sampler_inputs(


can we have something like self.include_sampler = validate_sampler_inputs(set(self._session.input_names), include_sampler if include_sampler is not False else False

I don't think we need to do this. include_sampler is a mandatory boolean variable with a default value of False. The only other possible value is True. Both of these scenarios are handled in validate_sampler_inputs() function. Additionally, the function handles the case of include_sampler=None so that we can re-use the function in other places if we want to.

quic-rishinr · 2025-08-06T11:57:35Z

QEfficient/generation/text_generation_inference.py

-            if len(logits.shape) == 2:
-                logits = np.expand_dims(logits, 1)
-            next_token_id = logits.argmax(2)
+            next_token_id = self._fetch_next_token_id(outputs)


Due to the additional reshape just wanted to check it would create any perfromance deviation. its good to have a performance number for a smaller model just for reference

tests/transformers/sampler/test_sampler.py

Signed-off-by: sanising <[email protected]>

Add sampler transform test

8417d8f

Signed-off-by: quic-sanising <[email protected]>

quic-rishinr added the 1.20.0 label Jun 24, 2025

ochougul requested changes Jun 24, 2025

View reviewed changes

tests/transformers/sampler/test_sampler.py Show resolved Hide resolved

sanising added 3 commits June 30, 2025 13:20

Merge branch 'main' into ods-unit-tests

27d8dd5

Add example script

067f9b5

Signed-off-by: sanising <[email protected]>

Update docs

931860f

Signed-off-by: sanising <[email protected]>

sanising added 3 commits June 30, 2025 18:40

Enable On Device Sampling for _continuous_batching_execution()

79b6c95

Signed-off-by: sanising <[email protected]>

Disable On Device Sampling for _regular_model_execution()

75eac30

Signed-off-by: sanising <[email protected]>

Use same sampling parameters for each sequence in a batch

eb6e2eb

Signed-off-by: sanising <[email protected]>

quic-amitraj requested changes Jul 2, 2025

View reviewed changes

docs/source/quick_start.md Show resolved Hide resolved

QEfficient/generation/text_generation_inference.py Outdated Show resolved Hide resolved

sanising added 2 commits July 2, 2025 18:43

Enable On Device Sampling for _regular_model_execution()

48b35e3

Signed-off-by: sanising <[email protected]>

Add test for greedy sampling

c83a631

Signed-off-by: sanising <[email protected]>

quic-hemagnih reviewed Jul 3, 2025

View reviewed changes

QEfficient/generation/text_generation_inference.py Outdated Show resolved Hide resolved

QEfficient/generation/text_generation_inference.py Outdated Show resolved Hide resolved

QEfficient/generation/text_generation_inference.py Outdated Show resolved Hide resolved

sanising added 3 commits July 3, 2025 13:44

Add test for random sampling

f698a24

Signed-off-by: sanising <[email protected]>

Remove else block

7b34a07

Signed-off-by: sanising <[email protected]>

Merge branch 'main' into ods-unit-tests

5fa7269

Signed-off-by: sanising <[email protected]>

quic-sanising marked this pull request as ready for review July 3, 2025 19:08

quic-sanising requested a review from quic-rishinr as a code owner July 3, 2025 19:08

quic-sanising requested review from quic-hemagnih, ochougul and quic-amitraj July 3, 2025 19:08

Reformat code

0ee201a

Signed-off-by: sanising <[email protected]>

quic-hemagnih added 1.21.0 and removed 1.20.0 labels Jul 21, 2025

Merge branch 'quic:main' into ods-unit-tests

c074768

quic-sanising mentioned this pull request Jul 24, 2025

Add Support for Frequency Penalties in On Device Sampling #523

Draft

ochougul reviewed Jul 29, 2025

View reviewed changes

tests/transformers/sampler/test_sampler.py Outdated Show resolved Hide resolved

ochougul requested changes Jul 29, 2025

View reviewed changes

quic-rishinr requested changes Jul 30, 2025

View reviewed changes

sanising and others added 5 commits August 4, 2025 15:57

Move sampling operations, inputs, and validation functions to utils

115505e

Signed-off-by: sanising <[email protected]>

Change model to TinyLlama

3ac7503

Signed-off-by: sanising <[email protected]>

Add header

02669e0

Signed-off-by: sanising <[email protected]>

Reformat code

137cc4a

Signed-off-by: sanising <[email protected]>

Merge branch 'quic:main' into ods-unit-tests

54a926a

quic-sanising requested a review from quic-rishinr August 4, 2025 21:50

Update linter

6acf446

Signed-off-by: sanising <[email protected]>

quic-rishinr requested changes Aug 6, 2025

View reviewed changes

quic-sanising and others added 3 commits August 6, 2025 13:53

Merge branch 'quic:main' into ods-unit-tests

6083f5b

Remove device_id

c2d7e83

Signed-off-by: sanising <[email protected]>

Remove redundant line

1069109

Signed-off-by: sanising <[email protected]>

quic-sanising requested a review from quic-rishinr August 6, 2025 21:11

Unit Tests for On Device Sampling #463

Are you sure you want to change the base?

Unit Tests for On Device Sampling #463

Conversation

quic-sanising commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

quic-rishinr commented Jun 25, 2025

Uh oh!

quic-sanising commented Jun 30, 2025

Uh oh!

quic-amitraj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

quic-sanising commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ochougul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

quic-sanising Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

quic-sanising Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

quic-sanising Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

quic-sanising commented Jun 18, 2025 •

edited

Loading

quic-sanising commented Jul 3, 2025 •

edited

Loading

quic-sanising Aug 6, 2025 •

edited

Loading

quic-sanising Aug 4, 2025 •

edited

Loading

quic-sanising Aug 6, 2025 •

edited

Loading