[DRAFT] FEAT: Chain-of-Thought Hijacking Attack Strategy and Test Coverage by riyosha · Pull Request #1438 · microsoft/PyRIT

riyosha · 2026-03-05T01:05:33Z

Description

[DRAFT] FEAT: Chain-of-Thought Hijacking Attack Strategy and Test Coverage

Related to issue #897

This PR introduces the Chain-of-Thought (CoT) Hijacking attack strategy, as described in Zhao et al. (2025). The changes include:

ADDED: Implementation of the CoT Hijacking attack strategy - pyrit/executor/attack/multi_turn/cot_hijacking.py
ADDED: YAML prompt templates for 6 puzzle types from the paper - pyrit/datasets/executors/cot_hijacking/puzzle_generation_{puzzle_type}.yaml
ADDED: Unit tests for CoT Hijacking attack - tests/unit/executor/attack/multi_turn/test_cot_hijacking.py

Related issues: #897

Tests and Documentation

Added unit tests in tests/unit/executor/attack/multi_turn/test_cot_hijacking.py
Tested the attack with local llama3:8b as the target model and attacker model as mistral:7b. (These LLMs lack advanced reasoning; suggestions for better, locally accessible LRM models are welcome!)

This is a draft PR and I want to get your thoughts on the implementation so far. I have planned these updates:

Currently I'm relying on the _fallback_score_response function to use pattern matching for generating a score. I want to replace this with either another auxiliary model as a scorer or use float scale scoring using Azure Content Safety API.
Currently the iterative feedback given to the attacker model ( _generate_attack_prompt_async) only includes the harm score and a static prompt to make the puzzle more complex. I'll update it to include the target's previous safe response as well.

Question:

I noticed a few other multi_attack strategies define async def _teardown_async even if unused. Should I also add it?

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

romanlutz

This is really good! While reading it, I couldn't shake the feeling that this is very similar to RedTeamingAttack with the big difference that it cycles through the system prompt templates, of course. I haven't had time to compare with it in detail to see if that would be doable. My hunch is that it would introduce considerable complexity and is probably not worth it but I'd like to be sure...

Other things:

needs mentioning in api.rst
needs example notebook (both ipynb and py files) somewhere in doc/executor/attack, which in turn needs to be mentioned in TOC file. Example notebook doesn't need to be elaborate.
needs integration test, perhaps just one that runs the example notebook. This may be auto-created by test_executor_notebooks.py I think...

romanlutz · 2026-03-06T06:20:26Z

+    def __init__(
+        self,
+        *,
+        objective_target: PromptTarget = REQUIRED_VALUE,  # type: ignore[assignment]


Most likely, this assumes we're dealing with a target that has reasoning capabilities, right? @hannahwestra25 is currently working on expanding TargetCapabilities so that could come in handy here for validation purposes.

TargetCapabilities doesn't including reasoning models yet. I commented out a note to add this when it has been included!

riyosha · 2026-03-24T08:49:20Z

@romanlutz

Hi, I incorporated the feedback on the PR and added test notebooks as well. I used the Open AI Chat Model, and in order to bypass the safety guardrails for the testing notebooks, I used a non-harmful objective and prompt template.

Also, CoTHijackingAttack supports two scorer types:

TrueFalseScorer - passed via attack_scoring_config. Returns a boolean directly, used as-is for success determination.
FloatScaleScorer - passed via float_scale_scorer. Wrapped internally in FloatScaleThresholdScorer with the configurable success_threshold parameter, converting the float score to a boolean.

Both paths converge to the same success check: bool(score_obj.get_value()), keeping _perform_async scorer-agnostic.

Let me know what you think, thanks!

Isolated the 13 CoT-hijacking-specific files from riyosha/h-cot branch onto a clean main base for reviewability. Original PR had 473 changed files due to branch drift. Also added myst.yml entry for the doc notebook. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

romanlutz

Thanks for working on this, @riyosha! Implementing CoT Hijacking from Zhao et al. (2025) is a valuable addition to PyRIT's attack library. I've done a thorough read of the implementation against the reference paper and code. Feedback below as inline comments, organized by severity.

Note: I cleaned up the branch to isolate the 14 CoT-hijacking-specific files (it previously had 473 changed files from branch drift). The original work and authorship are preserved.

Summary: The structure follows MultiTurnAttackStrategy patterns well. Priority fixes: (1) JSON format mismatch across templates, (2) attacker conversation history. Parallel streams could be a documented follow-up.

romanlutz · 2026-04-24T06:02:14Z

+            parsed = json.loads(response_text)
+
+            # Step 3: extract the jailbreak prompt P from the JSON
+            return str(parsed.get("prompt", meta_prompt))


🔴 Broken JSON parsing for 2 of 6 real puzzle types

parsed.get("prompt", meta_prompt) assumes every template tells the attacker to return a prompt key, but:

logic_grid.yaml and sudoku.yaml instruct the attacker to return {puzzle_framework, operation_mapping, practical_example_request} — there is no prompt key.

category_theory.yaml / test.yaml → {prompt} ✅

logic_grid_enhanced.yaml / skyscrapers_memetic.yaml → {improvement, prompt} ✅

For logic_grid and sudoku, this always falls back to the raw meta-prompt (the instruction to the attacker, not a crafted adversarial prompt from the attacker). These two puzzle types are silently broken.

Fix: Either standardize all templates to include a prompt key in their FORMAT section, or update the parsing to handle the different response schemas (e.g., compose a prompt from the three fields).

romanlutz · 2026-04-24T06:02:14Z

+
+        # Step 2: send meta-prompt to adversarial model to generate the actual jailbreak prompt
+
+        message = Message.from_prompt(prompt=meta_prompt, role="user")


🔴 No conversation history for the attacker model

The paper's reference implementation maintains a multi-turn conversation with the attacker LLM, keeping the last N turns (keep_last_n=3) so it can refine based on accumulated history. See core/workflow.py:
python for conv in convs_list: conv.messages = conv.messages[-2 * args.keep_last_n:] ``n Here, a fresh single-turn message is created each iteration. The attacker can't learn from prior attempts beyond the raw template variables. The enhanced templates (logic_grid_enhanced, skyscrapers_memetic`) even explicitly say 'The user output you receive is composed of four parts, LANGUAGE MODEL OUTPUT, OBJECTIVE, SCORE and STEP NUMBER' — assuming multi-turn context exists.

Fix: Maintain a separate conversation context for the adversarial model and append each round's feedback as conversation turns, similar to how CrescendoAttack or RedTeamingAttack manage adversarial conversations.

romanlutz · 2026-04-24T06:02:15Z

+
+        for iteration in range(self._max_iterations):
+            context.iteration = iteration + 1
+            puzzle_type = self._puzzle_types[iteration % len(self._puzzle_types)]


🔴 Missing parallel streams (architectural gap vs. paper)

The reference runs 6 concurrent streams (n_streams=6) with different puzzle-type system prompts, taking the best result per iteration. This runs a single sequential stream, cycling puzzle types one at a time. This is the paper's core strategy — diversity across puzzle types in parallel, not sequentially.

Suggestion: At minimum, document this as a known simplification. Ideally, add a parameter (e.g., n_streams: int = 1) with best-of-N selection per iteration.

romanlutz · 2026-04-24T06:02:15Z

+        # Step 1: render the meta-prompt for the adversarial model
+        prompt_path = PUZZLE_PROMPT_PATHS[puzzle_type]
+        import yaml
+        from jinja2 import Environment


🟠 Inline imports — style guide violation

yaml and jinja2.Environment are imported inside a method body. The style guide mandates all imports at the top of the file. Move these to the top-level import section.

romanlutz · 2026-04-24T06:02:15Z

+)
+from pyrit.prompt_normalizer import PromptNormalizer
+from pyrit.prompt_target import PromptChatTarget, PromptTarget
+from pyrit.prompt_target.common.prompt_chat_target import PromptChatTarget


🟠 Duplicate import

PromptChatTarget is already imported on line 52 via from pyrit.prompt_target import PromptChatTarget, PromptTarget. This specific-path import is redundant and violates the style guide's import rules (prefer the higher-level module import). Remove this line.

romanlutz · 2026-04-24T06:02:15Z

+    "skyscrapers",
+    "logic_grid_enhanced",
+    "skyscrapers_memetic",
+    "test",


🟠 test in default puzzle types — This entry uses a softened template without the adversarial framing of the real puzzle types. Since SUPPORTED_PUZZLE_TYPES is the default list, test will be included in production runs. Consider removing it from this list (keep the YAML template for testing, but don't include it in defaults).

romanlutz · 2026-04-24T06:02:15Z

+    last_target_response: str = ""
+
+
+class CoTHijackingAttack(MultiTurnAttackStrategy[CoTHijackingAttackContext, AttackResult]):


🟠 Missing _build_identifier() override — Other multi-turn attacks (e.g. CrescendoAttack, TAPAttack) override _build_identifier() to include behavioral parameters like max_iterations, puzzle_types, etc. in the component identifier. This is important for experiment tracking and memory labeling.

romanlutz · 2026-04-24T06:02:15Z

+                continue
+
+            # Store response text for next iteration's feedback
+            context.last_target_response = str(response.get_value()) if response else ""


🟡 No CoT length tracking — The paper's key mechanistic insight is that CoT length determines attack success. The reference feeds STEP NUMBER (reasoning steps count) back to the attacker model as a critical feedback signal. Consider extracting reasoning token count from the response (if available from the target API) and including it in the feedback loop.

romanlutz · 2026-04-24T06:02:15Z

+            # TrueFalseScorer from config — use directly, no wrapping needed
+            self._objective_scorer = attack_scoring_config.objective_scorer
+        else:
+            self._objective_scorer = None


🟡 No scorer = silent all-iterations failure — If no scorer is configured (the default path), _score_response_async returns None every iteration, so the attack always exhausts max_iterations and returns FAILURE with no early stopping. Consider either requiring a scorer (raise ValueError) or at least logging a warning here that scoring is disabled.

romanlutz · 2026-04-24T06:02:15Z

+
+"""
+Unit tests for CoT Hijacking Attack implementation.
+"""


🔵 Tests only exercise the loop skeleton — All 30 tests mock the three core methods (_generate_attack_prompt_async, _send_prompt_to_target_async, _score_response_async). This is good for verifying the iteration/scoring/early-exit structure, but there's no coverage of:

Template rendering via Jinja2 (would catch the broken JSON format issue)

JSON parsing and fallback behavior when the adversarial model returns malformed output

The actual _generate_attack_prompt_async pipeline with a mocked adversarial model response

Edge case: adversarial model returns valid JSON but without the expected prompt key

Consider adding at least one test that exercises _generate_attack_prompt_async end-to-end with a mocked PromptNormalizer.send_prompt_async returning a controlled JSON string.

Copilot AI review requested due to automatic review settings March 5, 2026 01:05

Copilot started reviewing on behalf of riyosha March 5, 2026 01:06 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

romanlutz reviewed Mar 6, 2026

View reviewed changes

romanlutz force-pushed the h-cot branch from 4664680 to 2e01ced Compare April 24, 2026 05:51

romanlutz requested changes Apr 24, 2026

View reviewed changes


		# Step 2: send meta-prompt to adversarial model to generate the actual jailbreak prompt

		message = Message.from_prompt(prompt=meta_prompt, role="user")

		last_target_response: str = ""


		class CoTHijackingAttack(MultiTurnAttackStrategy[CoTHijackingAttackContext, AttackResult]):

Conversation

riyosha commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests and Documentation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

romanlutz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

riyosha commented Mar 24, 2026

Uh oh!

romanlutz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

riyosha commented Mar 5, 2026 •

edited

Loading