Skip to content

[DRAFT] FEAT: Chain-of-Thought Hijacking Attack Strategy and Test Coverage#1438

Open
riyosha wants to merge 1 commit intomicrosoft:mainfrom
riyosha:h-cot
Open

[DRAFT] FEAT: Chain-of-Thought Hijacking Attack Strategy and Test Coverage#1438
riyosha wants to merge 1 commit intomicrosoft:mainfrom
riyosha:h-cot

Conversation

@riyosha
Copy link
Copy Markdown
Contributor

@riyosha riyosha commented Mar 5, 2026

Description

[DRAFT] FEAT: Chain-of-Thought Hijacking Attack Strategy and Test Coverage

Related to issue #897

This PR introduces the Chain-of-Thought (CoT) Hijacking attack strategy, as described in Zhao et al. (2025). The changes include:

  • ADDED: Implementation of the CoT Hijacking attack strategy - pyrit/executor/attack/multi_turn/cot_hijacking.py
  • ADDED: YAML prompt templates for 6 puzzle types from the paper - pyrit/datasets/executors/cot_hijacking/puzzle_generation_{puzzle_type}.yaml
  • ADDED: Unit tests for CoT Hijacking attack - tests/unit/executor/attack/multi_turn/test_cot_hijacking.py

Related issues: #897


Tests and Documentation

  • Added unit tests in tests/unit/executor/attack/multi_turn/test_cot_hijacking.py
  • Tested the attack with local llama3:8b as the target model and attacker model as mistral:7b. (These LLMs lack advanced reasoning; suggestions for better, locally accessible LRM models are welcome!)

This is a draft PR and I want to get your thoughts on the implementation so far. I have planned these updates:

  • Currently I'm relying on the _fallback_score_response function to use pattern matching for generating a score. I want to replace this with either another auxiliary model as a scorer or use float scale scoring using Azure Content Safety API.
  • Currently the iterative feedback given to the attacker model ( _generate_attack_prompt_async) only includes the harm score and a static prompt to make the puzzle more complex. I'll update it to include the target's previous safe response as well.

Question:

  • I noticed a few other multi_attack strategies define async def _teardown_async even if unused. Should I also add it?

Copilot AI review requested due to automatic review settings March 5, 2026 01:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

@romanlutz romanlutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good! While reading it, I couldn't shake the feeling that this is very similar to RedTeamingAttack with the big difference that it cycles through the system prompt templates, of course. I haven't had time to compare with it in detail to see if that would be doable. My hunch is that it would introduce considerable complexity and is probably not worth it but I'd like to be sure...

Other things:

  • needs mentioning in api.rst
  • needs example notebook (both ipynb and py files) somewhere in doc/executor/attack, which in turn needs to be mentioned in TOC file. Example notebook doesn't need to be elaborate.
  • needs integration test, perhaps just one that runs the example notebook. This may be auto-created by test_executor_notebooks.py I think...

Comment thread pyrit/executor/attack/multi_turn/cot_hijacking.py Outdated
def __init__(
self,
*,
objective_target: PromptTarget = REQUIRED_VALUE, # type: ignore[assignment]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely, this assumes we're dealing with a target that has reasoning capabilities, right? @hannahwestra25 is currently working on expanding TargetCapabilities so that could come in handy here for validation purposes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TargetCapabilities doesn't including reasoning models yet. I commented out a note to add this when it has been included!

Comment thread pyrit/executor/attack/multi_turn/cot_hijacking.py Outdated
Comment thread pyrit/executor/attack/multi_turn/cot_hijacking.py Outdated
Comment thread pyrit/executor/attack/multi_turn/cot_hijacking.py Outdated
@riyosha
Copy link
Copy Markdown
Contributor Author

riyosha commented Mar 24, 2026

@romanlutz

Hi, I incorporated the feedback on the PR and added test notebooks as well. I used the Open AI Chat Model, and in order to bypass the safety guardrails for the testing notebooks, I used a non-harmful objective and prompt template.

Also, CoTHijackingAttack supports two scorer types:

  • TrueFalseScorer - passed via attack_scoring_config. Returns a boolean directly, used as-is for success determination.
  • FloatScaleScorer - passed via float_scale_scorer. Wrapped internally in FloatScaleThresholdScorer with the configurable success_threshold parameter, converting the float score to a boolean.

Both paths converge to the same success check: bool(score_obj.get_value()), keeping _perform_async scorer-agnostic.

Let me know what you think, thanks!

Isolated the 13 CoT-hijacking-specific files from riyosha/h-cot branch
onto a clean main base for reviewability. Original PR had 473 changed
files due to branch drift. Also added myst.yml entry for the doc notebook.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@romanlutz romanlutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, @riyosha! Implementing CoT Hijacking from Zhao et al. (2025) is a valuable addition to PyRIT's attack library. I've done a thorough read of the implementation against the reference paper and code. Feedback below as inline comments, organized by severity.

Note: I cleaned up the branch to isolate the 14 CoT-hijacking-specific files (it previously had 473 changed files from branch drift). The original work and authorship are preserved.

Summary: The structure follows MultiTurnAttackStrategy patterns well. Priority fixes: (1) JSON format mismatch across templates, (2) attacker conversation history. Parallel streams could be a documented follow-up.

parsed = json.loads(response_text)

# Step 3: extract the jailbreak prompt P from the JSON
return str(parsed.get("prompt", meta_prompt))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Broken JSON parsing for 2 of 6 real puzzle types

parsed.get("prompt", meta_prompt) assumes every template tells the attacker to return a prompt key, but:

  • logic_grid.yaml and sudoku.yaml instruct the attacker to return {puzzle_framework, operation_mapping, practical_example_request} — there is no prompt key.
  • category_theory.yaml / test.yaml{prompt}
  • logic_grid_enhanced.yaml / skyscrapers_memetic.yaml{improvement, prompt}

For logic_grid and sudoku, this always falls back to the raw meta-prompt (the instruction to the attacker, not a crafted adversarial prompt from the attacker). These two puzzle types are silently broken.

Fix: Either standardize all templates to include a prompt key in their FORMAT section, or update the parsing to handle the different response schemas (e.g., compose a prompt from the three fields).


# Step 2: send meta-prompt to adversarial model to generate the actual jailbreak prompt

message = Message.from_prompt(prompt=meta_prompt, role="user")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 No conversation history for the attacker model

The paper's reference implementation maintains a multi-turn conversation with the attacker LLM, keeping the last N turns (keep_last_n=3) so it can refine based on accumulated history. See core/workflow.py:
python for conv in convs_list: conv.messages = conv.messages[-2 * args.keep_last_n:] ``n Here, a fresh single-turn message is created each iteration. The attacker can't learn from prior attempts beyond the raw template variables. The enhanced templates (logic_grid_enhanced, skyscrapers_memetic`) even explicitly say 'The user output you receive is composed of four parts, LANGUAGE MODEL OUTPUT, OBJECTIVE, SCORE and STEP NUMBER' — assuming multi-turn context exists.

Fix: Maintain a separate conversation context for the adversarial model and append each round's feedback as conversation turns, similar to how CrescendoAttack or RedTeamingAttack manage adversarial conversations.


for iteration in range(self._max_iterations):
context.iteration = iteration + 1
puzzle_type = self._puzzle_types[iteration % len(self._puzzle_types)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Missing parallel streams (architectural gap vs. paper)

The reference runs 6 concurrent streams (n_streams=6) with different puzzle-type system prompts, taking the best result per iteration. This runs a single sequential stream, cycling puzzle types one at a time. This is the paper's core strategy — diversity across puzzle types in parallel, not sequentially.

Suggestion: At minimum, document this as a known simplification. Ideally, add a parameter (e.g., n_streams: int = 1) with best-of-N selection per iteration.

# Step 1: render the meta-prompt for the adversarial model
prompt_path = PUZZLE_PROMPT_PATHS[puzzle_type]
import yaml
from jinja2 import Environment
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Inline imports — style guide violation

yaml and jinja2.Environment are imported inside a method body. The style guide mandates all imports at the top of the file. Move these to the top-level import section.

)
from pyrit.prompt_normalizer import PromptNormalizer
from pyrit.prompt_target import PromptChatTarget, PromptTarget
from pyrit.prompt_target.common.prompt_chat_target import PromptChatTarget
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Duplicate import

PromptChatTarget is already imported on line 52 via from pyrit.prompt_target import PromptChatTarget, PromptTarget. This specific-path import is redundant and violates the style guide's import rules (prefer the higher-level module import). Remove this line.

"skyscrapers",
"logic_grid_enhanced",
"skyscrapers_memetic",
"test",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 test in default puzzle types — This entry uses a softened template without the adversarial framing of the real puzzle types. Since SUPPORTED_PUZZLE_TYPES is the default list, test will be included in production runs. Consider removing it from this list (keep the YAML template for testing, but don't include it in defaults).

last_target_response: str = ""


class CoTHijackingAttack(MultiTurnAttackStrategy[CoTHijackingAttackContext, AttackResult]):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Missing _build_identifier() override — Other multi-turn attacks (e.g. CrescendoAttack, TAPAttack) override _build_identifier() to include behavioral parameters like max_iterations, puzzle_types, etc. in the component identifier. This is important for experiment tracking and memory labeling.

continue

# Store response text for next iteration's feedback
context.last_target_response = str(response.get_value()) if response else ""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 No CoT length tracking — The paper's key mechanistic insight is that CoT length determines attack success. The reference feeds STEP NUMBER (reasoning steps count) back to the attacker model as a critical feedback signal. Consider extracting reasoning token count from the response (if available from the target API) and including it in the feedback loop.

# TrueFalseScorer from config — use directly, no wrapping needed
self._objective_scorer = attack_scoring_config.objective_scorer
else:
self._objective_scorer = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 No scorer = silent all-iterations failure — If no scorer is configured (the default path), _score_response_async returns None every iteration, so the attack always exhausts max_iterations and returns FAILURE with no early stopping. Consider either requiring a scorer (raise ValueError) or at least logging a warning here that scoring is disabled.


"""
Unit tests for CoT Hijacking Attack implementation.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Tests only exercise the loop skeleton — All 30 tests mock the three core methods (_generate_attack_prompt_async, _send_prompt_to_target_async, _score_response_async). This is good for verifying the iteration/scoring/early-exit structure, but there's no coverage of:

  • Template rendering via Jinja2 (would catch the broken JSON format issue)
  • JSON parsing and fallback behavior when the adversarial model returns malformed output
  • The actual _generate_attack_prompt_async pipeline with a mocked adversarial model response
  • Edge case: adversarial model returns valid JSON but without the expected prompt key

Consider adding at least one test that exercises _generate_attack_prompt_async end-to-end with a mocked PromptNormalizer.send_prompt_async returning a controlled JSON string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants