[DRAFT] FEAT: Chain-of-Thought Hijacking Attack Strategy and Test Coverage#1438
[DRAFT] FEAT: Chain-of-Thought Hijacking Attack Strategy and Test Coverage#1438riyosha wants to merge 1 commit intomicrosoft:mainfrom
Conversation
romanlutz
left a comment
There was a problem hiding this comment.
This is really good! While reading it, I couldn't shake the feeling that this is very similar to RedTeamingAttack with the big difference that it cycles through the system prompt templates, of course. I haven't had time to compare with it in detail to see if that would be doable. My hunch is that it would introduce considerable complexity and is probably not worth it but I'd like to be sure...
Other things:
- needs mentioning in api.rst
- needs example notebook (both ipynb and py files) somewhere in doc/executor/attack, which in turn needs to be mentioned in TOC file. Example notebook doesn't need to be elaborate.
- needs integration test, perhaps just one that runs the example notebook. This may be auto-created by test_executor_notebooks.py I think...
| def __init__( | ||
| self, | ||
| *, | ||
| objective_target: PromptTarget = REQUIRED_VALUE, # type: ignore[assignment] |
There was a problem hiding this comment.
Most likely, this assumes we're dealing with a target that has reasoning capabilities, right? @hannahwestra25 is currently working on expanding TargetCapabilities so that could come in handy here for validation purposes.
There was a problem hiding this comment.
TargetCapabilities doesn't including reasoning models yet. I commented out a note to add this when it has been included!
|
Hi, I incorporated the feedback on the PR and added test notebooks as well. I used the Open AI Chat Model, and in order to bypass the safety guardrails for the testing notebooks, I used a non-harmful objective and prompt template. Also, CoTHijackingAttack supports two scorer types:
Both paths converge to the same success check: bool(score_obj.get_value()), keeping _perform_async scorer-agnostic. Let me know what you think, thanks! |
Isolated the 13 CoT-hijacking-specific files from riyosha/h-cot branch onto a clean main base for reviewability. Original PR had 473 changed files due to branch drift. Also added myst.yml entry for the doc notebook. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
romanlutz
left a comment
There was a problem hiding this comment.
Thanks for working on this, @riyosha! Implementing CoT Hijacking from Zhao et al. (2025) is a valuable addition to PyRIT's attack library. I've done a thorough read of the implementation against the reference paper and code. Feedback below as inline comments, organized by severity.
Note: I cleaned up the branch to isolate the 14 CoT-hijacking-specific files (it previously had 473 changed files from branch drift). The original work and authorship are preserved.
Summary: The structure follows MultiTurnAttackStrategy patterns well. Priority fixes: (1) JSON format mismatch across templates, (2) attacker conversation history. Parallel streams could be a documented follow-up.
| parsed = json.loads(response_text) | ||
|
|
||
| # Step 3: extract the jailbreak prompt P from the JSON | ||
| return str(parsed.get("prompt", meta_prompt)) |
There was a problem hiding this comment.
🔴 Broken JSON parsing for 2 of 6 real puzzle types
parsed.get("prompt", meta_prompt) assumes every template tells the attacker to return a prompt key, but:
logic_grid.yamlandsudoku.yamlinstruct the attacker to return{puzzle_framework, operation_mapping, practical_example_request}— there is nopromptkey.category_theory.yaml/test.yaml→{prompt}✅logic_grid_enhanced.yaml/skyscrapers_memetic.yaml→{improvement, prompt}✅
For logic_grid and sudoku, this always falls back to the raw meta-prompt (the instruction to the attacker, not a crafted adversarial prompt from the attacker). These two puzzle types are silently broken.
Fix: Either standardize all templates to include a prompt key in their FORMAT section, or update the parsing to handle the different response schemas (e.g., compose a prompt from the three fields).
|
|
||
| # Step 2: send meta-prompt to adversarial model to generate the actual jailbreak prompt | ||
|
|
||
| message = Message.from_prompt(prompt=meta_prompt, role="user") |
There was a problem hiding this comment.
🔴 No conversation history for the attacker model
The paper's reference implementation maintains a multi-turn conversation with the attacker LLM, keeping the last N turns (keep_last_n=3) so it can refine based on accumulated history. See core/workflow.py:
python for conv in convs_list: conv.messages = conv.messages[-2 * args.keep_last_n:] ``n Here, a fresh single-turn message is created each iteration. The attacker can't learn from prior attempts beyond the raw template variables. The enhanced templates (logic_grid_enhanced, skyscrapers_memetic`) even explicitly say 'The user output you receive is composed of four parts, LANGUAGE MODEL OUTPUT, OBJECTIVE, SCORE and STEP NUMBER' — assuming multi-turn context exists.
Fix: Maintain a separate conversation context for the adversarial model and append each round's feedback as conversation turns, similar to how CrescendoAttack or RedTeamingAttack manage adversarial conversations.
|
|
||
| for iteration in range(self._max_iterations): | ||
| context.iteration = iteration + 1 | ||
| puzzle_type = self._puzzle_types[iteration % len(self._puzzle_types)] |
There was a problem hiding this comment.
🔴 Missing parallel streams (architectural gap vs. paper)
The reference runs 6 concurrent streams (n_streams=6) with different puzzle-type system prompts, taking the best result per iteration. This runs a single sequential stream, cycling puzzle types one at a time. This is the paper's core strategy — diversity across puzzle types in parallel, not sequentially.
Suggestion: At minimum, document this as a known simplification. Ideally, add a parameter (e.g., n_streams: int = 1) with best-of-N selection per iteration.
| # Step 1: render the meta-prompt for the adversarial model | ||
| prompt_path = PUZZLE_PROMPT_PATHS[puzzle_type] | ||
| import yaml | ||
| from jinja2 import Environment |
There was a problem hiding this comment.
🟠 Inline imports — style guide violation
yaml and jinja2.Environment are imported inside a method body. The style guide mandates all imports at the top of the file. Move these to the top-level import section.
| ) | ||
| from pyrit.prompt_normalizer import PromptNormalizer | ||
| from pyrit.prompt_target import PromptChatTarget, PromptTarget | ||
| from pyrit.prompt_target.common.prompt_chat_target import PromptChatTarget |
There was a problem hiding this comment.
🟠 Duplicate import
PromptChatTarget is already imported on line 52 via from pyrit.prompt_target import PromptChatTarget, PromptTarget. This specific-path import is redundant and violates the style guide's import rules (prefer the higher-level module import). Remove this line.
| "skyscrapers", | ||
| "logic_grid_enhanced", | ||
| "skyscrapers_memetic", | ||
| "test", |
There was a problem hiding this comment.
🟠 test in default puzzle types — This entry uses a softened template without the adversarial framing of the real puzzle types. Since SUPPORTED_PUZZLE_TYPES is the default list, test will be included in production runs. Consider removing it from this list (keep the YAML template for testing, but don't include it in defaults).
| last_target_response: str = "" | ||
|
|
||
|
|
||
| class CoTHijackingAttack(MultiTurnAttackStrategy[CoTHijackingAttackContext, AttackResult]): |
There was a problem hiding this comment.
🟠 Missing _build_identifier() override — Other multi-turn attacks (e.g. CrescendoAttack, TAPAttack) override _build_identifier() to include behavioral parameters like max_iterations, puzzle_types, etc. in the component identifier. This is important for experiment tracking and memory labeling.
| continue | ||
|
|
||
| # Store response text for next iteration's feedback | ||
| context.last_target_response = str(response.get_value()) if response else "" |
There was a problem hiding this comment.
🟡 No CoT length tracking — The paper's key mechanistic insight is that CoT length determines attack success. The reference feeds STEP NUMBER (reasoning steps count) back to the attacker model as a critical feedback signal. Consider extracting reasoning token count from the response (if available from the target API) and including it in the feedback loop.
| # TrueFalseScorer from config — use directly, no wrapping needed | ||
| self._objective_scorer = attack_scoring_config.objective_scorer | ||
| else: | ||
| self._objective_scorer = None |
There was a problem hiding this comment.
🟡 No scorer = silent all-iterations failure — If no scorer is configured (the default path), _score_response_async returns None every iteration, so the attack always exhausts max_iterations and returns FAILURE with no early stopping. Consider either requiring a scorer (raise ValueError) or at least logging a warning here that scoring is disabled.
|
|
||
| """ | ||
| Unit tests for CoT Hijacking Attack implementation. | ||
| """ |
There was a problem hiding this comment.
🔵 Tests only exercise the loop skeleton — All 30 tests mock the three core methods (_generate_attack_prompt_async, _send_prompt_to_target_async, _score_response_async). This is good for verifying the iteration/scoring/early-exit structure, but there's no coverage of:
- Template rendering via Jinja2 (would catch the broken JSON format issue)
- JSON parsing and fallback behavior when the adversarial model returns malformed output
- The actual
_generate_attack_prompt_asyncpipeline with a mocked adversarial model response - Edge case: adversarial model returns valid JSON but without the expected
promptkey
Consider adding at least one test that exercises _generate_attack_prompt_async end-to-end with a mocked PromptNormalizer.send_prompt_async returning a controlled JSON string.
Description
[DRAFT] FEAT: Chain-of-Thought Hijacking Attack Strategy and Test Coverage
Related to issue #897
This PR introduces the Chain-of-Thought (CoT) Hijacking attack strategy, as described in Zhao et al. (2025). The changes include:
pyrit/executor/attack/multi_turn/cot_hijacking.pypyrit/datasets/executors/cot_hijacking/puzzle_generation_{puzzle_type}.yamltests/unit/executor/attack/multi_turn/test_cot_hijacking.pyRelated issues: #897
Tests and Documentation
tests/unit/executor/attack/multi_turn/test_cot_hijacking.pyThis is a draft PR and I want to get your thoughts on the implementation so far. I have planned these updates:
Question:
async def _teardown_asynceven if unused. Should I also add it?