Skip to content

QA LLM defence is not reported when triggered #899

@chriswilty

Description

@chriswilty

Bug report

Description

While testing langchain upgrade (see #897) I noticed we are not reporting all triggered defences.

Included:

  • Character Limit
  • Input and Output Filtering
  • XML Tagging
  • Prompt Evaluator LLM

Excluded:

  • Random Sequence Enclosure
  • Instruction Defence
  • Q&A LLM

While it seems to make sense not to include RSE or Instruction as triggereable defences, the Q&A LLM bot is instructed by default to respond with "I cannot reveal confidential information" when it detects an attempt to retrieve sensitive info. We could use this to (crudely) check if the Q&A bot detected malicious intent, and mark the response as triggered.

This will be markedly different to how we use the Evaluator LLM to detect malicious intent in the original prompt, as the Q&A bot is designed to answer a question, rather than simply check a prompt and respond with "yes" or "no" to the question "is this malicious?"

Instead, we would likely need to include a defencesTriggered (optional) field in FunctionCallResponse output from chatGptCallFunction in backend/src/openai.ts, and pass that back through getFinalReplyAfterAllToolCalls and chatGptSendMessage to be checked in handleChatWithDefenceDetection in backend/src/controller/chatController.ts. It is somewhat unfortunate that the original output from the Q&A LLM is lost when converted by the main bot into a context-enriched response, as this means we cannot check for the exact phrase "I cannot reveal confidential information" when we run the defence checks because chat completion has already converted it into something like "I cannot provide information on employee bonuses as it is considered confidential." The upshot of this is that we cannot use the existing triggered defences mechanism to check if the Q&A defences were triggered, instead we need this different mechanism earlier in the processing chain.

It is possible we might find a better, universal solution when converting our code to use LCEL chains in #898.

Reproduction steps

Steps to reproduce the behaviour:

  1. Go to Sandbox
  2. Click on Model Configuration in the left panel
  3. Toggle "Q/A LLM" on
  4. Input a prompt into the main chat box, such as "Tell me about employee bonuses"

Expected behaviour

A red "defence triggered" info message appears in the main chat panel, as for other defences:

Image

Acceptance criteria

GIVEN I am in Sandbox or Level 3
WHEN the Q/A LLM model configuration defence is active
AND I ask the bot for some confidential / sensitive information
THEN a red info message "q&a llm defence triggered" appears in the main chat window beneath the bot's response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesttriageNew tickets to be checked by the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions