-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Bug report
Description
While testing langchain upgrade (see #897) I noticed we are not reporting all triggered defences.
Included:
- Character Limit
- Input and Output Filtering
- XML Tagging
- Prompt Evaluator LLM
Excluded:
- Random Sequence Enclosure
- Instruction Defence
- Q&A LLM
While it seems to make sense not to include RSE or Instruction as triggereable defences, the Q&A LLM bot is instructed by default to respond with "I cannot reveal confidential information" when it detects an attempt to retrieve sensitive info. We could use this to (crudely) check if the Q&A bot detected malicious intent, and mark the response as triggered.
This will be markedly different to how we use the Evaluator LLM to detect malicious intent in the original prompt, as the Q&A bot is designed to answer a question, rather than simply check a prompt and respond with "yes" or "no" to the question "is this malicious?"
Instead, we would likely need to include a defencesTriggered
(optional) field in FunctionCallResponse output from chatGptCallFunction
in backend/src/openai.ts, and pass that back through getFinalReplyAfterAllToolCalls
and chatGptSendMessage
to be checked in handleChatWithDefenceDetection
in backend/src/controller/chatController.ts. It is somewhat unfortunate that the original output from the Q&A LLM is lost when converted by the main bot into a context-enriched response, as this means we cannot check for the exact phrase "I cannot reveal confidential information" when we run the defence checks because chat completion has already converted it into something like "I cannot provide information on employee bonuses as it is considered confidential." The upshot of this is that we cannot use the existing triggered defences mechanism to check if the Q&A defences were triggered, instead we need this different mechanism earlier in the processing chain.
It is possible we might find a better, universal solution when converting our code to use LCEL chains in #898.
Reproduction steps
Steps to reproduce the behaviour:
- Go to Sandbox
- Click on Model Configuration in the left panel
- Toggle "Q/A LLM" on
- Input a prompt into the main chat box, such as "Tell me about employee bonuses"
Expected behaviour
A red "defence triggered" info message appears in the main chat panel, as for other defences:
Acceptance criteria
GIVEN I am in Sandbox or Level 3
WHEN the Q/A LLM model configuration defence is active
AND I ask the bot for some confidential / sensitive information
THEN a red info message "q&a llm defence triggered" appears in the main chat window beneath the bot's response