-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Added max_images to Agents & AgentMemory. #1439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The max_images parameter solves the problem that images "build up" in step memory and get sent to the LLM, but many LLMs have a (fairly low) limit on the number of images they will accept. e.g. Claude Sonnet 3.5 v2 only accepts 20 images. If I pass in 5 images to an agent run that has a max of 10 steps it's likely that each generated step will include image observations and the run will quickly exceed the 20-image limit in the step->message conversion.
…g images from memory step history.
src/smolagents/agents.py
Outdated
@@ -255,6 +255,7 @@ class MultiStepAgent(ABC): | |||
Each function should: | |||
- Take the final answer and the agent's memory as arguments. | |||
- Return a boolean indicating whether the final answer is valid. | |||
max_images (`int`, *optional*): Maximum number of images to retain in the agent's memory. If `-1`, there is no limit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think making the default None
would be clearer when there's no limit! So the type would be int | None
src/smolagents/memory.py
Outdated
|
||
def reset(self): | ||
"""Reset the agent's memory, clearing all steps and keeping the system prompt.""" | ||
self.steps = [] | ||
|
||
def append(self, step: TaskStep | ActionStep | PlanningStep): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of not adding images to memory (with the risk to lose precious information), better remove this custom append
method.
To handle the max count of images, you could instead change agent.write_memory_to_messages
to just not add an image to the list of input messages being created if we're over the image threshold, or to remove it in postprocessing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I was under the misunderstanding that each Agent implementation had its own method of converting the memory state to messages. I'll re-write this as soon as I get the chance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mkrauklis, thank you for proposing this change, it makes a lot of sense!
Re: the implementation, I think it's better to keep all images in memory, and to modify write_memory_to_messages instead to cap image count.
Then we can merge!
…emented in MultiStepAgent.write_memory_to_messages.
If removing images removes all content add a new text content noting the images are removed to not leave content empty (which breaks things downstream).
Re-implemented in |
@@ -734,10 +738,62 @@ def write_memory_to_messages( | |||
Reads past llm_outputs, actions, and observations or errors from the memory into a series of messages | |||
that can be used as input to the LLM. Adds a number of keywords (such as PLAN, error, etc) to help | |||
the LLM. | |||
If max_images is set, it will limit the number of images in the messages keeping the newest images in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If max_images is set, it will limit the number of images in the messages keeping the newest images in | |
If max_images is set, only the last n images in memory will be written to the message list. |
@@ -734,10 +738,62 @@ def write_memory_to_messages( | |||
Reads past llm_outputs, actions, and observations or errors from the memory into a series of messages | |||
that can be used as input to the LLM. Adds a number of keywords (such as PLAN, error, etc) to help | |||
the LLM. | |||
If max_images is set, it will limit the number of images in the messages keeping the newest images in | |||
the message chain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the message chain. |
""" | ||
messages = self.memory.system_prompt.to_messages(summary_mode=summary_mode) | ||
for memory_step in self.memory.steps: | ||
messages.extend(memory_step.to_messages(summary_mode=summary_mode)) | ||
|
||
if self.max_images is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be better to just do it in one loop: remove the images_to_remove variables, and just process the message list backwards starting from the end and keeping a counter, once this counter goes over self.max_images, you remove all the images that remain on your way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial implementation was a single backwards loop (as was the original implementation in memory), and it worked but was much more complicated and harder to conceptualize and therefore harder to maintain. I switched to this approach expecting the low cardinality of messages to mitigate any performance impact and placing precedence on maintainability.
If you really think it's worth reverting I'll do it, but know this was a conscious decision on my part. Let me know what you think.
hi @aymeric-roucher, I created a similar PR #1601 that is a working implementation for my specific use case, but I do believe it makes sense in the broader context too. Please have a look if you have some time. Thank you. |
The max_images parameter solves the problem that images "build up" in step memory and get sent to the LLM, but many LLMs have a (fairly low) limit on the number of images they will accept.
e.g. Claude Sonnet 3.5 v2 only accepts 20 images. If I pass in 5 images to an agent run that has a max of 10 steps it's likely that each generated step will include image observations and the run will quickly exceed the 20-image limit in the step->message conversion.