Skip to content

OpenAI Computer Use Agent #270

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

imenelydiaker
Copy link
Collaborator

@imenelydiaker imenelydiaker commented Jul 28, 2025

Hi all,

I added OpenAI computer use agent for evaluation. I mostly used this documentation and sample code. The agent uses the coordinates action space.

I ran few tests with WorkArena L1 but the agent is really bad at achieving tasks (get 0 almost all the time, expect for information extraction tasks - charts -). From my analysis, it seems like it's more a problem with the model rather than the environment. Can you probably help verify it is true?

This is the sample cod eI use to test the agent:

from agentlab.experiments.study import make_study, Study
from browsergym.experiments.benchmark import Benchmark
from browsergym.experiments.benchmark.utils import make_env_args_list_from_repeat_tasks
from browsergym.experiments.benchmark.metadata.utils import task_metadata
from browsergym.experiments.benchmark.configs import DEFAULT_HIGHLEVEL_ACTION_SET_ARGS

import numpy as np
import logging

from agentlab.openai_cua.agent_configs import OPENAI_CUA_AGENT_ARGS

agent_args = [
    OPENAI_CUA_AGENT_ARGS,
]

benchmark = Benchmark(
    name="workarena_l1_tiny",
    high_level_action_set_args=DEFAULT_HIGHLEVEL_ACTION_SET_ARGS["workarena"],
    is_multi_tab=False,
    supports_parallel_seeds=False,
    backends=["workarena"],
    env_args_list=make_env_args_list_from_repeat_tasks(
        task_list=[
            "workarena.servicenow.all-menu",
            # "workarena.servicenow.create-problem",
            "workarena.servicenow.create-user",
            # "workarena.servicenow.create-hardware-asset",
            # "workarena.servicenow.order-development-laptop-p-c",
            # "workarena.servicenow.order-developer-laptop",
            "workarena.servicenow.order-ipad-mini",
            # "workarena.servicenow.order-loaner-laptop",
            # "workarena.servicenow.single-chart-value-retrieval",
            "workarena.servicenow.multi-chart-value-retrieval",
            # "workarena.servicenow.filter-asset-list",
            # "workarena.servicenow.filter-change-request-list",
            # "workarena.servicenow.sort-asset-list",
            # "workarena.servicenow.sort-user-list",
            # "workarena.servicenow.knowledge-base-search",
        ],
        max_steps=15,
        n_repeats=3,
        seeds_rng=np.random.RandomState(42),
    ),
    task_metadata=task_metadata("workarena"),
)

# benchmark = "workarena_l1" # Uncomment this line to use the full WorkArena L1 benchmark

relaunch = False

if relaunch:
    study = Study.load_most_recent(contains=None)
    study.find_incomplete(include_errors=True)
else:
    study = make_study(
        agent_args=agent_args,
        benchmark=benchmark,
        logging_level_stdout=logging.INFO,
        ignore_dependencies=True,
  )

study.run(
    n_jobs=10,
    parallel_backend="ray",
    strict_reproducibility=False,
    n_relaunch=3,
)

Description by Korbit AI

What change is being made?

Add a new OpenAI Computer Use Agent with classes to define the agent's arguments and operations for interacting with a high-level action interface.

Why are these changes being made?

This agent is being introduced to enable automated interactions and reasoning within a browser environment using the OpenAI framework. The design includes options for configuring actions, executing tasks without explicit confirmations, and managing agent operations effectively in various scenarios, which aids in the development of more sophisticated AI-driven user interface automation.

Is this description stale? Ask me to generate a new description by commenting /korbit-generate-pr-description

Copy link

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.
Category Issue Status
Functionality Incomplete Multi-key Processing ▹ view ✅ Fix detected
Security Unsecured OpenAI Client Initialization ▹ view ✅ Fix detected
Security Disabled Safety Checks in Production Configuration ▹ view
Error Handling Silent Failure in Action Parsing ▹ view
Performance Unbounded Input History Growth ▹ view ✅ Fix detected
Functionality Incorrect Safety Check Assertion Logic ▹ view ✅ Fix detected
Files scanned
File Path Reviewed
src/agentlab/agents/openai_cua/agent_configs.py
src/agentlab/agents/openai_cua/agent.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

subsets=("chat", "coord"),
demo_mode=None,
),
enable_safety_checks=False,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabled Safety Checks in Production Configuration category Security

Tell me more
What is the issue?

Safety checks are disabled by default, which could allow potentially harmful actions to be executed without validation.

Why this matters

Without safety checks, the agent could perform unintended or dangerous operations in the browser environment, potentially compromising system security or causing undesired side effects.

Suggested change ∙ Feature Preview

Enable safety checks by default unless explicitly required otherwise for testing:

enable_safety_checks=True
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

@amanjaiswal73892 amanjaiswal73892 self-assigned this Jul 30, 2025
@amanjaiswal73892 amanjaiswal73892 self-requested a review July 30, 2025 22:26
@imenelydiaker
Copy link
Collaborator Author

@amanjaiswal73892 did you get the chance to check if the issue on solving WorkArena-L1 is coming from bgym coordinates interpretation or from the computer-use-preview model?

@amanjaiswal73892
Copy link
Collaborator

Hi @imenelydiaker, 
Thank you for the PR. It looks good! The coordinates interpretation in bgym seems to be working well. I was able to get ~25% for these tasks (66% for order-ipad-mini and 33% for chart retrieval). I changed the mapping of CUA type action to keyboard_type bgym function and used the latest bgym (v0.14.2). Other tasks are difficult for the agent and may need more steps. It would be great if you can update the code to render chat messages in agentlab-xray. Let me know if you need any help with this.

@imenelydiaker
Copy link
Collaborator Author

Okay great thank you, it may be an issue with WorkArena on my end. I'm not able to reproduce many of my preivous results with other non-GUI agents.

I'll update the code to add some logging info :)

@amanjaiswal73892 amanjaiswal73892 removed their request for review August 6, 2025 19:24
Copy link
Collaborator

@amanjaiswal73892 amanjaiswal73892 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested changes to support chat message rendering in xray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants