Skip to content

feat(core,mcp,web-integration): enhance aiKeyboardPress to support key combinations #799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 8 additions & 3 deletions apps/site/docs/en/API.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -162,16 +162,16 @@ await agent.aiInput('Hello World', 'The search input box');

### `agent.aiKeyboardPress()`

Press a keyboard key.
Press a keyboard key or key combination.

* Type

```typescript
function aiKeyboardPress(key: string, locate?: string, options?: Object): Promise<void>;
function aiKeyboardPress(key: string | string[], locate?: string, options?: Object): Promise<void>;
```

* Parameters:
* `key: string` - The web key to press, e.g. 'Enter', 'Tab', 'Escape', etc. Key Combination is not supported.
* `key: string | string[]` - The web key(s) to press. Can be a single key like 'Enter', 'Tab', 'Escape', or an array of keys for combinations like ['Ctrl', 'Shift'] or ['Meta', 'a'].
* `locate?: string` - Optional, a natural language description of the element to press the key on.
* `options?: Object` - Optional, a configuration object containing:
* `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element.
Expand All @@ -183,7 +183,12 @@ function aiKeyboardPress(key: string, locate?: string, options?: Object): Promis
* Examples:

```typescript
// Single key press
await agent.aiKeyboardPress('Enter', 'The search input box');

// Key combinations
await agent.aiKeyboardPress(['Ctrl', 'a']); // Select all
await agent.aiKeyboardPress(['Ctrl', 'Shift', 'I']); // Open developer tools
```

### `agent.aiScroll()`
Expand Down
11 changes: 8 additions & 3 deletions apps/site/docs/zh/API.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -159,16 +159,16 @@ await agent.aiInput('Hello World', '搜索框');

### `agent.aiKeyboardPress()`

按下键盘上的某个键
按下键盘上的某个键或组合键

* 类型

```typescript
function aiKeyboardPress(key: string, locate?: string, options?: Object): Promise<void>;
function aiKeyboardPress(key: string | string[], locate?: string, options?: Object): Promise<void>;
```

* 参数:
* `key: string` - 要按下的键, `Enter`、`Tab`、`Escape` 等。不支持组合键
* `key: string | string[]` - 要按下的键,可以是单个键如 `Enter`、`Tab`、`Escape` 等,或者是组合键数组如 `['Ctrl', 'Shift']`
* `locate?: string` - 用自然语言描述的元素定位。
* `options?: Object` - 可选,一个配置对象,包含:
* `deepThink?: boolean` - 是否开启深度思考。如果为 true,Midscene 会调用 AI 模型两次以精确定位元素。
Expand All @@ -180,7 +180,12 @@ function aiKeyboardPress(key: string, locate?: string, options?: Object): Promis
* 示例:

```typescript
// 单个按键
await agent.aiKeyboardPress('Enter', '搜索框');

// 组合键
await agent.aiKeyboardPress(['Ctrl', 'a']); // 全选
await agent.aiKeyboardPress(['Ctrl', 'Shift', 'I']); // 打开开发者工具
```

### `agent.aiScroll()`
Expand Down
7 changes: 4 additions & 3 deletions packages/core/src/ai-model/common.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@ import type {
ElementTreeNode,
MidsceneYamlFlowItem,
PlanningAction,
PlanningActionParamInputOrKeyPress,
PlanningActionParamInput,
PlanningActionParamKeyPress,
PlanningActionParamScroll,
PlanningActionParamSleep,
Rect,
Expand Down Expand Up @@ -348,13 +349,13 @@ export function buildYamlFlowFromPlans(
aiHover: locate!,
});
} else if (type === 'Input') {
const param = plan.param as PlanningActionParamInputOrKeyPress;
const param = plan.param as PlanningActionParamInput;
flow.push({
aiInput: param.value,
locate,
});
} else if (type === 'KeyboardPress') {
const param = plan.param as PlanningActionParamInputOrKeyPress;
const param = plan.param as PlanningActionParamKeyPress;
flow.push({
aiKeyboardPress: param.value,
locate,
Expand Down
6 changes: 5 additions & 1 deletion packages/core/src/types.ts
Original file line number Diff line number Diff line change
Expand Up @@ -306,10 +306,14 @@ export interface PlanningAIResponse {
export type PlanningActionParamTap = null;
export type PlanningActionParamHover = null;
export type PlanningActionParamRightClick = null;
export interface PlanningActionParamInputOrKeyPress {
export interface PlanningActionParamInput {
value: string;
}

export interface PlanningActionParamKeyPress {
value: string | string[];
}

export type PlanningActionParamScroll = scrollParam;

export interface PlanningActionParamAssert {
Expand Down
2 changes: 1 addition & 1 deletion packages/core/src/yaml.ts
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ export interface MidsceneYamlFlowItemAIInput extends LocateOption {
}

export interface MidsceneYamlFlowItemAIKeyboardPress extends LocateOption {
aiKeyboardPress: string;
aiKeyboardPress: string | string[];
locate?: string; // where to press, optional
}

Expand Down
7 changes: 4 additions & 3 deletions packages/mcp/src/midscene.ts
Original file line number Diff line number Diff line change
Expand Up @@ -241,9 +241,9 @@ export class MidsceneManager {
tools.midscene_aiKeyboardPress.description,
{
key: z
.string()
.union([z.string(), z.array(z.string())])
.describe(
"The web key to press, e.g. 'Enter', 'Tab', 'Escape', etc.",
"The web key(s) to press. Can be a single key like 'Enter', 'Tab', 'Escape', or an array of keys for combinations like ['Ctrl', 'Shift'] or ['Meta', 'a'].",
),
locate: z
.string()
Expand All @@ -264,11 +264,12 @@ export class MidsceneManager {
const options = deepThink ? { deepThink } : undefined;
await agent.aiKeyboardPress(key, locate, options);

const keyDesc = Array.isArray(key) ? key.join('+') : key;
const targetDesc = locate ? ` on element "${locate}"` : '';

return {
content: [
{ type: 'text', text: `Pressed key '${key}'${targetDesc}` },
{ type: 'text', text: `Pressed key(s) '${keyDesc}'${targetDesc}` },
{ type: 'text', text: `report file: ${agent.reportFile}` },
],
isError: false,
Expand Down
2 changes: 1 addition & 1 deletion packages/web-integration/src/common/agent.ts
Original file line number Diff line number Diff line change
Expand Up @@ -326,7 +326,7 @@ export class PageAgent<PageType extends WebPage = WebPage> {
}

async aiKeyboardPress(
keyName: string,
keyName: string | string[],
locatePrompt?: string,
opt?: LocateOption,
) {
Expand Down
34 changes: 23 additions & 11 deletions packages/web-integration/src/common/plan-builder.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@ import type {
DetailedLocateParam,
MidsceneYamlFlowItem,
PlanningAction,
PlanningActionParamInputOrKeyPress,
PlanningActionParamInput,
PlanningActionParamKeyPress,
PlanningActionParamScroll,
PlanningActionParamSleep,
PlanningActionParamTap,
Expand All @@ -17,7 +18,8 @@ export function buildPlans(
type: PlanningAction['type'],
locateParam?: DetailedLocateParam,
param?:
| PlanningActionParamInputOrKeyPress
| PlanningActionParamInput
| PlanningActionParamKeyPress
| PlanningActionParamScroll
| PlanningActionParamSleep,
): PlanningAction[] {
Expand All @@ -42,23 +44,33 @@ export function buildPlans(

returnPlans = [locatePlan, tapPlan];
}
if (type === 'Input' || type === 'KeyboardPress') {
if (type === 'Input') {
assert(locateParam, `missing locate info for action "${type}"`);
}
if (type === 'Input') {
assert(locateParam, `missing locate info for action "${type}"`);
assert(param, `missing param for action "${type}"`);

const inputPlan: PlanningAction<PlanningActionParamInputOrKeyPress> = {
const inputPlan: PlanningAction<PlanningActionParamInput> = {
type,
param: param as PlanningActionParamInputOrKeyPress,
param: param as PlanningActionParamInput,
thought: '',
locate: locateParam!,
locate: locateParam,
};

returnPlans = [locatePlan!, inputPlan];
}
if (type === 'KeyboardPress') {
assert(param, `missing param for action "${type}"`);

const keyboardPressPlan: PlanningAction<PlanningActionParamKeyPress> = {
type,
param: param as PlanningActionParamKeyPress,
thought: '',
locate: locateParam,
};

if (locatePlan) {
returnPlans = [locatePlan, inputPlan];
returnPlans = [locatePlan, keyboardPressPlan];
} else {
returnPlans = [inputPlan];
returnPlans = [keyboardPressPlan];
}
}

Expand Down
7 changes: 4 additions & 3 deletions packages/web-integration/src/common/tasks.ts
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@ import {
type PlanningActionParamAssert,
type PlanningActionParamError,
type PlanningActionParamHover,
type PlanningActionParamInputOrKeyPress,
type PlanningActionParamInput,
type PlanningActionParamKeyPress,
type PlanningActionParamScroll,
type PlanningActionParamSleep,
type PlanningActionParamTap,
Expand Down Expand Up @@ -379,7 +380,7 @@ export class PageTaskExecutor {
};
tasks.push(taskAssert);
} else if (plan.type === 'Input') {
const taskActionInput: ExecutionTaskActionApply<PlanningActionParamInputOrKeyPress> =
const taskActionInput: ExecutionTaskActionApply<PlanningActionParamInput> =
{
type: 'Action',
subType: 'Input',
Expand All @@ -402,7 +403,7 @@ export class PageTaskExecutor {
};
tasks.push(taskActionInput);
} else if (plan.type === 'KeyboardPress') {
const taskActionKeyboardPress: ExecutionTaskActionApply<PlanningActionParamInputOrKeyPress> =
const taskActionKeyboardPress: ExecutionTaskActionApply<PlanningActionParamKeyPress> =
{
type: 'Action',
subType: 'KeyboardPress',
Expand Down
2 changes: 1 addition & 1 deletion packages/web-integration/src/common/ui-utils.ts
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ export function getKeyCommands(
value: string | string[],
): Array<{ key: string; command?: string }> {
// Ensure value is an array of keys
const keys = Array.isArray(value) ? value : [value];
const keys = Array.isArray(value) ? value : value.split('+'); // Compatible with input format 'Meta+A';

// Process each key to attach a corresponding command if needed, based on the presence of 'Meta' or 'Control' in the keys array.
// ref: https://github.com/puppeteer/puppeteer/pull/9357/files#diff-32cf475237b000f980eb214a0a823e45a902bddb7d2426d677cae96397aa0ae4R94
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,23 @@ exports[`build plans > keyboardPress 1`] = `
]
`;

exports[`build plans > keyboardPress with combination keys 1`] = `
[
{
"locate": undefined,
"param": {
"value": [
"Ctrl",
"Shift",
"I",
],
},
"thought": "",
"type": "KeyboardPress",
},
]
`;

exports[`build plans > rightClick 1`] = `
[
{
Expand Down
7 changes: 7 additions & 0 deletions packages/web-integration/tests/unit-test/plan-builder.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,13 @@ describe('build plans', () => {
expect(result).toMatchSnapshot();
});

it('keyboardPress with combination keys', async () => {
const result = await buildPlans('KeyboardPress', undefined, {
value: ['Ctrl', 'Shift', 'I'],
});
expect(result).toMatchSnapshot();
});

it('scroll', async () => {
const result = await buildPlans('Scroll', undefined, {
direction: 'down',
Expand Down
23 changes: 23 additions & 0 deletions packages/web-integration/tests/unit-test/util.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -68,4 +68,27 @@ describe('getKeyCommands', () => {
{ key: 'V', command: 'Paste' },
]);
});

it('should handle combination keys like Ctrl+Shift', () => {
const result = getKeyCommands(['Control', 'Shift']);
expect(result).toEqual([{ key: 'Control' }, { key: 'Shift' }]);
});

it('should handle complex combinations like Ctrl+Shift+A', () => {
const result = getKeyCommands(['Control', 'Shift', 'A']);
expect(result).toEqual([
{ key: 'Control' },
{ key: 'Shift' },
{ key: 'A', command: 'SelectAll' },
]);
});

it('should handle Meta+Shift+V combination', () => {
const result = getKeyCommands(['Meta', 'Shift', 'V']);
expect(result).toEqual([
{ key: 'Meta' },
{ key: 'Shift' },
{ key: 'V', command: 'Paste' },
]);
});
});