A Harbor task for evaluating the perceptual reasoning abilities of frontier model agents. Agents are presented with orthographic projection problems drawn from the Perceptual Ability Test (PAT) and must identify the correct third view of a 3D solid given two out of its top, end, and front views.
Each question is rendered as a single composite image in standard PAT format. No code execution, terminal access, or internet access is allowed. The agent reasons only from vision and perceptual ability.
Two question sources are supported:
- Official questions — a fixed set of 15 freely available PAT problems from Bohr Prep in
environment/official_questions/ - Generated questions — synthetic PAT problems produced by
question-gen/question_gen.py, written toquestion-gen/generated_questions/
Three frontier models are evaluated:
| Key | Model |
|---|---|
gpt |
openai/gpt-5.4 |
opus |
anthropic/claude-opus-4.6 |
gemini |
google/gemini-3.1-pro-preview |
All models are routed through OpenRouter.
- Python 3.9 or later
- Harbor installed and available on
PATH - Docker
Install Python dependencies:
pip install -r requirements.txt
Provide your OpenRouter API key in a .env file one directory above the project root:
OPENROUTER_API_KEY=sk-or-...
view-recognition/
├── vision_agent.py # Custom Harbor agent (multi-turn vision + reasoning loop)
├── config.py # Question selection and Docker environment preparation
├── run_custom.py # Single-model trial runner
├── run_analytics.py # Multi-model parallel runner with pass@k statistics
├── analytics.py # Reanalyze all accumulated jobs in jobs/
├── task.toml # Harbor task configuration
├── instruction.md # Task description passed to the agent
├── requirements.txt
├── environment/
│ ├── Dockerfile
│ ├── official_questions/ # Hand-authored PAT questions (not in Docker image)
│ └── selected_questions/ # Prepared subset copied into Docker image
├── question-gen/
│ ├── question_gen.py # Synthetic PAT question generator
│ ├── make_composite.py # PAT-format composite image renderer
│ └── generated_questions/ # Output of question_gen.py (gitignored)
├── solution/
│ ├── solve.sh # Oracle solution script
│ └── solutions.json # Answer key synced from tests/ at prep time
└── tests/
├── test.py # Verifier: scores sol.txt against solutions.json
└── solutions.json # Ground-truth answers (not in Docker image)
To generate a new set of synthetic PAT questions:
cd question-gen
python question_gen.py
Output is written to question-gen/generated_questions/, with one subdirectory per question (q01/, q02/, ...) and a solutions.json answer key. Each question directory contains:
composite.png— PAT-format image shown to the agentfull_question.png— analysis composite with isometric view and marked answerinput/top_view.png,input/end_view.png— orthographic input viewsanswers/A.png…answers/D.png— front view answer choices
The number of questions generated is controlled by N_QUESTIONS at the top of question_gen.py (default: 10).
To control which questions are included in a run, edit config.py:
# Any subset of 1–15 in any order, or None for all
OFFICIAL_QUESTIONS = Nonepython run_custom.py <model> [--max-turns N]
Arguments:
| Argument | Description |
|---|---|
model |
One of gpt, opus, gemini |
--max-turns N |
Cap the agent's reasoning turns (default: unlimited) |
Examples:
python run_custom.py opus
python run_custom.py gemini
python run_custom.py opus --max-turns 10
Results are written to jobs/<timestamp>/.
Runs all three models in parallel for k trials each and computes pass@k statistics per question.
python run_analytics.py --runs <k> [--max-turns N]
--runs must be at least 2 to compute pass@2.
Examples:
python run_analytics.py --runs 5
python run_analytics.py --runs 7 --max-turns 10
Results are written to analytics/:
raw_results.json— per-job scorespass_at_k.json— pass@k values per model per questionsummary.txt— formatted table
pass@k is the probability that at least one of k randomly sampled runs is correct, estimated using the unbiased estimator from Chen et al. (2021):
pass@k = 1 - C(n-c, k) / C(n, k)
where n is the total number of runs and c is the number of correct runs.
VisionAgent operates in a multi-turn loop with no access to a terminal or code execution environment. Each turn:
- The model receives the composite question images and is asked to respond with a structured JSON block containing
analysis,plan, andanswersfields. - The loop continues until the model sets
answersto a complete list of letters, ormax_turnsis reached. - Answers are written to
/home/user/sol.txtinside the Docker container, which the verifier scores.
A trajectory in ATIF-v1.6 format is written to the job log directory after each run.
To reanalyze all accumulated jobs in jobs/ without running new trials:
python analytics.py [--k N]
Arguments:
| Argument | Description |
|---|---|
--k N |
Maximum k for pass@k statistics (default: 5) |
This reads every job directory regardless of when it was created, allowing results to be accumulated across multiple batched runs and reanalyzed at any time. Output is written to analytics/.
The default agent timeout is 2700 seconds (45 minutes), set in task.toml. Adjust as needed:
[agent]
timeout_sec = 2700.0To verify the task is solvable and the verifier is correct, run the oracle agent:
harbor run -p . -a oracle --force-build
The oracle reads solution/solutions.json (kept in sync with tests/solutions.json by config.py) and writes the correct answers to sol.txt. A reward of 1.000 confirms the verifier is working correctly.