Evaluation Toolkit for PyVision-RL

This toolkit is used for benchmark evaluation in PyVision-RL project.

Installation

conda create -n pyvision-eval python=3.10
pip install -r requirements.txt

Set API and Base URL

# api_config.json

{
    "api_key": [
        "sk-**",
        "sk-**"
    ],
    "base_url": "xx"
}

PyVision-Image-RL Evaluation on Benchmark

First, modify the 'evaluation/run_eval_pl.sh' script.

# evaluation/run_eval_pl.sh

API_NAME_MODEL=pyvision-rl
API_NAME=pyvision-image-rl-7b
CLIENT_TYPE=openai
PROMPT_TYPE=vistool_with_img_info_v2
temp=0.6
top_p=1.0
top_k=20
min_p=0.0
repetition_penalty=1.0

BENCH_TYPE=vstar_32
DATA_FILE_PATH=./dataset/vstar/vstar_bench_repeat32.json
DATA_DIR_PATH=./dataset/vstar/results

python ./evaluation/eval_gpt_pl.py \
    --bench_type ${BENCH_TYPE} \
    --data_file_path ${DATA_FILE_PATH} \
    --output_file_path ${DATA_DIR_PATH}/results_${API_NAME}_${PROMPT_TYPE}_t${temp}_top_p${top_p}_top_k${top_k}_min_p${min_p}_rp${repetition_penalty}.json \
    --prompt_template ./prompt_template/prompt_template_vis.json \
    --prompt ${PROMPT_TYPE} \
    --max_tokens 16000 \
    --temperature ${temp} \
    --top_p ${top_p} \
    --top_k ${top_k} \
    --min_p ${min_p} \
    --repetition_penalty ${repetition_penalty} \
    --messages_save_dir ${DATA_DIR_PATH}/test_generated_messages_${API_NAME}_${PROMPT_TYPE}_t${temp}_top_p${top_p}_top_k${top_k}_min_p${min_p}_rp${repetition_penalty} \
    --api_config_path ./api_config_intern.json \
    --client_type ${CLIENT_TYPE} \
    --api_name ${API_NAME_MODEL} \
    --exe_code \
    --max_images 50

Then run it:

bash evaluation/run_eval_pl.sh

PyVision-Video-RL Evaluation on Benchmark

First, modify the 'evaluation/run_eval_video_pl.sh' script.

# evaluation/run_eval_video_pl.sh

API_NAME_MODEL=pyvision-rl
API_NAME=pyvision-rl-wohint-vsi-std-sort-enable-bs16-multitask-verifier-1113-000001-wo-stdnorm-steps750
CLIENT_TYPE=openai
PROMPT_TYPE=vis_tool_with_img_info_video_v4

BENCH_TYPE=vsi_bench
DATA_FILE_PATH=./dataset/VSI-Bench/vsi_bench.json
DATA_DIR_PATH=./dataset/VSI-Bench/results

NUM_FRAMES=0
temp=0.01
top_p=1.0
top_k=0
min_p=0.0

python ./evaluation/eval_gpt_video_pl.py \
    --bench_type ${BENCH_TYPE} \
    --data_file_path ${DATA_FILE_PATH} \
    --output_file_path ${DATA_DIR_PATH}/results_${API_NAME}_${PROMPT_TYPE}_t${temp}_${NUM_FRAMES}frames_top_p${top_p}_top_k${top_k}_min_p${min_p}.json \
    --prompt_template ./prompt_template/prompt_template_vis.json \
    --prompt ${PROMPT_TYPE} \
    --max_tokens 10000 \
    --temperature ${temp} \
    --top_p ${top_p} \
    --top_k ${top_k} \
    --min_p ${min_p} \
    --messages_save_dir ${DATA_DIR_PATH}/generated_messages_${API_NAME}_${PROMPT_TYPE}_t0.6_${NUM_FRAMES}frames_top_p${top_p}_top_k${top_k}_min_p${min_p} \
    --api_config_path ./api_config_intern.json \
    --client_type ${CLIENT_TYPE} \
    --api_name ${API_NAME_MODEL} \
    --exe_code \
    --sampled_frames_num ${NUM_FRAMES}

Then run it:

bash evaluation/run_eval_video_pl.sh

Visualize the Generated Rollouts (messages)

Upload it here: https://huggingface.co/spaces/Agents-X/data-view

Citation

@article{zhao2026pyvisionrl,
  title={PyVision-RL: Forging Open Agentic Vision Models via RL.},
  author={Zhao, Shitian and Lin, Shaoheng and Li, Ming and Zhang, Haoquan and Peng, Wenshuo and Zhang, Kaipeng and Wei, Chen},
  journal={arxiv preprint arxiv:2602.20739},
  year={2026},
}

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
api_configs		api_configs
cal_score		cal_score
data_preparation		data_preparation
evaluation		evaluation
inference_engine		inference_engine
prompt_template		prompt_template
test_data		test_data
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation Toolkit for PyVision-RL

Installation

Set API and Base URL

PyVision-Image-RL Evaluation on Benchmark

PyVision-Video-RL Evaluation on Benchmark

Visualize the Generated Rollouts (messages)

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evaluation Toolkit for PyVision-RL

Installation

Set API and Base URL

PyVision-Image-RL Evaluation on Benchmark

PyVision-Video-RL Evaluation on Benchmark

Visualize the Generated Rollouts (messages)

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages