Scripts Directory

This directory contains all the run scripts for OSWorld, organized by type.

Structure

scripts/
├── python/          # Python run scripts for various models
│   ├── run_*.py     # Individual model run scripts
│   └── run_multienv_*.py  # Multi-environment run scripts
└── bash/            # Bash scripts
    └── run_*.sh     # Shell scripts for running models

Python Scripts

The python/ directory contains Python scripts for running different models and agents:

Single model scripts: run_autoglm.py, run_coact.py, run_maestro.py
Multi-environment scripts: run_multienv_*.py - Scripts for running models in multiple environments
Manual examination: manual_examine.py - Tool for manually verifying and examining specific benchmark tasks

Bash Scripts

The bash/ directory contains shell scripts for running specific models:

run_dart_gui.sh - Run DART GUI model
run_os_symphony.sh - Run OS Symphony model
run_manual_examine.sh - Example script for manual task examination with sample task IDs

Note: Due to previous oversight, many bash scripts were not preserved during the reorganization. We will gradually add more bash scripts in future updates. Community contributions are welcome! If you have bash scripts for running specific models or workflows, please feel free to submit a pull request.

Usage

Important: All scripts should be run from the project root directory (not from within the scripts/ directory).

Running Python Scripts

# From the OSWorld root directory
python scripts/python/run_multienv.py [args]

# Example: Run with OpenAI GPT-4o
python scripts/python/run_multienv.py \
    --provider_name docker \
    --headless \
    --observation_type screenshot \
    --model gpt-4o \
    --max_steps 15 \
    --num_envs 10 \
    --client_password password

Running Bash Scripts

# From the OSWorld root directory
bash scripts/bash/run_dart_gui.sh [args]

Manual Task Examination

For manual verification and examination of specific benchmark tasks:

# From the OSWorld root directory
python scripts/python/manual_examine.py \
    --headless \
    --observation_type screenshot \
    --result_dir ./results_human_examine \
    --test_all_meta_path evaluation_examples/test_all.json \
    --domain libreoffice_impress \
    --example_id a669ef01-ded5-4099-9ea9-25e99b569840 \
    --max_steps 3

This tool allows you to:

Manually execute tasks in the environment
Verify task correctness and evaluation metrics
Record the execution process with screenshots and videos
Examine specific problematic tasks

See scripts/bash/run_manual_examine.sh for example task IDs across different domains.

Technical Details

All Python scripts in this directory have been configured with automatic path resolution to import modules from the project root. This means:

You must run scripts from the project root directory
Scripts automatically add the project root to sys.path
All imports (like lib_run_single, desktop_env, mm_agents) work correctly

Adding New Scripts

If you create a new run script, make sure to include the following path setup at the beginning (after standard library imports but before project imports):

# Add project root to path for imports
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "../.."))

# Now you can import project modules
import lib_run_single
from desktop_env.desktop_env import DesktopEnv
from mm_agents.your_agent import YourAgent