[English](README.md) | [δΈ­ζ–‡](README-zh.md) --- # UNO Evaluation Framework To facilitate generalized evaluation of various Omni benchmarks, we have constructed a lightweight Omni evaluation framework and released a high-performance scoring model to support it. You can freely and easily add new datasets or evaluation models based on this framework. Below, we will use **UNO-Bench** and **Qwen-2.5-Omni-7B** as examples to demonstrate how to run the framework. # πŸš€ Quick Start ## πŸ› οΈ Environment Preparation Before running, please ensure the following Python core dependencies are installed. Note: Since vLLM installation involves PyTorch, CUDA, and other complex dependencies, it is recommended to set up the environment in a fresh virtual environment to avoid potential conflicts. ```bash pip install -r requirements.txt ``` Download the necessary models and datasets using the following commands: ```bash huggingface-cli download xxx --repo-type dataset --local-dir /path/to/UNO-Bench huggingface-cli download xxx --local-dir /path/to/UNO-Scorer huggingface-cli download Qwen/Qwen2.5-Omni-7B --local-dir /path/to/Qwen2.5-Omni ``` ## 🎯 Reproducing Experimental Results By executing the following code, you can reproduce the experimental results of **Qwen-2.5-Omni-7B** presented in the paper. Remember to replace **MODEL_PATH**, **DATASET_LOCAL_DIR**, and **SCORER_MODEL_PATH** with your local path. ```bash bash examples/run_unobench_qwen_omni_hf.sh ``` We recommend you to execute the vLLM version of the inference service for better performance. ```bash bash examples/run_unobench_qwen_omni_vllm.sh ``` * The program employs sequential logic for evaluation, executing in the following order: `Start Inference Service -> Generate Results -> Release Resources -> Start Scoring Service -> Calculate Scores -> Release Resources`. * It supports **resuming from breakpoints** (checkpointing); both inference progress and scoring progress are saved locally at regular intervals. ## πŸ“ˆ Compositional Law You can refer to the following code for the fitting curve of the Compositional Law. ```python python3 compositional_law.py ``` ## πŸ€– Using Only the Scoring Model We recommend using vLLM for higher efficiency. You can refer to: ```bash bash examples/test_scorer_vllm.sh ``` Or use transformers-based approach, but with lower efficiency: ```python python3 examples/test_scorer_hf.py ``` ## βš™οΈ Configuration Guide Before running, you **must** modify the configuration section at the top of `run_unobench_qwen_omni_*.sh` to adapt to your environment. ### 1. Inference Model Configuration (Target Model) | Variable Name | Description | Example | | :--- | :--- | :--- | | `MODEL_NAME` | Model registration name (corresponds to the name defined in `models` code) | `"Qwen-2.5-Omni-7B"` `"VLLMClient"` | | `MODEL_PATH` | Local absolute path to the model weights | `/path/to/Qwen2.5-Omni` | | `INFERENCE_BACKEND` | Inference backend selection: `"vllm"` or `"hf"` | `"vllm"` | | `TARGET_GPU_IDS` | GPU IDs used for the inference stage | `"0,1"` | | `TARGET_TP_SIZE` | Tensor Parallelism size for the inference model | `2` | | `TARGET_PORT` | vLLM service port | `8000` | ### 2. Scorer Model Configuration (Scorer Model) | Variable Name | Description | Example | | :--- | :--- | :--- | | `SCORER_MODEL_PATH` | Path to the scoring model (e.g., UNO-Scorer) | `/path/to/UNO-Scorer` | | `SCORER_GPU_IDS` | GPU IDs used for the scoring stage | `"0,1"` | | `SCORER_PORT` | vLLM service port for the scorer | `8001` | ### 3. Dataset and Paths | Variable Name | Description | | :--- | :--- | | `DATASET_NAME` | Evaluation dataset name (e.g., `"UNO-Bench"`) | | `HF_CACHE_DIR` | HuggingFace cache or multimedia data directory; automatically downloaded datasets will be saved here | | `DATASET_LOCAL_DIR` | Local path for the dataset. The program prioritizes reading from `DATASET_LOCAL_DIR`; otherwise, it automatically downloads to `HF_CACHE_DIR` | | `EXP_MARKING` | Experiment marking suffix (e.g., `_20251024`), used to distinguish experimental settings and output filenames | ## πŸŒ€ Running Evaluation After configuration, grant execution permissions to the script and run it: ```bash bash run_eval.sh ``` ### Detailed Script Execution Flow 1. **Stage 1: Inference** * If `vllm` mode is selected, the script starts the target model's API Server in the background. * Runs `eval.py --mode inference` to perform data inference. * **Key Step**: After inference is complete, the script automatically kills the target model's vLLM process to fully release GPU memory. 2. **Stage 2: Scorer Setup** * Starts the Scoring Model's (Scorer) vLLM service in the background. 3. **Stage 3: Evaluation (Scoring)** * Runs `eval.py --mode scoring` to send the generated results to the scoring model for evaluation. 4. **Cleanup** * Upon task completion, automatically shuts down the scoring model service. ## πŸ“Š Output Results Evaluation results will be generated as JSON files, saved by default in the `./eval_results/` directory. * **Filename Format**: `{MODEL_NAME}{EXP_MARKING}:{DATASET_NAME}.json` ## πŸ“‚ Minimalist Development Guide ```text . β”œβ”€β”€ run_eval.sh # [Main Program] Manages config parameters, service lifecycle, and flow control β”œβ”€β”€ eval.py # [Execution Script] Handles data loading, API interaction, and result storage β”œβ”€β”€ utils/ # [Dependencies] General utility functions β”œβ”€β”€ models/ # [Dependencies] Model registration and loading └── benchmarks/ # [Dependencies] Dataset registration and loading ``` The project is mainly divided into benchmarks (evaluation sets) and evaluation models. You can register new datasets in `benchmarks/` and new models in `models/`. ### Adding New Datasets 1. Create a new dataset `.py` file in `benchmarks/`, such as `unobench.py`. Inherit from the `BaseDataset` class and implement the abstract methods: * `load_and_prepare`: Download and load the dataset, organizing each item into the `utils.EvaluationRecord` format. * `build_message`: Construct the message sent to the model side (OpenAI Chat Message format). * `build_score_message`: Construct the message sent to the scoring model (OpenAI Chat Message format). * `compute_score`: Calculate the score for a single data item. * `compute_metrics`: Calculate metrics for the entire dataset. 2. Register the dataset in `__init__.py`. ### Adding New Models 1. Create a new model `.py` file in `models/`, such as `qwen_2d5_omni_7b.py`. Inherit from the `BaseModel` class and implement the abstract methods: * `load_model`: Load the model. * `generate`: Call the model interface once to generate text. * `generate_batch`: Batch call the model interface to generate text. 2. Register the model in `__init__.py`. ## ⚠️ Precautions * **Path Check**: Please ensure that the paths in the script have been modified to match the actual paths on your server.