blue-tundra-42's picture
Upload UNO Scorer (initial version)
f1f682e verified
[English](README.md) | [δΈ­ζ–‡](README-zh.md)
---
# UNO Evaluation Framework
To facilitate generalized evaluation of various Omni benchmarks, we have constructed a lightweight Omni evaluation framework and released a high-performance scoring model to support it. You can freely and easily add new datasets or evaluation models based on this framework. Below, we will use **UNO-Bench** and **Qwen-2.5-Omni-7B** as examples to demonstrate how to run the framework.
# πŸš€ Quick Start
## πŸ› οΈ Environment Preparation
Before running, please ensure the following Python core dependencies are installed. Note: Since vLLM installation involves PyTorch, CUDA, and other complex dependencies, it is recommended to set up the environment in a fresh virtual environment to avoid potential conflicts.
```bash
pip install -r requirements.txt
```
Download the necessary models and datasets using the following commands:
```bash
huggingface-cli download xxx --repo-type dataset --local-dir /path/to/UNO-Bench
huggingface-cli download xxx --local-dir /path/to/UNO-Scorer
huggingface-cli download Qwen/Qwen2.5-Omni-7B --local-dir /path/to/Qwen2.5-Omni
```
## 🎯 Reproducing Experimental Results
By executing the following code, you can reproduce the experimental results of **Qwen-2.5-Omni-7B** presented in the paper. Remember to replace **MODEL_PATH**, **DATASET_LOCAL_DIR**, and **SCORER_MODEL_PATH** with your local path.
```bash
bash examples/run_unobench_qwen_omni_hf.sh
```
We recommend you to execute the vLLM version of the inference service for better performance.
```bash
bash examples/run_unobench_qwen_omni_vllm.sh
```
* The program employs sequential logic for evaluation, executing in the following order: `Start Inference Service -> Generate Results -> Release Resources -> Start Scoring Service -> Calculate Scores -> Release Resources`.
* It supports **resuming from breakpoints** (checkpointing); both inference progress and scoring progress are saved locally at regular intervals.
## πŸ“ˆ Compositional Law
You can refer to the following code for the fitting curve of the Compositional Law.
```python
python3 compositional_law.py
```
## πŸ€– Using Only the Scoring Model
We recommend using vLLM for higher efficiency. You can refer to:
```bash
bash examples/test_scorer_vllm.sh
```
Or use transformers-based approach, but with lower efficiency:
```python
python3 examples/test_scorer_hf.py
```
## βš™οΈ Configuration Guide
Before running, you **must** modify the configuration section at the top of `run_unobench_qwen_omni_*.sh` to adapt to your environment.
### 1. Inference Model Configuration (Target Model)
| Variable Name | Description | Example |
| :--- | :--- | :--- |
| `MODEL_NAME` | Model registration name (corresponds to the name defined in `models` code) | `"Qwen-2.5-Omni-7B"` `"VLLMClient"` |
| `MODEL_PATH` | Local absolute path to the model weights | `/path/to/Qwen2.5-Omni` |
| `INFERENCE_BACKEND` | Inference backend selection: `"vllm"` or `"hf"` | `"vllm"` |
| `TARGET_GPU_IDS` | GPU IDs used for the inference stage | `"0,1"` |
| `TARGET_TP_SIZE` | Tensor Parallelism size for the inference model | `2` |
| `TARGET_PORT` | vLLM service port | `8000` |
### 2. Scorer Model Configuration (Scorer Model)
| Variable Name | Description | Example |
| :--- | :--- | :--- |
| `SCORER_MODEL_PATH` | Path to the scoring model (e.g., UNO-Scorer) | `/path/to/UNO-Scorer` |
| `SCORER_GPU_IDS` | GPU IDs used for the scoring stage | `"0,1"` |
| `SCORER_PORT` | vLLM service port for the scorer | `8001` |
### 3. Dataset and Paths
| Variable Name | Description |
| :--- | :--- |
| `DATASET_NAME` | Evaluation dataset name (e.g., `"UNO-Bench"`) |
| `HF_CACHE_DIR` | HuggingFace cache or multimedia data directory; automatically downloaded datasets will be saved here |
| `DATASET_LOCAL_DIR` | Local path for the dataset. The program prioritizes reading from `DATASET_LOCAL_DIR`; otherwise, it automatically downloads to `HF_CACHE_DIR` |
| `EXP_MARKING` | Experiment marking suffix (e.g., `_20251024`), used to distinguish experimental settings and output filenames |
## πŸŒ€ Running Evaluation
After configuration, grant execution permissions to the script and run it:
```bash
bash run_eval.sh
```
### Detailed Script Execution Flow
1. **Stage 1: Inference**
* If `vllm` mode is selected, the script starts the target model's API Server in the background.
* Runs `eval.py --mode inference` to perform data inference.
* **Key Step**: After inference is complete, the script automatically kills the target model's vLLM process to fully release GPU memory.
2. **Stage 2: Scorer Setup**
* Starts the Scoring Model's (Scorer) vLLM service in the background.
3. **Stage 3: Evaluation (Scoring)**
* Runs `eval.py --mode scoring` to send the generated results to the scoring model for evaluation.
4. **Cleanup**
* Upon task completion, automatically shuts down the scoring model service.
## πŸ“Š Output Results
Evaluation results will be generated as JSON files, saved by default in the `./eval_results/` directory.
* **Filename Format**: `{MODEL_NAME}{EXP_MARKING}:{DATASET_NAME}.json`
## πŸ“‚ Minimalist Development Guide
```text
.
β”œβ”€β”€ run_eval.sh # [Main Program] Manages config parameters, service lifecycle, and flow control
β”œβ”€β”€ eval.py # [Execution Script] Handles data loading, API interaction, and result storage
β”œβ”€β”€ utils/ # [Dependencies] General utility functions
β”œβ”€β”€ models/ # [Dependencies] Model registration and loading
└── benchmarks/ # [Dependencies] Dataset registration and loading
```
The project is mainly divided into benchmarks (evaluation sets) and evaluation models. You can register new datasets in `benchmarks/` and new models in `models/`.
### Adding New Datasets
1. Create a new dataset `.py` file in `benchmarks/`, such as `unobench.py`. Inherit from the `BaseDataset` class and implement the abstract methods:
* `load_and_prepare`: Download and load the dataset, organizing each item into the `utils.EvaluationRecord` format.
* `build_message`: Construct the message sent to the model side (OpenAI Chat Message format).
* `build_score_message`: Construct the message sent to the scoring model (OpenAI Chat Message format).
* `compute_score`: Calculate the score for a single data item.
* `compute_metrics`: Calculate metrics for the entire dataset.
2. Register the dataset in `__init__.py`.
### Adding New Models
1. Create a new model `.py` file in `models/`, such as `qwen_2d5_omni_7b.py`. Inherit from the `BaseModel` class and implement the abstract methods:
* `load_model`: Load the model.
* `generate`: Call the model interface once to generate text.
* `generate_batch`: Batch call the model interface to generate text.
2. Register the model in `__init__.py`.
## ⚠️ Precautions
* **Path Check**: Please ensure that the paths in the script have been modified to match the actual paths on your server.