File size: 7,063 Bytes
f1f682e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | [English](README.md) | [δΈζ](README-zh.md)
---
# UNO Evaluation Framework
To facilitate generalized evaluation of various Omni benchmarks, we have constructed a lightweight Omni evaluation framework and released a high-performance scoring model to support it. You can freely and easily add new datasets or evaluation models based on this framework. Below, we will use **UNO-Bench** and **Qwen-2.5-Omni-7B** as examples to demonstrate how to run the framework.
# π Quick Start
## π οΈ Environment Preparation
Before running, please ensure the following Python core dependencies are installed. Note: Since vLLM installation involves PyTorch, CUDA, and other complex dependencies, it is recommended to set up the environment in a fresh virtual environment to avoid potential conflicts.
```bash
pip install -r requirements.txt
```
Download the necessary models and datasets using the following commands:
```bash
huggingface-cli download xxx --repo-type dataset --local-dir /path/to/UNO-Bench
huggingface-cli download xxx --local-dir /path/to/UNO-Scorer
huggingface-cli download Qwen/Qwen2.5-Omni-7B --local-dir /path/to/Qwen2.5-Omni
```
## π― Reproducing Experimental Results
By executing the following code, you can reproduce the experimental results of **Qwen-2.5-Omni-7B** presented in the paper. Remember to replace **MODEL_PATH**, **DATASET_LOCAL_DIR**, and **SCORER_MODEL_PATH** with your local path.
```bash
bash examples/run_unobench_qwen_omni_hf.sh
```
We recommend you to execute the vLLM version of the inference service for better performance.
```bash
bash examples/run_unobench_qwen_omni_vllm.sh
```
* The program employs sequential logic for evaluation, executing in the following order: `Start Inference Service -> Generate Results -> Release Resources -> Start Scoring Service -> Calculate Scores -> Release Resources`.
* It supports **resuming from breakpoints** (checkpointing); both inference progress and scoring progress are saved locally at regular intervals.
## π Compositional Law
You can refer to the following code for the fitting curve of the Compositional Law.
```python
python3 compositional_law.py
```
## π€ Using Only the Scoring Model
We recommend using vLLM for higher efficiency. You can refer to:
```bash
bash examples/test_scorer_vllm.sh
```
Or use transformers-based approach, but with lower efficiency:
```python
python3 examples/test_scorer_hf.py
```
## βοΈ Configuration Guide
Before running, you **must** modify the configuration section at the top of `run_unobench_qwen_omni_*.sh` to adapt to your environment.
### 1. Inference Model Configuration (Target Model)
| Variable Name | Description | Example |
| :--- | :--- | :--- |
| `MODEL_NAME` | Model registration name (corresponds to the name defined in `models` code) | `"Qwen-2.5-Omni-7B"` `"VLLMClient"` |
| `MODEL_PATH` | Local absolute path to the model weights | `/path/to/Qwen2.5-Omni` |
| `INFERENCE_BACKEND` | Inference backend selection: `"vllm"` or `"hf"` | `"vllm"` |
| `TARGET_GPU_IDS` | GPU IDs used for the inference stage | `"0,1"` |
| `TARGET_TP_SIZE` | Tensor Parallelism size for the inference model | `2` |
| `TARGET_PORT` | vLLM service port | `8000` |
### 2. Scorer Model Configuration (Scorer Model)
| Variable Name | Description | Example |
| :--- | :--- | :--- |
| `SCORER_MODEL_PATH` | Path to the scoring model (e.g., UNO-Scorer) | `/path/to/UNO-Scorer` |
| `SCORER_GPU_IDS` | GPU IDs used for the scoring stage | `"0,1"` |
| `SCORER_PORT` | vLLM service port for the scorer | `8001` |
### 3. Dataset and Paths
| Variable Name | Description |
| :--- | :--- |
| `DATASET_NAME` | Evaluation dataset name (e.g., `"UNO-Bench"`) |
| `HF_CACHE_DIR` | HuggingFace cache or multimedia data directory; automatically downloaded datasets will be saved here |
| `DATASET_LOCAL_DIR` | Local path for the dataset. The program prioritizes reading from `DATASET_LOCAL_DIR`; otherwise, it automatically downloads to `HF_CACHE_DIR` |
| `EXP_MARKING` | Experiment marking suffix (e.g., `_20251024`), used to distinguish experimental settings and output filenames |
## π Running Evaluation
After configuration, grant execution permissions to the script and run it:
```bash
bash run_eval.sh
```
### Detailed Script Execution Flow
1. **Stage 1: Inference**
* If `vllm` mode is selected, the script starts the target model's API Server in the background.
* Runs `eval.py --mode inference` to perform data inference.
* **Key Step**: After inference is complete, the script automatically kills the target model's vLLM process to fully release GPU memory.
2. **Stage 2: Scorer Setup**
* Starts the Scoring Model's (Scorer) vLLM service in the background.
3. **Stage 3: Evaluation (Scoring)**
* Runs `eval.py --mode scoring` to send the generated results to the scoring model for evaluation.
4. **Cleanup**
* Upon task completion, automatically shuts down the scoring model service.
## π Output Results
Evaluation results will be generated as JSON files, saved by default in the `./eval_results/` directory.
* **Filename Format**: `{MODEL_NAME}{EXP_MARKING}:{DATASET_NAME}.json`
## π Minimalist Development Guide
```text
.
βββ run_eval.sh # [Main Program] Manages config parameters, service lifecycle, and flow control
βββ eval.py # [Execution Script] Handles data loading, API interaction, and result storage
βββ utils/ # [Dependencies] General utility functions
βββ models/ # [Dependencies] Model registration and loading
βββ benchmarks/ # [Dependencies] Dataset registration and loading
```
The project is mainly divided into benchmarks (evaluation sets) and evaluation models. You can register new datasets in `benchmarks/` and new models in `models/`.
### Adding New Datasets
1. Create a new dataset `.py` file in `benchmarks/`, such as `unobench.py`. Inherit from the `BaseDataset` class and implement the abstract methods:
* `load_and_prepare`: Download and load the dataset, organizing each item into the `utils.EvaluationRecord` format.
* `build_message`: Construct the message sent to the model side (OpenAI Chat Message format).
* `build_score_message`: Construct the message sent to the scoring model (OpenAI Chat Message format).
* `compute_score`: Calculate the score for a single data item.
* `compute_metrics`: Calculate metrics for the entire dataset.
2. Register the dataset in `__init__.py`.
### Adding New Models
1. Create a new model `.py` file in `models/`, such as `qwen_2d5_omni_7b.py`. Inherit from the `BaseModel` class and implement the abstract methods:
* `load_model`: Load the model.
* `generate`: Call the model interface once to generate text.
* `generate_batch`: Batch call the model interface to generate text.
2. Register the model in `__init__.py`.
## β οΈ Precautions
* **Path Check**: Please ensure that the paths in the script have been modified to match the actual paths on your server. |