tsinghua-ee
/

video-SALMONN-2_plus_3B

PEFT

Safetensors

English

Model card Files Files and versions

xet

Community

Changli commited on Sep 28, 2025

Commit

7a4fb89

verified ·

1 Parent(s): cec43c6

Update README.md

Browse files

Files changed (1) hide show

README.md +5 -3

README.md CHANGED Viewed

@@ -21,13 +21,15 @@ video-SALMONN 2+ is built on Qwen 2.5-VL using a similar pipeline of video-SALMO
 ## How to Use
 1. Prepare the dataset following `scripts/example_av.json`, `scripts/example_v.json`, `scripts/example_dpo.json`, and `scripts/example_a.json`
 2. Prepare base audio model through modifying the path in `gen_audio_model.py`
 3. To conduct audio alignment, use the following script:
    ```bash
    bash scripts/train.sh --interval 0.1 --run_name audio_alignment --dataset path_to_dataset --lr 2e-5 --train_qformer --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --bs 16 --epoch 5 --save_steps 5000
    ```
-4. To conduct audio visual SFT, use the following script:
     ```bash
     bash scripts/train.sh --interval 0.1 --run_name av_sft --dataset path_to_dataset --lr 2e-5 --train_qformer --train_proj --max_frames 768 --max_pixels 61250 --model audio_align_model --model_base path_to_audio_model --epoch 5 --save_steps 2000 --use_lora --lora_r 128 --lora_alpha 256
     ```
@@ -35,11 +37,11 @@ video-SALMONN 2+ is built on Qwen 2.5-VL using a similar pipeline of video-SALMO
     ```bash
     bash scripts/train.sh --interval 0.1 --run_name dpo --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model audio_visual_base --model_base audio_align_model --lora_ckpt audio_visual_checkpoint --train_type gdpo --use_lora --lora_r 128 --lora_alpha 256 --lr 5e-6 --epoch 1 --save_steps 200 --train_qformer --train_proj
     ```
-6. To evaluate 3B/7B model, use the following script:
    ```bash
    bash scripts/test.sh --interval 0.1 --run_name eval --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --lora_ckpt model_ckpt
    ```
 7. To evaluate 72B model, use the following script:
    ```bash
    bash scripts/test_8.sh --interval 0.1 --run_name eval --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --lora_ckpt model_ckpt
-   ```

 ## How to Use
+**IMPORTANT**: To get the same evaluation result, please use `--max_frames 768 --max_pixels 61250`. Using excessively high resolution or frame rate for evaluation may lead to too much input token count for the model, potentially causing performance degradation.
 1. Prepare the dataset following `scripts/example_av.json`, `scripts/example_v.json`, `scripts/example_dpo.json`, and `scripts/example_a.json`
 2. Prepare base audio model through modifying the path in `gen_audio_model.py`
 3. To conduct audio alignment, use the following script:
    ```bash
    bash scripts/train.sh --interval 0.1 --run_name audio_alignment --dataset path_to_dataset --lr 2e-5 --train_qformer --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --bs 16 --epoch 5 --save_steps 5000
    ```
+4. To conduct audio-visual SFT, use the following script:
     ```bash
     bash scripts/train.sh --interval 0.1 --run_name av_sft --dataset path_to_dataset --lr 2e-5 --train_qformer --train_proj --max_frames 768 --max_pixels 61250 --model audio_align_model --model_base path_to_audio_model --epoch 5 --save_steps 2000 --use_lora --lora_r 128 --lora_alpha 256
     ```
     ```bash
     bash scripts/train.sh --interval 0.1 --run_name dpo --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model audio_visual_base --model_base audio_align_model --lora_ckpt audio_visual_checkpoint --train_type gdpo --use_lora --lora_r 128 --lora_alpha 256 --lr 5e-6 --epoch 1 --save_steps 200 --train_qformer --train_proj
     ```
+6. To evaluate 7B model, use the following script:
    ```bash
    bash scripts/test.sh --interval 0.1 --run_name eval --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --lora_ckpt model_ckpt
    ```
 7. To evaluate 72B model, use the following script:
    ```bash
    bash scripts/test_8.sh --interval 0.1 --run_name eval --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --lora_ckpt model_ckpt
+   ```