Update README.md
Browse files
README.md
CHANGED
|
@@ -21,13 +21,15 @@ video-SALMONN 2+ is built on Qwen 2.5-VL using a similar pipeline of video-SALMO
|
|
| 21 |
|
| 22 |
## How to Use
|
| 23 |
|
|
|
|
|
|
|
| 24 |
1. Prepare the dataset following `scripts/example_av.json`, `scripts/example_v.json`, `scripts/example_dpo.json`, and `scripts/example_a.json`
|
| 25 |
2. Prepare base audio model through modifying the path in `gen_audio_model.py`
|
| 26 |
3. To conduct audio alignment, use the following script:
|
| 27 |
```bash
|
| 28 |
bash scripts/train.sh --interval 0.1 --run_name audio_alignment --dataset path_to_dataset --lr 2e-5 --train_qformer --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --bs 16 --epoch 5 --save_steps 5000
|
| 29 |
```
|
| 30 |
-
4. To conduct audio
|
| 31 |
```bash
|
| 32 |
bash scripts/train.sh --interval 0.1 --run_name av_sft --dataset path_to_dataset --lr 2e-5 --train_qformer --train_proj --max_frames 768 --max_pixels 61250 --model audio_align_model --model_base path_to_audio_model --epoch 5 --save_steps 2000 --use_lora --lora_r 128 --lora_alpha 256
|
| 33 |
```
|
|
@@ -35,11 +37,11 @@ video-SALMONN 2+ is built on Qwen 2.5-VL using a similar pipeline of video-SALMO
|
|
| 35 |
```bash
|
| 36 |
bash scripts/train.sh --interval 0.1 --run_name dpo --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model audio_visual_base --model_base audio_align_model --lora_ckpt audio_visual_checkpoint --train_type gdpo --use_lora --lora_r 128 --lora_alpha 256 --lr 5e-6 --epoch 1 --save_steps 200 --train_qformer --train_proj
|
| 37 |
```
|
| 38 |
-
6. To evaluate
|
| 39 |
```bash
|
| 40 |
bash scripts/test.sh --interval 0.1 --run_name eval --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --lora_ckpt model_ckpt
|
| 41 |
```
|
| 42 |
7. To evaluate 72B model, use the following script:
|
| 43 |
```bash
|
| 44 |
bash scripts/test_8.sh --interval 0.1 --run_name eval --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --lora_ckpt model_ckpt
|
| 45 |
-
```
|
|
|
|
| 21 |
|
| 22 |
## How to Use
|
| 23 |
|
| 24 |
+
**IMPORTANT**: To get the same evaluation result, please use `--max_frames 768 --max_pixels 61250`. Using excessively high resolution or frame rate for evaluation may lead to too much input token count for the model, potentially causing performance degradation.
|
| 25 |
+
|
| 26 |
1. Prepare the dataset following `scripts/example_av.json`, `scripts/example_v.json`, `scripts/example_dpo.json`, and `scripts/example_a.json`
|
| 27 |
2. Prepare base audio model through modifying the path in `gen_audio_model.py`
|
| 28 |
3. To conduct audio alignment, use the following script:
|
| 29 |
```bash
|
| 30 |
bash scripts/train.sh --interval 0.1 --run_name audio_alignment --dataset path_to_dataset --lr 2e-5 --train_qformer --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --bs 16 --epoch 5 --save_steps 5000
|
| 31 |
```
|
| 32 |
+
4. To conduct audio-visual SFT, use the following script:
|
| 33 |
```bash
|
| 34 |
bash scripts/train.sh --interval 0.1 --run_name av_sft --dataset path_to_dataset --lr 2e-5 --train_qformer --train_proj --max_frames 768 --max_pixels 61250 --model audio_align_model --model_base path_to_audio_model --epoch 5 --save_steps 2000 --use_lora --lora_r 128 --lora_alpha 256
|
| 35 |
```
|
|
|
|
| 37 |
```bash
|
| 38 |
bash scripts/train.sh --interval 0.1 --run_name dpo --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model audio_visual_base --model_base audio_align_model --lora_ckpt audio_visual_checkpoint --train_type gdpo --use_lora --lora_r 128 --lora_alpha 256 --lr 5e-6 --epoch 1 --save_steps 200 --train_qformer --train_proj
|
| 39 |
```
|
| 40 |
+
6. To evaluate 7B model, use the following script:
|
| 41 |
```bash
|
| 42 |
bash scripts/test.sh --interval 0.1 --run_name eval --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --lora_ckpt model_ckpt
|
| 43 |
```
|
| 44 |
7. To evaluate 72B model, use the following script:
|
| 45 |
```bash
|
| 46 |
bash scripts/test_8.sh --interval 0.1 --run_name eval --dataset path_to_dataset --max_frames 768 --max_pixels 61250 --model path_to_audio_model --model_base path_to_audio_model --lora_ckpt model_ckpt
|
| 47 |
+
```
|