Spaces:

FunAudioLLM
/

PrismAudio

Running on Zero

File size: 3,052 Bytes

ddb382a

# Training Guide

This guide will walk you through the process of preparing data, configuring your training setup, and launching GRPO training for the ThinkSound model. For best results, we recommend reading through all steps before starting.

---

## Step 1: Prepare the Dataset

Before training, you must preprocess the dataset following the instructions in [Dataset.md](./Dataset.md). This includes:

1. Converting raw videos and CoT annotations into structured feature `.npz` files.
2. Constructing a valid dataset configuration JSON that points to all precomputed features.

Make sure your extracted dataset includes all required modalities and is organized correctly.

---

## Step 2: Configure Training Script

Open `scripts/PrismAudio/grpo_1node8gpus.sh` and customize the following items:

Under the `grpo/config` section, set the paths to your model and configuration files:

* `model_config`: Path to the model architecture config (e.g., `ThinkSound/configs/model_configs/prismaudio.json`)
* `pretransform_ckpt_path`: Path to the pretrained model checkpoint (e.g., `ckpts/prismaudio.ckpt`)
* `dataset_config`: Path to your dataset configuration JSON prepared in Step 1

Also modify distributed training settings as needed:

* `num_gpus`, `num_nodes`, `node_rank`, `MASTER_PORT`, etc.

* (Optional) Enable debug mode by adding the `--debug` flag when running the script.

### 🔍 Tip

If you're using a multi-GPU setup, ensure the `WORLD_SIZE`, `NODE_RANK`, and `MASTER_PORT` are correctly set for your environment. These are critical for DistributedDataParallel (DDP) training.

---

## Step 3: Configure Reward Functions *(Optional)*

ThinkSound supports two optional reward functions during GRPO training. To enable them, provide the corresponding reference paths when extracting features (see [Dataset.md](./Dataset.md)):

| Reward | Required Argument | Description |
|--------|------------------|-------------|
| **Synchformer** | `--add_video_path` | Enables audio-visual synchronization reward |
| **ITD** | `--add_audio_path` | Enables inter-track distance reward using reference audio |

These paths are embedded into the extracted `.pth` feature files during dataset preparation and will be automatically used during GRPO training if present.

---

## Step 4: Launch Training

Make the script executable (if not already) and start training:

```bash
chmod +x scripts/PrismAudio/grpo_1node8gpus.sh
./scripts/PrismAudio/grpo_1node8gpus.sh
```

Logs will be written to the specified log directory (`log_dir`).

---

## Step 5: Customize Model and Training Parameters

To modify model architecture or training strategy, open the model config file specified in `grpo/config`.
You can adjust a wide range of parameters, such as:

* Number of model parameters
* Optimizer type
* Learning rate
* Latent dimension
* GRPO-specific reward weights

Be sure to keep a backup of your config for reproducibility.

---



Happy training! 🚀  
If you run into any issues, consider opening an issue or checking the documentation for detailed help.