| ## Training Framework | |
| The training code uses [ms-swift](https://github.com/modelscope/ms-swift), a scalable lightweight infrastructure for fine-tuning large language models. | |
| ## Model Configuration | |
| ### `MODEL_PATH` Parameter | |
| The `MODEL_PATH` in `train.sh` should point to the base model. Download the model from [HuggingFace](https://huggingface.co/datasets/bolshyC/qwen3-0.6B-music): | |
| ```bash | |
| # Download the model using huggingface_hub | |
| huggingface-cli download bolshyC/qwen3-0.6B-music --local-dir ./qwen3-0.6B-music | |
| ``` | |
| Then modify `MODEL_PATH` in `train.sh` to point to the local path: | |
| ```bash | |
| MODEL_PATH="./qwen3-0.6B-music" # or absolute path | |
| ``` | |
| ## Dataset Configuration | |
| ### `--dataset` Parameter | |
| **Note:** The current script `train.sh` uses `train_demo.jsonl` (for demonstration purposes). For actual training, you need to use the full dataset. | |
| ### Actual Training Data | |
| For actual training, please use the following two files from the [HuggingFace dataset](https://huggingface.co/datasets/bolshyC/Muse_train): | |
| - **`train_cn.jsonl`** - Chinese training data | |
| - **`train_en.jsonl`** - English training data | |
| ### Usage | |
| 1. Download the dataset from HuggingFace: | |
| ```bash | |
| # Using huggingface_hub to download | |
| huggingface-cli download bolshyC/Muse_train train_cn.jsonl --local-dir ./data | |
| huggingface-cli download bolshyC/Muse_train train_en.jsonl --local-dir ./data | |
| ``` | |
| 2. Modify the `--dataset` parameter in `train.sh`: | |
| ```bash | |
| # If using Chinese data only | |
| --dataset 'data/train_cn.jsonl' | |
| # If using both Chinese and English data (comma-separated, no spaces) | |
| --dataset 'data/train_cn.jsonl,data/train_en.jsonl' | |
| ``` | |
| **Note:** In ms-swift, multiple dataset files should be comma-separated without spaces. | |
| ## Building Custom Training Data | |
| If you want to build your own training dataset, you need to encode audio files into discrete tokens using MuCodec. | |
| ### Audio Encoding | |
| Use `train/encode_audio.py` to encode audio files into discrete tokens: | |
| 1. **Prepare input data file**: Create a JSONL file where each line contains a dictionary with an audio file path: | |
| ```json | |
| {"path": "path/to/audio1.wav"} | |
| {"path": "path/to/audio2.mp3"} | |
| ``` | |
| 2. **Modify paths in `encode_audio.py`**: | |
| - Set `DATA_PATH` to your input JSONL file path | |
| - Set `SAVE_DIR` to the directory where encoded tokens will be saved | |
| 3. **Run encoding**: | |
| ```bash | |
| python train/encode_audio.py | |
| ``` | |
| The script will: | |
| - Load audio files from the paths specified in the JSONL file | |
| - Encode each audio file into discrete tokens using MuCodec | |
| - Save the encoded tokens as `.pt` files in the `SAVE_DIR` directory | |
| - Skip files that have already been encoded | |
| **Note:** The audio files should be in WAV or MP3 format and will be automatically resampled to 48kHz if needed. | |
| ## Training Performance | |
| ### Training Time | |
| On 8Γ H200 GPUs, training one epoch takes approximately **150 minutes**. | |