Spaces:

Jacong
/

muse

Runtime error

App Files Files Community

muse / train /README.md

Jacong

Upload 96 files

aa9be1e verified about 1 month ago

preview code

raw

history blame contribute delete

2.89 kB

	## Training Framework

	The training code uses [ms-swift](https://github.com/modelscope/ms-swift), a scalable lightweight infrastructure for fine-tuning large language models.

	## Model Configuration

	### `MODEL_PATH` Parameter

	The `MODEL_PATH` in `train.sh` should point to the base model. Download the model from [HuggingFace](https://huggingface.co/datasets/bolshyC/qwen3-0.6B-music):

	```bash
	# Download the model using huggingface_hub
	huggingface-cli download bolshyC/qwen3-0.6B-music --local-dir ./qwen3-0.6B-music
	```

	Then modify `MODEL_PATH` in `train.sh` to point to the local path:
	```bash
	MODEL_PATH="./qwen3-0.6B-music" # or absolute path
	```

	## Dataset Configuration

	### `--dataset` Parameter

	Note: The current script `train.sh` uses `train_demo.jsonl` (for demonstration purposes). For actual training, you need to use the full dataset.

	### Actual Training Data

	For actual training, please use the following two files from the [HuggingFace dataset](https://huggingface.co/datasets/bolshyC/Muse_train):

	- `train_cn.jsonl` - Chinese training data
	- `train_en.jsonl` - English training data

	### Usage

	1. Download the dataset from HuggingFace:
	```bash
	# Using huggingface_hub to download
	huggingface-cli download bolshyC/Muse_train train_cn.jsonl --local-dir ./data
	huggingface-cli download bolshyC/Muse_train train_en.jsonl --local-dir ./data
	```

	2. Modify the `--dataset` parameter in `train.sh`:
	```bash
	# If using Chinese data only
	--dataset 'data/train_cn.jsonl'

	# If using both Chinese and English data (comma-separated, no spaces)
	--dataset 'data/train_cn.jsonl,data/train_en.jsonl'
	```

	Note: In ms-swift, multiple dataset files should be comma-separated without spaces.

	## Building Custom Training Data

	If you want to build your own training dataset, you need to encode audio files into discrete tokens using MuCodec.

	### Audio Encoding

	Use `train/encode_audio.py` to encode audio files into discrete tokens:

	1. Prepare input data file: Create a JSONL file where each line contains a dictionary with an audio file path:
	```json
	{"path": "path/to/audio1.wav"}
	{"path": "path/to/audio2.mp3"}
	```

	2. Modify paths in `encode_audio.py`:
	- Set `DATA_PATH` to your input JSONL file path
	- Set `SAVE_DIR` to the directory where encoded tokens will be saved

	3. Run encoding:
	```bash
	python train/encode_audio.py
	```

	The script will:
	- Load audio files from the paths specified in the JSONL file
	- Encode each audio file into discrete tokens using MuCodec
	- Save the encoded tokens as `.pt` files in the `SAVE_DIR` directory
	- Skip files that have already been encoded

	Note: The audio files should be in WAV or MP3 format and will be automatically resampled to 48kHz if needed.

	## Training Performance

	### Training Time

	On 8× H200 GPUs, training one epoch takes approximately 150 minutes.