plapre-simple / README.md

Upload README.md with huggingface_hub

cd1ebac verified 3 months ago

3.7 kB

	# Plapre Simple - Phoneme-based TTS Model

	A simplified phoneme-based text-to-speech model built on Qwen3-0.6B-Base, trained to generate audio tokens from phoneme sequences.

	## Model Overview

	This model is trained to perform phoneme-to-audio-token generation using a causal language modeling approach. It takes phoneme sequences as input and generates audio tokens that can be decoded by a neural audio codec (e.g., NeuCodec).

	## Tokenization

	The model uses a custom phoneme-based tokenizer with the following vocabulary structure:

	### Vocabulary Composition (66,192 tokens total)

	1. Standard tokens (4): `<pad>`, `<unk>`, `<bos>`, `<eos>`
	2. Phonemes (109): IPA phoneme characters from `phoneme_list.json`
	3. Audio tokens (65,536): `<audio_0>` to `<audio_65535>` (representing neural codec codes)
	4. Special structure tokens (8):
	- `<phoneme_start>`, `<phoneme_end>`
	- `<audio_start>`, `<audio_end>`
	- `<ref_audio_start>`, `<ref_audio_end>`
	- `<ref_text_start>`, `<ref_text_end>`
	5. Placeholder tokens (128): `<placeholder_0>` to `<placeholder_127>` (reserved for future use)

	## Training Sequence Format

	The model is trained on sequences with the following structure:

	```
	<phoneme_start> + [phoneme tokens] + <phoneme_end> + <audio_start> + [audio tokens] + <audio_end>
	```

	### Example Sequence

	For the text "You know, when":

	1. Text: "You know, when"
	2. Phonemes: `juː nˈoʊ, wˌɛn`
	3. Training sequence:
	```
	<phoneme_start>
	j u ː n ˈ o ʊ , w ˌ ɛ n
	<phoneme_end>
	<audio_start>
	<audio_2151> <audio_43235> <audio_56802> ... (audio tokens)
	<audio_end>
	```

	### Training Objective

	- The model uses causal language modeling (next-token prediction)
	- Phoneme tokens are masked in the loss (labels set to -100)
	- Only audio tokens are trained to be predicted from the phoneme context
	- This teaches the model to generate audio tokens conditioned on phoneme input

	## Phoneme Encoding

	Text is converted to phonemes using espeak-ng with the following settings:
	- Language: `en-us`
	- Preserve punctuation: `True`
	- With stress markers: `True`

	Phonemes are then tokenized character-by-character (each IPA symbol is a separate token).

	## Audio Token Encoding

	Audio codes from the neural codec (range 0-65535) are mapped to vocabulary tokens:
	- Audio code `n` → token `<audio_n>` → token ID `(audio_token_start_id + n)`

	## Model Details

	- Base model: Qwen3-0.6B-Base
	- Vocabulary size: 66,192 tokens
	- Training dataset: neuphonic/emilia-yodas-english-neucodec
	- Batch size: 16 (effective)
	- Precision: bfloat16
	- Attention: Flash Attention 2

	## Usage

	To use this model, you'll need:
	1. The custom `PhonemeTokenizer` class (see `train_simple.py`)
	2. espeak-ng for phonemization
	3. A neural audio codec decoder for converting audio tokens to waveforms

	```python
	from transformers import AutoModelForCausalLM
	from train_simple import PhonemeTokenizer

	# Load model and tokenizer
	model = AutoModelForCausalLM.from_pretrained("syvai/plapre-simple")
	tokenizer = PhonemeTokenizer.from_pretrained("syvai/plapre-simple")

	# Your inference code here
	```

	## Files in Repository

	- `config.json` - Model configuration
	- `model.safetensors` / `pytorch_model.bin` - Model weights
	- `tokenizer_config.json` - Tokenizer configuration and vocabulary
	- `phoneme_list.json` - List of phonemes used in vocabulary
	- `README.md` - This file

	## Training Details

	Trained using the Hugging Face Transformers `Trainer` with:
	- Learning rate: 0.0002
	- Warmup steps: 1000
	- Gradient accumulation: 4
	- Per-device batch size: 4
	- Optimizer: AdamW

	## License

	Inherits license from Qwen3-0.6B-Base.