| # Plapre Simple - Phoneme-based TTS Model | |
| A simplified phoneme-based text-to-speech model built on Qwen3-0.6B-Base, trained to generate audio tokens from phoneme sequences. | |
| ## Model Overview | |
| This model is trained to perform phoneme-to-audio-token generation using a causal language modeling approach. It takes phoneme sequences as input and generates audio tokens that can be decoded by a neural audio codec (e.g., NeuCodec). | |
| ## Tokenization | |
| The model uses a custom phoneme-based tokenizer with the following vocabulary structure: | |
| ### Vocabulary Composition (66,192 tokens total) | |
| 1. **Standard tokens (4)**: `<pad>`, `<unk>`, `<bos>`, `<eos>` | |
| 2. **Phonemes (109)**: IPA phoneme characters from `phoneme_list.json` | |
| 3. **Audio tokens (65,536)**: `<audio_0>` to `<audio_65535>` (representing neural codec codes) | |
| 4. **Special structure tokens (8)**: | |
| - `<phoneme_start>`, `<phoneme_end>` | |
| - `<audio_start>`, `<audio_end>` | |
| - `<ref_audio_start>`, `<ref_audio_end>` | |
| - `<ref_text_start>`, `<ref_text_end>` | |
| 5. **Placeholder tokens (128)**: `<placeholder_0>` to `<placeholder_127>` (reserved for future use) | |
| ## Training Sequence Format | |
| The model is trained on sequences with the following structure: | |
| ``` | |
| <phoneme_start> + [phoneme tokens] + <phoneme_end> + <audio_start> + [audio tokens] + <audio_end> | |
| ``` | |
| ### Example Sequence | |
| For the text "You know, when": | |
| 1. **Text**: "You know, when" | |
| 2. **Phonemes**: `juː nˈoʊ, wˌɛn` | |
| 3. **Training sequence**: | |
| ``` | |
| <phoneme_start> | |
| j u ː n ˈ o ʊ , w ˌ ɛ n | |
| <phoneme_end> | |
| <audio_start> | |
| <audio_2151> <audio_43235> <audio_56802> ... (audio tokens) | |
| <audio_end> | |
| ``` | |
| ### Training Objective | |
| - The model uses causal language modeling (next-token prediction) | |
| - **Phoneme tokens are masked** in the loss (labels set to -100) | |
| - **Only audio tokens are trained** to be predicted from the phoneme context | |
| - This teaches the model to generate audio tokens conditioned on phoneme input | |
| ## Phoneme Encoding | |
| Text is converted to phonemes using espeak-ng with the following settings: | |
| - Language: `en-us` | |
| - Preserve punctuation: `True` | |
| - With stress markers: `True` | |
| Phonemes are then tokenized character-by-character (each IPA symbol is a separate token). | |
| ## Audio Token Encoding | |
| Audio codes from the neural codec (range 0-65535) are mapped to vocabulary tokens: | |
| - Audio code `n` → token `<audio_n>` → token ID `(audio_token_start_id + n)` | |
| ## Model Details | |
| - **Base model**: Qwen3-0.6B-Base | |
| - **Vocabulary size**: 66,192 tokens | |
| - **Training dataset**: neuphonic/emilia-yodas-english-neucodec | |
| - **Batch size**: 16 (effective) | |
| - **Precision**: bfloat16 | |
| - **Attention**: Flash Attention 2 | |
| ## Usage | |
| To use this model, you'll need: | |
| 1. The custom `PhonemeTokenizer` class (see `train_simple.py`) | |
| 2. espeak-ng for phonemization | |
| 3. A neural audio codec decoder for converting audio tokens to waveforms | |
| ```python | |
| from transformers import AutoModelForCausalLM | |
| from train_simple import PhonemeTokenizer | |
| # Load model and tokenizer | |
| model = AutoModelForCausalLM.from_pretrained("syvai/plapre-simple") | |
| tokenizer = PhonemeTokenizer.from_pretrained("syvai/plapre-simple") | |
| # Your inference code here | |
| ``` | |
| ## Files in Repository | |
| - `config.json` - Model configuration | |
| - `model.safetensors` / `pytorch_model.bin` - Model weights | |
| - `tokenizer_config.json` - Tokenizer configuration and vocabulary | |
| - `phoneme_list.json` - List of phonemes used in vocabulary | |
| - `README.md` - This file | |
| ## Training Details | |
| Trained using the Hugging Face Transformers `Trainer` with: | |
| - Learning rate: 0.0002 | |
| - Warmup steps: 1000 | |
| - Gradient accumulation: 4 | |
| - Per-device batch size: 4 | |
| - Optimizer: AdamW | |
| ## License | |
| Inherits license from Qwen3-0.6B-Base. | |