# Plapre Simple - Phoneme-based TTS Model A simplified phoneme-based text-to-speech model built on Qwen3-0.6B-Base, trained to generate audio tokens from phoneme sequences. ## Model Overview This model is trained to perform phoneme-to-audio-token generation using a causal language modeling approach. It takes phoneme sequences as input and generates audio tokens that can be decoded by a neural audio codec (e.g., NeuCodec). ## Tokenization The model uses a custom phoneme-based tokenizer with the following vocabulary structure: ### Vocabulary Composition (66,192 tokens total) 1. **Standard tokens (4)**: ``, ``, ``, `` 2. **Phonemes (109)**: IPA phoneme characters from `phoneme_list.json` 3. **Audio tokens (65,536)**: `` to `` (representing neural codec codes) 4. **Special structure tokens (8)**: - ``, `` - ``, `` - ``, `` - ``, `` 5. **Placeholder tokens (128)**: `` to `` (reserved for future use) ## Training Sequence Format The model is trained on sequences with the following structure: ``` + [phoneme tokens] + + + [audio tokens] + ``` ### Example Sequence For the text "You know, when": 1. **Text**: "You know, when" 2. **Phonemes**: `juː nˈoʊ, wˌɛn` 3. **Training sequence**: ``` j u ː n ˈ o ʊ , w ˌ ɛ n ... (audio tokens) ``` ### Training Objective - The model uses causal language modeling (next-token prediction) - **Phoneme tokens are masked** in the loss (labels set to -100) - **Only audio tokens are trained** to be predicted from the phoneme context - This teaches the model to generate audio tokens conditioned on phoneme input ## Phoneme Encoding Text is converted to phonemes using espeak-ng with the following settings: - Language: `en-us` - Preserve punctuation: `True` - With stress markers: `True` Phonemes are then tokenized character-by-character (each IPA symbol is a separate token). ## Audio Token Encoding Audio codes from the neural codec (range 0-65535) are mapped to vocabulary tokens: - Audio code `n` → token `` → token ID `(audio_token_start_id + n)` ## Model Details - **Base model**: Qwen3-0.6B-Base - **Vocabulary size**: 66,192 tokens - **Training dataset**: neuphonic/emilia-yodas-english-neucodec - **Batch size**: 16 (effective) - **Precision**: bfloat16 - **Attention**: Flash Attention 2 ## Usage To use this model, you'll need: 1. The custom `PhonemeTokenizer` class (see `train_simple.py`) 2. espeak-ng for phonemization 3. A neural audio codec decoder for converting audio tokens to waveforms ```python from transformers import AutoModelForCausalLM from train_simple import PhonemeTokenizer # Load model and tokenizer model = AutoModelForCausalLM.from_pretrained("syvai/plapre-simple") tokenizer = PhonemeTokenizer.from_pretrained("syvai/plapre-simple") # Your inference code here ``` ## Files in Repository - `config.json` - Model configuration - `model.safetensors` / `pytorch_model.bin` - Model weights - `tokenizer_config.json` - Tokenizer configuration and vocabulary - `phoneme_list.json` - List of phonemes used in vocabulary - `README.md` - This file ## Training Details Trained using the Hugging Face Transformers `Trainer` with: - Learning rate: 0.0002 - Warmup steps: 1000 - Gradient accumulation: 4 - Per-device batch size: 4 - Optimizer: AdamW ## License Inherits license from Qwen3-0.6B-Base.