Model Description
A lightweight, permissively licensed Transformer-based Grapheme-to-Phoneme (G2P) model for Hungarian text-to-speech (TTS) phonemization.
This model is designed to convert Hungarian text into IPA phoneme sequences — serving as a drop-in replacement for eSpeak-NG.
Key Features
- ⚡ Fast and lightweight — Small Transformer model (~2MB checkpoint)
- 🧠 End-to-end text → phoneme prediction using CTC loss
- 📱 Fully offline — Runs on mobile and embedded devices
- 🔄 Drop-in replacement for eSpeak-NG in Piper-style TTS pipelines
- ⚖️ MIT licensed — Safe for closed-source and commercial apps (no GPL dependencies)
Model Architecture
| Parameter | Value |
|---|---|
| Architecture | Transformer (2 layers, 4 attention heads) |
| Hidden Size | 128 |
| FFN Hidden Size | 640 |
| Dropout | 0.1 |
| Max Position Embeddings | 320 |
| Vocabulary Size | 38 graphemes, 105 phonemes |
Training Data
- Source: Hungarian text phonemized with eSpeak-NG (hu_HU voice)
- Training samples: 450,000 sentences
- Validation samples: 25,000 sentences
- Test samples: 25,000 sentences
- Max sequence length: 200 characters
Performance
- Accuracy: ~98.83% match with eSpeak-NG IPA output
- Epoch: 25/45 trained
- Validation loss: 0.134
Input/Output Format
Input: Hungarian text (e.g., "A kezében levő lándzsát a töröknek a szügyébe veti.") Output: IPA phonemes (e.g., "ˌɑ kˈɛzeːbɛn lˈɛvøː lˈaːndʒaːt ˌɑ tˈørøknɛk ˌɑ sˈyɟeːbɛ vˈɛti")
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support