| | --- |
| | license: gemma |
| | license_name: license |
| | license_link: LICENSE |
| | base_model: |
| | - ModelSpace/GemmaX2-28-2B-v0.1 |
| | pipeline_tag: translation |
| | library_name: transformers |
| | tags: |
| | - text-generation |
| | language: |
| | - de |
| | - en |
| | - fr |
| | - es |
| | datasets: |
| | - iisys-hof/olaph-data |
| | --- |
| | ## Model Summary |
| |
|
| | OLaPh is a large language model for phonemization, finetuned from GemmaX2-28-2B-v0.1. |
| | Its tokenizer was extended with 1,024 phoneme tokens, derived from a BPE tokenizer trained on phoneme sequences generated by the [OLaPh framework](https://github.com/iisys-hof/olaph)). |
| |
|
| | The model was then finetuned for grapheme-to-phoneme conversion on a multilingual dataset (English, German, French, Spanish), created by phonemizing text from HuggingFaceFW/fineweb and HuggingFaceFW/fineweb-2 using the OLaPh framework. |
| |
|
| | - **Finetuned By**: Institute for Information Systems at Hof University |
| | - **Model type**: Text-To-Text |
| | - **Dataset**: [OLaPh Phonemization Dataset](https://huggingface.co/datasets/iisys-hof/olaph-data) |
| | - **Language(s)**: English, French, German, Spanish |
| | - **License**: Gemma (Gemma is provided under and subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms) |
| | - **Release Date**: September 25, 2025 |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | lang = "English" #German, French, Spanish |
| | sentence = "But we are not sorry, for the rain is delightful." |
| | |
| | model_id = "iisys-hof/olaph" |
| | tokenizer = AutoTokenizer.from_pretrained(model_id) |
| | model = AutoModelForCausalLM.from_pretrained(model_id).to("cuda") |
| | stop_tokens = [tokenizer.eos_token_id, tokenizer.encode(".", add_special_tokens=False)[0]] |
| | |
| | |
| | prompt = f"Translate this from {lang} to Phones:\n{lang}: " |
| | |
| | inputs = tokenizer(f"{prompt}{sentence}\nPhones:", return_tensors="pt").to("cuda") |
| | |
| | outputs = model.generate(**inputs, max_new_tokens=256, eos_token_id=stop_tokens) |
| | phonemized = tokenizer.decode(outputs[0], skip_special_tokens=False) |
| | phonemized = phonemized.split("\n")[-1].replace("Phones:", "") |
| | |
| | print(phonemized) |
| | ``` |
| |
|
| | ## Caveats |
| | The model may produce full stops over the length of max_new_tokens instead of an EOS token, this behaviour is currently being examined. |
| |
|
| |
|
| |
|
| | ### Citation |
| | ```bibtex |
| | @misc{wirth2025olaphoptimallanguagephonemizer, |
| | title={OLaPh: Optimal Language Phonemizer}, |
| | author={Johannes Wirth}, |
| | year={2025}, |
| | eprint={2509.20086}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL}, |
| | url={https://arxiv.org/abs/2509.20086}, |
| | } |
| | ``` |