--- license: gemma license_name: license license_link: LICENSE base_model: - ModelSpace/GemmaX2-28-2B-v0.1 pipeline_tag: translation library_name: transformers tags: - text-generation language: - de - en - fr - es datasets: - iisys-hof/olaph-data --- ## Model Summary OLaPh is a large language model for phonemization, finetuned from GemmaX2-28-2B-v0.1. Its tokenizer was extended with 1,024 phoneme tokens, derived from a BPE tokenizer trained on phoneme sequences generated by the [OLaPh framework](https://github.com/iisys-hof/olaph)). The model was then finetuned for grapheme-to-phoneme conversion on a multilingual dataset (English, German, French, Spanish), created by phonemizing text from HuggingFaceFW/fineweb and HuggingFaceFW/fineweb-2 using the OLaPh framework. - **Finetuned By**: Institute for Information Systems at Hof University - **Model type**: Text-To-Text - **Dataset**: [OLaPh Phonemization Dataset](https://huggingface.co/datasets/iisys-hof/olaph-data) - **Language(s)**: English, French, German, Spanish - **License**: Gemma (Gemma is provided under and subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms) - **Release Date**: September 25, 2025 ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer lang = "English" #German, French, Spanish sentence = "But we are not sorry, for the rain is delightful." model_id = "iisys-hof/olaph" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id).to("cuda") stop_tokens = [tokenizer.eos_token_id, tokenizer.encode(".", add_special_tokens=False)[0]] prompt = f"Translate this from {lang} to Phones:\n{lang}: " inputs = tokenizer(f"{prompt}{sentence}\nPhones:", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=256, eos_token_id=stop_tokens) phonemized = tokenizer.decode(outputs[0], skip_special_tokens=False) phonemized = phonemized.split("\n")[-1].replace("Phones:", "") print(phonemized) ``` ## Caveats The model may produce full stops over the length of max_new_tokens instead of an EOS token, this behaviour is currently being examined. ### Citation ```bibtex @misc{wirth2025olaphoptimallanguagephonemizer, title={OLaPh: Optimal Language Phonemizer}, author={Johannes Wirth}, year={2025}, eprint={2509.20086}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.20086}, } ```