iisys-hof
/

olaph

text-generation

text-generation-inference

Model card Files Files and versions

olaph / README.md

ojo's picture

Update README.md

ab67b4a verified 5 months ago

|

history blame contribute delete

2.46 kB

	---
	license: gemma
	license_name: license
	license_link: LICENSE
	base_model:
	- ModelSpace/GemmaX2-28-2B-v0.1
	pipeline_tag: translation
	library_name: transformers
	tags:
	- text-generation
	language:
	- de
	- en
	- fr
	- es
	datasets:
	- iisys-hof/olaph-data
	---
	## Model Summary

	OLaPh is a large language model for phonemization, finetuned from GemmaX2-28-2B-v0.1.
	Its tokenizer was extended with 1,024 phoneme tokens, derived from a BPE tokenizer trained on phoneme sequences generated by the [OLaPh framework](https://github.com/iisys-hof/olaph)).

	The model was then finetuned for grapheme-to-phoneme conversion on a multilingual dataset (English, German, French, Spanish), created by phonemizing text from HuggingFaceFW/fineweb and HuggingFaceFW/fineweb-2 using the OLaPh framework.

	- Finetuned By: Institute for Information Systems at Hof University
	- Model type: Text-To-Text
	- Dataset: [OLaPh Phonemization Dataset](https://huggingface.co/datasets/iisys-hof/olaph-data)
	- Language(s): English, French, German, Spanish
	- License: Gemma (Gemma is provided under and subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms)
	- Release Date: September 25, 2025

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	lang = "English" #German, French, Spanish
	sentence = "But we are not sorry, for the rain is delightful."

	model_id = "iisys-hof/olaph"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id).to("cuda")
	stop_tokens = [tokenizer.eos_token_id, tokenizer.encode(".", add_special_tokens=False)[0]]


	prompt = f"Translate this from {lang} to Phones:\n{lang}: "

	inputs = tokenizer(f"{prompt}{sentence}\nPhones:", return_tensors="pt").to("cuda")

	outputs = model.generate(**inputs, max_new_tokens=256, eos_token_id=stop_tokens)
	phonemized = tokenizer.decode(outputs[0], skip_special_tokens=False)
	phonemized = phonemized.split("\n")[-1].replace("Phones:", "")

	print(phonemized)
	```

	## Caveats
	The model may produce full stops over the length of max_new_tokens instead of an EOS token, this behaviour is currently being examined.



	### Citation
	```bibtex
	@misc{wirth2025olaphoptimallanguagephonemizer,
	title={OLaPh: Optimal Language Phonemizer},
	author={Johannes Wirth},
	year={2025},
	eprint={2509.20086},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2509.20086},
	}
	```