latin-normalizer / README.md

Add files using upload-large-folder tool

223bbe7 verified 5 days ago

3.73 kB

	---
	language: la
	library_name: transformers
	license: cc-by-sa-4.0
	base_model: google/byt5-large
	pipeline_tag: text2text-generation
	tags:
	- latin
	- medieval-latin
	- normalization
	- legal-history
	- digital-humanities
	- ocr-postprocessing
	widget:
	- text: "viiii vt in sabbato sancto ieiunium ante noctis initium non soluatur"
	example_title: "Medieval Legal Latin"
	---

	# Medieval Latin Normalizer (ByT5-Large)

	This model is a ByT5-Large transformer fine-tuned to normalize medieval Latin text. It transforms diplomatic transcriptions or noisy HTR/OCR output into a standardized normalized orthography, facilitating better downstream processing such as POS tagging, lemmatization, and linguistic analysis. The model was developed as part of the following research projects "Embedding the Past" (LOEWE-Exploration, TU Darmstadt) and "Burchards Dekret Digital" (Academy of Sciences and Literature \| Mainz).

	## Model Logic
	Medieval Latin normalization involves handling inconsistent orthography (e.g., `u/v`, `i/j`, or `ae/e` variations) and resolving phonetic spellings common in legal and ecclesiastical manuscripts.

	By using ByT5-Large, the model operates directly on UTF-8 bytes. This is a significant advantage for Medieval Latin, as it allows the model to process non-standard characters without the information loss typical of subword tokenizers (like BERT or standard T5).

	- Input: Raw/Diplomatic medieval Latin text.
	- Output: Standardized/Normalized Latin text.

	## Technical Specifications
	- Architecture: [ByT5-Large](https://huggingface.co/google/byt5-large) (~1.2B parameters).
	- Hardware: Trained on NVIDIA Blackwell GPUs using `bf16` precision and `adamw_torch_fused` optimization.
	- Training Parameters:
	- Learning Rate: 2e-4
	- Epochs: 20
	- Label Smoothing: 0.1 (to improve robustness against transcription noise).
	- Batch Size: 48.

	## Performance (Test Set)
	The model was evaluated on a held-out test set (85 samples) from medieval legal corpora:

	\| Metric \| Value \|
	\| :--- \| :--- \|
	\| Character Error Rate (CER) \| 1.62% \|
	\| Word-Level F1-Score \| 94.12% \|
	\| Evaluation Loss \| 0.143 \|

	## Usage
	You can utilize this model through the Hugging Face `pipeline` API:

	```python
	from transformers import pipeline

	# Initialize the normalizer
	normalizer = pipeline("text2text-generation", model="mschonhardt/latin-normalizer")

	# Example input
	raw_text = "viiii vt in sabbato sancto ieiunium ante noctis initium non soluatur"
	result = normalizer(raw_text, max_length=128)

	print(f"Normalized: {result[0]['generated_text']}")

	```

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@software{schonhardt_michael_2026_normalization,
	author = "Schonhardt, Michael",
	title = "Medieval Latin Normalizer",
	year = "2026",
	publisher = "Zenodo",
	doi = "10.5281/zenodo.18416639",
	url = "https://doi.org/10.5281/zenodo.18416639"
	}

	@article{xue-etal-2022-byt5,
	title = "{B}y{T}5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models",
	author = "Xue, Linting and
	Barua, Aditya and
	Constant, Noah and
	Al-Rfou, Rami and
	Narang, Sharan and
	Kale, Mihir and
	Roberts, Adam and
	Raffel, Colin",
	editor = "Roark, Brian and
	Nenkova, Ani",
	journal = "Transactions of the Association for Computational Linguistics",
	volume = "10",
	year = "2022",
	address = "Cambridge, MA",
	publisher = "MIT Press",
	url = "https://aclanthology.org/2022.tacl-1.17/",
	doi = "10.1162/tacl_a_00461",
	pages = "291--306"}

	```