AYVU-Talian: Character-Level Language Modeling for Linguistic Preservation
AYVU-Talian is a lightweight, high-precision character-level transformer trained to generate text in Talian (Brazilian Venetian), a minority language spoken in southern Brazil. By modeling text at the character level, AYVU captures the unique phonological and orthographic nuances of this low-resource dialect, serving as a foundational tool for linguistic preservation and research.
This project demonstrates custom Transformer architecture for low-resource NLP challenges.
Key Features
- Character-Level Architecture: Specifically designed to handle the morphosyntactic variations of Talian without the constraints of subword tokenization.
- Modern Transformer Design: Leveraging a decoder-only architecture with learned positional encodings and causal masking.
- Robust Training Pipeline: Includes custom data cleaning, validation-tracked checkpointing, and dynamic learning rate scheduling.
- Interactive Evaluation: Built-in Jupyter integration for real-time text generation with sampling controls (Temperature, Top-K, Nucleus Sampling).
- Linguistic Inside: Includes PCA-based embedding visualizations to analyze how the model "understands" the language's character relationships.
Technical Architecture
AYVU-Talian is built on a 5.39M parameter Decoder-Only Transformer:
| Hyperparameter | Value |
|---|---|
Layers (n_layer) |
4 |
Attention Heads (n_head) |
4 |
Embedding Dimension (n_embd) |
256 |
Context Window (block_size) |
256 |
| Dropout | 0.1 |
| Vocabulary size | 120 chars |
Data Pipeline
The model is trained on a cleaned subset of the Talian corpus (Garcia & Guzzo, 2021). The preprocessing logic (notebooks/ayvu-talian-preprocess.ipynb) performs:
- Deduplication: Ensuring unique sentence representation.
- Standardization: Normalizing punctuation (quotes, apostrophes) and removing OCR-related metadata.
- Splitting: A strict 85/15 train/test split to ensure verifiable linguistic generalization.
Repository Structure
notebooks/:ayvu-talian-preprocess.ipynb: Data cleaning and file preparation.ayvu-talian-base.ipynb: Core model definition, training loop, and interactive UI.
checkpoints/: Containsayvu_talian_best.pt, the state-of-the-art weights for this version.data/: Processed training and test text files.ayvu-talian-base_config.yaml: Centralized configuration for reproducible training runs.
Quick Start
1. Requirements
Install dependencies using pip:
pip install -r requirements.txt
2. Training
Training can be adjusted via ayvu-talian-base_config.yaml. To run from the notebook:
- Ensure
device: "cuda"is set for GPU usage. - Run the "Training Loop" cells in
ayvu-talian-base.ipynb.
3. Inference & UI
Open ayvu-talian-base.ipynb and navigate to the "AYVU-Talian: Console de Geração Avançado". You can adjust:
- Temperature: Control randomness (default: 0.9).
- Top-K: Limit sampling to top K likely characters (default: 40).
- Top-P: Filter tokens based on cumulative probability (default: 0.9).
Visualizations
The model includes deep-dive analysis tools:
- Embedding PCA: A 2D mapping of character relationships.
- Positional Similarity Matrix: A heatmap visualization of the learned positional encodings, showing how the model perceives sequence structure.
Roadmap
- Instruction Tuning: Adapting the base model for specific Q&A tasks.
- RLHF: Integrating feedback from native speakers to refine syntactic drift.
- Expanded Context: Increasing the
block_sizeto capture longer-range dependencies.
Citation
Corpus
@misc{Garcia_Guzzo_2021,
title={Talian corpus: a written corpus of Brazilian Veneto},
url={osf.io/63nrx},
DOI={10.17605/OSF.IO/63NRX},
author={Garcia, Guilherme D and Guzzo, Natália B},
year={2021}
}