AYVU-Talian: Character-Level Language Modeling for Linguistic Preservation

AYVU-Talian is a lightweight, high-precision character-level transformer trained to generate text in Talian (Brazilian Venetian), a minority language spoken in southern Brazil. By modeling text at the character level, AYVU captures the unique phonological and orthographic nuances of this low-resource dialect, serving as a foundational tool for linguistic preservation and research.

This project demonstrates custom Transformer architecture for low-resource NLP challenges.

Key Features

Character-Level Architecture: Specifically designed to handle the morphosyntactic variations of Talian without the constraints of subword tokenization.
Modern Transformer Design: Leveraging a decoder-only architecture with learned positional encodings and causal masking.
Robust Training Pipeline: Includes custom data cleaning, validation-tracked checkpointing, and dynamic learning rate scheduling.
Interactive Evaluation: Built-in Jupyter integration for real-time text generation with sampling controls (Temperature, Top-K, Nucleus Sampling).
Linguistic Inside: Includes PCA-based embedding visualizations to analyze how the model "understands" the language's character relationships.

Technical Architecture

AYVU-Talian is built on a 5.39M parameter Decoder-Only Transformer:

Hyperparameter	Value
Layers (`n_layer`)	4
Attention Heads (`n_head`)	4
Embedding Dimension (`n_embd`)	256
Context Window (`block_size`)	256
Dropout	0.1
Vocabulary size	120 chars

Data Pipeline

The model is trained on a cleaned subset of the Talian corpus (Garcia & Guzzo, 2021). The preprocessing logic (notebooks/ayvu-talian-preprocess.ipynb) performs:

Deduplication: Ensuring unique sentence representation.
Standardization: Normalizing punctuation (quotes, apostrophes) and removing OCR-related metadata.
Splitting: A strict 85/15 train/test split to ensure verifiable linguistic generalization.

Repository Structure

notebooks/:
- ayvu-talian-preprocess.ipynb: Data cleaning and file preparation.
- ayvu-talian-base.ipynb: Core model definition, training loop, and interactive UI.
checkpoints/: Contains ayvu_talian_best.pt, the state-of-the-art weights for this version.
data/: Processed training and test text files.
ayvu-talian-base_config.yaml: Centralized configuration for reproducible training runs.

Quick Start

1. Requirements

Install dependencies using pip:

pip install -r requirements.txt

2. Training

Training can be adjusted via ayvu-talian-base_config.yaml. To run from the notebook:

Ensure device: "cuda" is set for GPU usage.
Run the "Training Loop" cells in ayvu-talian-base.ipynb.

3. Inference & UI

Open ayvu-talian-base.ipynb and navigate to the "AYVU-Talian: Console de Geração Avançado". You can adjust:

Temperature: Control randomness (default: 0.9).
Top-K: Limit sampling to top K likely characters (default: 40).
Top-P: Filter tokens based on cumulative probability (default: 0.9).

Visualizations

The model includes deep-dive analysis tools:

Embedding PCA: A 2D mapping of character relationships.
Positional Similarity Matrix: A heatmap visualization of the learned positional encodings, showing how the model perceives sequence structure.

Roadmap

Instruction Tuning: Adapting the base model for specific Q&A tasks.
RLHF: Integrating feedback from native speakers to refine syntactic drift.
Expanded Context: Increasing the block_size to capture longer-range dependencies.

Citation

Corpus

@misc{Garcia_Guzzo_2021,   
  title={Talian corpus: a written corpus of Brazilian Veneto},   
  url={osf.io/63nrx},   
  DOI={10.17605/OSF.IO/63NRX},   
  author={Garcia, Guilherme D and Guzzo, Natália B},   
  year={2021} 
}

Downloads last month: -; Downloads are not tracked for this model. How to track