AYVU-Talian: Character-Level Language Modeling for Linguistic Preservation

AYVU-Talian is a lightweight, high-precision character-level transformer trained to generate text in Talian (Brazilian Venetian), a minority language spoken in southern Brazil. By modeling text at the character level, AYVU captures the unique phonological and orthographic nuances of this low-resource dialect, serving as a foundational tool for linguistic preservation and research.

This project demonstrates custom Transformer architecture for low-resource NLP challenges.

Key Features

  • Character-Level Architecture: Specifically designed to handle the morphosyntactic variations of Talian without the constraints of subword tokenization.
  • Modern Transformer Design: Leveraging a decoder-only architecture with learned positional encodings and causal masking.
  • Robust Training Pipeline: Includes custom data cleaning, validation-tracked checkpointing, and dynamic learning rate scheduling.
  • Interactive Evaluation: Built-in Jupyter integration for real-time text generation with sampling controls (Temperature, Top-K, Nucleus Sampling).
  • Linguistic Inside: Includes PCA-based embedding visualizations to analyze how the model "understands" the language's character relationships.

Technical Architecture

AYVU-Talian is built on a 5.39M parameter Decoder-Only Transformer:

Hyperparameter Value
Layers (n_layer) 4
Attention Heads (n_head) 4
Embedding Dimension (n_embd) 256
Context Window (block_size) 256
Dropout 0.1
Vocabulary size 120 chars

Data Pipeline

The model is trained on a cleaned subset of the Talian corpus (Garcia & Guzzo, 2021). The preprocessing logic (notebooks/ayvu-talian-preprocess.ipynb) performs:

  1. Deduplication: Ensuring unique sentence representation.
  2. Standardization: Normalizing punctuation (quotes, apostrophes) and removing OCR-related metadata.
  3. Splitting: A strict 85/15 train/test split to ensure verifiable linguistic generalization.

Repository Structure

  • notebooks/:
    • ayvu-talian-preprocess.ipynb: Data cleaning and file preparation.
    • ayvu-talian-base.ipynb: Core model definition, training loop, and interactive UI.
  • checkpoints/: Contains ayvu_talian_best.pt, the state-of-the-art weights for this version.
  • data/: Processed training and test text files.
  • ayvu-talian-base_config.yaml: Centralized configuration for reproducible training runs.

Quick Start

1. Requirements

Install dependencies using pip:

pip install -r requirements.txt

2. Training

Training can be adjusted via ayvu-talian-base_config.yaml. To run from the notebook:

  • Ensure device: "cuda" is set for GPU usage.
  • Run the "Training Loop" cells in ayvu-talian-base.ipynb.

3. Inference & UI

Open ayvu-talian-base.ipynb and navigate to the "AYVU-Talian: Console de Geração Avançado". You can adjust:

  • Temperature: Control randomness (default: 0.9).
  • Top-K: Limit sampling to top K likely characters (default: 40).
  • Top-P: Filter tokens based on cumulative probability (default: 0.9).

Visualizations

The model includes deep-dive analysis tools:

  • Embedding PCA: A 2D mapping of character relationships.
  • Positional Similarity Matrix: A heatmap visualization of the learned positional encodings, showing how the model perceives sequence structure.

Roadmap

  • Instruction Tuning: Adapting the base model for specific Q&A tasks.
  • RLHF: Integrating feedback from native speakers to refine syntactic drift.
  • Expanded Context: Increasing the block_size to capture longer-range dependencies.

Citation

Corpus

@misc{Garcia_Guzzo_2021,   
  title={Talian corpus: a written corpus of Brazilian Veneto},   
  url={osf.io/63nrx},   
  DOI={10.17605/OSF.IO/63NRX},   
  author={Garcia, Guilherme D and Guzzo, Natália B},   
  year={2021} 
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support