jorgeguberte
/

ayvu-talian-base

Text Generation

Model card Files Files and versions

xet

Community

jorgeguberte commited on Jan 7

Commit

ee3733a

verified ·

1 Parent(s): c858292

Update README.md

Browse files

Files changed (1) hide show

README.md +96 -92

README.md CHANGED Viewed

@@ -1,92 +1,96 @@
-# AYVU-Talian: Character-Level Language Modeling for Linguistic Preservation
-**AYVU-Talian** is a lightweight, high-precision character-level transformer trained to generate text in **Talian** (Brazilian Venetian), a minority language spoken in southern Brazil. By modeling text at the character level, AYVU captures the unique phonological and orthographic nuances of this low-resource dialect, serving as a foundational tool for linguistic preservation and research.
-This project demonstrates custom Transformer architecture for low-resource NLP challenges.
-## Key Features
-- **Character-Level Architecture**: Specifically designed to handle the morphosyntactic variations of Talian without the constraints of subword tokenization.
-- **Modern Transformer Design**: Leveraging a decoder-only architecture with learned positional encodings and causal masking.
-- **Robust Training Pipeline**: Includes custom data cleaning, validation-tracked checkpointing, and dynamic learning rate scheduling.
-- **Interactive Evaluation**: Built-in Jupyter integration for real-time text generation with sampling controls (Temperature, Top-K, Nucleus Sampling).
-- **Linguistic Inside**: Includes PCA-based embedding visualizations to analyze how the model "understands" the language's character relationships.
----
-## Technical Architecture
-AYVU-Talian is built on a 5.39M parameter **Decoder-Only Transformer**:
-| Hyperparameter | Value |
-| :--- | :--- |
-| Layers (`n_layer`) | 4 |
-| Attention Heads (`n_head`) | 4 |
-| Embedding Dimension (`n_embd`) | 256 |
-| Context Window (`block_size`) | 256 |
-| Dropout | 0.1 |
-| Vocabulary size | 120 chars |
-### Data Pipeline
-The model is trained on a cleaned subset of the **Talian corpus (Garcia & Guzzo, 2021)**. The preprocessing logic (`notebooks/ayvu-talian-preprocess.ipynb`) performs:
-1. **Deduplication**: Ensuring unique sentence representation.
-2. **Standardization**: Normalizing punctuation (quotes, apostrophes) and removing OCR-related metadata.
-3. **Splitting**: A strict 85/15 train/test split to ensure verifiable linguistic generalization.
----
-## Repository Structure
-- `notebooks/`:
-    - `ayvu-talian-preprocess.ipynb`: Data cleaning and file preparation.
-    - `ayvu-talian-base.ipynb`: Core model definition, training loop, and interactive UI.
-- `checkpoints/`: Contains `ayvu_talian_best.pt`, the state-of-the-art weights for this version.
-- `data/`: Processed training and test text files.
-- `ayvu-talian-base_config.yaml`: Centralized configuration for reproducible training runs.
----
-## Quick Start
-### 1. Requirements
-Install dependencies using `pip`:
-```bash
-pip install -r requirements.txt
-```
-### 2. Training
-Training can be adjusted via `ayvu-talian-base_config.yaml`. To run from the notebook:
-- Ensure `device: "cuda"` is set for GPU usage.
-- Run the "Training Loop" cells in `ayvu-talian-base.ipynb`.
-### 3. Inference & UI
-Open `ayvu-talian-base.ipynb` and navigate to the **"AYVU-Talian: Console de Geração Avançado"**. You can adjust:
-- **Temperature**: Control randomness (default: 0.9).
-- **Top-K**: Limit sampling to top K likely characters (default: 40).
-- **Top-P**: Filter tokens based on cumulative probability (default: 0.9).
----
-## Visualizations
-The model includes deep-dive analysis tools:
-- **Embedding PCA**: A 2D mapping of character relationships.
-- **Positional Similarity Matrix**: A heatmap visualization of the learned positional encodings, showing how the model perceives sequence structure.
----
-## Roadmap
-- [ ] **Instruction Tuning**: Adapting the base model for specific Q&A tasks.
-- [ ] **RLHF**: Integrating feedback from native speakers to refine syntactic drift.
-- [ ] **Expanded Context**: Increasing the `block_size` to capture longer-range dependencies.
-## Citation
-Corpus
-```bibtex
-@misc{Garcia_Guzzo_2021,
-  title={Talian corpus: a written corpus of Brazilian Veneto},
-  url={osf.io/63nrx},
-  DOI={10.17605/OSF.IO/63NRX},
-  author={Garcia, Guilherme D and Guzzo, Natália B},
-  year={2021}
-}
-```

+---
+license: mit
+pipeline_tag: text-generation
+---
+# AYVU-Talian: Character-Level Language Modeling for Linguistic Preservation
+**AYVU-Talian** is a lightweight, high-precision character-level transformer trained to generate text in **Talian** (Brazilian Venetian), a minority language spoken in southern Brazil. By modeling text at the character level, AYVU captures the unique phonological and orthographic nuances of this low-resource dialect, serving as a foundational tool for linguistic preservation and research.
+This project demonstrates custom Transformer architecture for low-resource NLP challenges.
+## Key Features
+- **Character-Level Architecture**: Specifically designed to handle the morphosyntactic variations of Talian without the constraints of subword tokenization.
+- **Modern Transformer Design**: Leveraging a decoder-only architecture with learned positional encodings and causal masking.
+- **Robust Training Pipeline**: Includes custom data cleaning, validation-tracked checkpointing, and dynamic learning rate scheduling.
+- **Interactive Evaluation**: Built-in Jupyter integration for real-time text generation with sampling controls (Temperature, Top-K, Nucleus Sampling).
+- **Linguistic Inside**: Includes PCA-based embedding visualizations to analyze how the model "understands" the language's character relationships.
+---
+## Technical Architecture
+AYVU-Talian is built on a 5.39M parameter **Decoder-Only Transformer**:
+| Hyperparameter | Value |
+| :--- | :--- |
+| Layers (`n_layer`) | 4 |
+| Attention Heads (`n_head`) | 4 |
+| Embedding Dimension (`n_embd`) | 256 |
+| Context Window (`block_size`) | 256 |
+| Dropout | 0.1 |
+| Vocabulary size | 120 chars |
+### Data Pipeline
+The model is trained on a cleaned subset of the **Talian corpus (Garcia & Guzzo, 2021)**. The preprocessing logic (`notebooks/ayvu-talian-preprocess.ipynb`) performs:
+1. **Deduplication**: Ensuring unique sentence representation.
+2. **Standardization**: Normalizing punctuation (quotes, apostrophes) and removing OCR-related metadata.
+3. **Splitting**: A strict 85/15 train/test split to ensure verifiable linguistic generalization.
+---
+## Repository Structure
+- `notebooks/`:
+    - `ayvu-talian-preprocess.ipynb`: Data cleaning and file preparation.
+    - `ayvu-talian-base.ipynb`: Core model definition, training loop, and interactive UI.
+- `checkpoints/`: Contains `ayvu_talian_best.pt`, the state-of-the-art weights for this version.
+- `data/`: Processed training and test text files.
+- `ayvu-talian-base_config.yaml`: Centralized configuration for reproducible training runs.
+---
+## Quick Start
+### 1. Requirements
+Install dependencies using `pip`:
+```bash
+pip install -r requirements.txt
+```
+### 2. Training
+Training can be adjusted via `ayvu-talian-base_config.yaml`. To run from the notebook:
+- Ensure `device: "cuda"` is set for GPU usage.
+- Run the "Training Loop" cells in `ayvu-talian-base.ipynb`.
+### 3. Inference & UI
+Open `ayvu-talian-base.ipynb` and navigate to the **"AYVU-Talian: Console de Geração Avançado"**. You can adjust:
+- **Temperature**: Control randomness (default: 0.9).
+- **Top-K**: Limit sampling to top K likely characters (default: 40).
+- **Top-P**: Filter tokens based on cumulative probability (default: 0.9).
+---
+## Visualizations
+The model includes deep-dive analysis tools:
+- **Embedding PCA**: A 2D mapping of character relationships.
+- **Positional Similarity Matrix**: A heatmap visualization of the learned positional encodings, showing how the model perceives sequence structure.
+---
+## Roadmap
+- [ ] **Instruction Tuning**: Adapting the base model for specific Q&A tasks.
+- [ ] **RLHF**: Integrating feedback from native speakers to refine syntactic drift.
+- [ ] **Expanded Context**: Increasing the `block_size` to capture longer-range dependencies.
+## Citation
+Corpus
+```bibtex
+@misc{Garcia_Guzzo_2021,
+  title={Talian corpus: a written corpus of Brazilian Veneto},
+  url={osf.io/63nrx},
+  DOI={10.17605/OSF.IO/63NRX},
+  author={Garcia, Guilherme D and Guzzo, Natália B},
+  year={2021}
+}
+```