Update README.md
Browse files
README.md
CHANGED
|
@@ -1,92 +1,96 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
- **
|
| 14 |
-
|
| 15 |
-
--
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
|
| 26 |
-
|
|
| 27 |
-
|
|
| 28 |
-
|
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
- `
|
| 45 |
-
- `ayvu-talian-
|
| 46 |
-
|
| 47 |
-
---
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
``
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
}
|
| 92 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
pipeline_tag: text-generation
|
| 4 |
+
---
|
| 5 |
+
# AYVU-Talian: Character-Level Language Modeling for Linguistic Preservation
|
| 6 |
+
|
| 7 |
+
**AYVU-Talian** is a lightweight, high-precision character-level transformer trained to generate text in **Talian** (Brazilian Venetian), a minority language spoken in southern Brazil. By modeling text at the character level, AYVU captures the unique phonological and orthographic nuances of this low-resource dialect, serving as a foundational tool for linguistic preservation and research.
|
| 8 |
+
|
| 9 |
+
This project demonstrates custom Transformer architecture for low-resource NLP challenges.
|
| 10 |
+
|
| 11 |
+
## Key Features
|
| 12 |
+
|
| 13 |
+
- **Character-Level Architecture**: Specifically designed to handle the morphosyntactic variations of Talian without the constraints of subword tokenization.
|
| 14 |
+
- **Modern Transformer Design**: Leveraging a decoder-only architecture with learned positional encodings and causal masking.
|
| 15 |
+
- **Robust Training Pipeline**: Includes custom data cleaning, validation-tracked checkpointing, and dynamic learning rate scheduling.
|
| 16 |
+
- **Interactive Evaluation**: Built-in Jupyter integration for real-time text generation with sampling controls (Temperature, Top-K, Nucleus Sampling).
|
| 17 |
+
- **Linguistic Inside**: Includes PCA-based embedding visualizations to analyze how the model "understands" the language's character relationships.
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## Technical Architecture
|
| 22 |
+
|
| 23 |
+
AYVU-Talian is built on a 5.39M parameter **Decoder-Only Transformer**:
|
| 24 |
+
|
| 25 |
+
| Hyperparameter | Value |
|
| 26 |
+
| :--- | :--- |
|
| 27 |
+
| Layers (`n_layer`) | 4 |
|
| 28 |
+
| Attention Heads (`n_head`) | 4 |
|
| 29 |
+
| Embedding Dimension (`n_embd`) | 256 |
|
| 30 |
+
| Context Window (`block_size`) | 256 |
|
| 31 |
+
| Dropout | 0.1 |
|
| 32 |
+
| Vocabulary size | 120 chars |
|
| 33 |
+
|
| 34 |
+
### Data Pipeline
|
| 35 |
+
The model is trained on a cleaned subset of the **Talian corpus (Garcia & Guzzo, 2021)**. The preprocessing logic (`notebooks/ayvu-talian-preprocess.ipynb`) performs:
|
| 36 |
+
1. **Deduplication**: Ensuring unique sentence representation.
|
| 37 |
+
2. **Standardization**: Normalizing punctuation (quotes, apostrophes) and removing OCR-related metadata.
|
| 38 |
+
3. **Splitting**: A strict 85/15 train/test split to ensure verifiable linguistic generalization.
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
## Repository Structure
|
| 43 |
+
|
| 44 |
+
- `notebooks/`:
|
| 45 |
+
- `ayvu-talian-preprocess.ipynb`: Data cleaning and file preparation.
|
| 46 |
+
- `ayvu-talian-base.ipynb`: Core model definition, training loop, and interactive UI.
|
| 47 |
+
- `checkpoints/`: Contains `ayvu_talian_best.pt`, the state-of-the-art weights for this version.
|
| 48 |
+
- `data/`: Processed training and test text files.
|
| 49 |
+
- `ayvu-talian-base_config.yaml`: Centralized configuration for reproducible training runs.
|
| 50 |
+
|
| 51 |
+
---
|
| 52 |
+
|
| 53 |
+
## Quick Start
|
| 54 |
+
|
| 55 |
+
### 1. Requirements
|
| 56 |
+
Install dependencies using `pip`:
|
| 57 |
+
```bash
|
| 58 |
+
pip install -r requirements.txt
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
### 2. Training
|
| 62 |
+
Training can be adjusted via `ayvu-talian-base_config.yaml`. To run from the notebook:
|
| 63 |
+
- Ensure `device: "cuda"` is set for GPU usage.
|
| 64 |
+
- Run the "Training Loop" cells in `ayvu-talian-base.ipynb`.
|
| 65 |
+
|
| 66 |
+
### 3. Inference & UI
|
| 67 |
+
Open `ayvu-talian-base.ipynb` and navigate to the **"AYVU-Talian: Console de Geração Avançado"**. You can adjust:
|
| 68 |
+
- **Temperature**: Control randomness (default: 0.9).
|
| 69 |
+
- **Top-K**: Limit sampling to top K likely characters (default: 40).
|
| 70 |
+
- **Top-P**: Filter tokens based on cumulative probability (default: 0.9).
|
| 71 |
+
|
| 72 |
+
---
|
| 73 |
+
|
| 74 |
+
## Visualizations
|
| 75 |
+
The model includes deep-dive analysis tools:
|
| 76 |
+
- **Embedding PCA**: A 2D mapping of character relationships.
|
| 77 |
+
- **Positional Similarity Matrix**: A heatmap visualization of the learned positional encodings, showing how the model perceives sequence structure.
|
| 78 |
+
|
| 79 |
+
---
|
| 80 |
+
|
| 81 |
+
## Roadmap
|
| 82 |
+
- [ ] **Instruction Tuning**: Adapting the base model for specific Q&A tasks.
|
| 83 |
+
- [ ] **RLHF**: Integrating feedback from native speakers to refine syntactic drift.
|
| 84 |
+
- [ ] **Expanded Context**: Increasing the `block_size` to capture longer-range dependencies.
|
| 85 |
+
|
| 86 |
+
## Citation
|
| 87 |
+
Corpus
|
| 88 |
+
```bibtex
|
| 89 |
+
@misc{Garcia_Guzzo_2021,
|
| 90 |
+
title={Talian corpus: a written corpus of Brazilian Veneto},
|
| 91 |
+
url={osf.io/63nrx},
|
| 92 |
+
DOI={10.17605/OSF.IO/63NRX},
|
| 93 |
+
author={Garcia, Guilherme D and Guzzo, Natália B},
|
| 94 |
+
year={2021}
|
| 95 |
+
}
|
| 96 |
+
```
|