jorgeguberte commited on
Commit
ee3733a
·
verified ·
1 Parent(s): c858292

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +96 -92
README.md CHANGED
@@ -1,92 +1,96 @@
1
- # AYVU-Talian: Character-Level Language Modeling for Linguistic Preservation
2
-
3
- **AYVU-Talian** is a lightweight, high-precision character-level transformer trained to generate text in **Talian** (Brazilian Venetian), a minority language spoken in southern Brazil. By modeling text at the character level, AYVU captures the unique phonological and orthographic nuances of this low-resource dialect, serving as a foundational tool for linguistic preservation and research.
4
-
5
- This project demonstrates custom Transformer architecture for low-resource NLP challenges.
6
-
7
- ## Key Features
8
-
9
- - **Character-Level Architecture**: Specifically designed to handle the morphosyntactic variations of Talian without the constraints of subword tokenization.
10
- - **Modern Transformer Design**: Leveraging a decoder-only architecture with learned positional encodings and causal masking.
11
- - **Robust Training Pipeline**: Includes custom data cleaning, validation-tracked checkpointing, and dynamic learning rate scheduling.
12
- - **Interactive Evaluation**: Built-in Jupyter integration for real-time text generation with sampling controls (Temperature, Top-K, Nucleus Sampling).
13
- - **Linguistic Inside**: Includes PCA-based embedding visualizations to analyze how the model "understands" the language's character relationships.
14
-
15
- ---
16
-
17
- ## Technical Architecture
18
-
19
- AYVU-Talian is built on a 5.39M parameter **Decoder-Only Transformer**:
20
-
21
- | Hyperparameter | Value |
22
- | :--- | :--- |
23
- | Layers (`n_layer`) | 4 |
24
- | Attention Heads (`n_head`) | 4 |
25
- | Embedding Dimension (`n_embd`) | 256 |
26
- | Context Window (`block_size`) | 256 |
27
- | Dropout | 0.1 |
28
- | Vocabulary size | 120 chars |
29
-
30
- ### Data Pipeline
31
- The model is trained on a cleaned subset of the **Talian corpus (Garcia & Guzzo, 2021)**. The preprocessing logic (`notebooks/ayvu-talian-preprocess.ipynb`) performs:
32
- 1. **Deduplication**: Ensuring unique sentence representation.
33
- 2. **Standardization**: Normalizing punctuation (quotes, apostrophes) and removing OCR-related metadata.
34
- 3. **Splitting**: A strict 85/15 train/test split to ensure verifiable linguistic generalization.
35
-
36
- ---
37
-
38
- ## Repository Structure
39
-
40
- - `notebooks/`:
41
- - `ayvu-talian-preprocess.ipynb`: Data cleaning and file preparation.
42
- - `ayvu-talian-base.ipynb`: Core model definition, training loop, and interactive UI.
43
- - `checkpoints/`: Contains `ayvu_talian_best.pt`, the state-of-the-art weights for this version.
44
- - `data/`: Processed training and test text files.
45
- - `ayvu-talian-base_config.yaml`: Centralized configuration for reproducible training runs.
46
-
47
- ---
48
-
49
- ## Quick Start
50
-
51
- ### 1. Requirements
52
- Install dependencies using `pip`:
53
- ```bash
54
- pip install -r requirements.txt
55
- ```
56
-
57
- ### 2. Training
58
- Training can be adjusted via `ayvu-talian-base_config.yaml`. To run from the notebook:
59
- - Ensure `device: "cuda"` is set for GPU usage.
60
- - Run the "Training Loop" cells in `ayvu-talian-base.ipynb`.
61
-
62
- ### 3. Inference & UI
63
- Open `ayvu-talian-base.ipynb` and navigate to the **"AYVU-Talian: Console de Geração Avançado"**. You can adjust:
64
- - **Temperature**: Control randomness (default: 0.9).
65
- - **Top-K**: Limit sampling to top K likely characters (default: 40).
66
- - **Top-P**: Filter tokens based on cumulative probability (default: 0.9).
67
-
68
- ---
69
-
70
- ## Visualizations
71
- The model includes deep-dive analysis tools:
72
- - **Embedding PCA**: A 2D mapping of character relationships.
73
- - **Positional Similarity Matrix**: A heatmap visualization of the learned positional encodings, showing how the model perceives sequence structure.
74
-
75
- ---
76
-
77
- ## Roadmap
78
- - [ ] **Instruction Tuning**: Adapting the base model for specific Q&A tasks.
79
- - [ ] **RLHF**: Integrating feedback from native speakers to refine syntactic drift.
80
- - [ ] **Expanded Context**: Increasing the `block_size` to capture longer-range dependencies.
81
-
82
- ## Citation
83
- Corpus
84
- ```bibtex
85
- @misc{Garcia_Guzzo_2021,
86
- title={Talian corpus: a written corpus of Brazilian Veneto},
87
- url={osf.io/63nrx},
88
- DOI={10.17605/OSF.IO/63NRX},
89
- author={Garcia, Guilherme D and Guzzo, Natália B},
90
- year={2021}
91
- }
92
- ```
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: text-generation
4
+ ---
5
+ # AYVU-Talian: Character-Level Language Modeling for Linguistic Preservation
6
+
7
+ **AYVU-Talian** is a lightweight, high-precision character-level transformer trained to generate text in **Talian** (Brazilian Venetian), a minority language spoken in southern Brazil. By modeling text at the character level, AYVU captures the unique phonological and orthographic nuances of this low-resource dialect, serving as a foundational tool for linguistic preservation and research.
8
+
9
+ This project demonstrates custom Transformer architecture for low-resource NLP challenges.
10
+
11
+ ## Key Features
12
+
13
+ - **Character-Level Architecture**: Specifically designed to handle the morphosyntactic variations of Talian without the constraints of subword tokenization.
14
+ - **Modern Transformer Design**: Leveraging a decoder-only architecture with learned positional encodings and causal masking.
15
+ - **Robust Training Pipeline**: Includes custom data cleaning, validation-tracked checkpointing, and dynamic learning rate scheduling.
16
+ - **Interactive Evaluation**: Built-in Jupyter integration for real-time text generation with sampling controls (Temperature, Top-K, Nucleus Sampling).
17
+ - **Linguistic Inside**: Includes PCA-based embedding visualizations to analyze how the model "understands" the language's character relationships.
18
+
19
+ ---
20
+
21
+ ## Technical Architecture
22
+
23
+ AYVU-Talian is built on a 5.39M parameter **Decoder-Only Transformer**:
24
+
25
+ | Hyperparameter | Value |
26
+ | :--- | :--- |
27
+ | Layers (`n_layer`) | 4 |
28
+ | Attention Heads (`n_head`) | 4 |
29
+ | Embedding Dimension (`n_embd`) | 256 |
30
+ | Context Window (`block_size`) | 256 |
31
+ | Dropout | 0.1 |
32
+ | Vocabulary size | 120 chars |
33
+
34
+ ### Data Pipeline
35
+ The model is trained on a cleaned subset of the **Talian corpus (Garcia & Guzzo, 2021)**. The preprocessing logic (`notebooks/ayvu-talian-preprocess.ipynb`) performs:
36
+ 1. **Deduplication**: Ensuring unique sentence representation.
37
+ 2. **Standardization**: Normalizing punctuation (quotes, apostrophes) and removing OCR-related metadata.
38
+ 3. **Splitting**: A strict 85/15 train/test split to ensure verifiable linguistic generalization.
39
+
40
+ ---
41
+
42
+ ## Repository Structure
43
+
44
+ - `notebooks/`:
45
+ - `ayvu-talian-preprocess.ipynb`: Data cleaning and file preparation.
46
+ - `ayvu-talian-base.ipynb`: Core model definition, training loop, and interactive UI.
47
+ - `checkpoints/`: Contains `ayvu_talian_best.pt`, the state-of-the-art weights for this version.
48
+ - `data/`: Processed training and test text files.
49
+ - `ayvu-talian-base_config.yaml`: Centralized configuration for reproducible training runs.
50
+
51
+ ---
52
+
53
+ ## Quick Start
54
+
55
+ ### 1. Requirements
56
+ Install dependencies using `pip`:
57
+ ```bash
58
+ pip install -r requirements.txt
59
+ ```
60
+
61
+ ### 2. Training
62
+ Training can be adjusted via `ayvu-talian-base_config.yaml`. To run from the notebook:
63
+ - Ensure `device: "cuda"` is set for GPU usage.
64
+ - Run the "Training Loop" cells in `ayvu-talian-base.ipynb`.
65
+
66
+ ### 3. Inference & UI
67
+ Open `ayvu-talian-base.ipynb` and navigate to the **"AYVU-Talian: Console de Geração Avançado"**. You can adjust:
68
+ - **Temperature**: Control randomness (default: 0.9).
69
+ - **Top-K**: Limit sampling to top K likely characters (default: 40).
70
+ - **Top-P**: Filter tokens based on cumulative probability (default: 0.9).
71
+
72
+ ---
73
+
74
+ ## Visualizations
75
+ The model includes deep-dive analysis tools:
76
+ - **Embedding PCA**: A 2D mapping of character relationships.
77
+ - **Positional Similarity Matrix**: A heatmap visualization of the learned positional encodings, showing how the model perceives sequence structure.
78
+
79
+ ---
80
+
81
+ ## Roadmap
82
+ - [ ] **Instruction Tuning**: Adapting the base model for specific Q&A tasks.
83
+ - [ ] **RLHF**: Integrating feedback from native speakers to refine syntactic drift.
84
+ - [ ] **Expanded Context**: Increasing the `block_size` to capture longer-range dependencies.
85
+
86
+ ## Citation
87
+ Corpus
88
+ ```bibtex
89
+ @misc{Garcia_Guzzo_2021,
90
+ title={Talian corpus: a written corpus of Brazilian Veneto},
91
+ url={osf.io/63nrx},
92
+ DOI={10.17605/OSF.IO/63NRX},
93
+ author={Garcia, Guilherme D and Guzzo, Natália B},
94
+ year={2021}
95
+ }
96
+ ```