Update README.md
Browse files
README.md
CHANGED
|
@@ -24,6 +24,23 @@ Overview
|
|
| 24 |
This translation model was built from the ground up, starting with raw text data and ending with a fully functional bilingual
|
| 25 |
English ↔ Russian language model. No pretrained translation models were used at any stage. The project emphasizes data curation,
|
| 26 |
tokenizer design, architectural discipline, and alignment through supervised fine-tuning.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
1. Data Collection & Cleaning
|
| 28 |
|
| 29 |
The process began with a corpus of approximately 40 public-domain English books, which were:
|
|
@@ -130,6 +147,19 @@ and technical texts.
|
|
| 130 |
- No safety fine-tuning
|
| 131 |
- Not suitable for production or legal/medical use
|
| 132 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 133 |
## Reproducibility
|
| 134 |
Inference example:
|
| 135 |
```python
|
|
|
|
| 24 |
This translation model was built from the ground up, starting with raw text data and ending with a fully functional bilingual
|
| 25 |
English ↔ Russian language model. No pretrained translation models were used at any stage. The project emphasizes data curation,
|
| 26 |
tokenizer design, architectural discipline, and alignment through supervised fine-tuning.
|
| 27 |
+
|
| 28 |
+
### Training Procedure
|
| 29 |
+
|
| 30 |
+
The model was trained end-to-end from random initialization using a causal language modeling objective.
|
| 31 |
+
|
| 32 |
+
- Optimizer: AdamW
|
| 33 |
+
- Loss: Cross-entropy (next-token prediction)
|
| 34 |
+
- Training strategy:
|
| 35 |
+
- Stage 1: Monolingual English + Russian language modeling
|
| 36 |
+
- Stage 2: Interleaved bilingual language modeling
|
| 37 |
+
- Stage 3: Bidirectional translation alignment fine-tuning
|
| 38 |
+
- Gradient accumulation used to simulate larger batch sizes
|
| 39 |
+
- Checkpoints saved periodically and manually evaluated
|
| 40 |
+
- Training conducted on consumer-grade GPUs
|
| 41 |
+
|
| 42 |
+
The focus was on stability, coherence, and alignment rather than maximum scale.
|
| 43 |
+
|
| 44 |
1. Data Collection & Cleaning
|
| 45 |
|
| 46 |
The process began with a corpus of approximately 40 public-domain English books, which were:
|
|
|
|
| 147 |
- No safety fine-tuning
|
| 148 |
- Not suitable for production or legal/medical use
|
| 149 |
|
| 150 |
+
### Alignment Perspective
|
| 151 |
+
|
| 152 |
+
This project explores how structured data curation and constrained supervision can reduce hallucination and improve faithfulness in small-to-mid-scale language models.
|
| 153 |
+
|
| 154 |
+
Key alignment-relevant aspects:
|
| 155 |
+
- Conditioning without natural-language instructions
|
| 156 |
+
- Strict source-target alignment
|
| 157 |
+
- Avoidance of RLHF or preference modeling
|
| 158 |
+
- Observation of failure modes under ambiguity
|
| 159 |
+
|
| 160 |
+
The model is intentionally not instruction-tuned to preserve interpretability of its learned representations.
|
| 161 |
+
|
| 162 |
+
|
| 163 |
## Reproducibility
|
| 164 |
Inference example:
|
| 165 |
```python
|