LoganResearch commited on
Commit
79c8b45
·
verified ·
1 Parent(s): 01c6fee

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -0
README.md CHANGED
@@ -24,6 +24,23 @@ Overview
24
  This translation model was built from the ground up, starting with raw text data and ending with a fully functional bilingual
25
  English ↔ Russian language model. No pretrained translation models were used at any stage. The project emphasizes data curation,
26
  tokenizer design, architectural discipline, and alignment through supervised fine-tuning.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  1. Data Collection & Cleaning
28
 
29
  The process began with a corpus of approximately 40 public-domain English books, which were:
@@ -130,6 +147,19 @@ and technical texts.
130
  - No safety fine-tuning
131
  - Not suitable for production or legal/medical use
132
 
 
 
 
 
 
 
 
 
 
 
 
 
 
133
  ## Reproducibility
134
  Inference example:
135
  ```python
 
24
  This translation model was built from the ground up, starting with raw text data and ending with a fully functional bilingual
25
  English ↔ Russian language model. No pretrained translation models were used at any stage. The project emphasizes data curation,
26
  tokenizer design, architectural discipline, and alignment through supervised fine-tuning.
27
+
28
+ ### Training Procedure
29
+
30
+ The model was trained end-to-end from random initialization using a causal language modeling objective.
31
+
32
+ - Optimizer: AdamW
33
+ - Loss: Cross-entropy (next-token prediction)
34
+ - Training strategy:
35
+ - Stage 1: Monolingual English + Russian language modeling
36
+ - Stage 2: Interleaved bilingual language modeling
37
+ - Stage 3: Bidirectional translation alignment fine-tuning
38
+ - Gradient accumulation used to simulate larger batch sizes
39
+ - Checkpoints saved periodically and manually evaluated
40
+ - Training conducted on consumer-grade GPUs
41
+
42
+ The focus was on stability, coherence, and alignment rather than maximum scale.
43
+
44
  1. Data Collection & Cleaning
45
 
46
  The process began with a corpus of approximately 40 public-domain English books, which were:
 
147
  - No safety fine-tuning
148
  - Not suitable for production or legal/medical use
149
 
150
+ ### Alignment Perspective
151
+
152
+ This project explores how structured data curation and constrained supervision can reduce hallucination and improve faithfulness in small-to-mid-scale language models.
153
+
154
+ Key alignment-relevant aspects:
155
+ - Conditioning without natural-language instructions
156
+ - Strict source-target alignment
157
+ - Avoidance of RLHF or preference modeling
158
+ - Observation of failure modes under ambiguity
159
+
160
+ The model is intentionally not instruction-tuned to preserve interpretability of its learned representations.
161
+
162
+
163
  ## Reproducibility
164
  Inference example:
165
  ```python