LoganResearch commited on
Commit
01c6fee
·
verified ·
1 Parent(s): 0f896bd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -4
README.md CHANGED
@@ -19,11 +19,85 @@ tags:
19
 
20
  # Cyborg Translator (EN ↔ RU)
21
 
22
- ## Overview
23
- Cyborg Translator is a custom-trained English ↔ Russian neural machine translation model.
24
- The project focuses on **data quality, tokenizer design, and bidirectional translation robustness**
25
- rather than scale.
26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  This model was trained end-to-end on a self-curated parallel corpus derived from cleaned literary
28
  and technical texts.
29
 
 
19
 
20
  # Cyborg Translator (EN ↔ RU)
21
 
22
+ Overview
 
 
 
23
 
24
+ This translation model was built from the ground up, starting with raw text data and ending with a fully functional bilingual
25
+ English ↔ Russian language model. No pretrained translation models were used at any stage. The project emphasizes data curation,
26
+ tokenizer design, architectural discipline, and alignment through supervised fine-tuning.
27
+ 1. Data Collection & Cleaning
28
+
29
+ The process began with a corpus of approximately 40 public-domain English books, which were:
30
+ Cleaned via custom Bash and Python scripts
31
+ Normalized for punctuation, casing, and encoding
32
+
33
+ Sentence-segmented and deduplicate
34
+ Filtered to remove metadata, headers, and non-linguistic content
35
+ This produced a clean English monolingual corpus suitable for language modeling and later bilingual alignment.
36
+ 2. Parallel Data Creation (English ↔ Russian)
37
+ To create bilingual supervision:
38
+ Clean English sentences were translated into Russian using a controlled automated pipeline
39
+ Sentence alignment was preserved strictly (1 English sentence ↔ 1 Russian sentence)
40
+ Additional filtering removed mistranslations, malformed pairs, and length mismatches
41
+ The result was a high-quality parallel English–Russian dataset suitable for supervised translation training.
42
+ ⚠️ No pretrained translation models were fine-tuned; the translations were used only as supervision data.
43
+ 3. Tokenizer Built From Scratch
44
+ A SentencePiece BPE tokenizer (32k vocab) was trained from scratch on the combined English + Russian corpus.
45
+ Key properties:
46
+ Shared multilingual vocabulary
47
+ Unicode-safe
48
+ No language-specific hacks
49
+ Enables cross-lingual parameter sharing
50
+ This tokenizer was frozen and reused across all training stages.
51
+ 4. Base Language Model Initialization
52
+ The model architecture is a GPT-style causal transformer (~300M parameters) initialized randomly (no pretrained weights).
53
+ Training began with:
54
+ Standard causal language modeling objective
55
+ Mixed English and Russian text
56
+ Strict context length limits aligned with positional embeddings
57
+ This stage taught the model:
58
+ Syntax
59
+ Word order
60
+ Grammar
61
+ Cross-lingual token co-occurrence patterns
62
+ 5. Bilingual Language Modeling
63
+ After initial convergence, the model was trained on interleaved English and Russian data, allowing it to:
64
+ Develop internal multilingual representations
65
+ Share semantic structure across languages
66
+ Build fluency independently in both languages
67
+ At this stage, the model was not yet a translator, but a bilingual language model.
68
+ 6. Translation Alignment Fine-Tuning
69
+ To convert the bilingual LM into a translator, a structured alignment phase was introduced.
70
+ Training examples were formatted as:
71
+
72
+ <EN> source sentence
73
+ <RU> target sentence
74
+
75
+
76
+ and the reverse direction:
77
+
78
+ <RU> source sentence
79
+ <EN> target sentence
80
+
81
+
82
+ Key alignment principles:
83
+ No natural-language instructions
84
+ No chat formatting
85
+ Strict conditional generation
86
+ Low learning rate to preserve fluency
87
+ This taught the model to:
88
+ Condition generation on the source language
89
+ Maintain semantic faithfulness
90
+ Reduce hallucination while preserving coherence
91
+ 7. Evaluation & Iterative Refinement
92
+ The model was evaluated manually and iteratively on:
93
+ Literal translation accuracy
94
+ Sentence-level coherence
95
+ Hallucination frequency
96
+ Directional stability (EN→RU vs RU→EN)
97
+ At ~300M parameters, the model achieves:
98
+ Coherent sentence-level translation
99
+ Stable grammar
100
+ Occasional lexical or semantic drift (expected at this scale)
101
  This model was trained end-to-end on a self-curated parallel corpus derived from cleaned literary
102
  and technical texts.
103