Update README.md
Browse files
README.md
CHANGED
|
@@ -19,11 +19,85 @@ tags:
|
|
| 19 |
|
| 20 |
# Cyborg Translator (EN ↔ RU)
|
| 21 |
|
| 22 |
-
|
| 23 |
-
Cyborg Translator is a custom-trained English ↔ Russian neural machine translation model.
|
| 24 |
-
The project focuses on **data quality, tokenizer design, and bidirectional translation robustness**
|
| 25 |
-
rather than scale.
|
| 26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
This model was trained end-to-end on a self-curated parallel corpus derived from cleaned literary
|
| 28 |
and technical texts.
|
| 29 |
|
|
|
|
| 19 |
|
| 20 |
# Cyborg Translator (EN ↔ RU)
|
| 21 |
|
| 22 |
+
Overview
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
+
This translation model was built from the ground up, starting with raw text data and ending with a fully functional bilingual
|
| 25 |
+
English ↔ Russian language model. No pretrained translation models were used at any stage. The project emphasizes data curation,
|
| 26 |
+
tokenizer design, architectural discipline, and alignment through supervised fine-tuning.
|
| 27 |
+
1. Data Collection & Cleaning
|
| 28 |
+
|
| 29 |
+
The process began with a corpus of approximately 40 public-domain English books, which were:
|
| 30 |
+
Cleaned via custom Bash and Python scripts
|
| 31 |
+
Normalized for punctuation, casing, and encoding
|
| 32 |
+
|
| 33 |
+
Sentence-segmented and deduplicate
|
| 34 |
+
Filtered to remove metadata, headers, and non-linguistic content
|
| 35 |
+
This produced a clean English monolingual corpus suitable for language modeling and later bilingual alignment.
|
| 36 |
+
2. Parallel Data Creation (English ↔ Russian)
|
| 37 |
+
To create bilingual supervision:
|
| 38 |
+
Clean English sentences were translated into Russian using a controlled automated pipeline
|
| 39 |
+
Sentence alignment was preserved strictly (1 English sentence ↔ 1 Russian sentence)
|
| 40 |
+
Additional filtering removed mistranslations, malformed pairs, and length mismatches
|
| 41 |
+
The result was a high-quality parallel English–Russian dataset suitable for supervised translation training.
|
| 42 |
+
⚠️ No pretrained translation models were fine-tuned; the translations were used only as supervision data.
|
| 43 |
+
3. Tokenizer Built From Scratch
|
| 44 |
+
A SentencePiece BPE tokenizer (32k vocab) was trained from scratch on the combined English + Russian corpus.
|
| 45 |
+
Key properties:
|
| 46 |
+
Shared multilingual vocabulary
|
| 47 |
+
Unicode-safe
|
| 48 |
+
No language-specific hacks
|
| 49 |
+
Enables cross-lingual parameter sharing
|
| 50 |
+
This tokenizer was frozen and reused across all training stages.
|
| 51 |
+
4. Base Language Model Initialization
|
| 52 |
+
The model architecture is a GPT-style causal transformer (~300M parameters) initialized randomly (no pretrained weights).
|
| 53 |
+
Training began with:
|
| 54 |
+
Standard causal language modeling objective
|
| 55 |
+
Mixed English and Russian text
|
| 56 |
+
Strict context length limits aligned with positional embeddings
|
| 57 |
+
This stage taught the model:
|
| 58 |
+
Syntax
|
| 59 |
+
Word order
|
| 60 |
+
Grammar
|
| 61 |
+
Cross-lingual token co-occurrence patterns
|
| 62 |
+
5. Bilingual Language Modeling
|
| 63 |
+
After initial convergence, the model was trained on interleaved English and Russian data, allowing it to:
|
| 64 |
+
Develop internal multilingual representations
|
| 65 |
+
Share semantic structure across languages
|
| 66 |
+
Build fluency independently in both languages
|
| 67 |
+
At this stage, the model was not yet a translator, but a bilingual language model.
|
| 68 |
+
6. Translation Alignment Fine-Tuning
|
| 69 |
+
To convert the bilingual LM into a translator, a structured alignment phase was introduced.
|
| 70 |
+
Training examples were formatted as:
|
| 71 |
+
|
| 72 |
+
<EN> source sentence
|
| 73 |
+
<RU> target sentence
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
and the reverse direction:
|
| 77 |
+
|
| 78 |
+
<RU> source sentence
|
| 79 |
+
<EN> target sentence
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
Key alignment principles:
|
| 83 |
+
No natural-language instructions
|
| 84 |
+
No chat formatting
|
| 85 |
+
Strict conditional generation
|
| 86 |
+
Low learning rate to preserve fluency
|
| 87 |
+
This taught the model to:
|
| 88 |
+
Condition generation on the source language
|
| 89 |
+
Maintain semantic faithfulness
|
| 90 |
+
Reduce hallucination while preserving coherence
|
| 91 |
+
7. Evaluation & Iterative Refinement
|
| 92 |
+
The model was evaluated manually and iteratively on:
|
| 93 |
+
Literal translation accuracy
|
| 94 |
+
Sentence-level coherence
|
| 95 |
+
Hallucination frequency
|
| 96 |
+
Directional stability (EN→RU vs RU→EN)
|
| 97 |
+
At ~300M parameters, the model achieves:
|
| 98 |
+
Coherent sentence-level translation
|
| 99 |
+
Stable grammar
|
| 100 |
+
Occasional lexical or semantic drift (expected at this scale)
|
| 101 |
This model was trained end-to-end on a self-curated parallel corpus derived from cleaned literary
|
| 102 |
and technical texts.
|
| 103 |
|