LoganResearch
/

cyborg-translator-en-ru

@@ -19,11 +19,85 @@ tags:
 # Cyborg Translator (EN ↔ RU)
-## Overview
-Cyborg Translator is a custom-trained English ↔ Russian neural machine translation model.
-The project focuses on **data quality, tokenizer design, and bidirectional translation robustness**
-rather than scale.
 This model was trained end-to-end on a self-curated parallel corpus derived from cleaned literary
 and technical texts.

 # Cyborg Translator (EN ↔ RU)
+Overview
+This translation model was built from the ground up, starting with raw text data and ending with a fully functional bilingual
+English ↔ Russian language model. No pretrained translation models were used at any stage. The project emphasizes data curation,
+tokenizer design, architectural discipline, and alignment through supervised fine-tuning.
+1. Data Collection & Cleaning
+The process began with a corpus of approximately 40 public-domain English books, which were:
+Cleaned via custom Bash and Python scripts
+Normalized for punctuation, casing, and encoding
+Sentence-segmented and deduplicate
+Filtered to remove metadata, headers, and non-linguistic content
+This produced a clean English monolingual corpus suitable for language modeling and later bilingual alignment.
+2. Parallel Data Creation (English ↔ Russian)
+To create bilingual supervision:
+Clean English sentences were translated into Russian using a controlled automated pipeline
+Sentence alignment was preserved strictly (1 English sentence ↔ 1 Russian sentence)
+Additional filtering removed mistranslations, malformed pairs, and length mismatches
+The result was a high-quality parallel English–Russian dataset suitable for supervised translation training.
+⚠️ No pretrained translation models were fine-tuned; the translations were used only as supervision data.
+3. Tokenizer Built From Scratch
+A SentencePiece BPE tokenizer (32k vocab) was trained from scratch on the combined English + Russian corpus.
+Key properties:
+Shared multilingual vocabulary
+Unicode-safe
+No language-specific hacks
+Enables cross-lingual parameter sharing
+This tokenizer was frozen and reused across all training stages.
+4. Base Language Model Initialization
+The model architecture is a GPT-style causal transformer (~300M parameters) initialized randomly (no pretrained weights).
+Training began with:
+Standard causal language modeling objective
+Mixed English and Russian text
+Strict context length limits aligned with positional embeddings
+This stage taught the model:
+Syntax
+Word order
+Grammar
+Cross-lingual token co-occurrence patterns
+5. Bilingual Language Modeling
+After initial convergence, the model was trained on interleaved English and Russian data, allowing it to:
+Develop internal multilingual representations
+Share semantic structure across languages
+Build fluency independently in both languages
+At this stage, the model was not yet a translator, but a bilingual language model.
+6. Translation Alignment Fine-Tuning
+To convert the bilingual LM into a translator, a structured alignment phase was introduced.
+Training examples were formatted as:
+<EN> source sentence
+<RU> target sentence
+and the reverse direction:
+<RU> source sentence
+<EN> target sentence
+Key alignment principles:
+No natural-language instructions
+No chat formatting
+Strict conditional generation
+Low learning rate to preserve fluency
+This taught the model to:
+Condition generation on the source language
+Maintain semantic faithfulness
+Reduce hallucination while preserving coherence
+7. Evaluation & Iterative Refinement
+The model was evaluated manually and iteratively on:
+Literal translation accuracy
+Sentence-level coherence
+Hallucination frequency
+Directional stability (EN→RU vs RU→EN)
+At ~300M parameters, the model achieves:
+Coherent sentence-level translation
+Stable grammar
+Occasional lexical or semantic drift (expected at this scale)
 This model was trained end-to-end on a self-curated parallel corpus derived from cleaned literary
 and technical texts.