Cyborg Translator (EN ↔ RU)
Overview
This translation model was built from the ground up, starting with raw text data and ending with a fully functional bilingual English ↔ Russian language model. No pretrained translation models were used at any stage. The project emphasizes data curation, tokenizer design, architectural discipline, and alignment through supervised fine-tuning.
Training Procedure
The model was trained end-to-end from random initialization using a causal language modeling objective.
- Optimizer: AdamW
- Loss: Cross-entropy (next-token prediction)
- Training strategy:
- Stage 1: Monolingual English + Russian language modeling
- Stage 2: Interleaved bilingual language modeling
- Stage 3: Bidirectional translation alignment fine-tuning
- Gradient accumulation used to simulate larger batch sizes
- Checkpoints saved periodically and manually evaluated
- Training conducted on consumer-grade GPUs
The focus was on stability, coherence, and alignment rather than maximum scale.
- Data Collection & Cleaning
The process began with a corpus of approximately 40 public-domain English books, which were: Cleaned via custom Bash and Python scripts Normalized for punctuation, casing, and encoding
Sentence-segmented and deduplicate Filtered to remove metadata, headers, and non-linguistic content This produced a clean English monolingual corpus suitable for language modeling and later bilingual alignment. 2. Parallel Data Creation (English ↔ Russian) To create bilingual supervision: Clean English sentences were translated into Russian using a controlled automated pipeline Sentence alignment was preserved strictly (1 English sentence ↔ 1 Russian sentence) Additional filtering removed mistranslations, malformed pairs, and length mismatches The result was a high-quality parallel English–Russian dataset suitable for supervised translation training. ⚠️ No pretrained translation models were fine-tuned; the translations were used only as supervision data. 3. Tokenizer Built From Scratch A SentencePiece BPE tokenizer (32k vocab) was trained from scratch on the combined English + Russian corpus. Key properties: Shared multilingual vocabulary Unicode-safe No language-specific hacks Enables cross-lingual parameter sharing This tokenizer was frozen and reused across all training stages. 4. Base Language Model Initialization The model architecture is a GPT-style causal transformer (~300M parameters) initialized randomly (no pretrained weights). Training began with: Standard causal language modeling objective Mixed English and Russian text Strict context length limits aligned with positional embeddings This stage taught the model: Syntax Word order Grammar Cross-lingual token co-occurrence patterns 5. Bilingual Language Modeling After initial convergence, the model was trained on interleaved English and Russian data, allowing it to: Develop internal multilingual representations Share semantic structure across languages Build fluency independently in both languages At this stage, the model was not yet a translator, but a bilingual language model. 6. Translation Alignment Fine-Tuning To convert the bilingual LM into a translator, a structured alignment phase was introduced. Training examples were formatted as:
source sentence target sentence
and the reverse direction:
source sentence target sentence
Key alignment principles: No natural-language instructions No chat formatting Strict conditional generation Low learning rate to preserve fluency This taught the model to: Condition generation on the source language Maintain semantic faithfulness Reduce hallucination while preserving coherence 7. Evaluation & Iterative Refinement The model was evaluated manually and iteratively on: Literal translation accuracy Sentence-level coherence Hallucination frequency Directional stability (EN→RU vs RU→EN) At ~300M parameters, the model achieves: Coherent sentence-level translation Stable grammar Occasional lexical or semantic drift (expected at this scale) This model was trained end-to-end on a self-curated parallel corpus derived from cleaned literary and technical texts.
Model Details
- Architecture: Transformer (GPT-style causal LM adapted for translation)
- Parameters: ~300M
- Precision: FP32
- Tokenizer: Custom BPE (32k vocab)
- Framework: PyTorch / Hugging Face Transformers
- Training Style: Supervised bilingual sequence modeling
Training Data
- ~40 public-domain English books
- Extensive text normalization and deduplication
- Sentence-level alignment
- Russian translations generated and filtered programmatically
- Multiple cleaning passes (length, language ID, punctuation, encoding)
Emphasis was placed on corpus hygiene and alignment fidelity.
Intended Use
- English ↔ Russian translation research
- Studying effects of tokenizer choice on MT quality
- Low-resource MT experimentation
- Educational purposes
Limitations
- Not instruction-tuned
- May hallucinate under ambiguous input
- No safety fine-tuning
- Not suitable for production or legal/medical use
Alignment Perspective
This project explores how structured data curation and constrained supervision can reduce hallucination and improve faithfulness in small-to-mid-scale language models.
Key alignment-relevant aspects:
- Conditioning without natural-language instructions
- Strict source-target alignment
- Avoidance of RLHF or preference modeling
- Observation of failure modes under ambiguity
The model is intentionally not instruction-tuned to preserve interpretability of its learned representations.
Reproducibility
Inference example:
from transformers import AutoTokenizer, AutoModelForCausalLM
tok = AutoTokenizer.from_pretrained("LoganResearch/cyborg-translator-en-ru")
model = AutoModelForCausalLM.from_pretrained("LoganResearch/cyborg-translator-en-ru")
inputs = tok("Hello world", return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=100)
print(tok.decode(out[0]))
- Downloads last month
- 107