Cyborg Translator (EN ↔ RU)

Overview

This translation model was built from the ground up, starting with raw text data and ending with a fully functional bilingual English ↔ Russian language model. No pretrained translation models were used at any stage. The project emphasizes data curation, tokenizer design, architectural discipline, and alignment through supervised fine-tuning.

Training Procedure

The model was trained end-to-end from random initialization using a causal language modeling objective.

Optimizer: AdamW
Loss: Cross-entropy (next-token prediction)
Training strategy:
- Stage 1: Monolingual English + Russian language modeling
- Stage 2: Interleaved bilingual language modeling
- Stage 3: Bidirectional translation alignment fine-tuning
Gradient accumulation used to simulate larger batch sizes
Checkpoints saved periodically and manually evaluated
Training conducted on consumer-grade GPUs

The focus was on stability, coherence, and alignment rather than maximum scale.

Data Collection & Cleaning

The process began with a corpus of approximately 40 public-domain English books, which were: Cleaned via custom Bash and Python scripts Normalized for punctuation, casing, and encoding

Sentence-segmented and deduplicate Filtered to remove metadata, headers, and non-linguistic content This produced a clean English monolingual corpus suitable for language modeling and later bilingual alignment. 2. Parallel Data Creation (English ↔ Russian) To create bilingual supervision: Clean English sentences were translated into Russian using a controlled automated pipeline Sentence alignment was preserved strictly (1 English sentence ↔ 1 Russian sentence) Additional filtering removed mistranslations, malformed pairs, and length mismatches The result was a high-quality parallel English–Russian dataset suitable for supervised translation training. ⚠️ No pretrained translation models were fine-tuned; the translations were used only as supervision data. 3. Tokenizer Built From Scratch A SentencePiece BPE tokenizer (32k vocab) was trained from scratch on the combined English + Russian corpus. Key properties: Shared multilingual vocabulary Unicode-safe No language-specific hacks Enables cross-lingual parameter sharing This tokenizer was frozen and reused across all training stages. 4. Base Language Model Initialization The model architecture is a GPT-style causal transformer (~300M parameters) initialized randomly (no pretrained weights). Training began with: Standard causal language modeling objective Mixed English and Russian text Strict context length limits aligned with positional embeddings This stage taught the model: Syntax Word order Grammar Cross-lingual token co-occurrence patterns 5. Bilingual Language Modeling After initial convergence, the model was trained on interleaved English and Russian data, allowing it to: Develop internal multilingual representations Share semantic structure across languages Build fluency independently in both languages At this stage, the model was not yet a translator, but a bilingual language model. 6. Translation Alignment Fine-Tuning To convert the bilingual LM into a translator, a structured alignment phase was introduced. Training examples were formatted as:

source sentence target sentence

and the reverse direction:

source sentence target sentence

Key alignment principles: No natural-language instructions No chat formatting Strict conditional generation Low learning rate to preserve fluency This taught the model to: Condition generation on the source language Maintain semantic faithfulness Reduce hallucination while preserving coherence 7. Evaluation & Iterative Refinement The model was evaluated manually and iteratively on: Literal translation accuracy Sentence-level coherence Hallucination frequency Directional stability (EN→RU vs RU→EN) At ~300M parameters, the model achieves: Coherent sentence-level translation Stable grammar Occasional lexical or semantic drift (expected at this scale) This model was trained end-to-end on a self-curated parallel corpus derived from cleaned literary and technical texts.

Model Details

Architecture: Transformer (GPT-style causal LM adapted for translation)
Parameters: ~300M
Precision: FP32
Tokenizer: Custom BPE (32k vocab)
Framework: PyTorch / Hugging Face Transformers
Training Style: Supervised bilingual sequence modeling

Training Data

~40 public-domain English books
Extensive text normalization and deduplication
Sentence-level alignment
Russian translations generated and filtered programmatically
Multiple cleaning passes (length, language ID, punctuation, encoding)

Emphasis was placed on corpus hygiene and alignment fidelity.

Intended Use

English ↔ Russian translation research
Studying effects of tokenizer choice on MT quality
Low-resource MT experimentation
Educational purposes

Limitations

Not instruction-tuned
May hallucinate under ambiguous input
No safety fine-tuning
Not suitable for production or legal/medical use

Alignment Perspective

This project explores how structured data curation and constrained supervision can reduce hallucination and improve faithfulness in small-to-mid-scale language models.

Key alignment-relevant aspects:

Conditioning without natural-language instructions
Strict source-target alignment
Avoidance of RLHF or preference modeling
Observation of failure modes under ambiguity

The model is intentionally not instruction-tuned to preserve interpretability of its learned representations.

Reproducibility

Inference example:

from transformers import AutoTokenizer, AutoModelForCausalLM

tok = AutoTokenizer.from_pretrained("LoganResearch/cyborg-translator-en-ru")
model = AutoModelForCausalLM.from_pretrained("LoganResearch/cyborg-translator-en-ru")

inputs = tok("Hello world", return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=100)
print(tok.decode(out[0]))

Downloads last month: 15

Safetensors

Model size

0.3B params

Tensor type

F32