Cyborg Translator – EN↔RU

Cyborg Translator (EN ↔ RU)

Overview

This translation model was built from the ground up, starting with raw text data and ending with a fully functional bilingual English ↔ Russian language model. No pretrained translation models were used at any stage. The project emphasizes data curation, tokenizer design, architectural discipline, and alignment through supervised fine-tuning.

Training Procedure

The model was trained end-to-end from random initialization using a causal language modeling objective.

  • Optimizer: AdamW
  • Loss: Cross-entropy (next-token prediction)
  • Training strategy:
    • Stage 1: Monolingual English + Russian language modeling
    • Stage 2: Interleaved bilingual language modeling
    • Stage 3: Bidirectional translation alignment fine-tuning
  • Gradient accumulation used to simulate larger batch sizes
  • Checkpoints saved periodically and manually evaluated
  • Training conducted on consumer-grade GPUs

The focus was on stability, coherence, and alignment rather than maximum scale.

  1. Data Collection & Cleaning

The process began with a corpus of approximately 40 public-domain English books, which were: Cleaned via custom Bash and Python scripts Normalized for punctuation, casing, and encoding

Sentence-segmented and deduplicate Filtered to remove metadata, headers, and non-linguistic content This produced a clean English monolingual corpus suitable for language modeling and later bilingual alignment. 2. Parallel Data Creation (English ↔ Russian) To create bilingual supervision: Clean English sentences were translated into Russian using a controlled automated pipeline Sentence alignment was preserved strictly (1 English sentence ↔ 1 Russian sentence) Additional filtering removed mistranslations, malformed pairs, and length mismatches The result was a high-quality parallel English–Russian dataset suitable for supervised translation training. ⚠️ No pretrained translation models were fine-tuned; the translations were used only as supervision data. 3. Tokenizer Built From Scratch A SentencePiece BPE tokenizer (32k vocab) was trained from scratch on the combined English + Russian corpus. Key properties: Shared multilingual vocabulary Unicode-safe No language-specific hacks Enables cross-lingual parameter sharing This tokenizer was frozen and reused across all training stages. 4. Base Language Model Initialization The model architecture is a GPT-style causal transformer (~300M parameters) initialized randomly (no pretrained weights). Training began with: Standard causal language modeling objective Mixed English and Russian text Strict context length limits aligned with positional embeddings This stage taught the model: Syntax Word order Grammar Cross-lingual token co-occurrence patterns 5. Bilingual Language Modeling After initial convergence, the model was trained on interleaved English and Russian data, allowing it to: Develop internal multilingual representations Share semantic structure across languages Build fluency independently in both languages At this stage, the model was not yet a translator, but a bilingual language model. 6. Translation Alignment Fine-Tuning To convert the bilingual LM into a translator, a structured alignment phase was introduced. Training examples were formatted as:

source sentence target sentence

and the reverse direction:

source sentence target sentence

Key alignment principles: No natural-language instructions No chat formatting Strict conditional generation Low learning rate to preserve fluency This taught the model to: Condition generation on the source language Maintain semantic faithfulness Reduce hallucination while preserving coherence 7. Evaluation & Iterative Refinement The model was evaluated manually and iteratively on: Literal translation accuracy Sentence-level coherence Hallucination frequency Directional stability (EN→RU vs RU→EN) At ~300M parameters, the model achieves: Coherent sentence-level translation Stable grammar Occasional lexical or semantic drift (expected at this scale) This model was trained end-to-end on a self-curated parallel corpus derived from cleaned literary and technical texts.

Model Details

  • Architecture: Transformer (GPT-style causal LM adapted for translation)
  • Parameters: ~300M
  • Precision: FP32
  • Tokenizer: Custom BPE (32k vocab)
  • Framework: PyTorch / Hugging Face Transformers
  • Training Style: Supervised bilingual sequence modeling

Training Data

  • ~40 public-domain English books
  • Extensive text normalization and deduplication
  • Sentence-level alignment
  • Russian translations generated and filtered programmatically
  • Multiple cleaning passes (length, language ID, punctuation, encoding)

Emphasis was placed on corpus hygiene and alignment fidelity.

Intended Use

  • English ↔ Russian translation research
  • Studying effects of tokenizer choice on MT quality
  • Low-resource MT experimentation
  • Educational purposes

Limitations

  • Not instruction-tuned
  • May hallucinate under ambiguous input
  • No safety fine-tuning
  • Not suitable for production or legal/medical use

Alignment Perspective

This project explores how structured data curation and constrained supervision can reduce hallucination and improve faithfulness in small-to-mid-scale language models.

Key alignment-relevant aspects:

  • Conditioning without natural-language instructions
  • Strict source-target alignment
  • Avoidance of RLHF or preference modeling
  • Observation of failure modes under ambiguity

The model is intentionally not instruction-tuned to preserve interpretability of its learned representations.

Reproducibility

Inference example:

from transformers import AutoTokenizer, AutoModelForCausalLM

tok = AutoTokenizer.from_pretrained("LoganResearch/cyborg-translator-en-ru")
model = AutoModelForCausalLM.from_pretrained("LoganResearch/cyborg-translator-en-ru")

inputs = tok("Hello world", return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=100)
print(tok.decode(out[0]))
Downloads last month
107
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support