MoE Multilingual Translator - Stage 2 Fine-tuned

A Mixture-of-Experts (MoE) transformer fine-tuned for translating French, Hindi, and Bengali to English.

🎯 Quick Info

Supports: French → English | Hindi → English | Bengali → English

Base Model: arka7/moe-multilingual-translator

📊 Performance

Metric	Value
Validation Loss	3.8833
Token Accuracy	35.95%
Perplexity	48.58
Training Loss	3.9530
Epochs	3

Training History

{
  "train_loss": [
    5.081450140173895,
    4.325329969776386,
    3.95300766737378
  ],
  "val_loss": [
    4.531953684556713,
    4.124982544608208,
    3.8832832201203304
  ],
  "perplexity": [
    92.93997192382812,
    61.86671829223633,
    48.583457946777344
  ],
  "accuracy": [
    29.0423772315063,
    33.302914504078025,
    35.949352649289914
  ],
  "epochs": [
    1,
    2,
    3
  ]
}

🏗️ Architecture

Type: Encoder-Decoder Transformer with MoE routing
Vocabulary: 32,000 tokens (SentencePiece)
Model Dimension: 512
Attention Heads: 8
Layers: 6 encoder + 6 decoder
Experts: 4 (in encoder)
Max Sequence: 256 tokens

🚀 Usage

Installation

pip install torch sentencepiece huggingface_hub

Load Model

import torch
import sentencepiece as spm
from huggingface_hub import hf_hub_download
import json

# Download files
model_path = hf_hub_download(
    repo_id="arka7/moe-multilingual-translator-stage2", 
    filename="pytorch_model.pt"
)
tokenizer_path = hf_hub_download(
    repo_id="arka7/moe-multilingual-translator-stage2", 
    filename="tokenizer.model"
)
config_path = hf_hub_download(
    repo_id="arka7/moe-multilingual-translator-stage2", 
    filename="config.json"
)

# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.load(tokenizer_path)

# Load config
with open(config_path) as f:
    cfg = json.load(f)

# Load checkpoint
checkpoint = torch.load(model_path, map_location='cpu')

# You need to define the model architecture first
# See: https://huggingface.co/arka7/moe-multilingual-translator for architecture code

Translate Text

# After loading model (see architecture in base model)

def translate(text, src_lang='fr'):
    # Add language token
    input_text = f"<{src_lang}> {text}"
    
    # Encode
    input_ids = sp.encode(input_text)
    
    # Generate translation (greedy decoding)
    # ... model inference code ...
    
    return translation

# Examples
translate("Bonjour, comment allez-vous?", "fr")
# → "Hello, how are you?"

translate("नमस्ते, आप कैसे हैं?", "hi")
# → "Hello, how are you?"

translate("আপনি কেমন আছেন?", "bn")
# → "How are you?"

📚 Training

Stage 1: Pre-training

Self-supervised language modeling
Wikipedia data (4 languages)
Learned multilingual representations

Stage 2: Translation Fine-tuning ⭐

This model - fine-tuned on parallel translation data
~150K translation pairs (50K per language)
Languages: French, Hindi, Bengali → English
Datasets: OPUS-100 parallel corpora

🎓 Model Architecture Code

import torch.nn as nn

class MoE(nn.Module):
    def __init__(self, d_model, num_experts=4):
        super().__init__()
        self.num_experts = num_experts
        self.router = nn.Linear(d_model, num_experts)
        self.experts = nn.ModuleList([
            nn.Linear(d_model, d_model) 
            for _ in range(num_experts)
        ])
        self.balance_loss = 0.0

    def forward(self, x):
        seq_repr = x.mean(dim=1)
        logits = self.router(seq_repr)
        weights = torch.softmax(logits, dim=-1)
        expert_outputs = torch.stack(
            [exp(x) for exp in self.experts], dim=-1
        )
        out = torch.einsum('bsde,be->bsd', expert_outputs, weights)
        usage = weights.mean(dim=0)
        self.balance_loss = ((usage - 1/self.num_experts) ** 2).sum()
        return out

# See base model for full architecture

⚠️ Limitations

Only translates TO English (not FROM English)
Best on general domain text
May struggle with:
- Technical/specialized vocabulary
- Very long sentences (>256 tokens)
- Code-mixed text
- Rare dialects

🔮 Improvements

To get better performance:

Train longer (more epochs)
Larger model (increase d_model, layers)
More data (additional parallel corpora)
Beam search decoding
Learning rate scheduling

📄 Files

pytorch_model.pt - Trained model weights
tokenizer.model - SentencePiece tokenizer
tokenizer.vocab - Vocabulary
config.json - Configuration
training_metrics.json - Training history

📖 Citation

@misc{moe_translator_stage2,
  author = {arka7},
  title = {MoE Multilingual Translator - Stage 2},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/arka7/moe-multilingual-translator-stage2}
}

📜 License

MIT License

🔗 Links

This Model: https://huggingface.co/arka7/moe-multilingual-translator-stage2
Base Model (Stage 1): https://huggingface.co/arka7/moe-multilingual-translator
Dataset: OPUS-100

Built with PyTorch • Trained on 3 epochs • Ready for translation!

Downloads last month: 7

Model tree for arka7/moe-multilingual-translator-stage2

Base model

arka7/moe-multilingual-translator

Finetuned

(1)

this model

arka7
/

moe-multilingual-translator-stage2