MoE Multilingual Translator - Stage 2 Fine-tuned

A Mixture-of-Experts (MoE) transformer fine-tuned for translating French, Hindi, and Bengali to English.

๐ŸŽฏ Quick Info

Supports: French โ†’ English | Hindi โ†’ English | Bengali โ†’ English

Base Model: arka7/moe-multilingual-translator

๐Ÿ“Š Performance

Metric Value
Validation Loss 3.8833
Token Accuracy 35.95%
Perplexity 48.58
Training Loss 3.9530
Epochs 3

Training History

{
  "train_loss": [
    5.081450140173895,
    4.325329969776386,
    3.95300766737378
  ],
  "val_loss": [
    4.531953684556713,
    4.124982544608208,
    3.8832832201203304
  ],
  "perplexity": [
    92.93997192382812,
    61.86671829223633,
    48.583457946777344
  ],
  "accuracy": [
    29.0423772315063,
    33.302914504078025,
    35.949352649289914
  ],
  "epochs": [
    1,
    2,
    3
  ]
}

๐Ÿ—๏ธ Architecture

  • Type: Encoder-Decoder Transformer with MoE routing
  • Vocabulary: 32,000 tokens (SentencePiece)
  • Model Dimension: 512
  • Attention Heads: 8
  • Layers: 6 encoder + 6 decoder
  • Experts: 4 (in encoder)
  • Max Sequence: 256 tokens

๐Ÿš€ Usage

Installation

pip install torch sentencepiece huggingface_hub

Load Model

import torch
import sentencepiece as spm
from huggingface_hub import hf_hub_download
import json

# Download files
model_path = hf_hub_download(
    repo_id="arka7/moe-multilingual-translator-stage2", 
    filename="pytorch_model.pt"
)
tokenizer_path = hf_hub_download(
    repo_id="arka7/moe-multilingual-translator-stage2", 
    filename="tokenizer.model"
)
config_path = hf_hub_download(
    repo_id="arka7/moe-multilingual-translator-stage2", 
    filename="config.json"
)

# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.load(tokenizer_path)

# Load config
with open(config_path) as f:
    cfg = json.load(f)

# Load checkpoint
checkpoint = torch.load(model_path, map_location='cpu')

# You need to define the model architecture first
# See: https://huggingface.co/arka7/moe-multilingual-translator for architecture code

Translate Text

# After loading model (see architecture in base model)

def translate(text, src_lang='fr'):
    # Add language token
    input_text = f"<{src_lang}> {text}"
    
    # Encode
    input_ids = sp.encode(input_text)
    
    # Generate translation (greedy decoding)
    # ... model inference code ...
    
    return translation

# Examples
translate("Bonjour, comment allez-vous?", "fr")
# โ†’ "Hello, how are you?"

translate("เคจเคฎเคธเฅเคคเฅ‡, เค†เคช เค•เฅˆเคธเฅ‡ เคนเฅˆเค‚?", "hi")
# โ†’ "Hello, how are you?"

translate("เฆ†เฆชเฆจเฆฟ เฆ•เง‡เฆฎเฆจ เฆ†เฆ›เง‡เฆจ?", "bn")
# โ†’ "How are you?"

๐Ÿ“š Training

Stage 1: Pre-training

  • Self-supervised language modeling
  • Wikipedia data (4 languages)
  • Learned multilingual representations

Stage 2: Translation Fine-tuning โญ

  • This model - fine-tuned on parallel translation data
  • ~150K translation pairs (50K per language)
  • Languages: French, Hindi, Bengali โ†’ English
  • Datasets: OPUS-100 parallel corpora

๐ŸŽ“ Model Architecture Code

import torch.nn as nn

class MoE(nn.Module):
    def __init__(self, d_model, num_experts=4):
        super().__init__()
        self.num_experts = num_experts
        self.router = nn.Linear(d_model, num_experts)
        self.experts = nn.ModuleList([
            nn.Linear(d_model, d_model) 
            for _ in range(num_experts)
        ])
        self.balance_loss = 0.0

    def forward(self, x):
        seq_repr = x.mean(dim=1)
        logits = self.router(seq_repr)
        weights = torch.softmax(logits, dim=-1)
        expert_outputs = torch.stack(
            [exp(x) for exp in self.experts], dim=-1
        )
        out = torch.einsum('bsde,be->bsd', expert_outputs, weights)
        usage = weights.mean(dim=0)
        self.balance_loss = ((usage - 1/self.num_experts) ** 2).sum()
        return out

# See base model for full architecture

โš ๏ธ Limitations

  • Only translates TO English (not FROM English)
  • Best on general domain text
  • May struggle with:
    • Technical/specialized vocabulary
    • Very long sentences (>256 tokens)
    • Code-mixed text
    • Rare dialects

๐Ÿ”ฎ Improvements

To get better performance:

  • Train longer (more epochs)
  • Larger model (increase d_model, layers)
  • More data (additional parallel corpora)
  • Beam search decoding
  • Learning rate scheduling

๐Ÿ“„ Files

  • pytorch_model.pt - Trained model weights
  • tokenizer.model - SentencePiece tokenizer
  • tokenizer.vocab - Vocabulary
  • config.json - Configuration
  • training_metrics.json - Training history

๐Ÿ“– Citation

@misc{moe_translator_stage2,
  author = {arka7},
  title = {MoE Multilingual Translator - Stage 2},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/arka7/moe-multilingual-translator-stage2}
}

๐Ÿ“œ License

MIT License

๐Ÿ”— Links


Built with PyTorch โ€ข Trained on 3 epochs โ€ข Ready for translation!

Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for arka7/moe-multilingual-translator-stage2

Finetuned
(1)
this model

Datasets used to train arka7/moe-multilingual-translator-stage2