percobaan_1 / README.md
dhintech's picture
Initial upload of fine-tuned MarianMT ID-EN model
15e7978 verified
metadata
language:
  - id
  - en
license: apache-2.0
base_model: Helsinki-NLP/opus-mt-id-en
tags:
  - translation
  - indonesian
  - english
  - marian
  - fine-tuned
pipeline_tag: translation
datasets:
  - ted_talks_iwslt
library_name: transformers

MarianMT Indonesian-English Translation (Fine-Tuned)

This model is a fine-tuned version of Helsinki-NLP/opus-mt-id-en specialized for translating Indonesian to English, particularly within contexts found in TED Talks.

🎯 Model Highlights

  • Specialized Context: Fine-tuned on the TED Talks parallel corpus for better performance on formal and presentation-style language.
  • Optimized Training: Utilizes modern training techniques like layer freezing and a cosine annealing scheduler for stable and effective fine-tuning.
  • Production Ready: Can be easily integrated into applications using the transformers library.

πŸš€ Model Details

  • Base Model: Helsinki-NLP/opus-mt-id-en
  • Fine-tuned Dataset: Cleaned and aligned TED Talks parallel corpus (Indonesian-English).
  • Training Date: 2025-06-12
  • Languages: Indonesian (id) β†’ English (en)

βš™οΈ Training Configuration

Hyperparameters

  • Learning Rate: 5e-6
  • Weight Decay: 0.001
  • Gradient Clipping: 0.5
  • Max Sequence Length: 96-128 tokens
  • Scheduler: Cosine Annealing with Warmup

Architecture Optimizations

  • Layer Freezing: Early encoder layers were frozen to preserve foundational language knowledge from the base model.
  • Memory Optimization: Utilized gradient accumulation to simulate a larger batch size.
  • Early Stopping: Implemented with a patience of 5 epochs to prevent overfitting.

πŸ› οΈ Usage Example

from transformers import MarianMTModel, MarianTokenizer

model_name = "dhintech/marian-tedtalks_clean-id-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Pindahkan model ke GPU jika tersedia
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def translate(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128).to(device)
    with torch.no_grad():
        outputs = model.generate(**inputs, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Contoh penggunaan
indonesian_text = "Selamat pagi, mari kita mulai rapat hari ini."
english_translation = translate(indonesian_text)
print(f"ID: {indonesian_text}")
print(f"EN: {english_translation}")

🎯 Intended Use Cases

  • Presentation Translation: Translating presentation scripts and materials.
  • Formal Content: Translating articles, reports, and other formal documents.
  • Educational Content: Assisting with the translation of academic and educational materials.

⚑ Performance Metrics

Performance metrics such as BLEU score, inference time, and human evaluation will be added here after the model has been fully trained and evaluated.

🚨 Limitations and Considerations

  • Domain Specificity: While trained on a broad corpus, performance is best on formal language similar to TED Talks. It may not perform as well on very casual slang or regional dialects.
  • Long Sequences: Performance might degrade for sentences significantly longer than the max length used in training (128 tokens).

🀝 Contributing

Feedback and contributions are welcome! Please use the Community tab or open an issue on the repository if you encounter any problems or have suggestions for improvement.