percobaan_1 / README.md

dhintech

Initial upload of fine-tuned MarianMT ID-EN model

15e7978 verified 10 months ago

preview code

raw

history blame contribute delete

3.61 kB

metadata

language:
  - id
  - en
license: apache-2.0
base_model: Helsinki-NLP/opus-mt-id-en
tags:
  - translation
  - indonesian
  - english
  - marian
  - fine-tuned
pipeline_tag: translation
datasets:
  - ted_talks_iwslt
library_name: transformers

MarianMT Indonesian-English Translation (Fine-Tuned)

This model is a fine-tuned version of Helsinki-NLP/opus-mt-id-en specialized for translating Indonesian to English, particularly within contexts found in TED Talks.

🎯 Model Highlights

Specialized Context: Fine-tuned on the TED Talks parallel corpus for better performance on formal and presentation-style language.
Optimized Training: Utilizes modern training techniques like layer freezing and a cosine annealing scheduler for stable and effective fine-tuning.
Production Ready: Can be easily integrated into applications using the transformers library.

🚀 Model Details

Base Model: Helsinki-NLP/opus-mt-id-en
Fine-tuned Dataset: Cleaned and aligned TED Talks parallel corpus (Indonesian-English).
Training Date: 2025-06-12
Languages: Indonesian (id) → English (en)

⚙️ Training Configuration

Hyperparameters

Learning Rate: 5e-6
Weight Decay: 0.001
Gradient Clipping: 0.5
Max Sequence Length: 96-128 tokens
Scheduler: Cosine Annealing with Warmup

Architecture Optimizations

Layer Freezing: Early encoder layers were frozen to preserve foundational language knowledge from the base model.
Memory Optimization: Utilized gradient accumulation to simulate a larger batch size.
Early Stopping: Implemented with a patience of 5 epochs to prevent overfitting.

🛠️ Usage Example

from transformers import MarianMTModel, MarianTokenizer

model_name = "dhintech/marian-tedtalks_clean-id-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Pindahkan model ke GPU jika tersedia
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def translate(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128).to(device)
    with torch.no_grad():
        outputs = model.generate(**inputs, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Contoh penggunaan
indonesian_text = "Selamat pagi, mari kita mulai rapat hari ini."
english_translation = translate(indonesian_text)
print(f"ID: {indonesian_text}")
print(f"EN: {english_translation}")

🎯 Intended Use Cases

Presentation Translation: Translating presentation scripts and materials.
Formal Content: Translating articles, reports, and other formal documents.
Educational Content: Assisting with the translation of academic and educational materials.

⚡ Performance Metrics

Performance metrics such as BLEU score, inference time, and human evaluation will be added here after the model has been fully trained and evaluated.

🚨 Limitations and Considerations

Domain Specificity: While trained on a broad corpus, performance is best on formal language similar to TED Talks. It may not perform as well on very casual slang or regional dialects.
Long Sequences: Performance might degrade for sentences significantly longer than the max length used in training (128 tokens).

🤝 Contributing

Feedback and contributions are welcome! Please use the Community tab or open an issue on the repository if you encounter any problems or have suggestions for improvement.