MedAI_Processing / vi /README.md
LiamKhoaLe's picture
Upd vietnamese transl
1d46eb9

Vietnamese Translation Module

This module provides Vietnamese translation functionality for the MedAI Processing application using the Helsinki-NLP/opus-mt-en-vi model.

Features

  • English to Vietnamese Translation: Translates English text to Vietnamese using the Helsinki-NLP/opus-mt-en-vi model
  • Batch Processing: Efficiently translates multiple texts at once
  • Dictionary Translation: Translates specific fields in data dictionaries
  • Integration: Seamlessly integrates with both SFT and RAG processing workflows
  • Error Handling: Graceful fallback to original text if translation fails
  • Logging: Comprehensive logging for debugging and monitoring

Configuration

Add the following environment variable to your .env file:

EN_VI=Helsinki-NLP/opus-mt-en-vi

Usage

Basic Translation

from vi.translator import VietnameseTranslator

# Initialize translator
translator = VietnameseTranslator()

# Load the model
translator.load_model()

# Translate single text
translated = translator.translate_text("Hello, how are you?")

# Translate batch of texts
texts = ["Text 1", "Text 2", "Text 3"]
translated_batch = translator.translate_batch(texts)

Dictionary Translation

# Translate specific fields in a dictionary
data = {
    "instruction": "Answer the question",
    "input": "What is diabetes?",
    "output": "Diabetes is a metabolic disorder..."
}

translated_data = translator.translate_dict(data, ["instruction", "input", "output"])

Integration

The translation functionality is automatically integrated into the processing workflows:

  1. UI Toggle: Users can enable Vietnamese translation via the checkbox in the web interface
  2. SFT Processing: All text fields in SFT format are translated when enabled
  3. RAG Processing: All text fields in RAG format are translated when enabled
  4. Metadata: Translated rows are marked with vietnamese_translated: true in metadata

Model Information

  • Model: Helsinki-NLP/opus-mt-en-vi
  • Source Language: English
  • Target Language: Vietnamese
  • BLEU Score: 37.2
  • chrF Score: 0.542
  • License: Apache 2.0

Testing

Run the test script to verify translation functionality:

python test_translation.py

Files

  • translator.py: Main translation class
  • download.py: Model download script for Docker
  • processing_utils.py: Utility functions for processing integration
  • __init__.py: Module initialization
  • README.md: This documentation

Notes

  • The model is automatically downloaded during Docker build
  • Translation is performed on the CPU by default, but can use GPU if available
  • The model requires the target language token >>vie<< for proper translation
  • All translation operations include comprehensive error handling and logging