---
title: MedVietAI Processing
emoji: ⚕️
colorFrom: green
colorTo: pink
sdk: docker
pinned: false
license: apache-2.0
short_description: Data processing with en-vi translation. Derived from 500k mi
---

## 🚀 Quick Access

[HF Space](https://huggingface.co/spaces/MedVietAI/processing)

[MedDialog-100k](https://huggingface.co/datasets/MedAI-COS30018/MedDialog-EN-100k)

[MedDialog-10k](https://huggingface.co/datasets/MedAI-COS30018/MedDialog-EN-10k)

[PubMedQA-Labelled](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-L)

[PubMedQA-Unlabelled](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-U)

[PubMedQA-Mapper](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-MAP)

## 🎯 Features

### 🏠 Dual Mode Operation
- **Local Mode**: MedAlpaca-13b model running locally for privacy and cost efficiency
- **Cloud Mode**: NVIDIA + Gemini API integration for scalable processing
- **Dynamic Switching**: Toggle between modes via environment variables
- **Medical Specialization**: MedAlpaca-13b specifically fine-tuned for medical tasks

### 🔄 Advanced Data Augmentation
- **Paraphrasing**: Multi-model rotation (NVIDIA + Gemini) with easy/hard difficulty levels
- **Backtranslation**: Vietnamese pivot language for semantic preservation
- **Style Standardization**: Clinical voice enforcement and professional medical tone
- **Response Validation**: Invalid response detection and retry logic (max 3 attempts)
- **Quality Guards**: Length/semantic validation for backtranslation outputs

### 🇻🇳 Vietnamese Translation
- **Complete Translation**: All text fields translated when Vietnamese mode is enabled
- **Quality Validation**: Translation quality checks with fallback to original text
- **SFT Format**: `instruction`, `input`, `output` fields translated
- **RAG Format**: `question`, `answer`, `context` fields translated
- **Sanitization**: Repetition reduction and whitespace normalization

### 📊 SFT Data Enrichment
- **Multiple Answer Variants**: 2-3 different answers per question for better reasoning
- **Multiple Question Variants**: 2-3 different questions per answer for diverse training
- **Cross Combinations**: All question × answer variant combinations (up to 9 per sample)
- **Vietnamese Variants**: Translated versions of enriched combinations
- **Reasoning Enhancement**: Multiple reasoning paths for improved model training

### 🔍 Quality Assurance
- **Invalid Response Detection**: Catches "Fail", "Invalid", "I can't", "Sorry", etc.
- **Retry Logic**: Up to 3 attempts with different paraphrasing difficulties
- **Drop Strategy**: Samples dropped if retry fails (no fallback answers)
- **Consistency Checking**: LLM-based validation of answer quality
- **De-identification**: PHI removal with configurable strictness

### 🎯 RAG Optimization
- **Embedding-Friendly**: Concise, direct text optimized for dense retrieval
- **Context Generation**: Synthetic context creation when missing
- **Content Cleaning**: Conversational element removal for medical focus
- **Length Control**: Hard caps on question/answer/context lengths
- **Quality Filtering**: Invalid response cleaning for RAG corpora

## 📋 Supported Datasets

### Medical Dialogue
- **HealthCareMagic**: 100k medical conversations
- **iCliniq**: 10k derived medical Q&A

### Biomedical QA
- **PubMedQA-L**: Labeled biomedical questions
- **PubMedQA-U**: Unlabeled biomedical questions  
- **PubMedQA-MAP**: Mapped biomedical Q&A pairs

## ⚙️ Configuration

### Mode Selection
```bash
# Local Mode (MedAlpaca-13b)
IS_LOCAL=true
HF_TOKEN=your_huggingface_token

# Cloud Mode (NVIDIA/Gemini APIs)
IS_LOCAL=false
NVIDIA_API_1=your_nvidia_key
GEMINI_API_1=your_gemini_key
```

### Augmentation Parameters
```python
class AugmentOptions:
    paraphrase_ratio: float = 0.2          # 0.0-1.0
    paraphrase_outputs: bool = True         # Augment model answers
    backtranslate_ratio: float = 0.1        # 0.0-1.0 (Vietnamese pivot)
    style_standardize: bool = True          # Enforce clinical style
    deidentify: bool = True                 # Remove PHI
    dedupe: bool = True                     # Remove duplicates
    max_chars: int = 5000                   # Text length limit
    consistency_check_ratio: float = 0.05   # 0.0-1.0
    expand: bool = True                     # Enable enrichment
    max_aug_per_sample: int = 2             # 1-3 variants
```

### Processing Modes
- **SFT Processing**: Supervised Fine-Tuning format with enrichment
- **RAG Processing**: Question-Context-Answer format for retrieval
- **Vietnamese Mode**: Complete translation of all text fields

## 📈 Output Statistics

The system tracks comprehensive statistics:
- `written`: Successfully processed samples
- `paraphrased_input/output`: Paraphrasing counts
- `backtranslated_input/output`: Backtranslation counts
- `dropped_invalid`: Samples dropped due to failed retries
- `vietnamese_variants`: Vietnamese variants created
- `dedup_skipped`: Duplicate samples removed
- `consistency_failed`: Samples flagged for quality issues

## 🔧 Usage

### Web Interface
1. Visit the [HF Space](https://huggingface.co/spaces/MedVietAI/processing)
2. Select dataset and processing mode (SFT/RAG)
3. Enable Vietnamese translation if needed
4. Click process button

### API Usage
```bash
# SFT Processing with Vietnamese translation
curl -X POST "https://huggingface.co/spaces/MedVietAI/processing/process/healthcaremagic" \
  -H "Content-Type: application/json" \
  -d '{
    "augment": {
      "paraphrase_ratio": 0.2,
      "backtranslate_ratio": 0.1,
      "paraphrase_outputs": true,
      "style_standardize": true,
      "deidentify": true,
      "dedupe": true,
      "expand": true
    },
    "vietnamese_translation": true
  }'

# RAG Processing
curl -X POST "https://huggingface.co/spaces/MedVietAI/processing/rag/healthcaremagic" \
  -H "Content-Type: application/json" \
  -d '{
    "vietnamese_translation": true
  }'
```

## 📚 Documentation

- [Request Documentation](docs/REQUEST.md)  
- [Data Processing Guide](docs/DATA_PROCESSING.md)  
- [Local Mode Guide](docs/LOCAL_MODE.md)  

## 📄 License

[Apache-2.0 LICENSE](docs/LICENSE.txt)