File size: 6,175 Bytes
80cb919
1d46eb9
80cb919
1d46eb9
 
80cb919
 
 
99c49c6
80cb919
 
65da874
80cb919
1d46eb9
80cb919
1d46eb9
80cb919
65da874
80cb919
1d46eb9
80cb919
1d46eb9
80cb919
1d46eb9
80cb919
65da874
80cb919
a89888b
 
 
 
 
 
65da874
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a89888b
 
 
 
 
 
 
 
 
 
 
 
65da874
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a89888b
 
 
65da874
 
80cb919
a89888b
80cb919
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
title: MedVietAI Processing
emoji: ⚕️
colorFrom: green
colorTo: pink
sdk: docker
pinned: false
license: apache-2.0
short_description: Data processing with en-vi translation. Derived from 500k mi
---

## 🚀 Quick Access

[HF Space](https://huggingface.co/spaces/MedVietAI/processing)

[MedDialog-100k](https://huggingface.co/datasets/MedAI-COS30018/MedDialog-EN-100k)

[MedDialog-10k](https://huggingface.co/datasets/MedAI-COS30018/MedDialog-EN-10k)

[PubMedQA-Labelled](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-L)

[PubMedQA-Unlabelled](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-U)

[PubMedQA-Mapper](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-MAP)

## 🎯 Features

### 🏠 Dual Mode Operation
- **Local Mode**: MedAlpaca-13b model running locally for privacy and cost efficiency
- **Cloud Mode**: NVIDIA + Gemini API integration for scalable processing
- **Dynamic Switching**: Toggle between modes via environment variables
- **Medical Specialization**: MedAlpaca-13b specifically fine-tuned for medical tasks

### 🔄 Advanced Data Augmentation
- **Paraphrasing**: Multi-model rotation (NVIDIA + Gemini) with easy/hard difficulty levels
- **Backtranslation**: Vietnamese pivot language for semantic preservation
- **Style Standardization**: Clinical voice enforcement and professional medical tone
- **Response Validation**: Invalid response detection and retry logic (max 3 attempts)
- **Quality Guards**: Length/semantic validation for backtranslation outputs

### 🇻🇳 Vietnamese Translation
- **Complete Translation**: All text fields translated when Vietnamese mode is enabled
- **Quality Validation**: Translation quality checks with fallback to original text
- **SFT Format**: `instruction`, `input`, `output` fields translated
- **RAG Format**: `question`, `answer`, `context` fields translated
- **Sanitization**: Repetition reduction and whitespace normalization

### 📊 SFT Data Enrichment
- **Multiple Answer Variants**: 2-3 different answers per question for better reasoning
- **Multiple Question Variants**: 2-3 different questions per answer for diverse training
- **Cross Combinations**: All question × answer variant combinations (up to 9 per sample)
- **Vietnamese Variants**: Translated versions of enriched combinations
- **Reasoning Enhancement**: Multiple reasoning paths for improved model training

### 🔍 Quality Assurance
- **Invalid Response Detection**: Catches "Fail", "Invalid", "I can't", "Sorry", etc.
- **Retry Logic**: Up to 3 attempts with different paraphrasing difficulties
- **Drop Strategy**: Samples dropped if retry fails (no fallback answers)
- **Consistency Checking**: LLM-based validation of answer quality
- **De-identification**: PHI removal with configurable strictness

### 🎯 RAG Optimization
- **Embedding-Friendly**: Concise, direct text optimized for dense retrieval
- **Context Generation**: Synthetic context creation when missing
- **Content Cleaning**: Conversational element removal for medical focus
- **Length Control**: Hard caps on question/answer/context lengths
- **Quality Filtering**: Invalid response cleaning for RAG corpora

## 📋 Supported Datasets

### Medical Dialogue
- **HealthCareMagic**: 100k medical conversations
- **iCliniq**: 10k derived medical Q&A

### Biomedical QA
- **PubMedQA-L**: Labeled biomedical questions
- **PubMedQA-U**: Unlabeled biomedical questions  
- **PubMedQA-MAP**: Mapped biomedical Q&A pairs

## ⚙️ Configuration

### Mode Selection
```bash
# Local Mode (MedAlpaca-13b)
IS_LOCAL=true
HF_TOKEN=your_huggingface_token

# Cloud Mode (NVIDIA/Gemini APIs)
IS_LOCAL=false
NVIDIA_API_1=your_nvidia_key
GEMINI_API_1=your_gemini_key
```

### Augmentation Parameters
```python
class AugmentOptions:
    paraphrase_ratio: float = 0.2          # 0.0-1.0
    paraphrase_outputs: bool = True         # Augment model answers
    backtranslate_ratio: float = 0.1        # 0.0-1.0 (Vietnamese pivot)
    style_standardize: bool = True          # Enforce clinical style
    deidentify: bool = True                 # Remove PHI
    dedupe: bool = True                     # Remove duplicates
    max_chars: int = 5000                   # Text length limit
    consistency_check_ratio: float = 0.05   # 0.0-1.0
    expand: bool = True                     # Enable enrichment
    max_aug_per_sample: int = 2             # 1-3 variants
```

### Processing Modes
- **SFT Processing**: Supervised Fine-Tuning format with enrichment
- **RAG Processing**: Question-Context-Answer format for retrieval
- **Vietnamese Mode**: Complete translation of all text fields

## 📈 Output Statistics

The system tracks comprehensive statistics:
- `written`: Successfully processed samples
- `paraphrased_input/output`: Paraphrasing counts
- `backtranslated_input/output`: Backtranslation counts
- `dropped_invalid`: Samples dropped due to failed retries
- `vietnamese_variants`: Vietnamese variants created
- `dedup_skipped`: Duplicate samples removed
- `consistency_failed`: Samples flagged for quality issues

## 🔧 Usage

### Web Interface
1. Visit the [HF Space](https://huggingface.co/spaces/MedVietAI/processing)
2. Select dataset and processing mode (SFT/RAG)
3. Enable Vietnamese translation if needed
4. Click process button

### API Usage
```bash
# SFT Processing with Vietnamese translation
curl -X POST "https://huggingface.co/spaces/MedVietAI/processing/process/healthcaremagic" \
  -H "Content-Type: application/json" \
  -d '{
    "augment": {
      "paraphrase_ratio": 0.2,
      "backtranslate_ratio": 0.1,
      "paraphrase_outputs": true,
      "style_standardize": true,
      "deidentify": true,
      "dedupe": true,
      "expand": true
    },
    "vietnamese_translation": true
  }'

# RAG Processing
curl -X POST "https://huggingface.co/spaces/MedVietAI/processing/rag/healthcaremagic" \
  -H "Content-Type: application/json" \
  -d '{
    "vietnamese_translation": true
  }'
```

## 📚 Documentation

- [Request Documentation](docs/REQUEST.md)  
- [Data Processing Guide](docs/DATA_PROCESSING.md)  
- [Local Mode Guide](docs/LOCAL_MODE.md)  

## 📄 License

[Apache-2.0 LICENSE](docs/LICENSE.txt)