Spaces:
Sleeping
Sleeping
File size: 6,175 Bytes
80cb919 1d46eb9 80cb919 1d46eb9 80cb919 99c49c6 80cb919 65da874 80cb919 1d46eb9 80cb919 1d46eb9 80cb919 65da874 80cb919 1d46eb9 80cb919 1d46eb9 80cb919 1d46eb9 80cb919 65da874 80cb919 a89888b 65da874 a89888b 65da874 a89888b 65da874 80cb919 a89888b 80cb919 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
---
title: MedVietAI Processing
emoji: ⚕️
colorFrom: green
colorTo: pink
sdk: docker
pinned: false
license: apache-2.0
short_description: Data processing with en-vi translation. Derived from 500k mi
---
## 🚀 Quick Access
[HF Space](https://huggingface.co/spaces/MedVietAI/processing)
[MedDialog-100k](https://huggingface.co/datasets/MedAI-COS30018/MedDialog-EN-100k)
[MedDialog-10k](https://huggingface.co/datasets/MedAI-COS30018/MedDialog-EN-10k)
[PubMedQA-Labelled](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-L)
[PubMedQA-Unlabelled](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-U)
[PubMedQA-Mapper](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-MAP)
## 🎯 Features
### 🏠 Dual Mode Operation
- **Local Mode**: MedAlpaca-13b model running locally for privacy and cost efficiency
- **Cloud Mode**: NVIDIA + Gemini API integration for scalable processing
- **Dynamic Switching**: Toggle between modes via environment variables
- **Medical Specialization**: MedAlpaca-13b specifically fine-tuned for medical tasks
### 🔄 Advanced Data Augmentation
- **Paraphrasing**: Multi-model rotation (NVIDIA + Gemini) with easy/hard difficulty levels
- **Backtranslation**: Vietnamese pivot language for semantic preservation
- **Style Standardization**: Clinical voice enforcement and professional medical tone
- **Response Validation**: Invalid response detection and retry logic (max 3 attempts)
- **Quality Guards**: Length/semantic validation for backtranslation outputs
### 🇻🇳 Vietnamese Translation
- **Complete Translation**: All text fields translated when Vietnamese mode is enabled
- **Quality Validation**: Translation quality checks with fallback to original text
- **SFT Format**: `instruction`, `input`, `output` fields translated
- **RAG Format**: `question`, `answer`, `context` fields translated
- **Sanitization**: Repetition reduction and whitespace normalization
### 📊 SFT Data Enrichment
- **Multiple Answer Variants**: 2-3 different answers per question for better reasoning
- **Multiple Question Variants**: 2-3 different questions per answer for diverse training
- **Cross Combinations**: All question × answer variant combinations (up to 9 per sample)
- **Vietnamese Variants**: Translated versions of enriched combinations
- **Reasoning Enhancement**: Multiple reasoning paths for improved model training
### 🔍 Quality Assurance
- **Invalid Response Detection**: Catches "Fail", "Invalid", "I can't", "Sorry", etc.
- **Retry Logic**: Up to 3 attempts with different paraphrasing difficulties
- **Drop Strategy**: Samples dropped if retry fails (no fallback answers)
- **Consistency Checking**: LLM-based validation of answer quality
- **De-identification**: PHI removal with configurable strictness
### 🎯 RAG Optimization
- **Embedding-Friendly**: Concise, direct text optimized for dense retrieval
- **Context Generation**: Synthetic context creation when missing
- **Content Cleaning**: Conversational element removal for medical focus
- **Length Control**: Hard caps on question/answer/context lengths
- **Quality Filtering**: Invalid response cleaning for RAG corpora
## 📋 Supported Datasets
### Medical Dialogue
- **HealthCareMagic**: 100k medical conversations
- **iCliniq**: 10k derived medical Q&A
### Biomedical QA
- **PubMedQA-L**: Labeled biomedical questions
- **PubMedQA-U**: Unlabeled biomedical questions
- **PubMedQA-MAP**: Mapped biomedical Q&A pairs
## ⚙️ Configuration
### Mode Selection
```bash
# Local Mode (MedAlpaca-13b)
IS_LOCAL=true
HF_TOKEN=your_huggingface_token
# Cloud Mode (NVIDIA/Gemini APIs)
IS_LOCAL=false
NVIDIA_API_1=your_nvidia_key
GEMINI_API_1=your_gemini_key
```
### Augmentation Parameters
```python
class AugmentOptions:
paraphrase_ratio: float = 0.2 # 0.0-1.0
paraphrase_outputs: bool = True # Augment model answers
backtranslate_ratio: float = 0.1 # 0.0-1.0 (Vietnamese pivot)
style_standardize: bool = True # Enforce clinical style
deidentify: bool = True # Remove PHI
dedupe: bool = True # Remove duplicates
max_chars: int = 5000 # Text length limit
consistency_check_ratio: float = 0.05 # 0.0-1.0
expand: bool = True # Enable enrichment
max_aug_per_sample: int = 2 # 1-3 variants
```
### Processing Modes
- **SFT Processing**: Supervised Fine-Tuning format with enrichment
- **RAG Processing**: Question-Context-Answer format for retrieval
- **Vietnamese Mode**: Complete translation of all text fields
## 📈 Output Statistics
The system tracks comprehensive statistics:
- `written`: Successfully processed samples
- `paraphrased_input/output`: Paraphrasing counts
- `backtranslated_input/output`: Backtranslation counts
- `dropped_invalid`: Samples dropped due to failed retries
- `vietnamese_variants`: Vietnamese variants created
- `dedup_skipped`: Duplicate samples removed
- `consistency_failed`: Samples flagged for quality issues
## 🔧 Usage
### Web Interface
1. Visit the [HF Space](https://huggingface.co/spaces/MedVietAI/processing)
2. Select dataset and processing mode (SFT/RAG)
3. Enable Vietnamese translation if needed
4. Click process button
### API Usage
```bash
# SFT Processing with Vietnamese translation
curl -X POST "https://huggingface.co/spaces/MedVietAI/processing/process/healthcaremagic" \
-H "Content-Type: application/json" \
-d '{
"augment": {
"paraphrase_ratio": 0.2,
"backtranslate_ratio": 0.1,
"paraphrase_outputs": true,
"style_standardize": true,
"deidentify": true,
"dedupe": true,
"expand": true
},
"vietnamese_translation": true
}'
# RAG Processing
curl -X POST "https://huggingface.co/spaces/MedVietAI/processing/rag/healthcaremagic" \
-H "Content-Type: application/json" \
-d '{
"vietnamese_translation": true
}'
```
## 📚 Documentation
- [Request Documentation](docs/REQUEST.md)
- [Data Processing Guide](docs/DATA_PROCESSING.md)
- [Local Mode Guide](docs/LOCAL_MODE.md)
## 📄 License
[Apache-2.0 LICENSE](docs/LICENSE.txt)
|