MedAI_Processing / README.md
LiamKhoaLe's picture
Upd syntax
fb6b1e8
---
title: Medical Processing
emoji: ⚕️
colorFrom: green
colorTo: pink
sdk: docker
pinned: false
license: apache-2.0
short_description: Data processing. Derived from 500k medical knowledge mix
---
## 🚀 Quick Access
[HF Space](https://huggingface.co/spaces/MedSwin/medai-processing)
[MedDialog-100k](https://huggingface.co/datasets/MedAI-COS30018/MedDialog-EN-100k)
[MedDialog-10k](https://huggingface.co/datasets/MedAI-COS30018/MedDialog-EN-10k)
[PubMedQA-Labelled](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-L)
[PubMedQA-Unlabelled](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-U)
[PubMedQA-Mapper](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-MAP)
## 🎯 Features
### 🏠 Dual Mode Operation
- **Local Mode**: MedAlpaca-13b model running locally for privacy and cost efficiency
- **Cloud Mode**: NVIDIA + Gemini API integration for scalable processing
- **Dynamic Switching**: Toggle between modes via environment variables
- **Medical Specialization**: MedAlpaca-13b specifically fine-tuned for medical tasks
### 🔄 Advanced Data Augmentation
- **Paraphrasing**: Multi-model rotation (NVIDIA + Gemini) with easy/hard difficulty levels
- **Backtranslation**: Vietnamese pivot language for semantic preservation
- **Style Standardization**: Clinical voice enforcement and professional medical tone
- **Response Validation**: Invalid response detection and retry logic (max 3 attempts)
- **Quality Guards**: Length/semantic validation for backtranslation outputs
<!-- ### 🇻🇳 Vietnamese Translation
- **Complete Translation**: All text fields translated when Vietnamese mode is enabled
- **Quality Validation**: Translation quality checks with fallback to original text
- **SFT Format**: `instruction`, `input`, `output` fields translated
- **RAG Format**: `question`, `answer`, `context` fields translated
- **Sanitization**: Repetition reduction and whitespace normalization -->
### 📊 SFT Data Enrichment
- **Multiple Answer Variants**: 2-3 different answers per question for better reasoning
- **Multiple Question Variants**: 2-3 different questions per answer for diverse training
- **Cross Combinations**: All question × answer variant combinations (up to 9 per sample)
- **Vietnamese Variants**: Translated versions of enriched combinations
- **Reasoning Enhancement**: Multiple reasoning paths for improved model training
### 🔍 Quality Assurance
- **Invalid Response Detection**: Catches "Fail", "Invalid", "I can't", "Sorry", etc.
- **Retry Logic**: Up to 3 attempts with different paraphrasing difficulties
- **Drop Strategy**: Samples dropped if retry fails (no fallback answers)
- **Consistency Checking**: LLM-based validation of answer quality
- **De-identification**: PHI removal with configurable strictness
### 🎯 RAG Optimization
- **Embedding-Friendly**: Concise, direct text optimized for dense retrieval
- **Context Generation**: Synthetic context creation when missing
- **Content Cleaning**: Conversational element removal for medical focus
- **Length Control**: Hard caps on question/answer/context lengths
- **Quality Filtering**: Invalid response cleaning for RAG corpora
## 📋 Supported Datasets
### Medical Dialogue
- **HealthCareMagic**: 100k medical conversations
- **iCliniq**: 10k derived medical Q&A
### Biomedical QA
- **PubMedQA-L**: Labeled biomedical questions
- **PubMedQA-U**: Unlabeled biomedical questions
- **PubMedQA-MAP**: Mapped biomedical Q&A pairs
## ⚙️ Configuration
### Mode Selection
```bash
# Local Mode (MedAlpaca-13b)
IS_LOCAL=true
HF_TOKEN=your_huggingface_token
# Cloud Mode (NVIDIA/Gemini APIs)
IS_LOCAL=false
NVIDIA_API_1=your_nvidia_key
GEMINI_API_1=your_gemini_key
```
### Augmentation Parameters
```python
class AugmentOptions:
paraphrase_ratio: float = 0.2 # 0.0-1.0
paraphrase_outputs: bool = True # Augment model answers
backtranslate_ratio: float = 0.1 # 0.0-1.0 (Vietnamese pivot)
style_standardize: bool = True # Enforce clinical style
deidentify: bool = True # Remove PHI
dedupe: bool = True # Remove duplicates
max_chars: int = 5000 # Text length limit
consistency_check_ratio: float = 0.05 # 0.0-1.0
expand: bool = True # Enable enrichment
max_aug_per_sample: int = 2 # 1-3 variants
```
### Processing Modes
- **SFT Processing**: Supervised Fine-Tuning format with enrichment
- **RAG Processing**: Question-Context-Answer format for retrieval
- **Vietnamese Mode**: Complete translation of all text fields
## 📈 Output Statistics
The system tracks comprehensive statistics:
- `written`: Successfully processed samples
- `paraphrased_input/output`: Paraphrasing counts
- `backtranslated_input/output`: Backtranslation counts
- `dropped_invalid`: Samples dropped due to failed retries
- `vietnamese_variants`: Vietnamese variants created
- `dedup_skipped`: Duplicate samples removed
- `consistency_failed`: Samples flagged for quality issues
## 🔧 Usage
### Web Interface
1. Visit the [HF Space](https://huggingface.co/spaces/MedSwin/medai-processing)
2. Select dataset and processing mode (SFT/RAG)
3. Enable Vietnamese translation if needed
4. Click process button
### API Usage
```bash
# SFT Processing with Vietnamese translation
curl -X POST "https://huggingface.co/spaces/MedSwin/medai-processing/process/healthcaremagic" \
-H "Content-Type: application/json" \
-d '{
"augment": {
"paraphrase_ratio": 0.2,
"backtranslate_ratio": 0.1,
"paraphrase_outputs": true,
"style_standardize": true,
"deidentify": true,
"dedupe": true,
"expand": true
},
"vietnamese_translation": true
}'
# RAG Processing
curl -X POST "https://huggingface.co/spaces/MedSwin/medai-processing/rag/healthcaremagic" \
-H "Content-Type: application/json" \
-d '{
"vietnamese_translation": true
}'
```
## 📚 Documentation
- [Request Documentation](docs/REQUEST.md)
- [Data Processing Guide](docs/DATA_PROCESSING.md)
- [Local Mode Guide](docs/LOCAL_MODE.md)
## 📄 License
[Apache-2.0 LICENSE](docs/LICENSE.txt)