Spaces:

MedSwin
/

MedAI_Processing

Sleeping

App Files Files Community

MedAI_Processing / README.md

LiamKhoaLe

Upd syntax

fb6b1e8 2 months ago

preview code

raw

history blame contribute delete

6.19 kB

	---
	title: Medical Processing
	emoji: ⚕️
	colorFrom: green
	colorTo: pink
	sdk: docker
	pinned: false
	license: apache-2.0
	short_description: Data processing. Derived from 500k medical knowledge mix
	---

	## 🚀 Quick Access

	[HF Space](https://huggingface.co/spaces/MedSwin/medai-processing)

	[MedDialog-100k](https://huggingface.co/datasets/MedAI-COS30018/MedDialog-EN-100k)

	[MedDialog-10k](https://huggingface.co/datasets/MedAI-COS30018/MedDialog-EN-10k)

	[PubMedQA-Labelled](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-L)

	[PubMedQA-Unlabelled](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-U)

	[PubMedQA-Mapper](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-MAP)

	## 🎯 Features

	### 🏠 Dual Mode Operation
	- Local Mode: MedAlpaca-13b model running locally for privacy and cost efficiency
	- Cloud Mode: NVIDIA + Gemini API integration for scalable processing
	- Dynamic Switching: Toggle between modes via environment variables
	- Medical Specialization: MedAlpaca-13b specifically fine-tuned for medical tasks

	### 🔄 Advanced Data Augmentation
	- Paraphrasing: Multi-model rotation (NVIDIA + Gemini) with easy/hard difficulty levels
	- Backtranslation: Vietnamese pivot language for semantic preservation
	- Style Standardization: Clinical voice enforcement and professional medical tone
	- Response Validation: Invalid response detection and retry logic (max 3 attempts)
	- Quality Guards: Length/semantic validation for backtranslation outputs

	<!-- ### 🇻🇳 Vietnamese Translation
	- Complete Translation: All text fields translated when Vietnamese mode is enabled
	- Quality Validation: Translation quality checks with fallback to original text
	- SFT Format: `instruction`, `input`, `output` fields translated
	- RAG Format: `question`, `answer`, `context` fields translated
	- Sanitization: Repetition reduction and whitespace normalization -->

	### 📊 SFT Data Enrichment
	- Multiple Answer Variants: 2-3 different answers per question for better reasoning
	- Multiple Question Variants: 2-3 different questions per answer for diverse training
	- Cross Combinations: All question × answer variant combinations (up to 9 per sample)
	- Vietnamese Variants: Translated versions of enriched combinations
	- Reasoning Enhancement: Multiple reasoning paths for improved model training

	### 🔍 Quality Assurance
	- Invalid Response Detection: Catches "Fail", "Invalid", "I can't", "Sorry", etc.
	- Retry Logic: Up to 3 attempts with different paraphrasing difficulties
	- Drop Strategy: Samples dropped if retry fails (no fallback answers)
	- Consistency Checking: LLM-based validation of answer quality
	- De-identification: PHI removal with configurable strictness

	### 🎯 RAG Optimization
	- Embedding-Friendly: Concise, direct text optimized for dense retrieval
	- Context Generation: Synthetic context creation when missing
	- Content Cleaning: Conversational element removal for medical focus
	- Length Control: Hard caps on question/answer/context lengths
	- Quality Filtering: Invalid response cleaning for RAG corpora

	## 📋 Supported Datasets

	### Medical Dialogue
	- HealthCareMagic: 100k medical conversations
	- iCliniq: 10k derived medical Q&A

	### Biomedical QA
	- PubMedQA-L: Labeled biomedical questions
	- PubMedQA-U: Unlabeled biomedical questions
	- PubMedQA-MAP: Mapped biomedical Q&A pairs

	## ⚙️ Configuration

	### Mode Selection
	```bash
	# Local Mode (MedAlpaca-13b)
	IS_LOCAL=true
	HF_TOKEN=your_huggingface_token

	# Cloud Mode (NVIDIA/Gemini APIs)
	IS_LOCAL=false
	NVIDIA_API_1=your_nvidia_key
	GEMINI_API_1=your_gemini_key
	```

	### Augmentation Parameters
	```python
	class AugmentOptions:
	paraphrase_ratio: float = 0.2 # 0.0-1.0
	paraphrase_outputs: bool = True # Augment model answers
	backtranslate_ratio: float = 0.1 # 0.0-1.0 (Vietnamese pivot)
	style_standardize: bool = True # Enforce clinical style
	deidentify: bool = True # Remove PHI
	dedupe: bool = True # Remove duplicates
	max_chars: int = 5000 # Text length limit
	consistency_check_ratio: float = 0.05 # 0.0-1.0
	expand: bool = True # Enable enrichment
	max_aug_per_sample: int = 2 # 1-3 variants
	```

	### Processing Modes
	- SFT Processing: Supervised Fine-Tuning format with enrichment
	- RAG Processing: Question-Context-Answer format for retrieval
	- Vietnamese Mode: Complete translation of all text fields

	## 📈 Output Statistics

	The system tracks comprehensive statistics:
	- `written`: Successfully processed samples
	- `paraphrased_input/output`: Paraphrasing counts
	- `backtranslated_input/output`: Backtranslation counts
	- `dropped_invalid`: Samples dropped due to failed retries
	- `vietnamese_variants`: Vietnamese variants created
	- `dedup_skipped`: Duplicate samples removed
	- `consistency_failed`: Samples flagged for quality issues

	## 🔧 Usage

	### Web Interface
	1. Visit the [HF Space](https://huggingface.co/spaces/MedSwin/medai-processing)
	2. Select dataset and processing mode (SFT/RAG)
	3. Enable Vietnamese translation if needed
	4. Click process button

	### API Usage
	```bash
	# SFT Processing with Vietnamese translation
	curl -X POST "https://huggingface.co/spaces/MedSwin/medai-processing/process/healthcaremagic" \
	-H "Content-Type: application/json" \
	-d '{
	"augment": {
	"paraphrase_ratio": 0.2,
	"backtranslate_ratio": 0.1,
	"paraphrase_outputs": true,
	"style_standardize": true,
	"deidentify": true,
	"dedupe": true,
	"expand": true
	},
	"vietnamese_translation": true
	}'

	# RAG Processing
	curl -X POST "https://huggingface.co/spaces/MedSwin/medai-processing/rag/healthcaremagic" \
	-H "Content-Type: application/json" \
	-d '{
	"vietnamese_translation": true
	}'
	```

	## 📚 Documentation

	- [Request Documentation](docs/REQUEST.md)
	- [Data Processing Guide](docs/DATA_PROCESSING.md)
	- [Local Mode Guide](docs/LOCAL_MODE.md)

	## 📄 License

	[Apache-2.0 LICENSE](docs/LICENSE.txt)