Spaces:
Sleeping
Sleeping
| title: Medical Processing | |
| emoji: ⚕️ | |
| colorFrom: green | |
| colorTo: pink | |
| sdk: docker | |
| pinned: false | |
| license: apache-2.0 | |
| short_description: Data processing. Derived from 500k medical knowledge mix | |
| ## 🚀 Quick Access | |
| [HF Space](https://huggingface.co/spaces/MedSwin/medai-processing) | |
| [MedDialog-100k](https://huggingface.co/datasets/MedAI-COS30018/MedDialog-EN-100k) | |
| [MedDialog-10k](https://huggingface.co/datasets/MedAI-COS30018/MedDialog-EN-10k) | |
| [PubMedQA-Labelled](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-L) | |
| [PubMedQA-Unlabelled](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-U) | |
| [PubMedQA-Mapper](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-MAP) | |
| ## 🎯 Features | |
| ### 🏠 Dual Mode Operation | |
| - **Local Mode**: MedAlpaca-13b model running locally for privacy and cost efficiency | |
| - **Cloud Mode**: NVIDIA + Gemini API integration for scalable processing | |
| - **Dynamic Switching**: Toggle between modes via environment variables | |
| - **Medical Specialization**: MedAlpaca-13b specifically fine-tuned for medical tasks | |
| ### 🔄 Advanced Data Augmentation | |
| - **Paraphrasing**: Multi-model rotation (NVIDIA + Gemini) with easy/hard difficulty levels | |
| - **Backtranslation**: Vietnamese pivot language for semantic preservation | |
| - **Style Standardization**: Clinical voice enforcement and professional medical tone | |
| - **Response Validation**: Invalid response detection and retry logic (max 3 attempts) | |
| - **Quality Guards**: Length/semantic validation for backtranslation outputs | |
| <!-- ### 🇻🇳 Vietnamese Translation | |
| - **Complete Translation**: All text fields translated when Vietnamese mode is enabled | |
| - **Quality Validation**: Translation quality checks with fallback to original text | |
| - **SFT Format**: `instruction`, `input`, `output` fields translated | |
| - **RAG Format**: `question`, `answer`, `context` fields translated | |
| - **Sanitization**: Repetition reduction and whitespace normalization --> | |
| ### 📊 SFT Data Enrichment | |
| - **Multiple Answer Variants**: 2-3 different answers per question for better reasoning | |
| - **Multiple Question Variants**: 2-3 different questions per answer for diverse training | |
| - **Cross Combinations**: All question × answer variant combinations (up to 9 per sample) | |
| - **Vietnamese Variants**: Translated versions of enriched combinations | |
| - **Reasoning Enhancement**: Multiple reasoning paths for improved model training | |
| ### 🔍 Quality Assurance | |
| - **Invalid Response Detection**: Catches "Fail", "Invalid", "I can't", "Sorry", etc. | |
| - **Retry Logic**: Up to 3 attempts with different paraphrasing difficulties | |
| - **Drop Strategy**: Samples dropped if retry fails (no fallback answers) | |
| - **Consistency Checking**: LLM-based validation of answer quality | |
| - **De-identification**: PHI removal with configurable strictness | |
| ### 🎯 RAG Optimization | |
| - **Embedding-Friendly**: Concise, direct text optimized for dense retrieval | |
| - **Context Generation**: Synthetic context creation when missing | |
| - **Content Cleaning**: Conversational element removal for medical focus | |
| - **Length Control**: Hard caps on question/answer/context lengths | |
| - **Quality Filtering**: Invalid response cleaning for RAG corpora | |
| ## 📋 Supported Datasets | |
| ### Medical Dialogue | |
| - **HealthCareMagic**: 100k medical conversations | |
| - **iCliniq**: 10k derived medical Q&A | |
| ### Biomedical QA | |
| - **PubMedQA-L**: Labeled biomedical questions | |
| - **PubMedQA-U**: Unlabeled biomedical questions | |
| - **PubMedQA-MAP**: Mapped biomedical Q&A pairs | |
| ## ⚙️ Configuration | |
| ### Mode Selection | |
| ```bash | |
| # Local Mode (MedAlpaca-13b) | |
| IS_LOCAL=true | |
| HF_TOKEN=your_huggingface_token | |
| # Cloud Mode (NVIDIA/Gemini APIs) | |
| IS_LOCAL=false | |
| NVIDIA_API_1=your_nvidia_key | |
| GEMINI_API_1=your_gemini_key | |
| ``` | |
| ### Augmentation Parameters | |
| ```python | |
| class AugmentOptions: | |
| paraphrase_ratio: float = 0.2 # 0.0-1.0 | |
| paraphrase_outputs: bool = True # Augment model answers | |
| backtranslate_ratio: float = 0.1 # 0.0-1.0 (Vietnamese pivot) | |
| style_standardize: bool = True # Enforce clinical style | |
| deidentify: bool = True # Remove PHI | |
| dedupe: bool = True # Remove duplicates | |
| max_chars: int = 5000 # Text length limit | |
| consistency_check_ratio: float = 0.05 # 0.0-1.0 | |
| expand: bool = True # Enable enrichment | |
| max_aug_per_sample: int = 2 # 1-3 variants | |
| ``` | |
| ### Processing Modes | |
| - **SFT Processing**: Supervised Fine-Tuning format with enrichment | |
| - **RAG Processing**: Question-Context-Answer format for retrieval | |
| - **Vietnamese Mode**: Complete translation of all text fields | |
| ## 📈 Output Statistics | |
| The system tracks comprehensive statistics: | |
| - `written`: Successfully processed samples | |
| - `paraphrased_input/output`: Paraphrasing counts | |
| - `backtranslated_input/output`: Backtranslation counts | |
| - `dropped_invalid`: Samples dropped due to failed retries | |
| - `vietnamese_variants`: Vietnamese variants created | |
| - `dedup_skipped`: Duplicate samples removed | |
| - `consistency_failed`: Samples flagged for quality issues | |
| ## 🔧 Usage | |
| ### Web Interface | |
| 1. Visit the [HF Space](https://huggingface.co/spaces/MedSwin/medai-processing) | |
| 2. Select dataset and processing mode (SFT/RAG) | |
| 3. Enable Vietnamese translation if needed | |
| 4. Click process button | |
| ### API Usage | |
| ```bash | |
| # SFT Processing with Vietnamese translation | |
| curl -X POST "https://huggingface.co/spaces/MedSwin/medai-processing/process/healthcaremagic" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "augment": { | |
| "paraphrase_ratio": 0.2, | |
| "backtranslate_ratio": 0.1, | |
| "paraphrase_outputs": true, | |
| "style_standardize": true, | |
| "deidentify": true, | |
| "dedupe": true, | |
| "expand": true | |
| }, | |
| "vietnamese_translation": true | |
| }' | |
| # RAG Processing | |
| curl -X POST "https://huggingface.co/spaces/MedSwin/medai-processing/rag/healthcaremagic" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "vietnamese_translation": true | |
| }' | |
| ``` | |
| ## 📚 Documentation | |
| - [Request Documentation](docs/REQUEST.md) | |
| - [Data Processing Guide](docs/DATA_PROCESSING.md) | |
| - [Local Mode Guide](docs/LOCAL_MODE.md) | |
| ## 📄 License | |
| [Apache-2.0 LICENSE](docs/LICENSE.txt) | |