Spaces:
Sleeping
Sleeping
| a. LLM-Based Paraphrasing | |
| - **Multi-model approach**: Llama-8B (same architecture) and Gemini (Flash/Pro) models for reliability | |
| - **Difficulty levels**: Easy vs. Hard paraphrasing modes to effectively use different models with auditing. | |
| - **Medical context preservation**: Maintains clinical terminology accuracy | |
| - **Configurable ratios**: User-defined augmentation percentages | |
| b. Back-Translation Augmentation | |
| - **Pivot languages** EN-VI-EN-VI... | |
| - **Quality control**: Length and semantic similarity validation | |
| - **Meaning preservation**: Maintains semantic accuracy through translation cycles | |
| c. Style Standardization | |
| - **Clinical voice enforcement**: Neutral, professional medical tone | |
| - **Absolute language removal**: Replaces guarantees with probabilistic language | |
| - **Forum sign-off removal**: Eliminates informal communication patterns | |
| d. Multi-Variant Generation (for reasoning) | |
| - **Answer variants**: Concise, detailed, clinical, patient-friendly styles | |
| - **Question variants**: Clarifying, follow-up, symptom-focused, treatment-focused | |
| - **Cross combinations**: All question × answer variant combinations (up to 9 per sample) e. Clinical Scenario Creation | |
| - **Context variations**: Emergency room, routine checkup, chronic conditions, family member perspectives | |
| - **Enhanced diversity**: Multiple reasoning paths for improved model training | |
| f. Quality Assurance | |
| f1. Data Cleaning | |
| - **PHI removal**: Email, phone, URL, IP address redaction | |
| - **Deduplication**: MD5-based content hashing with normalized comparison | |
| - **Invalid response handling**: Detection and retry logic for failed responses | |
| - **Conversational element cleaning**: Removal of greetings and non-medical content | |
| f2. Validation | |
| - **Medical accuracy validation**: LLM-based consistency checking | |
| - **Length control**: Configurable maximum character limits | |
| - **Language detection**: English validation for content quality | |
| g. Output Formats: SFT Format | |
| - **Instruction**: Task description | |
| - **Input**: User question/context | |
| - **Output**: Model response | |
| - **Metadata**: Augmentation tags and source information |