MedAI_Processing / docs /REVIEW.md
LiamKhoaLe's picture
Upd local setups with dynamic mode setter
a89888b
a. LLM-Based Paraphrasing
- **Multi-model approach**: Llama-8B (same architecture) and Gemini (Flash/Pro) models for reliability
- **Difficulty levels**: Easy vs. Hard paraphrasing modes to effectively use different models with auditing.
- **Medical context preservation**: Maintains clinical terminology accuracy
- **Configurable ratios**: User-defined augmentation percentages
b. Back-Translation Augmentation
- **Pivot languages** EN-VI-EN-VI...
- **Quality control**: Length and semantic similarity validation
- **Meaning preservation**: Maintains semantic accuracy through translation cycles
c. Style Standardization
- **Clinical voice enforcement**: Neutral, professional medical tone
- **Absolute language removal**: Replaces guarantees with probabilistic language
- **Forum sign-off removal**: Eliminates informal communication patterns
d. Multi-Variant Generation (for reasoning)
- **Answer variants**: Concise, detailed, clinical, patient-friendly styles
- **Question variants**: Clarifying, follow-up, symptom-focused, treatment-focused
- **Cross combinations**: All question × answer variant combinations (up to 9 per sample) e. Clinical Scenario Creation
- **Context variations**: Emergency room, routine checkup, chronic conditions, family member perspectives
- **Enhanced diversity**: Multiple reasoning paths for improved model training
f. Quality Assurance
f1. Data Cleaning
- **PHI removal**: Email, phone, URL, IP address redaction
- **Deduplication**: MD5-based content hashing with normalized comparison
- **Invalid response handling**: Detection and retry logic for failed responses
- **Conversational element cleaning**: Removal of greetings and non-medical content
f2. Validation
- **Medical accuracy validation**: LLM-based consistency checking
- **Length control**: Configurable maximum character limits
- **Language detection**: English validation for content quality
g. Output Formats: SFT Format
- **Instruction**: Task description
- **Input**: User question/context
- **Output**: Model response
- **Metadata**: Augmentation tags and source information