MedAI_Processing / docs /REVIEW.md
LiamKhoaLe's picture
Upd local setups with dynamic mode setter
a89888b

a. LLM-Based Paraphrasing

  • Multi-model approach: Llama-8B (same architecture) and Gemini (Flash/Pro) models for reliability
  • Difficulty levels: Easy vs. Hard paraphrasing modes to effectively use different models with auditing.
  • Medical context preservation: Maintains clinical terminology accuracy
  • Configurable ratios: User-defined augmentation percentages

b. Back-Translation Augmentation

  • Pivot languages EN-VI-EN-VI...
  • Quality control: Length and semantic similarity validation
  • Meaning preservation: Maintains semantic accuracy through translation cycles

c. Style Standardization

  • Clinical voice enforcement: Neutral, professional medical tone
  • Absolute language removal: Replaces guarantees with probabilistic language
  • Forum sign-off removal: Eliminates informal communication patterns

d. Multi-Variant Generation (for reasoning)

  • Answer variants: Concise, detailed, clinical, patient-friendly styles
  • Question variants: Clarifying, follow-up, symptom-focused, treatment-focused
  • Cross combinations: All question × answer variant combinations (up to 9 per sample) e. Clinical Scenario Creation
  • Context variations: Emergency room, routine checkup, chronic conditions, family member perspectives
  • Enhanced diversity: Multiple reasoning paths for improved model training

f. Quality Assurance f1. Data Cleaning

  • PHI removal: Email, phone, URL, IP address redaction
  • Deduplication: MD5-based content hashing with normalized comparison
  • Invalid response handling: Detection and retry logic for failed responses
  • Conversational element cleaning: Removal of greetings and non-medical content

f2. Validation

  • Medical accuracy validation: LLM-based consistency checking
  • Length control: Configurable maximum character limits
  • Language detection: English validation for content quality

g. Output Formats: SFT Format

  • Instruction: Task description
  • Input: User question/context
  • Output: Model response
  • Metadata: Augmentation tags and source information