Spaces:
Sleeping
Sleeping
a. LLM-Based Paraphrasing
- Multi-model approach: Llama-8B (same architecture) and Gemini (Flash/Pro) models for reliability
- Difficulty levels: Easy vs. Hard paraphrasing modes to effectively use different models with auditing.
- Medical context preservation: Maintains clinical terminology accuracy
- Configurable ratios: User-defined augmentation percentages
b. Back-Translation Augmentation
- Pivot languages EN-VI-EN-VI...
- Quality control: Length and semantic similarity validation
- Meaning preservation: Maintains semantic accuracy through translation cycles
c. Style Standardization
- Clinical voice enforcement: Neutral, professional medical tone
- Absolute language removal: Replaces guarantees with probabilistic language
- Forum sign-off removal: Eliminates informal communication patterns
d. Multi-Variant Generation (for reasoning)
- Answer variants: Concise, detailed, clinical, patient-friendly styles
- Question variants: Clarifying, follow-up, symptom-focused, treatment-focused
- Cross combinations: All question × answer variant combinations (up to 9 per sample) e. Clinical Scenario Creation
- Context variations: Emergency room, routine checkup, chronic conditions, family member perspectives
- Enhanced diversity: Multiple reasoning paths for improved model training
f. Quality Assurance f1. Data Cleaning
- PHI removal: Email, phone, URL, IP address redaction
- Deduplication: MD5-based content hashing with normalized comparison
- Invalid response handling: Detection and retry logic for failed responses
- Conversational element cleaning: Removal of greetings and non-medical content
f2. Validation
- Medical accuracy validation: LLM-based consistency checking
- Length control: Configurable maximum character limits
- Language detection: English validation for content quality
g. Output Formats: SFT Format
- Instruction: Task description
- Input: User question/context
- Output: Model response
- Metadata: Augmentation tags and source information