File size: 2,117 Bytes
235b116
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
a. LLM-Based Paraphrasing 
- **Multi-model approach**: Llama-8B (same architecture) and Gemini (Flash/Pro) models for reliability 
- **Difficulty levels**: Easy vs. Hard paraphrasing modes to effectively use different models with auditing. 
- **Medical context preservation**: Maintains clinical terminology accuracy 
- **Configurable ratios**: User-defined augmentation percentages 

b. Back-Translation Augmentation 
- **Pivot languages** EN-VI-EN-VI...
- **Quality control**: Length and semantic similarity validation
- **Meaning preservation**: Maintains semantic accuracy through translation cycles 

c. Style Standardization 
- **Clinical voice enforcement**: Neutral, professional medical tone 
- **Absolute language removal**: Replaces guarantees with probabilistic language 
- **Forum sign-off removal**: Eliminates informal communication patterns 

d. Multi-Variant Generation (for reasoning) 
- **Answer variants**: Concise, detailed, clinical, patient-friendly styles 
- **Question variants**: Clarifying, follow-up, symptom-focused, treatment-focused 
- **Cross combinations**: All question × answer variant combinations (up to 9 per sample) e. Clinical Scenario Creation 
- **Context variations**: Emergency room, routine checkup, chronic conditions, family member perspectives
- **Enhanced diversity**: Multiple reasoning paths for improved model training 

f. Quality Assurance 
f1. Data Cleaning 
- **PHI removal**: Email, phone, URL, IP address redaction 
- **Deduplication**: MD5-based content hashing with normalized comparison 
- **Invalid response handling**: Detection and retry logic for failed responses 
- **Conversational element cleaning**: Removal of greetings and non-medical content 

f2. Validation 
- **Medical accuracy validation**: LLM-based consistency checking 
- **Length control**: Configurable maximum character limits 
- **Language detection**: English validation for content quality 

g. Output Formats: SFT Format 
- **Instruction**: Task description 
- **Input**: User question/context 
- **Output**: Model response
- **Metadata**: Augmentation tags and source information