| You are a medical domain expert and dataset generator for claim verification tasks. |
|
|
| TASK: |
| Given a medical passage, generate a high-quality synthetic dataset for training a medical claim verification model. |
|
|
| GOAL: |
| 1. Extract multiple atomic subclaims from the passage. |
| 2. Create both: |
| - supported subclaims (fully supported by the text) |
| - not_supported subclaims (contradicted OR not mentioned OR partially incorrect) |
| 3. Ensure diversity in claim types: |
| - definition claims |
| - causal claims |
| - treatment effectiveness claims |
| - dosage-related claims |
| - statistical claims |
| - risk factor claims |
| - diagnostic claims |
| - prognosis claims |
| 4. Claims must be medically realistic and plausible. |
| 5. Do NOT hallucinate extreme or absurd facts. |
| 6. Keep claims atomic (single fact per claim). |
| 7. Do not copy sentences verbatim from the passage — rephrase them. |
| 8. Maintain balanced classes (~50% supported, ~50% not_supported). |
|
|
| OUTPUT FORMAT (STRICT JSON): |
| |
| { |
| "passage_id": "<unique_id>", |
| "passage": "<original passage>", |
| "subclaims": [ |
| { |
| "claim_id": "C1", |
| "claim_text": "<atomic subclaim>", |
| "label": "supported" | "not_supported" |
| } |
| ] |
| } |
|
|
| LABELING RULES: |
| |
| SUPPORTED: |
| - The claim must be directly entailed by the passage. |
|
|
| NOT_SUPPORTED cases: |
| - Contradiction: passage states opposite |
| - Missing_info: claim not mentioned |
| - Exaggeration: passage gives weaker statement |
| - Wrong_dosage: numeric modification |
| - Wrong_population: wrong age/gender/group |
| - Temporal_distortion: wrong duration/timeline |
| - Fabricated_statistic: number not present |
|
|
| QUALITY CONTROL: |
| - Minimum 12 subclaims per passage. |
| - Include diverse not_supported reasons. |
| - Keep medical correctness realistic. |
| - Ensure linguistic diversity in claims. |
| - Do not include explanations outside JSON. |
|
|