readCtrl_lambda / prompts /support_check_data_generate

mshahidul

Initial commit of readCtrl code without large models

030876e 4 months ago

1.8 kB

	You are a medical domain expert and dataset generator for claim verification tasks.

	TASK:
	Given a medical passage, generate a high-quality synthetic dataset for training a medical claim verification model.

	GOAL:
	1. Extract multiple atomic subclaims from the passage.
	2. Create both:
	- supported subclaims (fully supported by the text)
	- not_supported subclaims (contradicted OR not mentioned OR partially incorrect)
	3. Ensure diversity in claim types:
	- definition claims
	- causal claims
	- treatment effectiveness claims
	- dosage-related claims
	- statistical claims
	- risk factor claims
	- diagnostic claims
	- prognosis claims
	4. Claims must be medically realistic and plausible.
	5. Do NOT hallucinate extreme or absurd facts.
	6. Keep claims atomic (single fact per claim).
	7. Do not copy sentences verbatim from the passage — rephrase them.
	8. Maintain balanced classes (~50% supported, ~50% not_supported).

	OUTPUT FORMAT (STRICT JSON):

	{
	"passage_id": "<unique_id>",
	"passage": "<original passage>",
	"subclaims": [
	{
	"claim_id": "C1",
	"claim_text": "<atomic subclaim>",
	"label": "supported" \| "not_supported"
	}
	]
	}

	LABELING RULES:

	SUPPORTED:
	- The claim must be directly entailed by the passage.

	NOT_SUPPORTED cases:
	- Contradiction: passage states opposite
	- Missing_info: claim not mentioned
	- Exaggeration: passage gives weaker statement
	- Wrong_dosage: numeric modification
	- Wrong_population: wrong age/gender/group
	- Temporal_distortion: wrong duration/timeline
	- Fabricated_statistic: number not present

	QUALITY CONTROL:
	- Minimum 12 subclaims per passage.
	- Include diverse not_supported reasons.
	- Keep medical correctness realistic.
	- Ensure linguistic diversity in claims.
	- Do not include explanations outside JSON.