Spaces:

Anuj424614
/

medibill

Sleeping

App Files Files Community

medibill / scripts /build_deck.applescript

Anuj424614

Upload folder using huggingface_hub

a09b1f5 verified about 1 month ago

raw

history blame contribute delete

7.92 kB

	-- Build MediBill-Env pitch deck directly in Keynote via AppleScript.
	-- Run: osascript scripts/build_deck.applescript

	on run
	set slidesData to {}

	-- Slide 1
	set end of slidesData to {¬
	title:"180 minutes to close the claim.", ¬
	bullets:"IRDAI mandate (May 2024): 1 hour pre-auth, 3 hours discharge
	Miss the 3-hour clock — insurer eats the cost from shareholder funds
	FY24: ~Rs 26,000 cr health-claim disallowed
	~13% of pre-auths still miss the window", ¬
	notes:"In India, IRDAI gives hospitals one hour for pre-authorization and three hours for final discharge on every cashless claim. Miss the three-hour clock, and the overrun comes out of the insurer's shareholder fund. Industry estimates put FY24 disallowed health-claim value around twenty-six thousand crore rupees. Roughly thirteen percent of pre-auths still miss the one-hour window. The bottleneck is a human coder racing a clock, and the policies keep changing on them."}

	-- Slide 2
	set end of slidesData to {¬
	title:"Why agents fail here", ¬
	bullets:"Rules engines handle static schema validation
	They do NOT handle staleness — yesterday's correct rule, today wrong
	Agents that imitate one month's trajectories fail quietly the next month
	We need an agent that knows to re-check before submitting", ¬
	notes:"Most agent benchmarks check whether the agent can fill a form correctly. That is schema validation, and rules engines already do it. The real failure mode in this domain is staleness — the policy changed, the agent did not notice, the claim is wrong. An agent that learned by imitating last month's expert trajectories will reproduce last month's rules. We want an agent that knows to re-check before submitting."}

	-- Slide 3
	set end of slidesData to {¬
	title:"MediBill-Env: 5 tools, 3 task tiers, 6-axis grader", ¬
	bullets:"Tools: ehr_query, insurance_lookup, coding_engine, escalate_to_human, submit_claim
	Tasks: easy_cashless, medium_multi_payer, hard_drift
	6-axis deterministic grader with disjoint identity/policy partition
	Asserted at module import time — no overlap possible
	Five reward-hacking attacks neutralised in the gate suite", ¬
	notes:"The agent has five tools: query the patient record, look up the insurer's active policy, write fields, escalate when uncertain, and submit. Three task tiers — easy, medium, and hard, where the policy mutates mid-episode. The grader has six axes with a disjoint field partition asserted at import time, so identity correctness and policy compliance never overlap."}

	-- Slide 4
	set end of slidesData to {¬
	title:"Silent multi-field policy drift", ¬
	bullets:"Active policy mutates 3-7 fields at a seed-randomized step
	No announcement — no observation flag, no metadata key, no event
	submit_claim is graded against the policy AT SUBMIT TIME
	Only path to new rules: a fresh insurance_lookup after the drift step
	12 claim types x 3 tiers x random drift = ~12k+ unique trajectories
	Scripted baseline: 1.00 on easy, 0.7611 on drift — the 0.24 gap is the signal", ¬
	notes:"On hard_drift tasks the active policy mutates mid-episode across three to seven fields — pre-auth thresholds, required signatures, narrative requirements, discharge attachment rules. Multi-field mutation, not a boolean. No announcement, no flag, no event. The only path to the new rules is a fresh insurance_lookup after the unknown drift step. Submissions are graded against the policy at submit time. Twelve claim types, three tiers, seed-randomized drift = over twelve thousand unique trajectories. Scripted baseline drops from one-zero on easy to zero-seven-six on drift. That zero-two-four gap is the trainable signal."}

	-- Slide 5 — HEADLINE
	set end of slidesData to {¬
	title:"Base 0.00 to SFT v2 0.9999 avg. Teacher beat GRPO saturation.", ¬
	bullets:"easy_cashless: Base 0.0000 to SFT v2 1.0000 (lift +1.000)
	medium_multi_payer: Base 0.0000 to SFT v2 1.0000 (lift +1.000)
	hard_drift: Base 0.0000 to SFT v2 0.9996 (lift +0.9996)
	AVERAGE: Base 0.0000 to SFT v2 0.9999 (lift +0.9999)

	Iteration: SFT v1 0.7573 -> GRPO 0.7575 (saturated) -> SFT v2 0.9996
	Pivot was teacher engineering, not RL. +0.2423 lift in 33 min retrain.
	5 exploit patterns neutralised, all score <= no_op.", ¬
	notes:"Six bars on hard_drift, left to right: base Qwen at zero, random at eleven, no-op at eight, scripted at seventy-six, SFT v1 at seventy-six, our final SFT v2 at zero-point-nine-nine-nine-six. Untrained, the 3B model scores literal zero — zero parse failures across fifteen episodes — it can format JSON, it just has no policy reasoning. SFT v1 hit scripted-teacher parity. Then GRPO with five reward functions saturated — delta two ten-thousandths, gradient ten-to-minus-seven. Diagnosis: SFT extracts everything the rewards can grip on. So we engineered a stronger teacher — Scripted plus plus, which escalates ambiguous cells and does a fresh insurance lookup before each submit. Ninety new trajectories, thirty-three minutes of retraining. SFT v2: one-zero-zero on easy and medium, zero-point-nine-nine-nine-six on hard. Average lift base to SFT v2: zero-point-nine-nine-nine-nine."}

	-- Slide 6 — Close
	set end of slidesData to {¬
	title:"Environment-first submission under Theme 3.1", ¬
	bullets:"Shipping: env + grader + 5-attack exploit suite + scripted baseline + SFT v2 (0.9999)
	Two of six axes are RL-only by design (abstention_quality, drift_bonus)
	Code enforces it: disjoint partition at import, 5 exploit tests, prompt-version handshake
	Theme 3.1 — DataOps Copilot. Enterprise reasoning under shifting business rules.
	Repo: github.com/Algoace1403/METAHackthon2026
	HF Space (LIVE): huggingface.co/spaces/Anuj424614/medibill-env", ¬
	notes:"We submit under Theme 3.1, DataOps Copilot. Shipping today: the environment, six-axis deterministic grader, silent drift mechanic, five-attack exploit suite, scripted baseline, and a trained SFT v2 adapter that hits zero-point-nine-nine-nine-nine average across all three difficulty tiers — table on slide five. Two axes — abstention and drift_bonus — are RL-only by design. Disjoint partition at import, five exploit tests, prompt-version handshake. Repo and live HF Space on screen. Thank you."}

	tell application "Keynote"
	activate
	-- Create a new document with the White theme
	set thisDoc to make new document with properties {document theme:theme "White"}
	delay 1

	-- The new document already has 1 slide. We'll set it as title slide and then add the rest.
	set slideCount to 0
	repeat with d in slidesData
	set slideCount to slideCount + 1
	set theTitle to title of d
	set theBullets to bullets of d
	set theNotes to notes of d

	if slideCount is 1 then
	-- Use the existing first slide
	tell thisDoc
	set base slide of slide 1 to master slide "Title & Bullets"
	set theSlide to slide 1
	end tell
	else
	tell thisDoc
	set theSlide to make new slide with properties {base slide:master slide "Title & Bullets"}
	end tell
	end if

	tell theSlide
	set object text of default title item to theTitle
	set object text of default body item to theBullets
	set presenter notes to theNotes
	end tell
	end repeat

	-- Save the document
	set savePath to (POSIX file "/Users/aks/METAHackthon2026/docs/medibill_pitch.key") as text
	save thisDoc in file savePath
	end tell

	return "Deck built. Saved to docs/medibill_pitch.key"
	end run