bshepp commited on
Commit
5d53fbf
·
1 Parent(s): 1f36481

docs: full documentation vs reality audit

Browse files

- README.md: Updated Tech Stack (MedGemma via HF Endpoint, not Gemma 3 via Google AI Studio),
added 50-case MedQA validation results, updated Quick Start for HF token, added
deploy guide to doc index and project structure, added analyze/check_progress to tree
- docs/architecture.md: Updated LLM label, model config, API dependencies, configuration
table defaults, system prompt section (TGI supports system role natively), known limitations
- docs/test_results.md: Added 50-case MedQA results with question-type breakdown, updated header
- docs/writeup_draft.md: Added MedQA 50-case results to performance table, updated validation
section from 'in progress' to actual numbers, updated latency note
- DEVELOPMENT_LOG.md: Fixed environment config table defaults, added Phase 11 (MedGemma HF
Endpoint deployment) and Phase 12 (50-case MedQA validation with full results/analysis)
- TODO.md: Complete rewrite - marked validation as done, updated model info from gemma-3 to
MedGemma, updated project state table, reprioritized tasks (video + writeup now #1-2)
- SUBMISSION_GUIDE.md: Updated days-remaining estimate
- docs/deploy_medgemma_hf.md: Fixed recommended MAX_INPUT/MAX_TOTAL tokens to match actual
deployed values (12288/16384), added note about default 4096 causing 422 errors
- SECURITY.md: Updated API key references (HF token + Google AI Studio)
- orchestrator.py: Fixed docstring to list all 6 pipeline steps including conflict detection

DEVELOPMENT_LOG.md CHANGED
@@ -258,9 +258,10 @@ All config via `.env` (template in `.env.template`):
258
 
259
  | Variable | Required | Default | Description |
260
  |----------|----------|---------|-------------|
261
- | `MEDGEMMA_API_KEY` | Yes | — | Google AI Studio API key |
262
- | `MEDGEMMA_BASE_URL` | No | `https://generativelanguage.googleapis.com/v1beta/openai/` | LLM endpoint |
263
- | `MEDGEMMA_MODEL_ID` | No | `gemma-3-27b-it` | Model identifier |
 
264
  | `CHROMA_PERSIST_DIR` | No | `./data/chroma` | ChromaDB storage |
265
  | `EMBEDDING_MODEL` | No | `sentence-transformers/all-MiniLM-L6-v2` | RAG embeddings |
266
  | `MAX_GUIDELINES` | No | `5` | Guidelines per RAG query |
@@ -314,6 +315,104 @@ Full validation runs (50–100+ cases) are planned for the next session.
314
 
315
  ---
316
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
317
  ## Phase 10: Final Documentation Audit & Cleanup
318
 
319
  Performed a full accuracy audit of all 5 documentation files and `test_e2e.py`.
 
258
 
259
  | Variable | Required | Default | Description |
260
  |----------|----------|---------|-------------|
261
+ | `MEDGEMMA_API_KEY` | Yes | — | HuggingFace API token or Google AI Studio API key |
262
+ | `MEDGEMMA_BASE_URL` | No | `""` (empty) | LLM endpoint (HF Endpoint URL/v1 or Google AI Studio URL) |
263
+ | `MEDGEMMA_MODEL_ID` | No | `google/medgemma` | Model identifier (`tgi` for HF Endpoints, or full model name) |
264
+ | `HF_TOKEN` | No | `""` | HuggingFace token for dataset downloads |
265
  | `CHROMA_PERSIST_DIR` | No | `./data/chroma` | ChromaDB storage |
266
  | `EMBEDDING_MODEL` | No | `sentence-transformers/all-MiniLM-L6-v2` | RAG embeddings |
267
  | `MAX_GUIDELINES` | No | `5` | Guidelines per RAG query |
 
315
 
316
  ---
317
 
318
+ ## Phase 11: MedGemma HuggingFace Dedicated Endpoint
319
+
320
+ ### Motivation
321
+
322
+ The competition requires using HAI-DEF models (MedGemma). Google AI Studio served `gemma-3-27b-it` for development, but for the final submission we needed the actual `google/medgemma-27b-text-it` model. HuggingFace Dedicated Endpoints provide an OpenAI-compatible TGI server with scale-to-zero billing.
323
+
324
+ ### Deployment
325
+
326
+ - **Endpoint name:** `medgemma-27b-cds`
327
+ - **Model:** `google/medgemma-27b-text-it`
328
+ - **Instance:** 1× NVIDIA A100 80 GB (AWS `us-east-1`)
329
+ - **Container:** Text Generation Inference (TGI) with `DTYPE=bfloat16`
330
+ - **Scale-to-zero:** Enabled (15 min idle timeout)
331
+ - **Cost:** ~$2.50/hr when running
332
+
333
+ ### Key Configuration
334
+
335
+ After initial deployment, the default TGI token limits (`MAX_INPUT_TOKENS=4096`) caused 422 errors on longer synthesis prompts. Updated endpoint environment:
336
+
337
+ - `MAX_INPUT_TOKENS=12288`
338
+ - `MAX_TOTAL_TOKENS=16384`
339
+
340
+ Also reduced per-step `max_tokens` to stay within limits:
341
+ - `patient_parser.py`: 1500
342
+ - `clinical_reasoning.py`: 3072
343
+ - `conflict_detection.py`: 2000
344
+ - `synthesis.py`: 3000
345
+
346
+ ### Code Changes
347
+
348
+ - **`medgemma.py`:** Updated to send `role: "system"` natively (TGI supports it), with automatic fallback to folding system prompt into user message for Google AI Studio compatibility.
349
+ - **`.env`:** Updated `MEDGEMMA_BASE_URL` to HF endpoint URL, `MEDGEMMA_API_KEY` to HF token, `MEDGEMMA_MODEL_ID=tgi`.
350
+ - **`.env.template`:** Updated with MedGemma model name and HF Endpoint instructions.
351
+
352
+ ### Verification
353
+
354
+ Single-case test: Chikungunya question → correct diagnosis appeared at rank 5 in differential. All 6 pipeline steps completed in 281s.
355
+
356
+ **Deployment guide:** `docs/deploy_medgemma_hf.md`
357
+
358
+ ---
359
+
360
+ ## Phase 12: 50-Case MedQA Validation
361
+
362
+ ### Setup
363
+
364
+ Ran 50 MedQA (USMLE) cases through the full pipeline using the MedGemma HF Endpoint:
365
+
366
+ ```bash
367
+ cd src/backend
368
+ python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2
369
+ ```
370
+
371
+ ### Results
372
+
373
+ | Metric | Value |
374
+ |--------|-------|
375
+ | Cases run | 50 |
376
+ | Pipeline success | 94% (47/50) |
377
+ | Top-1 diagnostic accuracy | 36% |
378
+ | Top-3 diagnostic accuracy | 38% |
379
+ | Differential accuracy | 10% |
380
+ | Mentioned in report | 38% |
381
+ | Avg pipeline time | 204 s/case |
382
+ | Total run time | ~60 min |
383
+
384
+ ### Question Type Breakdown
385
+
386
+ Used `analyze_results.py` to categorize the 50 cases:
387
+
388
+ | Type | Count | Mentioned | Differential |
389
+ |------|-------|-----------|-------------|
390
+ | Diagnostic | 36 | 14 (39%) | 5 (14%) |
391
+ | Treatment | 6 | — | — |
392
+ | Pathophysiology | 6 | — | — |
393
+ | Statistics | 1 | — | — |
394
+ | Anatomy | 1 | — | — |
395
+
396
+ ### Key Observations
397
+
398
+ 1. **MedQA includes many non-diagnostic questions** (treatment, mechanism, stats) that the CDS pipeline is not designed to answer — it generates differential diagnoses, not multiple-choice answers.
399
+ 2. **On diagnostic questions specifically**, 39% mentioned accuracy is reasonable for a pipeline that wasn't optimized for exam-style questions.
400
+ 3. **Pipeline failures (3/50)** were caused by the HF endpoint scaling to zero mid-run. The `--resume` flag successfully continued from the checkpoint.
401
+ 4. **Improved clinical reasoning prompt** to demand disease-level diagnoses rather than symptom categories (e.g., "Chikungunya" not "viral arthritis").
402
+
403
+ ### Infrastructure Improvements
404
+
405
+ - **Incremental JSONL checkpoints:** Each case result is appended to `medqa_checkpoint.jsonl` as it completes.
406
+ - **`--resume` flag:** Skips already-completed cases, enabling graceful recovery from endpoint failures.
407
+ - **`check_progress.py`:** Utility to monitor checkpoint progress during long runs.
408
+ - **`analyze_results.py`:** Categorizes MedQA results by question type for more meaningful accuracy analysis.
409
+ - **Unicode fixes:** Replaced box-drawing characters (`╔═╗║╚╝`) and symbols (`✓✗─`) with ASCII equivalents for Windows console compatibility.
410
+
411
+ **Files created:** `validation/analyze_results.py`, `validation/check_progress.py`
412
+ **Files modified:** `validation/base.py`, `validation/harness_medqa.py`, `validation/run_validation.py`, `app/tools/clinical_reasoning.py`, `app/tools/synthesis.py`, `app/tools/conflict_detection.py`, `app/tools/patient_parser.py`
413
+
414
+ ---
415
+
416
  ## Phase 10: Final Documentation Audit & Cleanup
417
 
418
  Performed a full accuracy audit of all 5 documentation files and `test_e2e.py`.
README.md CHANGED
@@ -97,6 +97,20 @@ A validation framework tests the pipeline against real-world clinical datasets:
97
 
98
  Initial smoke test (3 MedQA cases): 100% parse success, 66.7% top-1 diagnostic accuracy.
99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
  See [docs/test_results.md](docs/test_results.md) for full details and reproduction steps.
101
 
102
  ---
@@ -137,7 +151,8 @@ medgemma_impact_challenge/
137
  ├── docs/
138
  │ ├── architecture.md # System architecture & design decisions
139
  │ ├── test_results.md # Detailed test results & benchmarks
140
- ── writeup_draft.md # Project writeup / summary
 
141
  ├── src/
142
  │ ├── backend/ # Python FastAPI backend
143
  │ │ ├── .env.template # Environment config template
@@ -152,7 +167,9 @@ medgemma_impact_challenge/
152
  │ │ │ ├── harness_medqa.py # MedQA (USMLE) diagnostic accuracy harness
153
  │ │ │ ├── harness_mtsamples.py # MTSamples parse quality harness
154
  │ │ │ ├── harness_pmc.py # PMC Case Reports diagnostic harness
155
- │ │ │ ── run_validation.py # Unified CLI runner
 
 
156
  │ │ └── app/
157
  │ │ ├── main.py # FastAPI entry (CORS, routers, lifespan)
158
  │ │ ├── config.py # Pydantic Settings (ports, models, dirs)
@@ -204,7 +221,7 @@ medgemma_impact_challenge/
204
 
205
  - **Python 3.10+** (tested with Python 3.10)
206
  - **Node.js 18+** (tested with Node.js 18)
207
- - **API Key:** Google AI Studio API key for Gemma model access
208
 
209
  ### Backend Setup
210
 
@@ -221,7 +238,9 @@ pip install -r requirements.txt
221
 
222
  # Configure environment
223
  copy .env.template .env # Windows (or: cp .env.template .env)
224
- # Edit .env — set MEDGEMMA_API_KEY to your Google AI Studio key
 
 
225
 
226
  # Start the backend
227
  uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
@@ -281,7 +300,7 @@ python -m validation.run_validation --all --max-cases 10 # All 3 datasets
281
  |-------|-----------|---------|
282
  | Frontend | Next.js 14, React 18, TypeScript, Tailwind CSS | Patient input, pipeline visualization, report display |
283
  | API | FastAPI, WebSocket, Pydantic v2 | REST endpoints + real-time streaming |
284
- | LLM | Gemma 3 27B IT (via Google AI Studio) | Clinical reasoning + synthesis |
285
  | RAG | ChromaDB, sentence-transformers (all-MiniLM-L6-v2) | Clinical guideline retrieval |
286
  | Drug Data | OpenFDA API, RxNorm / NLM API | Drug interactions, medication normalization |
287
  | Validation | Pydantic | Structured output validation across all pipeline steps |
@@ -326,6 +345,7 @@ curl -X POST http://localhost:8000/api/cases/submit \
326
  | [SECURITY.md](SECURITY.md) | Security policy and responsible disclosure |
327
  | [TODO.md](TODO.md) | Next-session action items and project state |
328
  | [SUBMISSION_GUIDE.md](SUBMISSION_GUIDE.md) | Competition submission strategy |
 
329
 
330
  ---
331
 
 
97
 
98
  Initial smoke test (3 MedQA cases): 100% parse success, 66.7% top-1 diagnostic accuracy.
99
 
100
+ **50-case MedQA validation (MedGemma 27B via HF Endpoint):**
101
+
102
+ | Metric | Value |
103
+ |--------|-------|
104
+ | Cases run | 50 |
105
+ | Pipeline success | 94% (47/50) |
106
+ | Top-1 diagnostic accuracy | 36% |
107
+ | Top-3 diagnostic accuracy | 38% |
108
+ | Differential accuracy | 10% |
109
+ | Mentioned in report | 38% |
110
+ | Avg pipeline time | 204 s/case |
111
+
112
+ Of the 50 cases, 36 were diagnostic questions — on those, 39% mentioned the correct diagnosis and 14% placed it in the differential.
113
+
114
  See [docs/test_results.md](docs/test_results.md) for full details and reproduction steps.
115
 
116
  ---
 
151
  ├── docs/
152
  │ ├── architecture.md # System architecture & design decisions
153
  │ ├── test_results.md # Detailed test results & benchmarks
154
+ ── writeup_draft.md # Project writeup / summary
155
+ │ └── deploy_medgemma_hf.md # MedGemma HF Endpoint deployment guide
156
  ├── src/
157
  │ ├── backend/ # Python FastAPI backend
158
  │ │ ├── .env.template # Environment config template
 
167
  │ │ │ ├── harness_medqa.py # MedQA (USMLE) diagnostic accuracy harness
168
  │ │ │ ├── harness_mtsamples.py # MTSamples parse quality harness
169
  │ │ │ ├── harness_pmc.py # PMC Case Reports diagnostic harness
170
+ │ │ │ ── run_validation.py # Unified CLI runner
171
+ │ │ │ ├── analyze_results.py # Question-type categorization & analysis
172
+ │ │ │ └── check_progress.py # Checkpoint progress monitor
173
  │ │ └── app/
174
  │ │ ├── main.py # FastAPI entry (CORS, routers, lifespan)
175
  │ │ ├── config.py # Pydantic Settings (ports, models, dirs)
 
221
 
222
  - **Python 3.10+** (tested with Python 3.10)
223
  - **Node.js 18+** (tested with Node.js 18)
224
+ - **API Key:** HuggingFace API token (for MedGemma endpoint) or Google AI Studio API key
225
 
226
  ### Backend Setup
227
 
 
238
 
239
  # Configure environment
240
  copy .env.template .env # Windows (or: cp .env.template .env)
241
+ # Edit .env — set MEDGEMMA_API_KEY and MEDGEMMA_BASE_URL
242
+ # For HF Endpoints: see docs/deploy_medgemma_hf.md
243
+ # For Google AI Studio: set MEDGEMMA_API_KEY to your Google AI Studio key
244
 
245
  # Start the backend
246
  uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
 
300
  |-------|-----------|---------|
301
  | Frontend | Next.js 14, React 18, TypeScript, Tailwind CSS | Patient input, pipeline visualization, report display |
302
  | API | FastAPI, WebSocket, Pydantic v2 | REST endpoints + real-time streaming |
303
+ | LLM | MedGemma 27B Text IT (via HuggingFace Dedicated Endpoint) | Clinical reasoning + synthesis |
304
  | RAG | ChromaDB, sentence-transformers (all-MiniLM-L6-v2) | Clinical guideline retrieval |
305
  | Drug Data | OpenFDA API, RxNorm / NLM API | Drug interactions, medication normalization |
306
  | Validation | Pydantic | Structured output validation across all pipeline steps |
 
345
  | [SECURITY.md](SECURITY.md) | Security policy and responsible disclosure |
346
  | [TODO.md](TODO.md) | Next-session action items and project state |
347
  | [SUBMISSION_GUIDE.md](SUBMISSION_GUIDE.md) | Competition submission strategy |
348
+ | [docs/deploy_medgemma_hf.md](docs/deploy_medgemma_hf.md) | MedGemma HuggingFace Endpoint deployment guide |
349
 
350
  ---
351
 
SECURITY.md CHANGED
@@ -37,12 +37,12 @@ We will acknowledge receipt within 48 hours and aim to provide a fix or mitigati
37
  - This system processes clinical text that could contain protected health information (PHI)
38
  - **No real patient data should ever be used** with this demonstration system
39
  - In a production deployment, HIPAA compliance would require: encrypted storage, audit logging, access controls, and BAAs with all third-party services
40
- - The Gemma model can be self-hosted on-premises to avoid sending data to external APIs
41
 
42
  ### API Keys
43
 
44
- - The Google AI Studio API key is stored in `.env` (gitignored)
45
- - Never commit `.env` or any file containing API keys
46
  - The `.env.template` file shows required variables without actual values
47
 
48
  ### LLM-Specific Risks
 
37
  - This system processes clinical text that could contain protected health information (PHI)
38
  - **No real patient data should ever be used** with this demonstration system
39
  - In a production deployment, HIPAA compliance would require: encrypted storage, audit logging, access controls, and BAAs with all third-party services
40
+ - The MedGemma model can be self-hosted on-premises to avoid sending data to external APIs
41
 
42
  ### API Keys
43
 
44
+ - API keys/tokens (HuggingFace token, Google AI Studio key) are stored in `.env` (gitignored)
45
+ - Never commit `.env` or any file containing API keys or tokens
46
  - The `.env.template` file shows required variables without actual values
47
 
48
  ### LLM-Specific Risks
SUBMISSION_GUIDE.md CHANGED
@@ -8,7 +8,7 @@ Jan 13 ─────────────────────── Feb
8
  ◄────── Build & Iterate ──────►
9
  ```
10
 
11
- **⏰ Days remaining as of Feb 13, 2026: ~11 days**
12
 
13
  ---
14
 
 
8
  ◄────── Build & Iterate ──────►
9
  ```
10
 
11
+ **⏰ Days remaining as of Feb 15, 2026: ~9 days**
12
 
13
  ---
14
 
TODO.md CHANGED
@@ -1,39 +1,13 @@
1
  # TODO — Next Session Action Items
2
 
3
- > **Last updated:** End of validation framework + documentation audit session.
4
  > **Read this first** if you're a new AI instance picking up this project.
5
 
6
  ---
7
 
8
  ## High Priority (Do Next)
9
 
10
- ### 1. Run Full-Scale Validation (~2 hours total)
11
-
12
- The validation framework is built and tested with a 3-case smoke test. It needs a proper run:
13
-
14
- ```bash
15
- cd src/backend
16
-
17
- # MedQA — 50 cases, ~45 min
18
- python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2
19
-
20
- # MTSamples — 50 cases, ~45 min
21
- python -m validation.run_validation --mtsamples --max-cases 50 --seed 42 --delay 2
22
-
23
- # PMC Case Reports — 10-20 cases (smaller pool), ~15-30 min
24
- python -m validation.run_validation --pmc --max-cases 20 --seed 42 --delay 2
25
- ```
26
-
27
- Results save to `validation/results/`. After running, update:
28
- - `docs/test_results.md` Section 6 with real numbers (replace smoke test placeholder)
29
- - `docs/writeup_draft.md` validation methodology section with actual metrics
30
- - `README.md` "External Dataset Validation" section
31
-
32
- ### 2. Update Writeup with Actual Validation Metrics
33
-
34
- `docs/writeup_draft.md` currently says "initial smoke test" and "in progress." Once full validation is done, replace with actual numbers (top-1 accuracy, parse success rates, etc.).
35
-
36
- ### 3. Record a Demo Video
37
 
38
  The writeup says "Video: [To be recorded]". Record a ~3 min screencast showing:
39
  1. Pasting a patient case
@@ -41,6 +15,23 @@ The writeup says "Video: [To be recorded]". Record a ~3 min screencast showing:
41
  3. Reviewing the CDS report (especially conflicts section)
42
  4. Showing validation results
43
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
  ---
45
 
46
  ## Medium Priority
@@ -67,12 +58,12 @@ We deliberately removed numeric confidence scores (see Phase 8 in DEVELOPMENT_LO
67
 
68
  ## Low Priority / Future
69
 
70
- ### 7. Model Upgrade Path
71
 
72
- Currently using `gemma-3-27b-it`. When available, evaluate:
73
- - MedGemma (medical-specific Gemma fine-tune) if released
74
- - Smaller/distilled models for latency reduction
75
  - Specialized models for individual pipeline steps (e.g., a parse-only model)
 
76
 
77
  ### 8. EHR Integration Prototype
78
 
@@ -95,15 +86,24 @@ Current input is manual text paste. A FHIR client could auto-populate patient da
95
  | Frontend (Next.js) | ✅ Complete | Real-time pipeline viz, CDS report with conflicts |
96
  | RAG (62 guidelines) | ✅ Complete | 30/30 quality test, 100% top-1 accuracy |
97
  | Conflict Detection | ✅ Complete | Integrated into pipeline, frontend, and docs |
98
- | Validation Framework | ✅ Built | Smoke-tested only needs full-scale runs |
99
- | Documentation (5 files) | ✅ Audited | All docs updated and cross-checked |
 
 
100
  | test_e2e.py | ✅ Fixed | Now asserts 6 steps + conflict_detection |
101
  | GitHub | ✅ Pushed | `bshepp/clinical-decision-support-agent` (master) |
 
 
102
 
103
  **Key files:**
104
  - Backend entry: `src/backend/app/main.py`
105
  - Orchestrator: `src/backend/app/agent/orchestrator.py`
 
106
  - Validation CLI: `src/backend/validation/run_validation.py`
 
107
  - All docs: `README.md`, `docs/architecture.md`, `docs/test_results.md`, `docs/writeup_draft.md`, `DEVELOPMENT_LOG.md`
108
 
109
- **Dev ports:** Backend = 8002 (not 8000 — zombie process issue), Frontend = 3000
 
 
 
 
1
  # TODO — Next Session Action Items
2
 
3
+ > **Last updated:** After 50-case MedQA validation, MedGemma HF Endpoint deployment, and documentation audit.
4
  > **Read this first** if you're a new AI instance picking up this project.
5
 
6
  ---
7
 
8
  ## High Priority (Do Next)
9
 
10
+ ### 1. Record a Demo Video
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  The writeup says "Video: [To be recorded]". Record a ~3 min screencast showing:
13
  1. Pasting a patient case
 
15
  3. Reviewing the CDS report (especially conflicts section)
16
  4. Showing validation results
17
 
18
+ **Note:** Resume the HF Endpoint first (`medgemma-27b-cds` on HuggingFace). It costs ~$2.50/hr and is currently **paused**. Allow 5–15 min for cold start.
19
+
20
+ ### 2. Finalize Submission Writeup
21
+
22
+ `docs/writeup_draft.md` has been updated with 50-case MedQA results. Still needs:
23
+ - Team name / member info filled in
24
+ - Final polish for 3-page limit
25
+ - Links to video and live demo (once recorded/deployed)
26
+
27
+ ### 3. Improve Diagnostic Accuracy (Optional)
28
+
29
+ Current 50-case MedQA accuracy: 36% top-1, 38% mentioned. Potential improvements:
30
+ - **Specialist agents (Option B):** Route to domain-specific reasoning agents for cardiology, neurology, etc.
31
+ - **Better prompting:** Further refine `clinical_reasoning.py` system prompt
32
+ - **Multi-turn reasoning:** Add a self-critique / verification step before synthesis
33
+ - **Run MTSamples + PMC validation** for additional metrics
34
+
35
  ---
36
 
37
  ## Medium Priority
 
58
 
59
  ## Low Priority / Future
60
 
61
+ ### 7. Model Optimization
62
 
63
+ Currently using `google/medgemma-27b-text-it` on A100 80 GB. Options:
64
+ - Smaller/quantized models for latency reduction (medgemma-4b-it for lighter steps)
 
65
  - Specialized models for individual pipeline steps (e.g., a parse-only model)
66
+ - Batch inference optimizations
67
 
68
  ### 8. EHR Integration Prototype
69
 
 
86
  | Frontend (Next.js) | ✅ Complete | Real-time pipeline viz, CDS report with conflicts |
87
  | RAG (62 guidelines) | ✅ Complete | 30/30 quality test, 100% top-1 accuracy |
88
  | Conflict Detection | ✅ Complete | Integrated into pipeline, frontend, and docs |
89
+ | MedGemma HF Endpoint | ✅ Deployed | `medgemma-27b-cds`, A100 80 GB, scale-to-zero, **currently paused** |
90
+ | MedQA Validation (50 cases) | ✅ Complete | 36% top-1, 38% mentioned, 94% pipeline success |
91
+ | Validation Framework | ✅ Complete | MedQA done; MTSamples + PMC harnesses built but not yet run at scale |
92
+ | Documentation (8+ files) | ✅ Audited | All docs updated and cross-checked |
93
  | test_e2e.py | ✅ Fixed | Now asserts 6 steps + conflict_detection |
94
  | GitHub | ✅ Pushed | `bshepp/clinical-decision-support-agent` (master) |
95
+ | Demo Video | ⬜ Not started | Required for submission |
96
+ | Submission Writeup | 🔄 In progress | Template filled, needs final polish |
97
 
98
  **Key files:**
99
  - Backend entry: `src/backend/app/main.py`
100
  - Orchestrator: `src/backend/app/agent/orchestrator.py`
101
+ - MedGemma service: `src/backend/app/services/medgemma.py`
102
  - Validation CLI: `src/backend/validation/run_validation.py`
103
+ - HF Endpoint guide: `docs/deploy_medgemma_hf.md`
104
  - All docs: `README.md`, `docs/architecture.md`, `docs/test_results.md`, `docs/writeup_draft.md`, `DEVELOPMENT_LOG.md`
105
 
106
+ **Infrastructure:**
107
+ - HF Endpoint: `medgemma-27b-cds` at `https://lisvpf8if1yhgxn2.us-east-1.aws.endpoints.huggingface.cloud`
108
+ - Dev ports: Backend = 8002 (not 8000 — zombie process issue), Frontend = 3000
109
+ - Virtual env: `src/backend/venv/`
docs/architecture.md CHANGED
@@ -50,8 +50,8 @@ structured clinical decision support report — all in seconds.
50
  │ └─────────────────┘ │
51
  └──────────────────────────────────────────────────────────────────┘
52
 
53
- LLM: gemma-3-27b-it via Google AI Studio
54
- (OpenAI-compatible endpoint)
55
  ```
56
 
57
  ---
@@ -138,17 +138,18 @@ LLM: gemma-3-27b-it via Google AI Studio
138
 
139
  ### Model Configuration
140
 
141
- - **Model:** `gemma-3-27b-it`
142
- - **API:** Google AI Studio (OpenAI-compatible endpoint)
143
- - **Base URL:** `https://generativelanguage.googleapis.com/v1beta/openai/`
144
  - **Client:** OpenAI Python SDK (`openai==1.51.0`)
145
  - **Service:** `medgemma.py` wraps all LLM calls
 
146
 
147
- ### Gemma System Prompt Workaround
148
 
149
- **Problem discovered during development:** Gemma models accessed via the Google AI Studio OpenAI-compatible endpoint return a 400 error if you include a `role: "system"` message. The API does not support the system role.
150
 
151
- **Solution implemented:** `medgemma.py`'s `_generate_api` method detects system messages and folds them into the first user message with a `[System Instructions]` prefix:
152
 
153
  ```python
154
  # If system message exists, fold it into the first user message
@@ -225,7 +226,8 @@ All pipeline data is strongly typed via Pydantic models in `schemas.py` (~280 li
225
 
226
  | API | Purpose | Authentication | Rate Limits |
227
  |-----|---------|---------------|-------------|
228
- | Google AI Studio | Gemma 3 27B IT LLM inference | API key | Per-key quota |
 
229
  | OpenFDA | Drug adverse event data | None (public) | 240 req/min (with key), 40/min (without) |
230
  | RxNorm / NLM | Drug normalization (name → RxCUI), pairwise interactions | None (public) | 20 req/sec |
231
 
@@ -269,9 +271,10 @@ All configuration lives in `config.py` (Pydantic Settings) and `.env`:
269
 
270
  | Setting | Default | Description |
271
  |---------|---------|-------------|
272
- | `MEDGEMMA_API_KEY` | (required) | Google AI Studio API key |
273
- | `MEDGEMMA_BASE_URL` | `https://generativelanguage.googleapis.com/v1beta/openai/` | LLM API endpoint |
274
- | `MEDGEMMA_MODEL_ID` | `gemma-3-27b-it` | Model identifier |
 
275
  | `CHROMA_PERSIST_DIR` | `./data/chroma` | ChromaDB storage directory |
276
  | `EMBEDDING_MODEL` | `sentence-transformers/all-MiniLM-L6-v2` | Embedding model for RAG |
277
  | `MAX_GUIDELINES` | `5` | Number of guidelines to retrieve per query |
@@ -283,7 +286,7 @@ All configuration lives in `config.py` (Pydantic Settings) and `.env`:
283
 
284
  - **LLM latency:** Full pipeline takes ~75 s due to multiple sequential LLM calls. Could be improved with smaller models or parallel LLM calls.
285
  - **No authentication:** No user auth — designed as a local demo / research tool.
286
- - **Single-model:** Uses only Gemma 3 27B IT. Could benefit from specialized models for different steps.
287
  - **Guideline currency:** Guidelines are a static snapshot. A production system would need automated updates.
288
  - **No EHR integration:** Input is manual text paste. A production system would integrate with EHR FHIR APIs.
289
 
 
50
  │ └─────────────────┘ │
51
  └──────────────────────────────────────────────────────────────────┘
52
 
53
+ LLM: google/medgemma-27b-text-it via HuggingFace Dedicated Endpoint
54
+ (OpenAI-compatible TGI, 1× A100 80 GB, bfloat16)
55
  ```
56
 
57
  ---
 
138
 
139
  ### Model Configuration
140
 
141
+ - **Model:** `google/medgemma-27b-text-it` (MedGemma from HAI-DEF)
142
+ - **API:** HuggingFace Dedicated Endpoint (TGI), with Google AI Studio as fallback
143
+ - **Base URL:** `https://lisvpf8if1yhgxn2.us-east-1.aws.endpoints.huggingface.cloud/v1` (HF Endpoint)
144
  - **Client:** OpenAI Python SDK (`openai==1.51.0`)
145
  - **Service:** `medgemma.py` wraps all LLM calls
146
+ - **Endpoint config:** `MAX_INPUT_TOKENS=12288`, `MAX_TOTAL_TOKENS=16384`, `DTYPE=bfloat16`
147
 
148
+ ### Gemma System Prompt Handling
149
 
150
+ **MedGemma via TGI** natively supports `role: "system"` messages, so we send system/user messages properly.
151
 
152
+ **Fallback for Google AI Studio:** If the backend happens to be plain Gemma on Google AI Studio (which rejects the system role), the code automatically catches the error and falls back to folding the system prompt into the first user message:
153
 
154
  ```python
155
  # If system message exists, fold it into the first user message
 
226
 
227
  | API | Purpose | Authentication | Rate Limits |
228
  |-----|---------|---------------|-------------|
229
+ | HuggingFace Dedicated Endpoint | MedGemma 27B Text IT LLM inference | HF API token | Dedicated GPU (no shared limits) |
230
+ | Google AI Studio (fallback) | Gemma 3 27B IT LLM inference | API key | Per-key quota |
231
  | OpenFDA | Drug adverse event data | None (public) | 240 req/min (with key), 40/min (without) |
232
  | RxNorm / NLM | Drug normalization (name → RxCUI), pairwise interactions | None (public) | 20 req/sec |
233
 
 
271
 
272
  | Setting | Default | Description |
273
  |---------|---------|-------------|
274
+ | `MEDGEMMA_API_KEY` | (required) | HuggingFace API token or Google AI Studio API key |
275
+ | `MEDGEMMA_BASE_URL` | `""` (empty) | LLM API endpoint (HF Endpoint URL with /v1, or Google AI Studio URL) |
276
+ | `MEDGEMMA_MODEL_ID` | `google/medgemma` | Model identifier (`tgi` for HF Endpoints, or full model name) |
277
+ | `HF_TOKEN` | `""` | HuggingFace token for dataset downloads |
278
  | `CHROMA_PERSIST_DIR` | `./data/chroma` | ChromaDB storage directory |
279
  | `EMBEDDING_MODEL` | `sentence-transformers/all-MiniLM-L6-v2` | Embedding model for RAG |
280
  | `MAX_GUIDELINES` | `5` | Number of guidelines to retrieve per query |
 
286
 
287
  - **LLM latency:** Full pipeline takes ~75 s due to multiple sequential LLM calls. Could be improved with smaller models or parallel LLM calls.
288
  - **No authentication:** No user auth — designed as a local demo / research tool.
289
+ - **Single-model:** Uses only MedGemma 27B Text IT. Could benefit from specialized models for different steps.
290
  - **Guideline currency:** Guidelines are a static snapshot. A production system would need automated updates.
291
  - **No EHR integration:** Input is manual text paste. A production system would integrate with EHR FHIR APIs.
292
 
docs/deploy_medgemma_hf.md CHANGED
@@ -35,10 +35,14 @@ OpenAI-compatible API.
35
  - GCP: ~$3.60/hr
36
  6. **Container type**: Text Generation Inference (TGI) — this is the default.
37
  7. **Advanced Settings**:
38
- - **Max Input Length**: `32768`
39
- - **Max Total Tokens**: `40960`
40
  - **Quantization**: `none` (bfloat16 fits in 80 GB)
41
  - **Scale-to-zero**: **Enable** (idle timeout: 15 min recommended)
 
 
 
 
42
  8. Click **Create Endpoint**.
43
 
44
  ### 2. Wait for the endpoint to become ready
 
35
  - GCP: ~$3.60/hr
36
  6. **Container type**: Text Generation Inference (TGI) — this is the default.
37
  7. **Advanced Settings**:
38
+ - **Max Input Length**: `12288` (default 4096 is too small for synthesis prompts)
39
+ - **Max Total Tokens**: `16384`
40
  - **Quantization**: `none` (bfloat16 fits in 80 GB)
41
  - **Scale-to-zero**: **Enable** (idle timeout: 15 min recommended)
42
+
43
+ > **Note:** The default TGI `MAX_INPUT_TOKENS=4096` will cause 422 errors
44
+ > on longer pipeline prompts (especially synthesis). We found `12288` /
45
+ > `16384` to be sufficient for all 6 pipeline steps.
46
  8. Click **Create Endpoint**.
47
 
48
  ### 2. Wait for the endpoint to become ready
docs/test_results.md CHANGED
@@ -1,6 +1,6 @@
1
  # Test Results — CDS Agent
2
 
3
- > Last updated after RAG expansion to 62 guidelines across 14 specialties.
4
 
5
  ---
6
 
@@ -221,7 +221,34 @@ Tests use only the standard library + `httpx` (for REST calls) and the backend's
221
  | Top-3 diagnostic accuracy | 66.7% (2/3) |
222
  | Avg pipeline time | ~94 s per case |
223
 
224
- > **Note:** This is a smoke test only. A full validation run (50–100 cases per dataset) is planned but takes ~45 min per dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
225
 
226
  ### How to Reproduce
227
 
 
1
  # Test Results — CDS Agent
2
 
3
+ > Last updated after 50-case MedQA validation with MedGemma 27B via HuggingFace Dedicated Endpoint.
4
 
5
  ---
6
 
 
221
  | Top-3 diagnostic accuracy | 66.7% (2/3) |
222
  | Avg pipeline time | ~94 s per case |
223
 
224
+ ### 50-Case MedQA Validation (MedGemma 27B Text IT via HF Endpoint)
225
+
226
+ Run with: `python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2`
227
+
228
+ | Metric | Value |
229
+ |--------|-------|
230
+ | Cases run | 50 |
231
+ | Pipeline success | 94% (47/50) |
232
+ | Top-1 diagnostic accuracy | 36% |
233
+ | Top-3 diagnostic accuracy | 38% |
234
+ | Differential accuracy | 10% |
235
+ | Mentioned in report | 38% |
236
+ | Avg pipeline time | 204 s per case |
237
+ | Total run time | ~60 min |
238
+
239
+ **Breakdown by question type (50 cases):**
240
+
241
+ | Type | Count | Mentioned | Differential |
242
+ |------|-------|-----------|-------------|
243
+ | Diagnostic | 36 | 14 (39%) | 5 (14%) |
244
+ | Treatment | 6 | — | — |
245
+ | Pathophysiology | 6 | — | — |
246
+ | Statistics | 1 | — | — |
247
+ | Anatomy | 1 | — | — |
248
+
249
+ > **Notes:** MedQA questions include many non-diagnostic question types (treatment selection, mechanism of action, etc.) which the CDS pipeline is not designed to answer. On diagnostic-only questions, the pipeline mentioned the correct diagnosis 39% of the time. Pipeline failures (3/50) were due to HF endpoint scale-to-zero mid-run.
250
+
251
+ > Full validation was run on Feb 15, 2026 using the `medgemma-27b-cds` HuggingFace Dedicated Endpoint (1× A100 80 GB, bfloat16). Incremental checkpoints saved to `validation/results/medqa_checkpoint.jsonl` with `--resume` support.
252
 
253
  ### How to Reproduce
254
 
docs/writeup_draft.md CHANGED
@@ -107,6 +107,8 @@ No fine-tuning was performed in the current version. The base MedGemma model (`m
107
  | RAG retrieval quality | 30/30 queries passed (100%), avg relevance 0.639 |
108
  | Clinical test suite | 22 scenarios across 14 specialties |
109
  | Top-1 RAG accuracy | 100% — correct guideline ranked #1 for all queries |
 
 
110
 
111
  **Application stack:**
112
 
@@ -121,7 +123,7 @@ No fine-tuning was performed in the current version. The base MedGemma model (`m
121
  **Deployment considerations:**
122
 
123
  - **HIPAA compliance:** MedGemma is an open-weight model that can be self-hosted on-premises, eliminating the need to send patient data to external APIs. This is critical for healthcare deployment.
124
- - **Latency:** Current pipeline takes ~75 s end-to-end. For production, this could be reduced with: smaller/distilled models, parallel LLM calls, or GPU-accelerated inference.
125
  - **Scalability:** FastAPI + uvicorn supports async request handling. For high-throughput deployment, add worker processes and a task queue (e.g., Celery).
126
  - **EHR integration:** Current input is manual text paste. A production system would integrate with EHR systems via FHIR APIs for automatic patient data extraction.
127
 
@@ -139,7 +141,7 @@ The validation harness calls the `Orchestrator` directly (no HTTP server), enabl
139
 
140
  **Initial smoke test (3 MedQA cases):** 100% parse success, 66.7% top-1 diagnostic accuracy, ~94 s avg per case.
141
 
142
- Full-scale validation (50–100+ cases per dataset) is in progress.
143
 
144
  **Practical usage:**
145
 
 
107
  | RAG retrieval quality | 30/30 queries passed (100%), avg relevance 0.639 |
108
  | Clinical test suite | 22 scenarios across 14 specialties |
109
  | Top-1 RAG accuracy | 100% — correct guideline ranked #1 for all queries |
110
+ | **MedQA 50-case validation** | **36% top-1, 38% top-3, 38% mentioned, 94% pipeline success** |
111
+ | MedQA diagnostic-only (36 cases) | 39% mentioned, 14% differential |
112
 
113
  **Application stack:**
114
 
 
123
  **Deployment considerations:**
124
 
125
  - **HIPAA compliance:** MedGemma is an open-weight model that can be self-hosted on-premises, eliminating the need to send patient data to external APIs. This is critical for healthcare deployment.
126
+ - **Latency:** Current pipeline takes ~75 s for a single E2E case (local), or ~204 s avg on the HuggingFace Dedicated Endpoint (50-case MedQA validation). For production, this could be reduced with: smaller/distilled models, parallel LLM calls, or GPU-accelerated inference with higher throughput.
127
  - **Scalability:** FastAPI + uvicorn supports async request handling. For high-throughput deployment, add worker processes and a task queue (e.g., Celery).
128
  - **EHR integration:** Current input is manual text paste. A production system would integrate with EHR systems via FHIR APIs for automatic patient data extraction.
129
 
 
141
 
142
  **Initial smoke test (3 MedQA cases):** 100% parse success, 66.7% top-1 diagnostic accuracy, ~94 s avg per case.
143
 
144
+ **50-case MedQA validation (MedGemma 27B via HF Endpoint):** 94% pipeline success, 36% top-1 diagnostic accuracy, 38% mentioned in report, 204 s avg per case. On diagnostic-only questions (36/50), 39% mentioned the correct diagnosis. Full results in [docs/test_results.md](docs/test_results.md).
145
 
146
  **Practical usage:**
147
 
src/backend/analyze_checkpoint.py ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Quick analysis of MedQA checkpoint data."""
2
+ import json
3
+
4
+ path = "validation/results/medqa_checkpoint.jsonl"
5
+ with open(path) as f:
6
+ results = [json.loads(l) for l in f]
7
+
8
+ print(f"Cases completed: {len(results)}\n")
9
+
10
+ # ── Table view ──
11
+ fmt = "{:<12} {:>3} {:>3} {:>4} {:>7} {:>3} {:>4} {:<15} {:<42} {}"
12
+ print(fmt.format("ID", "t1", "t3", "diff", "ms", "#dx", "rnk", "match_loc", "correct_answer", "top_diagnosis"))
13
+ print("-" * 145)
14
+
15
+ for r in results:
16
+ d = r["details"]
17
+ t1 = "Y" if r["scores"]["top1_accuracy"] else "N"
18
+ t3 = "Y" if r["scores"]["top3_accuracy"] else "N"
19
+ da = "Y" if r["scores"].get("differential_accuracy") else "N"
20
+ rank = d.get("found_at_rank", -1)
21
+ loc = d.get("match_location", "?")
22
+ ca = d["correct_answer"][:42]
23
+ td = d.get("top_diagnosis", "?")[:45]
24
+ print(fmt.format(r["case_id"], t1, t3, da, r["pipeline_time_ms"], d.get("num_diagnoses", 0), rank, loc, ca, td))
25
+
26
+ print()
27
+
28
+ # ── Timing analysis ──
29
+ correct = [r for r in results if r["scores"]["top1_accuracy"]]
30
+ wrong = [r for r in results if not r["scores"]["top1_accuracy"]]
31
+ mentioned = [r for r in results if r["scores"].get("mentioned_accuracy")]
32
+ top3 = [r for r in results if r["scores"]["top3_accuracy"]]
33
+ diff_only = [r for r in results if r["scores"].get("differential_accuracy")]
34
+
35
+ if correct:
36
+ avg = sum(r["pipeline_time_ms"] for r in correct) / len(correct)
37
+ print(f"Correct (top1) avg time: {avg:.0f}ms ({len(correct)}/{len(results)} = {len(correct)/len(results)*100:.0f}%)")
38
+ if top3:
39
+ avg = sum(r["pipeline_time_ms"] for r in top3) / len(top3)
40
+ print(f"Correct (top3) avg time: {avg:.0f}ms ({len(top3)}/{len(results)} = {len(top3)/len(results)*100:.0f}%)")
41
+ if diff_only:
42
+ avg = sum(r["pipeline_time_ms"] for r in diff_only) / len(diff_only)
43
+ print(f"Differential only: {avg:.0f}ms ({len(diff_only)}/{len(results)} = {len(diff_only)/len(results)*100:.0f}%)")
44
+ if wrong:
45
+ avg = sum(r["pipeline_time_ms"] for r in wrong) / len(wrong)
46
+ print(f"Wrong (top1) avg time: {avg:.0f}ms ({len(wrong)}/{len(results)} = {len(wrong)/len(results)*100:.0f}%)")
47
+ if mentioned:
48
+ print(f"Mentioned anywhere: {len(mentioned)}/{len(results)}")
49
+
50
+ # ── Match location breakdown ──
51
+ print("\n=== MATCH LOCATION BREAKDOWN ===")
52
+ loc_counts = {}
53
+ for r in results:
54
+ loc = r["details"].get("match_location", "not_found")
55
+ loc_counts[loc] = loc_counts.get(loc, 0) + 1
56
+ for loc, count in sorted(loc_counts.items()):
57
+ print(f" {loc:<20} {count:>3} ({count/len(results)*100:.0f}%)")
58
+
59
+ # ── Detailed per-case (new fields if available) ──
60
+ print("\n=== PER-CASE DETAIL ===")
61
+ for r in results:
62
+ d = r["details"]
63
+ cid = r["case_id"]
64
+ loc = d.get("match_location", "?")
65
+ ca = d["correct_answer"]
66
+ td = d.get("top_diagnosis", "?")
67
+ all_dx = d.get("all_diagnoses", [td])
68
+ all_next = d.get("all_next_steps", [])
69
+ all_recs = d.get("all_recommendations", [])
70
+ t1 = "Y" if r["scores"]["top1_accuracy"] else "N"
71
+
72
+ print(f"\n {cid} [t1={t1}, loc={loc}]")
73
+ print(f" Expected: {ca}")
74
+ print(f" Differential: {', '.join(all_dx)}")
75
+ if all_next:
76
+ print(f" Next steps: {'; '.join(all_next[:3])}")
77
+ if all_recs:
78
+ print(f" Recommendations: {'; '.join(str(r)[:60] for r in all_recs[:3])}")
79
+
80
+ # ── Answer type vs accuracy ──
81
+ print("\n=== ANSWER TYPE vs ACCURACY ===")
82
+ dx_correct = dx_total = mgmt_correct = mgmt_total = 0
83
+ action_words = ["start", "stop", "give", "prescribe", "perform", "order", "refer",
84
+ "increase", "decrease", "switch", "add", "monitor", "observation",
85
+ "reassure", "discharge", "admit", "excess", "adaptation", "exclusion",
86
+ "it is", "right-sided", "affective", "exploratory", "lytic"]
87
+ for r in results:
88
+ ca = r["details"]["correct_answer"]
89
+ is_dx = not any(w.lower() in ca.lower() for w in action_words)
90
+ if is_dx:
91
+ dx_total += 1
92
+ if r["scores"]["top1_accuracy"]:
93
+ dx_correct += 1
94
+ else:
95
+ mgmt_total += 1
96
+ if r["scores"]["top1_accuracy"]:
97
+ mgmt_correct += 1
98
+
99
+ if dx_total:
100
+ print(f" Diagnosis questions: {dx_correct}/{dx_total} = {dx_correct/dx_total*100:.0f}%")
101
+ if mgmt_total:
102
+ print(f" Mgmt/concept questions: {mgmt_correct}/{mgmt_total} = {mgmt_correct/mgmt_total*100:.0f}%")
103
+
104
+ dx_counts = [r["details"].get("num_diagnoses", 0) for r in results]
105
+ print(f"\nDiagnoses generated: min={min(dx_counts)}, max={max(dx_counts)}, avg={sum(dx_counts)/len(dx_counts):.1f}")
src/backend/app/agent/orchestrator.py CHANGED
@@ -2,11 +2,12 @@
2
  Agent Orchestrator — the brain of the CDS Agent.
3
 
4
  Controls the multi-step pipeline:
5
- 1. Parse patient data
6
- 2. Clinical reasoning (MedGemma)
7
- 3. Drug interaction check
8
- 4. Guideline retrieval (RAG)
9
- 5. Synthesis (MedGemma)
 
10
 
11
  Each step is a tool call. The orchestrator manages state, handles errors,
12
  and streams step updates to the frontend via a callback.
 
2
  Agent Orchestrator — the brain of the CDS Agent.
3
 
4
  Controls the multi-step pipeline:
5
+ 1. Parse patient data (MedGemma)
6
+ 2. Clinical reasoning / differential diagnosis (MedGemma)
7
+ 3. Drug interaction check (OpenFDA + RxNorm APIs)
8
+ 4. Guideline retrieval (RAG over ChromaDB)
9
+ 5. Conflict detection (MedGemma)
10
+ 6. Synthesis into CDS report (MedGemma)
11
 
12
  Each step is a tool call. The orchestrator manages state, handles errors,
13
  and streams step updates to the frontend via a callback.
src/backend/validation_test_output.txt ADDED
@@ -0,0 +1,507 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ==========================================================
2
+ Clinical Decision Support Agent - Validation Suite
3
+ ==========================================================
4
+
5
+ Datasets: MedQA
6
+ Cases/dataset: 1
7
+ Drug check: Yes
8
+ Guidelines: Yes
9
+ Resume: No
10
+ Fetch only: No
11
+
12
+ ============================================================
13
+ DATASET 1: MedQA (USMLE-style diagnostic accuracy)
14
+ ============================================================
15
+ Loading MedQA from cache: F:\kaggle\medgemma_impact_challenge\src\backend\validation\data\medqa_test.jsonl
16
+ Loaded 1 MedQA cases
17
+
18
+ .\venv\Scripts\python.exe :
19
+ At line:1 char:174
20
+ + ... lyContinue; .\venv\Scripts\python.exe -m validation.run_validation -- ...
21
+ + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
22
+ + CategoryInfo : NotSpecified: (:String) [], RemoteException
23
+ + FullyQualifiedErrorId : NativeCommandError
24
+
25
+ Loading weights: 0%| | 0/103 [00:00<?, ?it/s]
26
+ Loading weights: 1%| | 1/103 [00:00<00:00, 19065.02it/s,
27
+ Materializing param=embeddings.LayerNorm.bias]
28
+ Loading weights: 1%| | 1/103 [00:00<00:00, 5504.34it/s,
29
+ Materializing param=embeddings.LayerNorm.bias]
30
+ Loading weights: 2%|ΓûÅ | 2/103 [00:00<00:00, 3855.06it/s,
31
+ Materializing param=embeddings.LayerNorm.weight]
32
+ Loading weights: 2%|ΓûÅ | 2/103 [00:00<00:00, 3482.20it/s,
33
+ Materializing param=embeddings.LayerNorm.weight]
34
+ Loading weights: 3%|ΓûÄ | 3/103 [00:00<00:00, 4359.98it/s,
35
+ Materializing param=embeddings.position_embeddings.weight]
36
+ Loading weights: 3%|ΓûÄ | 3/103 [00:00<00:00, 4124.19it/s,
37
+ Materializing param=embeddings.position_embeddings.weight]
38
+ Loading weights: 4%|Γûì | 4/103 [00:00<00:00, 4960.74it/s,
39
+ Materializing param=embeddings.token_type_embeddings.weight]
40
+ Loading weights: 4%|Γûì | 4/103 [00:00<00:00, 4470.35it/s,
41
+ Materializing param=embeddings.token_type_embeddings.weight]
42
+ Loading weights: 5%|Γûì | 5/103 [00:00<00:00, 3788.21it/s,
43
+ Materializing param=embeddings.word_embeddings.weight]
44
+ Loading weights: 5%|Γûì | 5/103 [00:00<00:00, 3614.53it/s,
45
+ Materializing param=embeddings.word_embeddings.weight]
46
+ Loading weights: 6%|Γûî | 6/103 [00:00<00:00, 3141.41it/s,
47
+ Materializing param=encoder.layer.0.attention.output.LayerNorm.bias]
48
+ Loading weights: 6%|Γûî | 6/103 [00:00<00:00, 3036.42it/s,
49
+ Materializing param=encoder.layer.0.attention.output.LayerNorm.bias]
50
+ Loading weights: 7%|Γûï | 7/103 [00:00<00:00, 3350.08it/s,
51
+ Materializing param=encoder.layer.0.attention.output.LayerNorm.weight]
52
+ Loading weights: 7%|Γûï | 7/103 [00:00<00:00, 3287.81it/s,
53
+ Materializing param=encoder.layer.0.attention.output.LayerNorm.weight]
54
+ Loading weights: 8%|Γûè | 8/103 [00:00<00:00, 3629.47it/s,
55
+ Materializing param=encoder.layer.0.attention.output.dense.bias]
56
+ Loading weights: 8%|Γûè | 8/103 [00:00<00:00, 3572.66it/s,
57
+ Materializing param=encoder.layer.0.attention.output.dense.bias]
58
+ Loading weights: 9%|Γûè | 9/103 [00:00<00:00, 3874.84it/s,
59
+ Materializing param=encoder.layer.0.attention.output.dense.weight]
60
+ Loading weights: 9%|Γûè | 9/103 [00:00<00:00, 3819.18it/s,
61
+ Materializing param=encoder.layer.0.attention.output.dense.weight]
62
+ Loading weights: 10%|Γûë | 10/103 [00:00<00:00, 3813.00it/s,
63
+ Materializing param=encoder.layer.0.attention.self.key.bias]
64
+ Loading weights: 10%|Γûë | 10/103 [00:00<00:00, 3603.04it/s,
65
+ Materializing param=encoder.layer.0.attention.self.key.bias]
66
+ Loading weights: 11%|Γûê | 11/103 [00:00<00:00, 3584.88it/s,
67
+ Materializing param=encoder.layer.0.attention.self.key.weight]
68
+ Loading weights: 11%|Γûê | 11/103 [00:00<00:00, 3500.56it/s,
69
+ Materializing param=encoder.layer.0.attention.self.key.weight]
70
+ Loading weights: 12%|ΓûêΓûÅ | 12/103 [00:00<00:00, 3682.44it/s,
71
+ Materializing param=encoder.layer.0.attention.self.query.bias]
72
+ Loading weights: 12%|ΓûêΓûÅ | 12/103 [00:00<00:00, 3581.30it/s,
73
+ Materializing param=encoder.layer.0.attention.self.query.bias]
74
+ Loading weights: 13%|ΓûêΓûÄ | 13/103 [00:00<00:00, 3355.86it/s,
75
+ Materializing param=encoder.layer.0.attention.self.query.weight]
76
+ Loading weights: 13%|ΓûêΓûÄ | 13/103 [00:00<00:00, 3265.42it/s,
77
+ Materializing param=encoder.layer.0.attention.self.query.weight]
78
+ Loading weights: 14%|ΓûêΓûÄ | 14/103 [00:00<00:00, 3102.62it/s,
79
+ Materializing param=encoder.layer.0.attention.self.value.bias]
80
+ Loading weights: 14%|ΓûêΓûÄ | 14/103 [00:00<00:00, 3056.12it/s,
81
+ Materializing param=encoder.layer.0.attention.self.value.bias]
82
+ Loading weights: 15%|ΓûêΓûì | 15/103 [00:00<00:00, 2961.94it/s,
83
+ Materializing param=encoder.layer.0.attention.self.value.weight]
84
+ Loading weights: 15%|ΓûêΓûì | 15/103 [00:00<00:00, 2896.09it/s,
85
+ Materializing param=encoder.layer.0.attention.self.value.weight]
86
+ Loading weights: 16%|ΓûêΓûî | 16/103 [00:00<00:00, 2895.37it/s,
87
+ Materializing param=encoder.layer.0.intermediate.dense.bias]
88
+ Loading weights: 16%|ΓûêΓûî | 16/103 [00:00<00:00, 2689.73it/s,
89
+ Materializing param=encoder.layer.0.intermediate.dense.bias]
90
+ Loading weights: 17%|ΓûêΓûï | 17/103 [00:00<00:00, 2702.62it/s,
91
+ Materializing param=encoder.layer.0.intermediate.dense.weight]
92
+ Loading weights: 17%|ΓûêΓûï | 17/103 [00:00<00:00, 2671.73it/s,
93
+ Materializing param=encoder.layer.0.intermediate.dense.weight]
94
+ Loading weights: 17%|ΓûêΓûï | 18/103 [00:00<00:00, 2678.17it/s,
95
+ Materializing param=encoder.layer.0.output.LayerNorm.bias]
96
+ Loading weights: 17%|ΓûêΓûï | 18/103 [00:00<00:00, 2542.17it/s,
97
+ Materializing param=encoder.layer.0.output.LayerNorm.bias]
98
+ Loading weights: 18%|ΓûêΓûè | 19/103 [00:00<00:00, 2556.11it/s,
99
+ Materializing param=encoder.layer.0.output.LayerNorm.weight]
100
+ Loading weights: 18%|ΓûêΓûè | 19/103 [00:00<00:00, 2535.85it/s,
101
+ Materializing param=encoder.layer.0.output.LayerNorm.weight]
102
+ Loading weights: 19%|ΓûêΓûë | 20/103 [00:00<00:00, 2582.54it/s,
103
+ Materializing param=encoder.layer.0.output.dense.bias]
104
+ Loading weights: 19%|ΓûêΓûë | 20/103 [00:00<00:00, 2498.10it/s,
105
+ Materializing param=encoder.layer.0.output.dense.bias]
106
+ Loading weights: 20%|ΓûêΓûê | 21/103 [00:00<00:00, 2512.99it/s,
107
+ Materializing param=encoder.layer.0.output.dense.weight]
108
+ Loading weights: 20%|ΓûêΓûê | 21/103 [00:00<00:00, 2433.29it/s,
109
+ Materializing param=encoder.layer.0.output.dense.weight]
110
+ Loading weights: 21%|ΓûêΓûêΓûÅ | 22/103 [00:00<00:00, 2454.44it/s,
111
+ Materializing param=encoder.layer.1.attention.output.LayerNorm.bias]
112
+ Loading weights: 21%|ΓûêΓûêΓûÅ | 22/103 [00:00<00:00, 2441.78it/s,
113
+ Materializing param=encoder.layer.1.attention.output.LayerNorm.bias]
114
+ Loading weights: 22%|ΓûêΓûêΓûÅ | 23/103 [00:00<00:00, 2473.82it/s,
115
+ Materializing param=encoder.layer.1.attention.output.LayerNorm.weight]
116
+ Loading weights: 22%|ΓûêΓûêΓûÅ | 23/103 [00:00<00:00, 2425.00it/s,
117
+ Materializing param=encoder.layer.1.attention.output.LayerNorm.weight]
118
+ Loading weights: 23%|ΓûêΓûêΓûÄ | 24/103 [00:00<00:00, 2464.22it/s,
119
+ Materializing param=encoder.layer.1.attention.output.dense.bias]
120
+ Loading weights: 23%|ΓûêΓûêΓûÄ | 24/103 [00:00<00:00, 2448.69it/s,
121
+ Materializing param=encoder.layer.1.attention.output.dense.bias]
122
+ Loading weights: 24%|ΓûêΓûêΓûì | 25/103 [00:00<00:00, 2527.85it/s,
123
+ Materializing param=encoder.layer.1.attention.output.dense.weight]
124
+ Loading weights: 24%|ΓûêΓûêΓûì | 25/103 [00:00<00:00, 2518.56it/s,
125
+ Materializing param=encoder.layer.1.attention.output.dense.weight]
126
+ Loading weights: 25%|ΓûêΓûêΓûî | 26/103 [00:00<00:00, 2599.94it/s,
127
+ Materializing param=encoder.layer.1.attention.self.key.bias]
128
+ Loading weights: 25%|ΓûêΓûêΓûî | 26/103 [00:00<00:00, 2591.23it/s,
129
+ Materializing param=encoder.layer.1.attention.self.key.bias]
130
+ Loading weights: 26%|ΓûêΓûêΓûî | 27/103 [00:00<00:00, 2594.06it/s,
131
+ Materializing param=encoder.layer.1.attention.self.key.weight]
132
+ Loading weights: 26%|ΓûêΓûêΓûî | 27/103 [00:00<00:00, 2573.43it/s,
133
+ Materializing param=encoder.layer.1.attention.self.key.weight]
134
+ Loading weights: 27%|ΓûêΓûêΓûï | 28/103 [00:00<00:00, 2616.42it/s,
135
+ Materializing param=encoder.layer.1.attention.self.query.bias]
136
+ Loading weights: 27%|ΓûêΓûêΓûï | 28/103 [00:00<00:00, 2605.10it/s,
137
+ Materializing param=encoder.layer.1.attention.self.query.bias]
138
+ Loading weights: 28%|ΓûêΓûêΓûè | 29/103 [00:00<00:00, 2679.12it/s,
139
+ Materializing param=encoder.layer.1.attention.self.query.weight]
140
+ Loading weights: 28%|ΓûêΓûêΓûè | 29/103 [00:00<00:00, 2670.48it/s,
141
+ Materializing param=encoder.layer.1.attention.self.query.weight]
142
+ Loading weights: 29%|ΓûêΓûêΓûë | 30/103 [00:00<00:00, 2726.64it/s,
143
+ Materializing param=encoder.layer.1.attention.self.value.bias]
144
+ Loading weights: 29%|ΓûêΓûêΓûë | 30/103 [00:00<00:00, 2717.57it/s,
145
+ Materializing param=encoder.layer.1.attention.self.value.bias]
146
+ Loading weights: 30%|ΓûêΓûêΓûê | 31/103 [00:00<00:00, 2790.26it/s,
147
+ Materializing param=encoder.layer.1.attention.self.value.weight]
148
+ Loading weights: 30%|ΓûêΓûêΓûê | 31/103 [00:00<00:00, 2782.20it/s,
149
+ Materializing param=encoder.layer.1.attention.self.value.weight]
150
+ Loading weights: 31%|ΓûêΓûêΓûê | 32/103 [00:00<00:00, 2854.42it/s,
151
+ Materializing param=encoder.layer.1.intermediate.dense.bias]
152
+ Loading weights: 31%|ΓûêΓûêΓûê | 32/103 [00:00<00:00, 2846.07it/s,
153
+ Materializing param=encoder.layer.1.intermediate.dense.bias]
154
+ Loading weights: 32%|ΓûêΓûêΓûêΓûÅ | 33/103 [00:00<00:00, 2881.66it/s,
155
+ Materializing param=encoder.layer.1.intermediate.dense.weight]
156
+ Loading weights: 32%|ΓûêΓûêΓûêΓûÅ | 33/103 [00:00<00:00, 2865.20it/s,
157
+ Materializing param=encoder.layer.1.intermediate.dense.weight]
158
+ Loading weights: 33%|ΓûêΓûêΓûêΓûÄ | 34/103 [00:00<00:00, 2928.02it/s,
159
+ Materializing param=encoder.layer.1.output.LayerNorm.bias]
160
+ Loading weights: 33%|ΓûêΓûêΓûêΓûÄ | 34/103 [00:00<00:00, 2919.03it/s,
161
+ Materializing param=encoder.layer.1.output.LayerNorm.bias]
162
+ Loading weights: 34%|ΓûêΓûêΓûêΓûì | 35/103 [00:00<00:00, 2941.72it/s,
163
+ Materializing param=encoder.layer.1.output.LayerNorm.weight]
164
+ Loading weights: 34%|ΓûêΓûêΓûêΓûì | 35/103 [00:00<00:00, 2929.86it/s,
165
+ Materializing param=encoder.layer.1.output.LayerNorm.weight]
166
+ Loading weights: 35%|ΓûêΓûêΓûêΓûì | 36/103 [00:00<00:00, 2993.56it/s,
167
+ Materializing param=encoder.layer.1.output.dense.bias]
168
+ Loading weights: 35%|ΓûêΓûêΓûêΓûì | 36/103 [00:00<00:00, 2985.51it/s,
169
+ Materializing param=encoder.layer.1.output.dense.bias]
170
+ Loading weights: 36%|ΓûêΓûêΓûêΓûî | 37/103 [00:00<00:00, 3010.29it/s,
171
+ Materializing param=encoder.layer.1.output.dense.weight]
172
+ Loading weights: 36%|ΓûêΓûêΓûêΓûî | 37/103 [00:00<00:00, 3001.44it/s,
173
+ Materializing param=encoder.layer.1.output.dense.weight]
174
+ Loading weights: 37%|ΓûêΓûêΓûêΓûï | 38/103 [00:00<00:00, 1948.86it/s,
175
+ Materializing param=encoder.layer.2.attention.output.LayerNorm.bias]
176
+ Loading weights: 37%|ΓûêΓûêΓûêΓûï | 38/103 [00:00<00:00, 1941.59it/s,
177
+ Materializing param=encoder.layer.2.attention.output.LayerNorm.bias]
178
+ Loading weights: 38%|ΓûêΓûêΓûêΓûè | 39/103 [00:00<00:00, 1983.39it/s,
179
+ Materializing param=encoder.layer.2.attention.output.LayerNorm.weight]
180
+ Loading weights: 38%|ΓûêΓûêΓûêΓûè | 39/103 [00:00<00:00, 1979.24it/s,
181
+ Materializing param=encoder.layer.2.attention.output.LayerNorm.weight]
182
+ Loading weights: 39%|ΓûêΓûêΓûêΓûë | 40/103 [00:00<00:00, 2022.62it/s,
183
+ Materializing param=encoder.layer.2.attention.output.dense.bias]
184
+ Loading weights: 39%|ΓûêΓûêΓûêΓûë | 40/103 [00:00<00:00, 2018.97it/s,
185
+ Materializing param=encoder.layer.2.attention.output.dense.bias]
186
+ Loading weights: 40%|ΓûêΓûêΓûêΓûë | 41/103 [00:00<00:00, 2061.16it/s,
187
+ Materializing param=encoder.layer.2.attention.output.dense.weight]
188
+ Loading weights: 40%|ΓûêΓûêΓûêΓûë | 41/103 [00:00<00:00, 2057.73it/s,
189
+ Materializing param=encoder.layer.2.attention.output.dense.weight]
190
+ Loading weights: 41%|ΓûêΓûêΓûêΓûê | 42/103 [00:00<00:00, 2101.61it/s,
191
+ Materializing param=encoder.layer.2.attention.self.key.bias]
192
+ Loading weights: 41%|ΓûêΓûêΓûêΓûê | 42/103 [00:00<00:00, 2098.30it/s,
193
+ Materializing param=encoder.layer.2.attention.self.key.bias]
194
+ Loading weights: 42%|ΓûêΓûêΓûêΓûêΓûÅ | 43/103 [00:00<00:00, 2141.78it/s,
195
+ Materializing param=encoder.layer.2.attention.self.key.weight]
196
+ Loading weights: 42%|ΓûêΓûêΓûêΓûêΓûÅ | 43/103 [00:00<00:00, 2138.48it/s,
197
+ Materializing param=encoder.layer.2.attention.self.key.weight]
198
+ Loading weights: 43%|ΓûêΓûêΓûêΓûêΓûÄ | 44/103 [00:00<00:00, 2182.05it/s,
199
+ Materializing param=encoder.layer.2.attention.self.query.bias]
200
+ Loading weights: 43%|ΓûêΓûêΓûêΓûêΓûÄ | 44/103 [00:00<00:00, 2178.81it/s,
201
+ Materializing param=encoder.layer.2.attention.self.query.bias]
202
+ Loading weights: 44%|ΓûêΓûêΓûêΓûêΓûÄ | 45/103 [00:00<00:00, 2222.13it/s,
203
+ Materializing param=encoder.layer.2.attention.self.query.weight]
204
+ Loading weights: 44%|ΓûêΓûêΓûêΓûêΓûÄ | 45/103 [00:00<00:00, 2218.66it/s,
205
+ Materializing param=encoder.layer.2.attention.self.query.weight]
206
+ Loading weights: 45%|ΓûêΓûêΓûêΓûêΓûì | 46/103 [00:00<00:00, 2261.66it/s,
207
+ Materializing param=encoder.layer.2.attention.self.value.bias]
208
+ Loading weights: 45%|ΓûêΓûêΓûêΓûêΓûì | 46/103 [00:00<00:00, 2257.51it/s,
209
+ Materializing param=encoder.layer.2.attention.self.value.bias]
210
+ Loading weights: 46%|ΓûêΓûêΓûêΓûêΓûî | 47/103 [00:00<00:00, 2299.72it/s,
211
+ Materializing param=encoder.layer.2.attention.self.value.weight]
212
+ Loading weights: 46%|ΓûêΓûêΓûêΓûêΓûî | 47/103 [00:00<00:00, 2296.00it/s,
213
+ Materializing param=encoder.layer.2.attention.self.value.weight]
214
+ Loading weights: 47%|ΓûêΓûêΓûêΓûêΓûï | 48/103 [00:00<00:00, 2337.91it/s,
215
+ Materializing param=encoder.layer.2.intermediate.dense.bias]
216
+ Loading weights: 47%|ΓûêΓûêΓûêΓûêΓûï | 48/103 [00:00<00:00, 2334.33it/s,
217
+ Materializing param=encoder.layer.2.intermediate.dense.bias]
218
+ Loading weights: 48%|ΓûêΓûêΓûêΓûêΓûè | 49/103 [00:00<00:00, 2376.32it/s,
219
+ Materializing param=encoder.layer.2.intermediate.dense.weight]
220
+ Loading weights: 48%|ΓûêΓûêΓûêΓûêΓûè | 49/103 [00:00<00:00, 2372.70it/s,
221
+ Materializing param=encoder.layer.2.intermediate.dense.weight]
222
+ Loading weights: 49%|ΓûêΓûêΓûêΓûêΓûè | 50/103 [00:00<00:00, 2407.75it/s,
223
+ Materializing param=encoder.layer.2.output.LayerNorm.bias]
224
+ Loading weights: 49%|ΓûêΓûêΓûêΓûêΓûè | 50/103 [00:00<00:00, 2403.06it/s,
225
+ Materializing param=encoder.layer.2.output.LayerNorm.bias]
226
+ Loading weights: 50%|ΓûêΓûêΓûêΓûêΓûë | 51/103 [00:00<00:00, 2443.48it/s,
227
+ Materializing param=encoder.layer.2.output.LayerNorm.weight]
228
+ Loading weights: 50%|ΓûêΓûêΓûêΓûêΓûë | 51/103 [00:00<00:00, 2439.80it/s,
229
+ Materializing param=encoder.layer.2.output.LayerNorm.weight]
230
+ Loading weights: 50%|ΓûêΓûêΓûêΓûêΓûê | 52/103 [00:00<00:00, 2479.47it/s,
231
+ Materializing param=encoder.layer.2.output.dense.bias]
232
+ Loading weights: 50%|ΓûêΓûêΓûêΓûêΓûê | 52/103 [00:00<00:00, 2475.44it/s,
233
+ Materializing param=encoder.layer.2.output.dense.bias]
234
+ Loading weights: 51%|ΓûêΓûêΓûêΓûêΓûêΓûÅ | 53/103 [00:00<00:00,
235
+ 2510.28it/s, Materializing param=encoder.layer.2.output.dense.weight]
236
+ Loading weights: 51%|ΓûêΓûêΓûêΓûêΓûêΓûÅ | 53/103 [00:00<00:00,
237
+ 2499.39it/s, Materializing param=encoder.layer.2.output.dense.weight]
238
+ Loading weights: 52%|ΓûêΓûêΓûêΓûêΓûêΓûÅ | 54/103 [00:00<00:00,
239
+ 2533.70it/s, Materializing
240
+ param=encoder.layer.3.attention.output.LayerNorm.bias]
241
+ Loading weights: 52%|ΓûêΓûêΓûêΓûêΓûêΓûÅ | 54/103 [00:00<00:00,
242
+ 2528.95it/s, Materializing
243
+ param=encoder.layer.3.attention.output.LayerNorm.bias]
244
+ Loading weights: 53%|ΓûêΓûêΓûêΓûêΓûêΓûÄ | 55/103 [00:00<00:00,
245
+ 2563.56it/s, Materializing
246
+ param=encoder.layer.3.attention.output.LayerNorm.weight]
247
+ Loading weights: 53%|ΓûêΓûêΓûêΓûêΓûêΓûÄ | 55/103 [00:00<00:00,
248
+ 2559.43it/s, Materializing
249
+ param=encoder.layer.3.attention.output.LayerNorm.weight]
250
+ Loading weights: 54%|ΓûêΓûêΓûêΓûêΓûêΓûì | 56/103 [00:00<00:00,
251
+ 2597.78it/s, Materializing param=encoder.layer.3.attention.output.dense.bias]
252
+
253
+ Loading weights: 54%|ΓûêΓûêΓûêΓûêΓûêΓûì | 56/103 [00:00<00:00,
254
+ 2593.65it/s, Materializing param=encoder.layer.3.attention.output.dense.bias]
255
+ Loading weights: 55%|ΓûêΓûêΓûêΓûêΓûêΓûî | 57/103 [00:00<00:00,
256
+ 2631.65it/s, Materializing param=encoder.layer.3.attention.output.dense.weight]
257
+ Loading weights: 55%|ΓûêΓûêΓûêΓûêΓûêΓûî | 57/103 [00:00<00:00,
258
+ 2627.61it/s, Materializing param=encoder.layer.3.attention.output.dense.weight]
259
+ Loading weights: 56%|ΓûêΓûêΓûêΓûêΓûêΓûï | 58/103 [00:00<00:00,
260
+ 2664.92it/s, Materializing param=encoder.layer.3.attention.self.key.bias]
261
+ Loading weights: 56%|ΓûêΓûêΓûêΓûêΓûêΓûï | 58/103 [00:00<00:00,
262
+ 2659.88it/s, Materializing param=encoder.layer.3.attention.self.key.bias]
263
+ Loading weights: 57%|ΓûêΓûêΓûêΓûêΓûêΓûï | 59/103 [00:00<00:00,
264
+ 2696.24it/s, Materializing param=encoder.layer.3.attention.self.key.weight]
265
+ Loading weights: 57%|ΓûêΓûêΓûêΓûêΓûêΓûï | 59/103 [00:00<00:00,
266
+ 2691.38it/s, Materializing param=encoder.layer.3.attention.self.key.weight]
267
+ Loading weights: 58%|ΓûêΓûêΓûêΓûêΓûêΓûè | 60/103 [00:00<00:00,
268
+ 2727.26it/s, Materializing param=encoder.layer.3.attention.self.query.bias]
269
+ Loading weights: 58%|ΓûêΓûêΓûêΓûêΓûêΓûè | 60/103 [00:00<00:00,
270
+ 2722.51it/s, Materializing param=encoder.layer.3.attention.self.query.bias]
271
+ Loading weights: 59%|ΓûêΓûêΓûêΓûêΓûêΓûë | 61/103 [00:00<00:00,
272
+ 2757.69it/s, Materializing param=encoder.layer.3.attention.self.query.weight]
273
+ Loading weights: 59%|ΓûêΓûêΓûêΓûêΓûêΓûë | 61/103 [00:00<00:00,
274
+ 2751.31it/s, Materializing param=encoder.layer.3.attention.self.query.weight]
275
+ Loading weights: 60%|ΓûêΓûêΓûêΓûêΓûêΓûê | 62/103 [00:00<00:00,
276
+ 2785.09it/s, Materializing param=encoder.layer.3.attention.self.value.bias]
277
+ Loading weights: 60%|ΓûêΓûêΓûêΓûêΓûêΓûê | 62/103 [00:00<00:00,
278
+ 2779.56it/s, Materializing param=encoder.layer.3.attention.self.value.bias]
279
+ Loading weights: 61%|ΓûêΓûêΓûêΓûêΓûêΓûê | 63/103 [00:00<00:00,
280
+ 2813.77it/s, Materializing param=encoder.layer.3.attention.self.value.weight]
281
+ Loading weights: 61%|ΓûêΓûêΓûêΓûêΓûêΓûê | 63/103 [00:00<00:00,
282
+ 2808.63it/s, Materializing param=encoder.layer.3.attention.self.value.weight]
283
+ Loading weights: 62%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûÅ | 64/103 [00:00<00:00,
284
+ 2844.89it/s, Materializing param=encoder.layer.3.intermediate.dense.bias]
285
+ Loading weights: 62%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûÅ | 64/103 [00:00<00:00,
286
+ 2840.65it/s, Materializing param=encoder.layer.3.intermediate.dense.bias]
287
+ Loading weights: 63%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûÄ | 65/103 [00:00<00:00,
288
+ 2876.18it/s, Materializing param=encoder.layer.3.intermediate.dense.weight]
289
+ Loading weights: 63%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûÄ | 65/103 [00:00<00:00,
290
+ 2871.12it/s, Materializing param=encoder.layer.3.intermediate.dense.weight]
291
+ Loading weights: 64%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûì | 66/103 [00:00<00:00,
292
+ 2906.90it/s, Materializing param=encoder.layer.3.output.LayerNorm.bias]
293
+ Loading weights: 64%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûì | 66/103 [00:00<00:00,
294
+ 2902.54it/s, Materializing param=encoder.layer.3.output.LayerNorm.bias]
295
+ Loading weights: 65%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûî | 67/103 [00:00<00:00,
296
+ 2937.92it/s, Materializing param=encoder.layer.3.output.LayerNorm.weight]
297
+ Loading weights: 65%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûî | 67/103 [00:00<00:00,
298
+ 2933.69it/s, Materializing param=encoder.layer.3.output.LayerNorm.weight]
299
+ Loading weights: 66%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûî | 68/103 [00:00<00:00,
300
+ 2969.26it/s, Materializing param=encoder.layer.3.output.dense.bias]
301
+ Loading weights: 66%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûî | 68/103 [00:00<00:00,
302
+ 2965.07it/s, Materializing param=encoder.layer.3.output.dense.bias]
303
+ Loading weights: 67%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûï | 69/103 [00:00<00:00,
304
+ 3000.44it/s, Materializing param=encoder.layer.3.output.dense.weight]
305
+ Loading weights: 67%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûï | 69/103 [00:00<00:00,
306
+ 2995.71it/s, Materializing param=encoder.layer.3.output.dense.weight]
307
+ Loading weights: 68%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûè | 70/103 [00:00<00:00,
308
+ 3029.75it/s, Materializing
309
+ param=encoder.layer.4.attention.output.LayerNorm.bias]
310
+ Loading weights: 68%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûè | 70/103 [00:00<00:00,
311
+ 3025.26it/s, Materializing
312
+ param=encoder.layer.4.attention.output.LayerNorm.bias]
313
+ Loading weights: 69%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûë | 71/103 [00:00<00:00,
314
+ 3058.89it/s, Materializing
315
+ param=encoder.layer.4.attention.output.LayerNorm.weight]
316
+ Loading weights: 69%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûë | 71/103 [00:00<00:00,
317
+ 3054.22it/s, Materializing
318
+ param=encoder.layer.4.attention.output.LayerNorm.weight]
319
+ Loading weights: 70%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûë | 72/103 [00:00<00:00,
320
+ 3088.43it/s, Materializing param=encoder.layer.4.attention.output.dense.bias]
321
+
322
+ Loading weights: 70%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûë | 72/103 [00:00<00:00,
323
+ 3083.29it/s, Materializing param=encoder.layer.4.attention.output.dense.bias]
324
+ Loading weights: 71%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûê | 73/103 [00:00<00:00,
325
+ 3114.48it/s, Materializing param=encoder.layer.4.attention.output.dense.weight]
326
+ Loading weights: 71%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûê | 73/103 [00:00<00:00,
327
+ 3109.48it/s, Materializing param=encoder.layer.4.attention.output.dense.weight]
328
+ Loading weights: 72%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÅ | 74/103 [00:00<00:00,
329
+ 3143.01it/s, Materializing param=encoder.layer.4.attention.self.key.bias]
330
+ Loading weights: 72%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÅ | 74/103 [00:00<00:00,
331
+ 3138.78it/s, Materializing param=encoder.layer.4.attention.self.key.bias]
332
+ Loading weights: 73%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÄ | 75/103 [00:00<00:00,
333
+ 3171.48it/s, Materializing param=encoder.layer.4.attention.self.key.weight]
334
+ Loading weights: 73%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÄ | 75/103 [00:00<00:00,
335
+ 3166.28it/s, Materializing param=encoder.layer.4.attention.self.key.weight]
336
+ Loading weights: 74%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûì | 76/103 [00:00<00:00,
337
+ 3196.62it/s, Materializing param=encoder.layer.4.attention.self.query.bias]
338
+ Loading weights: 74%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûì | 76/103 [00:00<00:00,
339
+ 3190.96it/s, Materializing param=encoder.layer.4.attention.self.query.bias]
340
+ Loading weights: 75%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûì | 77/103 [00:00<00:00,
341
+ 3220.69it/s, Materializing param=encoder.layer.4.attention.self.query.weight]
342
+ Loading weights: 75%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûì | 77/103 [00:00<00:00,
343
+ 3216.07it/s, Materializing param=encoder.layer.4.attention.self.query.weight]
344
+ Loading weights: 76%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûî | 78/103 [00:00<00:00,
345
+ 3246.78it/s, Materializing param=encoder.layer.4.attention.self.value.bias]
346
+ Loading weights: 76%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûî | 78/103 [00:00<00:00,
347
+ 3241.80it/s, Materializing param=encoder.layer.4.attention.self.value.bias]
348
+ Loading weights: 77%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûï | 79/103 [00:00<00:00,
349
+ 3271.33it/s, Materializing param=encoder.layer.4.attention.self.value.weight]
350
+ Loading weights: 77%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûï | 79/103 [00:00<00:00,
351
+ 3266.37it/s, Materializing param=encoder.layer.4.attention.self.value.weight]
352
+ Loading weights: 78%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûè | 80/103 [00:00<00:00,
353
+ 3296.53it/s, Materializing param=encoder.layer.4.intermediate.dense.bias]
354
+ Loading weights: 78%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûè | 80/103 [00:00<00:00,
355
+ 3291.42it/s, Materializing param=encoder.layer.4.intermediate.dense.bias]
356
+ Loading weights: 79%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûè | 81/103 [00:00<00:00,
357
+ 3318.57it/s, Materializing param=encoder.layer.4.intermediate.dense.weight]
358
+ Loading weights: 79%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûè | 81/103 [00:00<00:00,
359
+ 3312.52it/s, Materializing param=encoder.layer.4.intermediate.dense.weight]
360
+ Loading weights: 80%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûë | 82/103 [00:00<00:00,
361
+ 3339.93it/s, Materializing param=encoder.layer.4.output.LayerNorm.bias]
362
+ Loading weights: 80%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûë | 82/103 [00:00<00:00,
363
+ 3334.33it/s, Materializing param=encoder.layer.4.output.LayerNorm.bias]
364
+ Loading weights: 81%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûê | 83/103 [00:00<00:00,
365
+ 3363.48it/s, Materializing param=encoder.layer.4.output.LayerNorm.weight]
366
+ Loading weights: 81%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûê | 83/103 [00:00<00:00,
367
+ 3358.91it/s, Materializing param=encoder.layer.4.output.LayerNorm.weight]
368
+ Loading weights: 82%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÅ | 84/103 [00:00<00:00,
369
+ 3389.40it/s, Materializing param=encoder.layer.4.output.dense.bias]
370
+ Loading weights: 82%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÅ | 84/103 [00:00<00:00,
371
+ 3384.62it/s, Materializing param=encoder.layer.4.output.dense.bias]
372
+ Loading weights: 83%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÄ | 85/103 [00:00<00:00,
373
+ 3415.69it/s, Materializing param=encoder.layer.4.output.dense.weight]
374
+ Loading weights: 83%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÄ | 85/103 [00:00<00:00,
375
+ 3410.72it/s, Materializing param=encoder.layer.4.output.dense.weight]
376
+ Loading weights: 83%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÄ | 86/103 [00:00<00:00,
377
+ 3440.61it/s, Materializing
378
+ param=encoder.layer.5.attention.output.LayerNorm.bias]
379
+ Loading weights: 83%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÄ | 86/103 [00:00<00:00,
380
+ 3436.19it/s, Materializing
381
+ param=encoder.layer.5.attention.output.LayerNorm.bias]
382
+ Loading weights: 84%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûì | 87/103 [00:00<00:00,
383
+ 3467.59it/s, Materializing
384
+ param=encoder.layer.5.attention.output.LayerNorm.weight]
385
+ Loading weights: 84%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûì | 87/103 [00:00<00:00,
386
+ 3463.04it/s, Materializing
387
+ param=encoder.layer.5.attention.output.LayerNorm.weight]
388
+ Loading weights: 85%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûî | 88/103 [00:00<00:00,
389
+ 3493.53it/s, Materializing param=encoder.layer.5.attention.output.dense.bias]
390
+
391
+ Loading weights: 85%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûî | 88/103 [00:00<00:00,
392
+ 3489.07it/s, Materializing param=encoder.layer.5.attention.output.dense.bias]
393
+ Loading weights: 86%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûï | 89/103 [00:00<00:00,
394
+ 3520.50it/s, Materializing param=encoder.layer.5.attention.output.dense.weight]
395
+ Loading weights: 86%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûï | 89/103 [00:00<00:00,
396
+ 3515.86it/s, Materializing param=encoder.layer.5.attention.output.dense.weight]
397
+ Loading weights: 87%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûï | 90/103 [00:00<00:00,
398
+ 3545.01it/s, Materializing param=encoder.layer.5.attention.self.key.bias]
399
+ Loading weights: 87%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûï | 90/103 [00:00<00:00,
400
+ 3540.39it/s, Materializing param=encoder.layer.5.attention.self.key.bias]
401
+ Loading weights: 88%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûè | 91/103 [00:00<00:00,
402
+ 3569.62it/s, Materializing param=encoder.layer.5.attention.self.key.weight]
403
+ Loading weights: 88%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûè | 91/103 [00:00<00:00,
404
+ 3564.89it/s, Materializing param=encoder.layer.5.attention.self.key.weight]
405
+ Loading weights: 89%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûë | 92/103 [00:00<00:00,
406
+ 3595.66it/s, Materializing param=encoder.layer.5.attention.self.query.bias]
407
+ Loading weights: 89%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûë | 92/103 [00:00<00:00,
408
+ 3591.18it/s, Materializing param=encoder.layer.5.attention.self.query.bias]
409
+ Loading weights: 90%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûê | 93/103 [00:00<00:00,
410
+ 3620.11it/s, Materializing param=encoder.layer.5.attention.self.query.weight]
411
+ Loading weights: 90%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûê | 93/103 [00:00<00:00,
412
+ 3615.31it/s, Materializing param=encoder.layer.5.attention.self.query.weight]
413
+ Loading weights: 91%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÅ| 94/103 [00:00<00:00,
414
+ 3644.19it/s, Materializing param=encoder.layer.5.attention.self.value.bias]
415
+ Loading weights: 91%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÅ| 94/103 [00:00<00:00,
416
+ 3639.34it/s, Materializing param=encoder.layer.5.attention.self.value.bias]
417
+ Loading weights: 92%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÅ| 95/103 [00:00<00:00,
418
+ 3668.92it/s, Materializing param=encoder.layer.5.attention.self.value.weight]
419
+ Loading weights: 92%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÅ| 95/103 [00:00<00:00,
420
+ 3664.36it/s, Materializing param=encoder.layer.5.attention.self.value.weight]
421
+ Loading weights: 93%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÄ| 96/103 [00:00<00:00,
422
+ 3693.59it/s, Materializing param=encoder.layer.5.intermediate.dense.bias]
423
+ Loading weights: 93%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÄ| 96/103 [00:00<00:00,
424
+ 3688.52it/s, Materializing param=encoder.layer.5.intermediate.dense.bias]
425
+ Loading weights: 94%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûì| 97/103 [00:00<00:00,
426
+ 3718.18it/s, Materializing param=encoder.layer.5.intermediate.dense.weight]
427
+ Loading weights: 94%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûì| 97/103 [00:00<00:00,
428
+ 3713.03it/s, Materializing param=encoder.layer.5.intermediate.dense.weight]
429
+ Loading weights: 95%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûî| 98/103 [00:00<00:00,
430
+ 3742.63it/s, Materializing param=encoder.layer.5.output.LayerNorm.bias]
431
+ Loading weights: 95%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûî| 98/103 [00:00<00:00,
432
+ 3738.00it/s, Materializing param=encoder.layer.5.output.LayerNorm.bias]
433
+ Loading weights: 96%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûî| 99/103 [00:00<00:00,
434
+ 3766.31it/s, Materializing param=encoder.layer.5.output.LayerNorm.weight]
435
+ Loading weights: 96%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûî| 99/103 [00:00<00:00,
436
+ 3761.57it/s, Materializing param=encoder.layer.5.output.LayerNorm.weight]
437
+ Loading weights: 97%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûï| 100/103 [00:00<00:00,
438
+ 3789.95it/s, Materializing param=encoder.layer.5.output.dense.bias]
439
+ Loading weights: 97%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûï| 100/103 [00:00<00:00,
440
+ 3785.13it/s, Materializing param=encoder.layer.5.output.dense.bias]
441
+ Loading weights: 98%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûè| 101/103 [00:00<00:00,
442
+ 3813.96it/s, Materializing param=encoder.layer.5.output.dense.weight]
443
+ Loading weights: 98%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûè| 101/103 [00:00<00:00,
444
+ 3809.23it/s, Materializing param=encoder.layer.5.output.dense.weight]
445
+ Loading weights: 99%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûë| 102/103 [00:00<00:00,
446
+ 3837.97it/s, Materializing param=pooler.dense.bias]
447
+ Loading weights: 99%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûë| 102/103 [00:00<00:00,
448
+ 3833.68it/s, Materializing param=pooler.dense.bias]
449
+ Loading weights: 100%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûê| 103/103 [00:00<00:00,
450
+ 3862.33it/s, Materializing param=pooler.dense.weight]
451
+ Loading weights: 100%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûê| 103/103 [00:00<00:00,
452
+ 3857.26it/s, Materializing param=pooler.dense.weight]
453
+ Loading weights: 100%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûê| 103/103 [00:00<00:00,
454
+ 3842.34it/s, Materializing param=pooler.dense.weight]
455
+ BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
456
+ Key | Status | |
457
+ ------------------------+------------+--+-
458
+ embeddings.position_ids | UNEXPECTED | |
459
+
460
+ Notes:
461
+ - UNEXPECTED :can be ignored when loading from different
462
+ task/architecture; not ok if you expect identical arch.
463
+ [1/1] medqa_0000: Γ£ô top1=N top3=N diff=Y [differential] (281547ms)
464
+
465
+ ============================================================
466
+ Validation Results: MEDQA
467
+ ============================================================
468
+ Total cases: 1
469
+ Successful: 1
470
+ Failed: 0
471
+ Duration: 281.5s
472
+
473
+ Metrics:
474
+ avg_pipeline_time_ms 281547ms
475
+ differential_accuracy 100.0%
476
+ mentioned_accuracy 100.0%
477
+ parse_success 100.0%
478
+ top1_accuracy 0.0%
479
+ top3_accuracy 0.0%
480
+ ============================================================
481
+
482
+
483
+ ======================================================================
484
+ COMBINED VALIDATION REPORT
485
+ ======================================================================
486
+
487
+ Dataset Cases Success Key Metric Value
488
+ --------------- ------ -------- ------------------------- --------
489
+ medqa 1 1 top3_accuracy 0.0%
490
+
491
+ ΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇ
492
+
493
+ MEDQA metrics:
494
+ avg_pipeline_time_ms 281547ms
495
+ differential_accuracy 100.0%
496
+ mentioned_accuracy 100.0%
497
+ parse_success 100.0%
498
+ top1_accuracy 0.0%
499
+ top3_accuracy 0.0%
500
+
501
+ Total cases: 1
502
+ Total success: 1
503
+ Total duration: 281.6s (4.7min)
504
+ Timestamp: 2026-02-15T06:15:42.932073+00:00
505
+ ======================================================================
506
+
507
+ Combined report saved to: F:\kaggle\medgemma_impact_challenge\src\backend\validation\results\combined_20260215_061542.json