Spaces:

bshepp
/

cds-agent

Running

bshepp commited on Feb 15

Commit

5d53fbf

1 Parent(s): 1f36481

docs: full documentation vs reality audit

- README.md: Updated Tech Stack (MedGemma via HF Endpoint, not Gemma 3 via Google AI Studio),
added 50-case MedQA validation results, updated Quick Start for HF token, added
deploy guide to doc index and project structure, added analyze/check_progress to tree
- docs/architecture.md: Updated LLM label, model config, API dependencies, configuration
table defaults, system prompt section (TGI supports system role natively), known limitations
- docs/test_results.md: Added 50-case MedQA results with question-type breakdown, updated header
- docs/writeup_draft.md: Added MedQA 50-case results to performance table, updated validation
section from 'in progress' to actual numbers, updated latency note
- DEVELOPMENT_LOG.md: Fixed environment config table defaults, added Phase 11 (MedGemma HF
Endpoint deployment) and Phase 12 (50-case MedQA validation with full results/analysis)
- TODO.md: Complete rewrite - marked validation as done, updated model info from gemma-3 to
MedGemma, updated project state table, reprioritized tasks (video + writeup now #1-2)
- SUBMISSION_GUIDE.md: Updated days-remaining estimate
- docs/deploy_medgemma_hf.md: Fixed recommended MAX_INPUT/MAX_TOTAL tokens to match actual
deployed values (12288/16384), added note about default 4096 causing 422 errors
- SECURITY.md: Updated API key references (HF token + Google AI Studio)
- orchestrator.py: Fixed docstring to list all 6 pipeline steps including conflict detection

Files changed (12) hide show

DEVELOPMENT_LOG.md +102 -3
README.md +25 -5
SECURITY.md +3 -3
SUBMISSION_GUIDE.md +1 -1
TODO.md +35 -35
docs/architecture.md +16 -13
docs/deploy_medgemma_hf.md +6 -2
docs/test_results.md +29 -2
docs/writeup_draft.md +4 -2
src/backend/analyze_checkpoint.py +105 -0
src/backend/app/agent/orchestrator.py +6 -5
src/backend/validation_test_output.txt +507 -0

DEVELOPMENT_LOG.md CHANGED Viewed

@@ -258,9 +258,10 @@ All config via `.env` (template in `.env.template`):
 | Variable | Required | Default | Description |
 |----------|----------|---------|-------------|
-| `MEDGEMMA_API_KEY` | Yes | — | Google AI Studio API key |
-| `MEDGEMMA_BASE_URL` | No | `https://generativelanguage.googleapis.com/v1beta/openai/` | LLM endpoint |
-| `MEDGEMMA_MODEL_ID` | No | `gemma-3-27b-it` | Model identifier |
 | `CHROMA_PERSIST_DIR` | No | `./data/chroma` | ChromaDB storage |
 | `EMBEDDING_MODEL` | No | `sentence-transformers/all-MiniLM-L6-v2` | RAG embeddings |
 | `MAX_GUIDELINES` | No | `5` | Guidelines per RAG query |
@@ -314,6 +315,104 @@ Full validation runs (50–100+ cases) are planned for the next session.
 ---
 ## Phase 10: Final Documentation Audit & Cleanup
 Performed a full accuracy audit of all 5 documentation files and `test_e2e.py`.

 | Variable | Required | Default | Description |
 |----------|----------|---------|-------------|
+| `MEDGEMMA_API_KEY` | Yes | — | HuggingFace API token or Google AI Studio API key |
+| `MEDGEMMA_BASE_URL` | No | `""` (empty) | LLM endpoint (HF Endpoint URL/v1 or Google AI Studio URL) |
+| `MEDGEMMA_MODEL_ID` | No | `google/medgemma` | Model identifier (`tgi` for HF Endpoints, or full model name) |
+| `HF_TOKEN` | No | `""` | HuggingFace token for dataset downloads |
 | `CHROMA_PERSIST_DIR` | No | `./data/chroma` | ChromaDB storage |
 | `EMBEDDING_MODEL` | No | `sentence-transformers/all-MiniLM-L6-v2` | RAG embeddings |
 | `MAX_GUIDELINES` | No | `5` | Guidelines per RAG query |
 ---
+## Phase 11: MedGemma HuggingFace Dedicated Endpoint
+### Motivation
+The competition requires using HAI-DEF models (MedGemma). Google AI Studio served `gemma-3-27b-it` for development, but for the final submission we needed the actual `google/medgemma-27b-text-it` model. HuggingFace Dedicated Endpoints provide an OpenAI-compatible TGI server with scale-to-zero billing.
+### Deployment
+- **Endpoint name:** `medgemma-27b-cds`
+- **Model:** `google/medgemma-27b-text-it`
+- **Instance:** 1× NVIDIA A100 80 GB (AWS `us-east-1`)
+- **Container:** Text Generation Inference (TGI) with `DTYPE=bfloat16`
+- **Scale-to-zero:** Enabled (15 min idle timeout)
+- **Cost:** ~$2.50/hr when running
+### Key Configuration
+After initial deployment, the default TGI token limits (`MAX_INPUT_TOKENS=4096`) caused 422 errors on longer synthesis prompts. Updated endpoint environment:
+- `MAX_INPUT_TOKENS=12288`
+- `MAX_TOTAL_TOKENS=16384`
+Also reduced per-step `max_tokens` to stay within limits:
+- `patient_parser.py`: 1500
+- `clinical_reasoning.py`: 3072
+- `conflict_detection.py`: 2000
+- `synthesis.py`: 3000
+### Code Changes
+- **`medgemma.py`:** Updated to send `role: "system"` natively (TGI supports it), with automatic fallback to folding system prompt into user message for Google AI Studio compatibility.
+- **`.env`:** Updated `MEDGEMMA_BASE_URL` to HF endpoint URL, `MEDGEMMA_API_KEY` to HF token, `MEDGEMMA_MODEL_ID=tgi`.
+- **`.env.template`:** Updated with MedGemma model name and HF Endpoint instructions.
+### Verification
+Single-case test: Chikungunya question → correct diagnosis appeared at rank 5 in differential. All 6 pipeline steps completed in 281s.
+**Deployment guide:** `docs/deploy_medgemma_hf.md`
+---
+## Phase 12: 50-Case MedQA Validation
+### Setup
+Ran 50 MedQA (USMLE) cases through the full pipeline using the MedGemma HF Endpoint:
+```bash
+cd src/backend
+python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2
+```
+### Results
+| Metric | Value |
+|--------|-------|
+| Cases run | 50 |
+| Pipeline success | 94% (47/50) |
+| Top-1 diagnostic accuracy | 36% |
+| Top-3 diagnostic accuracy | 38% |
+| Differential accuracy | 10% |
+| Mentioned in report | 38% |
+| Avg pipeline time | 204 s/case |
+| Total run time | ~60 min |
+### Question Type Breakdown
+Used `analyze_results.py` to categorize the 50 cases:
+| Type | Count | Mentioned | Differential |
+|------|-------|-----------|-------------|
+| Diagnostic | 36 | 14 (39%) | 5 (14%) |
+| Treatment | 6 | — | — |
+| Pathophysiology | 6 | — | — |
+| Statistics | 1 | — | — |
+| Anatomy | 1 | — | — |
+### Key Observations
+1. **MedQA includes many non-diagnostic questions** (treatment, mechanism, stats) that the CDS pipeline is not designed to answer — it generates differential diagnoses, not multiple-choice answers.
+2. **On diagnostic questions specifically**, 39% mentioned accuracy is reasonable for a pipeline that wasn't optimized for exam-style questions.
+3. **Pipeline failures (3/50)** were caused by the HF endpoint scaling to zero mid-run. The `--resume` flag successfully continued from the checkpoint.
+4. **Improved clinical reasoning prompt** to demand disease-level diagnoses rather than symptom categories (e.g., "Chikungunya" not "viral arthritis").
+### Infrastructure Improvements
+- **Incremental JSONL checkpoints:** Each case result is appended to `medqa_checkpoint.jsonl` as it completes.
+- **`--resume` flag:** Skips already-completed cases, enabling graceful recovery from endpoint failures.
+- **`check_progress.py`:** Utility to monitor checkpoint progress during long runs.
+- **`analyze_results.py`:** Categorizes MedQA results by question type for more meaningful accuracy analysis.
+- **Unicode fixes:** Replaced box-drawing characters (`╔═╗║╚╝`) and symbols (`✓✗─`) with ASCII equivalents for Windows console compatibility.
+**Files created:** `validation/analyze_results.py`, `validation/check_progress.py`
+**Files modified:** `validation/base.py`, `validation/harness_medqa.py`, `validation/run_validation.py`, `app/tools/clinical_reasoning.py`, `app/tools/synthesis.py`, `app/tools/conflict_detection.py`, `app/tools/patient_parser.py`
+---
 ## Phase 10: Final Documentation Audit & Cleanup
 Performed a full accuracy audit of all 5 documentation files and `test_e2e.py`.

README.md CHANGED Viewed

@@ -97,6 +97,20 @@ A validation framework tests the pipeline against real-world clinical datasets:
 Initial smoke test (3 MedQA cases): 100% parse success, 66.7% top-1 diagnostic accuracy.
 See [docs/test_results.md](docs/test_results.md) for full details and reproduction steps.
 ---
@@ -137,7 +151,8 @@ medgemma_impact_challenge/
 ├── docs/
 │   ├── architecture.md                 # System architecture & design decisions
 │   ├── test_results.md                 # Detailed test results & benchmarks
-│   └── writeup_draft.md               # Project writeup / summary
 ├── src/
 │   ├── backend/                        # Python FastAPI backend
 │   │   ├── .env.template              # Environment config template
@@ -152,7 +167,9 @@ medgemma_impact_challenge/
 │   │   │   ├── harness_medqa.py      # MedQA (USMLE) diagnostic accuracy harness
 │   │   │   ├── harness_mtsamples.py  # MTSamples parse quality harness
 │   │   │   ├── harness_pmc.py        # PMC Case Reports diagnostic harness
-│   │   │   └── run_validation.py     # Unified CLI runner
 │   │   └── app/
 │   │       ├── main.py               # FastAPI entry (CORS, routers, lifespan)
 │   │       ├── config.py             # Pydantic Settings (ports, models, dirs)
@@ -204,7 +221,7 @@ medgemma_impact_challenge/
 - **Python 3.10+** (tested with Python 3.10)
 - **Node.js 18+** (tested with Node.js 18)
-- **API Key:** Google AI Studio API key for Gemma model access
 ### Backend Setup
@@ -221,7 +238,9 @@ pip install -r requirements.txt
 # Configure environment
 copy .env.template .env        # Windows (or: cp .env.template .env)
-# Edit .env — set MEDGEMMA_API_KEY to your Google AI Studio key
 # Start the backend
 uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
@@ -281,7 +300,7 @@ python -m validation.run_validation --all --max-cases 10   # All 3 datasets
 |-------|-----------|---------|
 | Frontend | Next.js 14, React 18, TypeScript, Tailwind CSS | Patient input, pipeline visualization, report display |
 | API | FastAPI, WebSocket, Pydantic v2 | REST endpoints + real-time streaming |
-| LLM | Gemma 3 27B IT (via Google AI Studio) | Clinical reasoning + synthesis |
 | RAG | ChromaDB, sentence-transformers (all-MiniLM-L6-v2) | Clinical guideline retrieval |
 | Drug Data | OpenFDA API, RxNorm / NLM API | Drug interactions, medication normalization |
 | Validation | Pydantic | Structured output validation across all pipeline steps |
@@ -326,6 +345,7 @@ curl -X POST http://localhost:8000/api/cases/submit \
 | [SECURITY.md](SECURITY.md) | Security policy and responsible disclosure |
 | [TODO.md](TODO.md) | Next-session action items and project state |
 | [SUBMISSION_GUIDE.md](SUBMISSION_GUIDE.md) | Competition submission strategy |
 ---

 Initial smoke test (3 MedQA cases): 100% parse success, 66.7% top-1 diagnostic accuracy.
+**50-case MedQA validation (MedGemma 27B via HF Endpoint):**
+| Metric | Value |
+|--------|-------|
+| Cases run | 50 |
+| Pipeline success | 94% (47/50) |
+| Top-1 diagnostic accuracy | 36% |
+| Top-3 diagnostic accuracy | 38% |
+| Differential accuracy | 10% |
+| Mentioned in report | 38% |
+| Avg pipeline time | 204 s/case |
+Of the 50 cases, 36 were diagnostic questions — on those, 39% mentioned the correct diagnosis and 14% placed it in the differential.
 See [docs/test_results.md](docs/test_results.md) for full details and reproduction steps.
 ---
 ├── docs/
 │   ├── architecture.md                 # System architecture & design decisions
 │   ├── test_results.md                 # Detailed test results & benchmarks
+│   ├── writeup_draft.md               # Project writeup / summary
+│   └── deploy_medgemma_hf.md          # MedGemma HF Endpoint deployment guide
 ├── src/
 │   ├── backend/                        # Python FastAPI backend
 │   │   ├── .env.template              # Environment config template
 │   │   │   ├── harness_medqa.py      # MedQA (USMLE) diagnostic accuracy harness
 │   │   │   ├── harness_mtsamples.py  # MTSamples parse quality harness
 │   │   │   ├── harness_pmc.py        # PMC Case Reports diagnostic harness
+│   │   │   ├── run_validation.py     # Unified CLI runner
+│   │   │   ├── analyze_results.py    # Question-type categorization & analysis
+│   │   │   └── check_progress.py     # Checkpoint progress monitor
 │   │   └── app/
 │   │       ├── main.py               # FastAPI entry (CORS, routers, lifespan)
 │   │       ├── config.py             # Pydantic Settings (ports, models, dirs)
 - **Python 3.10+** (tested with Python 3.10)
 - **Node.js 18+** (tested with Node.js 18)
+- **API Key:** HuggingFace API token (for MedGemma endpoint) or Google AI Studio API key
 ### Backend Setup
 # Configure environment
 copy .env.template .env        # Windows (or: cp .env.template .env)
+# Edit .env — set MEDGEMMA_API_KEY and MEDGEMMA_BASE_URL
+# For HF Endpoints: see docs/deploy_medgemma_hf.md
+# For Google AI Studio: set MEDGEMMA_API_KEY to your Google AI Studio key
 # Start the backend
 uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
 |-------|-----------|---------|
 | Frontend | Next.js 14, React 18, TypeScript, Tailwind CSS | Patient input, pipeline visualization, report display |
 | API | FastAPI, WebSocket, Pydantic v2 | REST endpoints + real-time streaming |
+| LLM | MedGemma 27B Text IT (via HuggingFace Dedicated Endpoint) | Clinical reasoning + synthesis |
 | RAG | ChromaDB, sentence-transformers (all-MiniLM-L6-v2) | Clinical guideline retrieval |
 | Drug Data | OpenFDA API, RxNorm / NLM API | Drug interactions, medication normalization |
 | Validation | Pydantic | Structured output validation across all pipeline steps |
 | [SECURITY.md](SECURITY.md) | Security policy and responsible disclosure |
 | [TODO.md](TODO.md) | Next-session action items and project state |
 | [SUBMISSION_GUIDE.md](SUBMISSION_GUIDE.md) | Competition submission strategy |
+| [docs/deploy_medgemma_hf.md](docs/deploy_medgemma_hf.md) | MedGemma HuggingFace Endpoint deployment guide |
 ---

SECURITY.md CHANGED Viewed

@@ -37,12 +37,12 @@ We will acknowledge receipt within 48 hours and aim to provide a fix or mitigati
 - This system processes clinical text that could contain protected health information (PHI)
 - **No real patient data should ever be used** with this demonstration system
 - In a production deployment, HIPAA compliance would require: encrypted storage, audit logging, access controls, and BAAs with all third-party services
-- The Gemma model can be self-hosted on-premises to avoid sending data to external APIs
 ### API Keys
-- The Google AI Studio API key is stored in `.env` (gitignored)
-- Never commit `.env` or any file containing API keys
 - The `.env.template` file shows required variables without actual values
 ### LLM-Specific Risks

 - This system processes clinical text that could contain protected health information (PHI)
 - **No real patient data should ever be used** with this demonstration system
 - In a production deployment, HIPAA compliance would require: encrypted storage, audit logging, access controls, and BAAs with all third-party services
+- The MedGemma model can be self-hosted on-premises to avoid sending data to external APIs
 ### API Keys
+- API keys/tokens (HuggingFace token, Google AI Studio key) are stored in `.env` (gitignored)
+- Never commit `.env` or any file containing API keys or tokens
 - The `.env.template` file shows required variables without actual values
 ### LLM-Specific Risks

SUBMISSION_GUIDE.md CHANGED Viewed

@@ -8,7 +8,7 @@ Jan 13 ─────────────────────── Feb
  ◄────── Build & Iterate ──────►
 ```
-**⏰ Days remaining as of Feb 13, 2026: ~11 days**
 ---

  ◄────── Build & Iterate ──────►
 ```
+**⏰ Days remaining as of Feb 15, 2026: ~9 days**
 ---

TODO.md CHANGED Viewed

@@ -1,39 +1,13 @@
 # TODO — Next Session Action Items
-> **Last updated:** End of validation framework + documentation audit session.
 > **Read this first** if you're a new AI instance picking up this project.
 ---
 ## High Priority (Do Next)
-### 1. Run Full-Scale Validation (~2 hours total)
-The validation framework is built and tested with a 3-case smoke test. It needs a proper run:
-```bash
-cd src/backend
-# MedQA — 50 cases, ~45 min
-python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2
-# MTSamples — 50 cases, ~45 min
-python -m validation.run_validation --mtsamples --max-cases 50 --seed 42 --delay 2
-# PMC Case Reports — 10-20 cases (smaller pool), ~15-30 min
-python -m validation.run_validation --pmc --max-cases 20 --seed 42 --delay 2
-```
-Results save to `validation/results/`. After running, update:
-- `docs/test_results.md` Section 6 with real numbers (replace smoke test placeholder)
-- `docs/writeup_draft.md` validation methodology section with actual metrics
-- `README.md` "External Dataset Validation" section
-### 2. Update Writeup with Actual Validation Metrics
-`docs/writeup_draft.md` currently says "initial smoke test" and "in progress." Once full validation is done, replace with actual numbers (top-1 accuracy, parse success rates, etc.).
-### 3. Record a Demo Video
 The writeup says "Video: [To be recorded]". Record a ~3 min screencast showing:
 1. Pasting a patient case
@@ -41,6 +15,23 @@ The writeup says "Video: [To be recorded]". Record a ~3 min screencast showing:
 3. Reviewing the CDS report (especially conflicts section)
 4. Showing validation results
 ---
 ## Medium Priority
@@ -67,12 +58,12 @@ We deliberately removed numeric confidence scores (see Phase 8 in DEVELOPMENT_LO
 ## Low Priority / Future
-### 7. Model Upgrade Path
-Currently using `gemma-3-27b-it`. When available, evaluate:
-- MedGemma (medical-specific Gemma fine-tune) if released
-- Smaller/distilled models for latency reduction
 - Specialized models for individual pipeline steps (e.g., a parse-only model)
 ### 8. EHR Integration Prototype
@@ -95,15 +86,24 @@ Current input is manual text paste. A FHIR client could auto-populate patient da
 | Frontend (Next.js) | ✅ Complete | Real-time pipeline viz, CDS report with conflicts |
 | RAG (62 guidelines) | ✅ Complete | 30/30 quality test, 100% top-1 accuracy |
 | Conflict Detection | ✅ Complete | Integrated into pipeline, frontend, and docs |
-| Validation Framework | ✅ Built | Smoke-tested only — needs full-scale runs |
-| Documentation (5 files) | ✅ Audited | All docs updated and cross-checked |
 | test_e2e.py | ✅ Fixed | Now asserts 6 steps + conflict_detection |
 | GitHub | ✅ Pushed | `bshepp/clinical-decision-support-agent` (master) |
 **Key files:**
 - Backend entry: `src/backend/app/main.py`
 - Orchestrator: `src/backend/app/agent/orchestrator.py`
 - Validation CLI: `src/backend/validation/run_validation.py`
 - All docs: `README.md`, `docs/architecture.md`, `docs/test_results.md`, `docs/writeup_draft.md`, `DEVELOPMENT_LOG.md`
-**Dev ports:** Backend = 8002 (not 8000 — zombie process issue), Frontend = 3000

 # TODO — Next Session Action Items
+> **Last updated:** After 50-case MedQA validation, MedGemma HF Endpoint deployment, and documentation audit.
 > **Read this first** if you're a new AI instance picking up this project.
 ---
 ## High Priority (Do Next)
+### 1. Record a Demo Video
 The writeup says "Video: [To be recorded]". Record a ~3 min screencast showing:
 1. Pasting a patient case
 3. Reviewing the CDS report (especially conflicts section)
 4. Showing validation results
+**Note:** Resume the HF Endpoint first (`medgemma-27b-cds` on HuggingFace). It costs ~$2.50/hr and is currently **paused**. Allow 5–15 min for cold start.
+### 2. Finalize Submission Writeup
+`docs/writeup_draft.md` has been updated with 50-case MedQA results. Still needs:
+- Team name / member info filled in
+- Final polish for 3-page limit
+- Links to video and live demo (once recorded/deployed)
+### 3. Improve Diagnostic Accuracy (Optional)
+Current 50-case MedQA accuracy: 36% top-1, 38% mentioned. Potential improvements:
+- **Specialist agents (Option B):** Route to domain-specific reasoning agents for cardiology, neurology, etc.
+- **Better prompting:** Further refine `clinical_reasoning.py` system prompt
+- **Multi-turn reasoning:** Add a self-critique / verification step before synthesis
+- **Run MTSamples + PMC validation** for additional metrics
 ---
 ## Medium Priority
 ## Low Priority / Future
+### 7. Model Optimization
+Currently using `google/medgemma-27b-text-it` on 1× A100 80 GB. Options:
+- Smaller/quantized models for latency reduction (medgemma-4b-it for lighter steps)
 - Specialized models for individual pipeline steps (e.g., a parse-only model)
+- Batch inference optimizations
 ### 8. EHR Integration Prototype
 | Frontend (Next.js) | ✅ Complete | Real-time pipeline viz, CDS report with conflicts |
 | RAG (62 guidelines) | ✅ Complete | 30/30 quality test, 100% top-1 accuracy |
 | Conflict Detection | ✅ Complete | Integrated into pipeline, frontend, and docs |
+| MedGemma HF Endpoint | ✅ Deployed | `medgemma-27b-cds`, 1× A100 80 GB, scale-to-zero, **currently paused** |
+| MedQA Validation (50 cases) | ✅ Complete | 36% top-1, 38% mentioned, 94% pipeline success |
+| Validation Framework | ✅ Complete | MedQA done; MTSamples + PMC harnesses built but not yet run at scale |
+| Documentation (8+ files) | ✅ Audited | All docs updated and cross-checked |
 | test_e2e.py | ✅ Fixed | Now asserts 6 steps + conflict_detection |
 | GitHub | ✅ Pushed | `bshepp/clinical-decision-support-agent` (master) |
+| Demo Video | ⬜ Not started | Required for submission |
+| Submission Writeup | 🔄 In progress | Template filled, needs final polish |
 **Key files:**
 - Backend entry: `src/backend/app/main.py`
 - Orchestrator: `src/backend/app/agent/orchestrator.py`
+- MedGemma service: `src/backend/app/services/medgemma.py`
 - Validation CLI: `src/backend/validation/run_validation.py`
+- HF Endpoint guide: `docs/deploy_medgemma_hf.md`
 - All docs: `README.md`, `docs/architecture.md`, `docs/test_results.md`, `docs/writeup_draft.md`, `DEVELOPMENT_LOG.md`
+**Infrastructure:**
+- HF Endpoint: `medgemma-27b-cds` at `https://lisvpf8if1yhgxn2.us-east-1.aws.endpoints.huggingface.cloud`
+- Dev ports: Backend = 8002 (not 8000 — zombie process issue), Frontend = 3000
+- Virtual env: `src/backend/venv/`

docs/architecture.md CHANGED Viewed

@@ -50,8 +50,8 @@ structured clinical decision support report — all in seconds.
 │                                    └─────────────────┘           │
 └──────────────────────────────────────────────────────────────────┘
-LLM: gemma-3-27b-it via Google AI Studio
-     (OpenAI-compatible endpoint)
 ```
 ---
@@ -138,17 +138,18 @@ LLM: gemma-3-27b-it via Google AI Studio
 ### Model Configuration
-- **Model:** `gemma-3-27b-it`
-- **API:** Google AI Studio (OpenAI-compatible endpoint)
-- **Base URL:** `https://generativelanguage.googleapis.com/v1beta/openai/`
 - **Client:** OpenAI Python SDK (`openai==1.51.0`)
 - **Service:** `medgemma.py` wraps all LLM calls
-### Gemma System Prompt Workaround
-**Problem discovered during development:** Gemma models accessed via the Google AI Studio OpenAI-compatible endpoint return a 400 error if you include a `role: "system"` message. The API does not support the system role.
-**Solution implemented:** `medgemma.py`'s `_generate_api` method detects system messages and folds them into the first user message with a `[System Instructions]` prefix:
 ```python
 # If system message exists, fold it into the first user message
@@ -225,7 +226,8 @@ All pipeline data is strongly typed via Pydantic models in `schemas.py` (~280 li
 | API | Purpose | Authentication | Rate Limits |
 |-----|---------|---------------|-------------|
-| Google AI Studio | Gemma 3 27B IT LLM inference | API key | Per-key quota |
 | OpenFDA | Drug adverse event data | None (public) | 240 req/min (with key), 40/min (without) |
 | RxNorm / NLM | Drug normalization (name → RxCUI), pairwise interactions | None (public) | 20 req/sec |
@@ -269,9 +271,10 @@ All configuration lives in `config.py` (Pydantic Settings) and `.env`:
 | Setting | Default | Description |
 |---------|---------|-------------|
-| `MEDGEMMA_API_KEY` | (required) | Google AI Studio API key |
-| `MEDGEMMA_BASE_URL` | `https://generativelanguage.googleapis.com/v1beta/openai/` | LLM API endpoint |
-| `MEDGEMMA_MODEL_ID` | `gemma-3-27b-it` | Model identifier |
 | `CHROMA_PERSIST_DIR` | `./data/chroma` | ChromaDB storage directory |
 | `EMBEDDING_MODEL` | `sentence-transformers/all-MiniLM-L6-v2` | Embedding model for RAG |
 | `MAX_GUIDELINES` | `5` | Number of guidelines to retrieve per query |
@@ -283,7 +286,7 @@ All configuration lives in `config.py` (Pydantic Settings) and `.env`:
 - **LLM latency:** Full pipeline takes ~75 s due to multiple sequential LLM calls. Could be improved with smaller models or parallel LLM calls.
 - **No authentication:** No user auth — designed as a local demo / research tool.
-- **Single-model:** Uses only Gemma 3 27B IT. Could benefit from specialized models for different steps.
 - **Guideline currency:** Guidelines are a static snapshot. A production system would need automated updates.
 - **No EHR integration:** Input is manual text paste. A production system would integrate with EHR FHIR APIs.

 │                                    └─────────────────┘           │
 └──────────────────────────────────────────────────────────────────┘
+LLM: google/medgemma-27b-text-it via HuggingFace Dedicated Endpoint
+     (OpenAI-compatible TGI, 1× A100 80 GB, bfloat16)
 ```
 ---
 ### Model Configuration
+- **Model:** `google/medgemma-27b-text-it` (MedGemma from HAI-DEF)
+- **API:** HuggingFace Dedicated Endpoint (TGI), with Google AI Studio as fallback
+- **Base URL:** `https://lisvpf8if1yhgxn2.us-east-1.aws.endpoints.huggingface.cloud/v1` (HF Endpoint)
 - **Client:** OpenAI Python SDK (`openai==1.51.0`)
 - **Service:** `medgemma.py` wraps all LLM calls
+- **Endpoint config:** `MAX_INPUT_TOKENS=12288`, `MAX_TOTAL_TOKENS=16384`, `DTYPE=bfloat16`
+### Gemma System Prompt Handling
+**MedGemma via TGI** natively supports `role: "system"` messages, so we send system/user messages properly.
+**Fallback for Google AI Studio:** If the backend happens to be plain Gemma on Google AI Studio (which rejects the system role), the code automatically catches the error and falls back to folding the system prompt into the first user message:
 ```python
 # If system message exists, fold it into the first user message
 | API | Purpose | Authentication | Rate Limits |
 |-----|---------|---------------|-------------|
+| HuggingFace Dedicated Endpoint | MedGemma 27B Text IT LLM inference | HF API token | Dedicated GPU (no shared limits) |
+| Google AI Studio (fallback) | Gemma 3 27B IT LLM inference | API key | Per-key quota |
 | OpenFDA | Drug adverse event data | None (public) | 240 req/min (with key), 40/min (without) |
 | RxNorm / NLM | Drug normalization (name → RxCUI), pairwise interactions | None (public) | 20 req/sec |
 | Setting | Default | Description |
 |---------|---------|-------------|
+| `MEDGEMMA_API_KEY` | (required) | HuggingFace API token or Google AI Studio API key |
+| `MEDGEMMA_BASE_URL` | `""` (empty) | LLM API endpoint (HF Endpoint URL with /v1, or Google AI Studio URL) |
+| `MEDGEMMA_MODEL_ID` | `google/medgemma` | Model identifier (`tgi` for HF Endpoints, or full model name) |
+| `HF_TOKEN` | `""` | HuggingFace token for dataset downloads |
 | `CHROMA_PERSIST_DIR` | `./data/chroma` | ChromaDB storage directory |
 | `EMBEDDING_MODEL` | `sentence-transformers/all-MiniLM-L6-v2` | Embedding model for RAG |
 | `MAX_GUIDELINES` | `5` | Number of guidelines to retrieve per query |
 - **LLM latency:** Full pipeline takes ~75 s due to multiple sequential LLM calls. Could be improved with smaller models or parallel LLM calls.
 - **No authentication:** No user auth — designed as a local demo / research tool.
+- **Single-model:** Uses only MedGemma 27B Text IT. Could benefit from specialized models for different steps.
 - **Guideline currency:** Guidelines are a static snapshot. A production system would need automated updates.
 - **No EHR integration:** Input is manual text paste. A production system would integrate with EHR FHIR APIs.

docs/deploy_medgemma_hf.md CHANGED Viewed

@@ -35,10 +35,14 @@ OpenAI-compatible API.
    - GCP: ~$3.60/hr
 6. **Container type**: Text Generation Inference (TGI) — this is the default.
 7. **Advanced Settings**:
-   - **Max Input Length**: `32768`
-   - **Max Total Tokens**: `40960`
    - **Quantization**: `none` (bfloat16 fits in 80 GB)
    - **Scale-to-zero**: **Enable** (idle timeout: 15 min recommended)
 8. Click **Create Endpoint**.
 ### 2. Wait for the endpoint to become ready

    - GCP: ~$3.60/hr
 6. **Container type**: Text Generation Inference (TGI) — this is the default.
 7. **Advanced Settings**:
+   - **Max Input Length**: `12288` (default 4096 is too small for synthesis prompts)
+   - **Max Total Tokens**: `16384`
    - **Quantization**: `none` (bfloat16 fits in 80 GB)
    - **Scale-to-zero**: **Enable** (idle timeout: 15 min recommended)
+   > **Note:** The default TGI `MAX_INPUT_TOKENS=4096` will cause 422 errors
+   > on longer pipeline prompts (especially synthesis). We found `12288` /
+   > `16384` to be sufficient for all 6 pipeline steps.
 8. Click **Create Endpoint**.
 ### 2. Wait for the endpoint to become ready

docs/test_results.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Test Results — CDS Agent
-> Last updated after RAG expansion to 62 guidelines across 14 specialties.
 ---
@@ -221,7 +221,34 @@ Tests use only the standard library + `httpx` (for REST calls) and the backend's
 | Top-3 diagnostic accuracy | 66.7% (2/3) |
 | Avg pipeline time | ~94 s per case |
-> **Note:** This is a smoke test only. A full validation run (50–100 cases per dataset) is planned but takes ~45 min per dataset.
 ### How to Reproduce

 # Test Results — CDS Agent
+> Last updated after 50-case MedQA validation with MedGemma 27B via HuggingFace Dedicated Endpoint.
 ---
 | Top-3 diagnostic accuracy | 66.7% (2/3) |
 | Avg pipeline time | ~94 s per case |
+### 50-Case MedQA Validation (MedGemma 27B Text IT via HF Endpoint)
+Run with: `python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2`
+| Metric | Value |
+|--------|-------|
+| Cases run | 50 |
+| Pipeline success | 94% (47/50) |
+| Top-1 diagnostic accuracy | 36% |
+| Top-3 diagnostic accuracy | 38% |
+| Differential accuracy | 10% |
+| Mentioned in report | 38% |
+| Avg pipeline time | 204 s per case |
+| Total run time | ~60 min |
+**Breakdown by question type (50 cases):**
+| Type | Count | Mentioned | Differential |
+|------|-------|-----------|-------------|
+| Diagnostic | 36 | 14 (39%) | 5 (14%) |
+| Treatment | 6 | — | — |
+| Pathophysiology | 6 | — | — |
+| Statistics | 1 | — | — |
+| Anatomy | 1 | — | — |
+> **Notes:** MedQA questions include many non-diagnostic question types (treatment selection, mechanism of action, etc.) which the CDS pipeline is not designed to answer. On diagnostic-only questions, the pipeline mentioned the correct diagnosis 39% of the time. Pipeline failures (3/50) were due to HF endpoint scale-to-zero mid-run.
+> Full validation was run on Feb 15, 2026 using the `medgemma-27b-cds` HuggingFace Dedicated Endpoint (1× A100 80 GB, bfloat16). Incremental checkpoints saved to `validation/results/medqa_checkpoint.jsonl` with `--resume` support.
 ### How to Reproduce

docs/writeup_draft.md CHANGED Viewed

@@ -107,6 +107,8 @@ No fine-tuning was performed in the current version. The base MedGemma model (`m
 | RAG retrieval quality | 30/30 queries passed (100%), avg relevance 0.639 |
 | Clinical test suite | 22 scenarios across 14 specialties |
 | Top-1 RAG accuracy | 100% — correct guideline ranked #1 for all queries |
 **Application stack:**
@@ -121,7 +123,7 @@ No fine-tuning was performed in the current version. The base MedGemma model (`m
 **Deployment considerations:**
 - **HIPAA compliance:** MedGemma is an open-weight model that can be self-hosted on-premises, eliminating the need to send patient data to external APIs. This is critical for healthcare deployment.
-- **Latency:** Current pipeline takes ~75 s end-to-end. For production, this could be reduced with: smaller/distilled models, parallel LLM calls, or GPU-accelerated inference.
 - **Scalability:** FastAPI + uvicorn supports async request handling. For high-throughput deployment, add worker processes and a task queue (e.g., Celery).
 - **EHR integration:** Current input is manual text paste. A production system would integrate with EHR systems via FHIR APIs for automatic patient data extraction.
@@ -139,7 +141,7 @@ The validation harness calls the `Orchestrator` directly (no HTTP server), enabl
 **Initial smoke test (3 MedQA cases):** 100% parse success, 66.7% top-1 diagnostic accuracy, ~94 s avg per case.
-Full-scale validation (50–100+ cases per dataset) is in progress.
 **Practical usage:**

 | RAG retrieval quality | 30/30 queries passed (100%), avg relevance 0.639 |
 | Clinical test suite | 22 scenarios across 14 specialties |
 | Top-1 RAG accuracy | 100% — correct guideline ranked #1 for all queries |
+| **MedQA 50-case validation** | **36% top-1, 38% top-3, 38% mentioned, 94% pipeline success** |
+| MedQA diagnostic-only (36 cases) | 39% mentioned, 14% differential |
 **Application stack:**
 **Deployment considerations:**
 - **HIPAA compliance:** MedGemma is an open-weight model that can be self-hosted on-premises, eliminating the need to send patient data to external APIs. This is critical for healthcare deployment.
+- **Latency:** Current pipeline takes ~75 s for a single E2E case (local), or ~204 s avg on the HuggingFace Dedicated Endpoint (50-case MedQA validation). For production, this could be reduced with: smaller/distilled models, parallel LLM calls, or GPU-accelerated inference with higher throughput.
 - **Scalability:** FastAPI + uvicorn supports async request handling. For high-throughput deployment, add worker processes and a task queue (e.g., Celery).
 - **EHR integration:** Current input is manual text paste. A production system would integrate with EHR systems via FHIR APIs for automatic patient data extraction.
 **Initial smoke test (3 MedQA cases):** 100% parse success, 66.7% top-1 diagnostic accuracy, ~94 s avg per case.
+**50-case MedQA validation (MedGemma 27B via HF Endpoint):** 94% pipeline success, 36% top-1 diagnostic accuracy, 38% mentioned in report, 204 s avg per case. On diagnostic-only questions (36/50), 39% mentioned the correct diagnosis. Full results in [docs/test_results.md](docs/test_results.md).
 **Practical usage:**

src/backend/analyze_checkpoint.py ADDED Viewed

	@@ -0,0 +1,105 @@

+"""Quick analysis of MedQA checkpoint data."""
+import json
+path = "validation/results/medqa_checkpoint.jsonl"
+with open(path) as f:
+    results = [json.loads(l) for l in f]
+print(f"Cases completed: {len(results)}\n")
+# ── Table view ──
+fmt = "{:<12} {:>3} {:>3} {:>4} {:>7} {:>3} {:>4}  {:<15} {:<42} {}"
+print(fmt.format("ID", "t1", "t3", "diff", "ms", "#dx", "rnk", "match_loc", "correct_answer", "top_diagnosis"))
+print("-" * 145)
+for r in results:
+    d = r["details"]
+    t1 = "Y" if r["scores"]["top1_accuracy"] else "N"
+    t3 = "Y" if r["scores"]["top3_accuracy"] else "N"
+    da = "Y" if r["scores"].get("differential_accuracy") else "N"
+    rank = d.get("found_at_rank", -1)
+    loc = d.get("match_location", "?")
+    ca = d["correct_answer"][:42]
+    td = d.get("top_diagnosis", "?")[:45]
+    print(fmt.format(r["case_id"], t1, t3, da, r["pipeline_time_ms"], d.get("num_diagnoses", 0), rank, loc, ca, td))
+print()
+# ── Timing analysis ──
+correct = [r for r in results if r["scores"]["top1_accuracy"]]
+wrong = [r for r in results if not r["scores"]["top1_accuracy"]]
+mentioned = [r for r in results if r["scores"].get("mentioned_accuracy")]
+top3 = [r for r in results if r["scores"]["top3_accuracy"]]
+diff_only = [r for r in results if r["scores"].get("differential_accuracy")]
+if correct:
+    avg = sum(r["pipeline_time_ms"] for r in correct) / len(correct)
+    print(f"Correct (top1) avg time: {avg:.0f}ms  ({len(correct)}/{len(results)} = {len(correct)/len(results)*100:.0f}%)")
+if top3:
+    avg = sum(r["pipeline_time_ms"] for r in top3) / len(top3)
+    print(f"Correct (top3) avg time: {avg:.0f}ms  ({len(top3)}/{len(results)} = {len(top3)/len(results)*100:.0f}%)")
+if diff_only:
+    avg = sum(r["pipeline_time_ms"] for r in diff_only) / len(diff_only)
+    print(f"Differential only:       {avg:.0f}ms  ({len(diff_only)}/{len(results)} = {len(diff_only)/len(results)*100:.0f}%)")
+if wrong:
+    avg = sum(r["pipeline_time_ms"] for r in wrong) / len(wrong)
+    print(f"Wrong   (top1) avg time: {avg:.0f}ms  ({len(wrong)}/{len(results)} = {len(wrong)/len(results)*100:.0f}%)")
+if mentioned:
+    print(f"Mentioned anywhere:      {len(mentioned)}/{len(results)}")
+# ── Match location breakdown ──
+print("\n=== MATCH LOCATION BREAKDOWN ===")
+loc_counts = {}
+for r in results:
+    loc = r["details"].get("match_location", "not_found")
+    loc_counts[loc] = loc_counts.get(loc, 0) + 1
+for loc, count in sorted(loc_counts.items()):
+    print(f"  {loc:<20} {count:>3} ({count/len(results)*100:.0f}%)")
+# ── Detailed per-case (new fields if available) ──
+print("\n=== PER-CASE DETAIL ===")
+for r in results:
+    d = r["details"]
+    cid = r["case_id"]
+    loc = d.get("match_location", "?")
+    ca = d["correct_answer"]
+    td = d.get("top_diagnosis", "?")
+    all_dx = d.get("all_diagnoses", [td])
+    all_next = d.get("all_next_steps", [])
+    all_recs = d.get("all_recommendations", [])
+    t1 = "Y" if r["scores"]["top1_accuracy"] else "N"
+    print(f"\n  {cid} [t1={t1}, loc={loc}]")
+    print(f"    Expected: {ca}")
+    print(f"    Differential: {', '.join(all_dx)}")
+    if all_next:
+        print(f"    Next steps: {'; '.join(all_next[:3])}")
+    if all_recs:
+        print(f"    Recommendations: {'; '.join(str(r)[:60] for r in all_recs[:3])}")
+# ── Answer type vs accuracy ──
+print("\n=== ANSWER TYPE vs ACCURACY ===")
+dx_correct = dx_total = mgmt_correct = mgmt_total = 0
+action_words = ["start", "stop", "give", "prescribe", "perform", "order", "refer",
+                "increase", "decrease", "switch", "add", "monitor", "observation",
+                "reassure", "discharge", "admit", "excess", "adaptation", "exclusion",
+                "it is", "right-sided", "affective", "exploratory", "lytic"]
+for r in results:
+    ca = r["details"]["correct_answer"]
+    is_dx = not any(w.lower() in ca.lower() for w in action_words)
+    if is_dx:
+        dx_total += 1
+        if r["scores"]["top1_accuracy"]:
+            dx_correct += 1
+    else:
+        mgmt_total += 1
+        if r["scores"]["top1_accuracy"]:
+            mgmt_correct += 1
+if dx_total:
+    print(f"  Diagnosis questions:    {dx_correct}/{dx_total} = {dx_correct/dx_total*100:.0f}%")
+if mgmt_total:
+    print(f"  Mgmt/concept questions: {mgmt_correct}/{mgmt_total} = {mgmt_correct/mgmt_total*100:.0f}%")
+dx_counts = [r["details"].get("num_diagnoses", 0) for r in results]
+print(f"\nDiagnoses generated: min={min(dx_counts)}, max={max(dx_counts)}, avg={sum(dx_counts)/len(dx_counts):.1f}")

src/backend/app/agent/orchestrator.py CHANGED Viewed

@@ -2,11 +2,12 @@
 Agent Orchestrator — the brain of the CDS Agent.
 Controls the multi-step pipeline:
-  1. Parse patient data
-  2. Clinical reasoning (MedGemma)
-  3. Drug interaction check
-  4. Guideline retrieval (RAG)
-  5. Synthesis (MedGemma)
 Each step is a tool call. The orchestrator manages state, handles errors,
 and streams step updates to the frontend via a callback.

 Agent Orchestrator — the brain of the CDS Agent.
 Controls the multi-step pipeline:
+  1. Parse patient data (MedGemma)
+  2. Clinical reasoning / differential diagnosis (MedGemma)
+  3. Drug interaction check (OpenFDA + RxNorm APIs)
+  4. Guideline retrieval (RAG over ChromaDB)
+  5. Conflict detection (MedGemma)
+  6. Synthesis into CDS report (MedGemma)
 Each step is a tool call. The orchestrator manages state, handles errors,
 and streams step updates to the frontend via a callback.

src/backend/validation_test_output.txt ADDED Viewed

	@@ -0,0 +1,507 @@

+==========================================================
+   Clinical Decision Support Agent - Validation Suite
+==========================================================
+  Datasets:     MedQA
+  Cases/dataset: 1
+  Drug check:    Yes
+  Guidelines:    Yes
+  Resume:        No
+  Fetch only:    No
+============================================================
+  DATASET 1: MedQA (USMLE-style diagnostic accuracy)
+============================================================
+  Loading MedQA from cache: F:\kaggle\medgemma_impact_challenge\src\backend\validation\data\medqa_test.jsonl
+  Loaded 1 MedQA cases
+.\venv\Scripts\python.exe :
+At line:1 char:174
++ ... lyContinue; .\venv\Scripts\python.exe -m validation.run_validation -- ...
++                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    + CategoryInfo          : NotSpecified: (:String) [], RemoteException
+    + FullyQualifiedErrorId : NativeCommandError
+Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]
+Loading weights:   1%|          | 1/103 [00:00<00:00, 19065.02it/s,
+Materializing param=embeddings.LayerNorm.bias]
+Loading weights:   1%|          | 1/103 [00:00<00:00, 5504.34it/s,
+Materializing param=embeddings.LayerNorm.bias]
+Loading weights:   2%|ΓûÅ         | 2/103 [00:00<00:00, 3855.06it/s,
+Materializing param=embeddings.LayerNorm.weight]
+Loading weights:   2%|ΓûÅ         | 2/103 [00:00<00:00, 3482.20it/s,
+Materializing param=embeddings.LayerNorm.weight]
+Loading weights:   3%|ΓûÄ         | 3/103 [00:00<00:00, 4359.98it/s,
+Materializing param=embeddings.position_embeddings.weight]
+Loading weights:   3%|ΓûÄ         | 3/103 [00:00<00:00, 4124.19it/s,
+Materializing param=embeddings.position_embeddings.weight]
+Loading weights:   4%|Γûì         | 4/103 [00:00<00:00, 4960.74it/s,
+Materializing param=embeddings.token_type_embeddings.weight]
+Loading weights:   4%|Γûì         | 4/103 [00:00<00:00, 4470.35it/s,
+Materializing param=embeddings.token_type_embeddings.weight]
+Loading weights:   5%|Γûì         | 5/103 [00:00<00:00, 3788.21it/s,
+Materializing param=embeddings.word_embeddings.weight]
+Loading weights:   5%|Γûì         | 5/103 [00:00<00:00, 3614.53it/s,
+Materializing param=embeddings.word_embeddings.weight]
+Loading weights:   6%|Γûî         | 6/103 [00:00<00:00, 3141.41it/s,
+Materializing param=encoder.layer.0.attention.output.LayerNorm.bias]
+Loading weights:   6%|Γûî         | 6/103 [00:00<00:00, 3036.42it/s,
+Materializing param=encoder.layer.0.attention.output.LayerNorm.bias]
+Loading weights:   7%|Γûï         | 7/103 [00:00<00:00, 3350.08it/s,
+Materializing param=encoder.layer.0.attention.output.LayerNorm.weight]
+Loading weights:   7%|Γûï         | 7/103 [00:00<00:00, 3287.81it/s,
+Materializing param=encoder.layer.0.attention.output.LayerNorm.weight]
+Loading weights:   8%|Γûè         | 8/103 [00:00<00:00, 3629.47it/s,
+Materializing param=encoder.layer.0.attention.output.dense.bias]
+Loading weights:   8%|Γûè         | 8/103 [00:00<00:00, 3572.66it/s,
+Materializing param=encoder.layer.0.attention.output.dense.bias]
+Loading weights:   9%|Γûè         | 9/103 [00:00<00:00, 3874.84it/s,
+Materializing param=encoder.layer.0.attention.output.dense.weight]
+Loading weights:   9%|Γûè         | 9/103 [00:00<00:00, 3819.18it/s,
+Materializing param=encoder.layer.0.attention.output.dense.weight]
+Loading weights:  10%|Γûë         | 10/103 [00:00<00:00, 3813.00it/s,
+Materializing param=encoder.layer.0.attention.self.key.bias]
+Loading weights:  10%|Γûë         | 10/103 [00:00<00:00, 3603.04it/s,
+Materializing param=encoder.layer.0.attention.self.key.bias]
+Loading weights:  11%|Γûê         | 11/103 [00:00<00:00, 3584.88it/s,
+Materializing param=encoder.layer.0.attention.self.key.weight]
+Loading weights:  11%|Γûê         | 11/103 [00:00<00:00, 3500.56it/s,
+Materializing param=encoder.layer.0.attention.self.key.weight]
+Loading weights:  12%|ΓûêΓûÅ        | 12/103 [00:00<00:00, 3682.44it/s,
+Materializing param=encoder.layer.0.attention.self.query.bias]
+Loading weights:  12%|ΓûêΓûÅ        | 12/103 [00:00<00:00, 3581.30it/s,
+Materializing param=encoder.layer.0.attention.self.query.bias]
+Loading weights:  13%|ΓûêΓûÄ        | 13/103 [00:00<00:00, 3355.86it/s,
+Materializing param=encoder.layer.0.attention.self.query.weight]
+Loading weights:  13%|ΓûêΓûÄ        | 13/103 [00:00<00:00, 3265.42it/s,
+Materializing param=encoder.layer.0.attention.self.query.weight]
+Loading weights:  14%|ΓûêΓûÄ        | 14/103 [00:00<00:00, 3102.62it/s,
+Materializing param=encoder.layer.0.attention.self.value.bias]
+Loading weights:  14%|ΓûêΓûÄ        | 14/103 [00:00<00:00, 3056.12it/s,
+Materializing param=encoder.layer.0.attention.self.value.bias]
+Loading weights:  15%|ΓûêΓûì        | 15/103 [00:00<00:00, 2961.94it/s,
+Materializing param=encoder.layer.0.attention.self.value.weight]
+Loading weights:  15%|ΓûêΓûì        | 15/103 [00:00<00:00, 2896.09it/s,
+Materializing param=encoder.layer.0.attention.self.value.weight]
+Loading weights:  16%|ΓûêΓûî        | 16/103 [00:00<00:00, 2895.37it/s,
+Materializing param=encoder.layer.0.intermediate.dense.bias]
+Loading weights:  16%|ΓûêΓûî        | 16/103 [00:00<00:00, 2689.73it/s,
+Materializing param=encoder.layer.0.intermediate.dense.bias]
+Loading weights:  17%|ΓûêΓûï        | 17/103 [00:00<00:00, 2702.62it/s,
+Materializing param=encoder.layer.0.intermediate.dense.weight]
+Loading weights:  17%|ΓûêΓûï        | 17/103 [00:00<00:00, 2671.73it/s,
+Materializing param=encoder.layer.0.intermediate.dense.weight]
+Loading weights:  17%|ΓûêΓûï        | 18/103 [00:00<00:00, 2678.17it/s,
+Materializing param=encoder.layer.0.output.LayerNorm.bias]
+Loading weights:  17%|ΓûêΓûï        | 18/103 [00:00<00:00, 2542.17it/s,
+Materializing param=encoder.layer.0.output.LayerNorm.bias]
+Loading weights:  18%|ΓûêΓûè        | 19/103 [00:00<00:00, 2556.11it/s,
+Materializing param=encoder.layer.0.output.LayerNorm.weight]
+Loading weights:  18%|ΓûêΓûè        | 19/103 [00:00<00:00, 2535.85it/s,
+Materializing param=encoder.layer.0.output.LayerNorm.weight]
+Loading weights:  19%|ΓûêΓûë        | 20/103 [00:00<00:00, 2582.54it/s,
+Materializing param=encoder.layer.0.output.dense.bias]
+Loading weights:  19%|ΓûêΓûë        | 20/103 [00:00<00:00, 2498.10it/s,
+Materializing param=encoder.layer.0.output.dense.bias]
+Loading weights:  20%|ΓûêΓûê        | 21/103 [00:00<00:00, 2512.99it/s,
+Materializing param=encoder.layer.0.output.dense.weight]
+Loading weights:  20%|ΓûêΓûê        | 21/103 [00:00<00:00, 2433.29it/s,
+Materializing param=encoder.layer.0.output.dense.weight]
+Loading weights:  21%|ΓûêΓûêΓûÅ       | 22/103 [00:00<00:00, 2454.44it/s,
+Materializing param=encoder.layer.1.attention.output.LayerNorm.bias]
+Loading weights:  21%|ΓûêΓûêΓûÅ       | 22/103 [00:00<00:00, 2441.78it/s,
+Materializing param=encoder.layer.1.attention.output.LayerNorm.bias]
+Loading weights:  22%|ΓûêΓûêΓûÅ       | 23/103 [00:00<00:00, 2473.82it/s,
+Materializing param=encoder.layer.1.attention.output.LayerNorm.weight]
+Loading weights:  22%|ΓûêΓûêΓûÅ       | 23/103 [00:00<00:00, 2425.00it/s,
+Materializing param=encoder.layer.1.attention.output.LayerNorm.weight]
+Loading weights:  23%|ΓûêΓûêΓûÄ       | 24/103 [00:00<00:00, 2464.22it/s,
+Materializing param=encoder.layer.1.attention.output.dense.bias]
+Loading weights:  23%|ΓûêΓûêΓûÄ       | 24/103 [00:00<00:00, 2448.69it/s,
+Materializing param=encoder.layer.1.attention.output.dense.bias]
+Loading weights:  24%|ΓûêΓûêΓûì       | 25/103 [00:00<00:00, 2527.85it/s,
+Materializing param=encoder.layer.1.attention.output.dense.weight]
+Loading weights:  24%|ΓûêΓûêΓûì       | 25/103 [00:00<00:00, 2518.56it/s,
+Materializing param=encoder.layer.1.attention.output.dense.weight]
+Loading weights:  25%|ΓûêΓûêΓûî       | 26/103 [00:00<00:00, 2599.94it/s,
+Materializing param=encoder.layer.1.attention.self.key.bias]
+Loading weights:  25%|ΓûêΓûêΓûî       | 26/103 [00:00<00:00, 2591.23it/s,
+Materializing param=encoder.layer.1.attention.self.key.bias]
+Loading weights:  26%|ΓûêΓûêΓûî       | 27/103 [00:00<00:00, 2594.06it/s,
+Materializing param=encoder.layer.1.attention.self.key.weight]
+Loading weights:  26%|ΓûêΓûêΓûî       | 27/103 [00:00<00:00, 2573.43it/s,
+Materializing param=encoder.layer.1.attention.self.key.weight]
+Loading weights:  27%|ΓûêΓûêΓûï       | 28/103 [00:00<00:00, 2616.42it/s,
+Materializing param=encoder.layer.1.attention.self.query.bias]
+Loading weights:  27%|ΓûêΓûêΓûï       | 28/103 [00:00<00:00, 2605.10it/s,
+Materializing param=encoder.layer.1.attention.self.query.bias]
+Loading weights:  28%|ΓûêΓûêΓûè       | 29/103 [00:00<00:00, 2679.12it/s,
+Materializing param=encoder.layer.1.attention.self.query.weight]
+Loading weights:  28%|ΓûêΓûêΓûè       | 29/103 [00:00<00:00, 2670.48it/s,
+Materializing param=encoder.layer.1.attention.self.query.weight]
+Loading weights:  29%|ΓûêΓûêΓûë       | 30/103 [00:00<00:00, 2726.64it/s,
+Materializing param=encoder.layer.1.attention.self.value.bias]
+Loading weights:  29%|ΓûêΓûêΓûë       | 30/103 [00:00<00:00, 2717.57it/s,
+Materializing param=encoder.layer.1.attention.self.value.bias]
+Loading weights:  30%|ΓûêΓûêΓûê       | 31/103 [00:00<00:00, 2790.26it/s,
+Materializing param=encoder.layer.1.attention.self.value.weight]
+Loading weights:  30%|ΓûêΓûêΓûê       | 31/103 [00:00<00:00, 2782.20it/s,
+Materializing param=encoder.layer.1.attention.self.value.weight]
+Loading weights:  31%|ΓûêΓûêΓûê       | 32/103 [00:00<00:00, 2854.42it/s,
+Materializing param=encoder.layer.1.intermediate.dense.bias]
+Loading weights:  31%|ΓûêΓûêΓûê       | 32/103 [00:00<00:00, 2846.07it/s,
+Materializing param=encoder.layer.1.intermediate.dense.bias]
+Loading weights:  32%|ΓûêΓûêΓûêΓûÅ      | 33/103 [00:00<00:00, 2881.66it/s,
+Materializing param=encoder.layer.1.intermediate.dense.weight]
+Loading weights:  32%|ΓûêΓûêΓûêΓûÅ      | 33/103 [00:00<00:00, 2865.20it/s,
+Materializing param=encoder.layer.1.intermediate.dense.weight]
+Loading weights:  33%|ΓûêΓûêΓûêΓûÄ      | 34/103 [00:00<00:00, 2928.02it/s,
+Materializing param=encoder.layer.1.output.LayerNorm.bias]
+Loading weights:  33%|ΓûêΓûêΓûêΓûÄ      | 34/103 [00:00<00:00, 2919.03it/s,
+Materializing param=encoder.layer.1.output.LayerNorm.bias]
+Loading weights:  34%|ΓûêΓûêΓûêΓûì      | 35/103 [00:00<00:00, 2941.72it/s,
+Materializing param=encoder.layer.1.output.LayerNorm.weight]
+Loading weights:  34%|ΓûêΓûêΓûêΓûì      | 35/103 [00:00<00:00, 2929.86it/s,
+Materializing param=encoder.layer.1.output.LayerNorm.weight]
+Loading weights:  35%|ΓûêΓûêΓûêΓûì      | 36/103 [00:00<00:00, 2993.56it/s,
+Materializing param=encoder.layer.1.output.dense.bias]
+Loading weights:  35%|ΓûêΓûêΓûêΓûì      | 36/103 [00:00<00:00, 2985.51it/s,
+Materializing param=encoder.layer.1.output.dense.bias]
+Loading weights:  36%|ΓûêΓûêΓûêΓûî      | 37/103 [00:00<00:00, 3010.29it/s,
+Materializing param=encoder.layer.1.output.dense.weight]
+Loading weights:  36%|ΓûêΓûêΓûêΓûî      | 37/103 [00:00<00:00, 3001.44it/s,
+Materializing param=encoder.layer.1.output.dense.weight]
+Loading weights:  37%|ΓûêΓûêΓûêΓûï      | 38/103 [00:00<00:00, 1948.86it/s,
+Materializing param=encoder.layer.2.attention.output.LayerNorm.bias]
+Loading weights:  37%|ΓûêΓûêΓûêΓûï      | 38/103 [00:00<00:00, 1941.59it/s,
+Materializing param=encoder.layer.2.attention.output.LayerNorm.bias]
+Loading weights:  38%|ΓûêΓûêΓûêΓûè      | 39/103 [00:00<00:00, 1983.39it/s,
+Materializing param=encoder.layer.2.attention.output.LayerNorm.weight]
+Loading weights:  38%|ΓûêΓûêΓûêΓûè      | 39/103 [00:00<00:00, 1979.24it/s,
+Materializing param=encoder.layer.2.attention.output.LayerNorm.weight]
+Loading weights:  39%|ΓûêΓûêΓûêΓûë      | 40/103 [00:00<00:00, 2022.62it/s,
+Materializing param=encoder.layer.2.attention.output.dense.bias]
+Loading weights:  39%|ΓûêΓûêΓûêΓûë      | 40/103 [00:00<00:00, 2018.97it/s,
+Materializing param=encoder.layer.2.attention.output.dense.bias]
+Loading weights:  40%|ΓûêΓûêΓûêΓûë      | 41/103 [00:00<00:00, 2061.16it/s,
+Materializing param=encoder.layer.2.attention.output.dense.weight]
+Loading weights:  40%|ΓûêΓûêΓûêΓûë      | 41/103 [00:00<00:00, 2057.73it/s,
+Materializing param=encoder.layer.2.attention.output.dense.weight]
+Loading weights:  41%|ΓûêΓûêΓûêΓûê      | 42/103 [00:00<00:00, 2101.61it/s,
+Materializing param=encoder.layer.2.attention.self.key.bias]
+Loading weights:  41%|ΓûêΓûêΓûêΓûê      | 42/103 [00:00<00:00, 2098.30it/s,
+Materializing param=encoder.layer.2.attention.self.key.bias]
+Loading weights:  42%|ΓûêΓûêΓûêΓûêΓûÅ     | 43/103 [00:00<00:00, 2141.78it/s,
+Materializing param=encoder.layer.2.attention.self.key.weight]
+Loading weights:  42%|ΓûêΓûêΓûêΓûêΓûÅ     | 43/103 [00:00<00:00, 2138.48it/s,
+Materializing param=encoder.layer.2.attention.self.key.weight]
+Loading weights:  43%|ΓûêΓûêΓûêΓûêΓûÄ     | 44/103 [00:00<00:00, 2182.05it/s,
+Materializing param=encoder.layer.2.attention.self.query.bias]
+Loading weights:  43%|ΓûêΓûêΓûêΓûêΓûÄ     | 44/103 [00:00<00:00, 2178.81it/s,
+Materializing param=encoder.layer.2.attention.self.query.bias]
+Loading weights:  44%|ΓûêΓûêΓûêΓûêΓûÄ     | 45/103 [00:00<00:00, 2222.13it/s,
+Materializing param=encoder.layer.2.attention.self.query.weight]
+Loading weights:  44%|ΓûêΓûêΓûêΓûêΓûÄ     | 45/103 [00:00<00:00, 2218.66it/s,
+Materializing param=encoder.layer.2.attention.self.query.weight]
+Loading weights:  45%|ΓûêΓûêΓûêΓûêΓûì     | 46/103 [00:00<00:00, 2261.66it/s,
+Materializing param=encoder.layer.2.attention.self.value.bias]
+Loading weights:  45%|ΓûêΓûêΓûêΓûêΓûì     | 46/103 [00:00<00:00, 2257.51it/s,
+Materializing param=encoder.layer.2.attention.self.value.bias]
+Loading weights:  46%|ΓûêΓûêΓûêΓûêΓûî     | 47/103 [00:00<00:00, 2299.72it/s,
+Materializing param=encoder.layer.2.attention.self.value.weight]
+Loading weights:  46%|ΓûêΓûêΓûêΓûêΓûî     | 47/103 [00:00<00:00, 2296.00it/s,
+Materializing param=encoder.layer.2.attention.self.value.weight]
+Loading weights:  47%|ΓûêΓûêΓûêΓûêΓûï     | 48/103 [00:00<00:00, 2337.91it/s,
+Materializing param=encoder.layer.2.intermediate.dense.bias]
+Loading weights:  47%|ΓûêΓûêΓûêΓûêΓûï     | 48/103 [00:00<00:00, 2334.33it/s,
+Materializing param=encoder.layer.2.intermediate.dense.bias]
+Loading weights:  48%|ΓûêΓûêΓûêΓûêΓûè     | 49/103 [00:00<00:00, 2376.32it/s,
+Materializing param=encoder.layer.2.intermediate.dense.weight]
+Loading weights:  48%|ΓûêΓûêΓûêΓûêΓûè     | 49/103 [00:00<00:00, 2372.70it/s,
+Materializing param=encoder.layer.2.intermediate.dense.weight]
+Loading weights:  49%|ΓûêΓûêΓûêΓûêΓûè     | 50/103 [00:00<00:00, 2407.75it/s,
+Materializing param=encoder.layer.2.output.LayerNorm.bias]
+Loading weights:  49%|ΓûêΓûêΓûêΓûêΓûè     | 50/103 [00:00<00:00, 2403.06it/s,
+Materializing param=encoder.layer.2.output.LayerNorm.bias]
+Loading weights:  50%|ΓûêΓûêΓûêΓûêΓûë     | 51/103 [00:00<00:00, 2443.48it/s,
+Materializing param=encoder.layer.2.output.LayerNorm.weight]
+Loading weights:  50%|ΓûêΓûêΓûêΓûêΓûë     | 51/103 [00:00<00:00, 2439.80it/s,
+Materializing param=encoder.layer.2.output.LayerNorm.weight]
+Loading weights:  50%|ΓûêΓûêΓûêΓûêΓûê     | 52/103 [00:00<00:00, 2479.47it/s,
+Materializing param=encoder.layer.2.output.dense.bias]
+Loading weights:  50%|ΓûêΓûêΓûêΓûêΓûê     | 52/103 [00:00<00:00, 2475.44it/s,
+Materializing param=encoder.layer.2.output.dense.bias]
+Loading weights:  51%|ΓûêΓûêΓûêΓûêΓûêΓûÅ    | 53/103 [00:00<00:00,
+2510.28it/s, Materializing param=encoder.layer.2.output.dense.weight]
+Loading weights:  51%|ΓûêΓûêΓûêΓûêΓûêΓûÅ    | 53/103 [00:00<00:00,
+2499.39it/s, Materializing param=encoder.layer.2.output.dense.weight]
+Loading weights:  52%|ΓûêΓûêΓûêΓûêΓûêΓûÅ    | 54/103 [00:00<00:00,
+2533.70it/s, Materializing
+param=encoder.layer.3.attention.output.LayerNorm.bias]
+Loading weights:  52%|ΓûêΓûêΓûêΓûêΓûêΓûÅ    | 54/103 [00:00<00:00,
+2528.95it/s, Materializing
+param=encoder.layer.3.attention.output.LayerNorm.bias]
+Loading weights:  53%|ΓûêΓûêΓûêΓûêΓûêΓûÄ    | 55/103 [00:00<00:00,
+2563.56it/s, Materializing
+param=encoder.layer.3.attention.output.LayerNorm.weight]
+Loading weights:  53%|ΓûêΓûêΓûêΓûêΓûêΓûÄ    | 55/103 [00:00<00:00,
+2559.43it/s, Materializing
+param=encoder.layer.3.attention.output.LayerNorm.weight]
+Loading weights:  54%|ΓûêΓûêΓûêΓûêΓûêΓûì    | 56/103 [00:00<00:00,
+2597.78it/s, Materializing param=encoder.layer.3.attention.output.dense.bias]
+Loading weights:  54%|ΓûêΓûêΓûêΓûêΓûêΓûì    | 56/103 [00:00<00:00,
+2593.65it/s, Materializing param=encoder.layer.3.attention.output.dense.bias]
+Loading weights:  55%|ΓûêΓûêΓûêΓûêΓûêΓûî    | 57/103 [00:00<00:00,
+2631.65it/s, Materializing param=encoder.layer.3.attention.output.dense.weight]
+Loading weights:  55%|ΓûêΓûêΓûêΓûêΓûêΓûî    | 57/103 [00:00<00:00,
+2627.61it/s, Materializing param=encoder.layer.3.attention.output.dense.weight]
+Loading weights:  56%|ΓûêΓûêΓûêΓûêΓûêΓûï    | 58/103 [00:00<00:00,
+2664.92it/s, Materializing param=encoder.layer.3.attention.self.key.bias]
+Loading weights:  56%|ΓûêΓûêΓûêΓûêΓûêΓûï    | 58/103 [00:00<00:00,
+2659.88it/s, Materializing param=encoder.layer.3.attention.self.key.bias]
+Loading weights:  57%|ΓûêΓûêΓûêΓûêΓûêΓûï    | 59/103 [00:00<00:00,
+2696.24it/s, Materializing param=encoder.layer.3.attention.self.key.weight]
+Loading weights:  57%|ΓûêΓûêΓûêΓûêΓûêΓûï    | 59/103 [00:00<00:00,
+2691.38it/s, Materializing param=encoder.layer.3.attention.self.key.weight]
+Loading weights:  58%|ΓûêΓûêΓûêΓûêΓûêΓûè    | 60/103 [00:00<00:00,
+2727.26it/s, Materializing param=encoder.layer.3.attention.self.query.bias]
+Loading weights:  58%|ΓûêΓûêΓûêΓûêΓûêΓûè    | 60/103 [00:00<00:00,
+2722.51it/s, Materializing param=encoder.layer.3.attention.self.query.bias]
+Loading weights:  59%|ΓûêΓûêΓûêΓûêΓûêΓûë    | 61/103 [00:00<00:00,
+2757.69it/s, Materializing param=encoder.layer.3.attention.self.query.weight]
+Loading weights:  59%|ΓûêΓûêΓûêΓûêΓûêΓûë    | 61/103 [00:00<00:00,
+2751.31it/s, Materializing param=encoder.layer.3.attention.self.query.weight]
+Loading weights:  60%|ΓûêΓûêΓûêΓûêΓûêΓûê    | 62/103 [00:00<00:00,
+2785.09it/s, Materializing param=encoder.layer.3.attention.self.value.bias]
+Loading weights:  60%|ΓûêΓûêΓûêΓûêΓûêΓûê    | 62/103 [00:00<00:00,
+2779.56it/s, Materializing param=encoder.layer.3.attention.self.value.bias]
+Loading weights:  61%|ΓûêΓûêΓûêΓûêΓûêΓûê    | 63/103 [00:00<00:00,
+2813.77it/s, Materializing param=encoder.layer.3.attention.self.value.weight]
+Loading weights:  61%|ΓûêΓûêΓûêΓûêΓûêΓûê    | 63/103 [00:00<00:00,
+2808.63it/s, Materializing param=encoder.layer.3.attention.self.value.weight]
+Loading weights:  62%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûÅ   | 64/103 [00:00<00:00,
+2844.89it/s, Materializing param=encoder.layer.3.intermediate.dense.bias]
+Loading weights:  62%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûÅ   | 64/103 [00:00<00:00,
+2840.65it/s, Materializing param=encoder.layer.3.intermediate.dense.bias]
+Loading weights:  63%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûÄ   | 65/103 [00:00<00:00,
+2876.18it/s, Materializing param=encoder.layer.3.intermediate.dense.weight]
+Loading weights:  63%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûÄ   | 65/103 [00:00<00:00,
+2871.12it/s, Materializing param=encoder.layer.3.intermediate.dense.weight]
+Loading weights:  64%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûì   | 66/103 [00:00<00:00,
+2906.90it/s, Materializing param=encoder.layer.3.output.LayerNorm.bias]
+Loading weights:  64%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûì   | 66/103 [00:00<00:00,
+2902.54it/s, Materializing param=encoder.layer.3.output.LayerNorm.bias]
+Loading weights:  65%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûî   | 67/103 [00:00<00:00,
+2937.92it/s, Materializing param=encoder.layer.3.output.LayerNorm.weight]
+Loading weights:  65%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûî   | 67/103 [00:00<00:00,
+2933.69it/s, Materializing param=encoder.layer.3.output.LayerNorm.weight]
+Loading weights:  66%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûî   | 68/103 [00:00<00:00,
+2969.26it/s, Materializing param=encoder.layer.3.output.dense.bias]
+Loading weights:  66%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûî   | 68/103 [00:00<00:00,
+2965.07it/s, Materializing param=encoder.layer.3.output.dense.bias]
+Loading weights:  67%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûï   | 69/103 [00:00<00:00,
+3000.44it/s, Materializing param=encoder.layer.3.output.dense.weight]
+Loading weights:  67%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûï   | 69/103 [00:00<00:00,
+2995.71it/s, Materializing param=encoder.layer.3.output.dense.weight]
+Loading weights:  68%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûè   | 70/103 [00:00<00:00,
+3029.75it/s, Materializing
+param=encoder.layer.4.attention.output.LayerNorm.bias]
+Loading weights:  68%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûè   | 70/103 [00:00<00:00,
+3025.26it/s, Materializing
+param=encoder.layer.4.attention.output.LayerNorm.bias]
+Loading weights:  69%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûë   | 71/103 [00:00<00:00,
+3058.89it/s, Materializing
+param=encoder.layer.4.attention.output.LayerNorm.weight]
+Loading weights:  69%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûë   | 71/103 [00:00<00:00,
+3054.22it/s, Materializing
+param=encoder.layer.4.attention.output.LayerNorm.weight]
+Loading weights:  70%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûë   | 72/103 [00:00<00:00,
+3088.43it/s, Materializing param=encoder.layer.4.attention.output.dense.bias]
+Loading weights:  70%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûë   | 72/103 [00:00<00:00,
+3083.29it/s, Materializing param=encoder.layer.4.attention.output.dense.bias]
+Loading weights:  71%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûê   | 73/103 [00:00<00:00,
+3114.48it/s, Materializing param=encoder.layer.4.attention.output.dense.weight]
+Loading weights:  71%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûê   | 73/103 [00:00<00:00,
+3109.48it/s, Materializing param=encoder.layer.4.attention.output.dense.weight]
+Loading weights:  72%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÅ  | 74/103 [00:00<00:00,
+3143.01it/s, Materializing param=encoder.layer.4.attention.self.key.bias]
+Loading weights:  72%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÅ  | 74/103 [00:00<00:00,
+3138.78it/s, Materializing param=encoder.layer.4.attention.self.key.bias]
+Loading weights:  73%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÄ  | 75/103 [00:00<00:00,
+3171.48it/s, Materializing param=encoder.layer.4.attention.self.key.weight]
+Loading weights:  73%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÄ  | 75/103 [00:00<00:00,
+3166.28it/s, Materializing param=encoder.layer.4.attention.self.key.weight]
+Loading weights:  74%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûì  | 76/103 [00:00<00:00,
+3196.62it/s, Materializing param=encoder.layer.4.attention.self.query.bias]
+Loading weights:  74%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûì  | 76/103 [00:00<00:00,
+3190.96it/s, Materializing param=encoder.layer.4.attention.self.query.bias]
+Loading weights:  75%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûì  | 77/103 [00:00<00:00,
+3220.69it/s, Materializing param=encoder.layer.4.attention.self.query.weight]
+Loading weights:  75%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûì  | 77/103 [00:00<00:00,
+3216.07it/s, Materializing param=encoder.layer.4.attention.self.query.weight]
+Loading weights:  76%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûî  | 78/103 [00:00<00:00,
+3246.78it/s, Materializing param=encoder.layer.4.attention.self.value.bias]
+Loading weights:  76%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûî  | 78/103 [00:00<00:00,
+3241.80it/s, Materializing param=encoder.layer.4.attention.self.value.bias]
+Loading weights:  77%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûï  | 79/103 [00:00<00:00,
+3271.33it/s, Materializing param=encoder.layer.4.attention.self.value.weight]
+Loading weights:  77%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûï  | 79/103 [00:00<00:00,
+3266.37it/s, Materializing param=encoder.layer.4.attention.self.value.weight]
+Loading weights:  78%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûè  | 80/103 [00:00<00:00,
+3296.53it/s, Materializing param=encoder.layer.4.intermediate.dense.bias]
+Loading weights:  78%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûè  | 80/103 [00:00<00:00,
+3291.42it/s, Materializing param=encoder.layer.4.intermediate.dense.bias]
+Loading weights:  79%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûè  | 81/103 [00:00<00:00,
+3318.57it/s, Materializing param=encoder.layer.4.intermediate.dense.weight]
+Loading weights:  79%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûè  | 81/103 [00:00<00:00,
+3312.52it/s, Materializing param=encoder.layer.4.intermediate.dense.weight]
+Loading weights:  80%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûë  | 82/103 [00:00<00:00,
+3339.93it/s, Materializing param=encoder.layer.4.output.LayerNorm.bias]
+Loading weights:  80%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûë  | 82/103 [00:00<00:00,
+3334.33it/s, Materializing param=encoder.layer.4.output.LayerNorm.bias]
+Loading weights:  81%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûê  | 83/103 [00:00<00:00,
+3363.48it/s, Materializing param=encoder.layer.4.output.LayerNorm.weight]
+Loading weights:  81%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûê  | 83/103 [00:00<00:00,
+3358.91it/s, Materializing param=encoder.layer.4.output.LayerNorm.weight]
+Loading weights:  82%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÅ | 84/103 [00:00<00:00,
+3389.40it/s, Materializing param=encoder.layer.4.output.dense.bias]
+Loading weights:  82%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÅ | 84/103 [00:00<00:00,
+3384.62it/s, Materializing param=encoder.layer.4.output.dense.bias]
+Loading weights:  83%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÄ | 85/103 [00:00<00:00,
+3415.69it/s, Materializing param=encoder.layer.4.output.dense.weight]
+Loading weights:  83%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÄ | 85/103 [00:00<00:00,
+3410.72it/s, Materializing param=encoder.layer.4.output.dense.weight]
+Loading weights:  83%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÄ | 86/103 [00:00<00:00,
+3440.61it/s, Materializing
+param=encoder.layer.5.attention.output.LayerNorm.bias]
+Loading weights:  83%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÄ | 86/103 [00:00<00:00,
+3436.19it/s, Materializing
+param=encoder.layer.5.attention.output.LayerNorm.bias]
+Loading weights:  84%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûì | 87/103 [00:00<00:00,
+3467.59it/s, Materializing
+param=encoder.layer.5.attention.output.LayerNorm.weight]
+Loading weights:  84%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûì | 87/103 [00:00<00:00,
+3463.04it/s, Materializing
+param=encoder.layer.5.attention.output.LayerNorm.weight]
+Loading weights:  85%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûî | 88/103 [00:00<00:00,
+3493.53it/s, Materializing param=encoder.layer.5.attention.output.dense.bias]
+Loading weights:  85%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûî | 88/103 [00:00<00:00,
+3489.07it/s, Materializing param=encoder.layer.5.attention.output.dense.bias]
+Loading weights:  86%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûï | 89/103 [00:00<00:00,
+3520.50it/s, Materializing param=encoder.layer.5.attention.output.dense.weight]
+Loading weights:  86%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûï | 89/103 [00:00<00:00,
+3515.86it/s, Materializing param=encoder.layer.5.attention.output.dense.weight]
+Loading weights:  87%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûï | 90/103 [00:00<00:00,
+3545.01it/s, Materializing param=encoder.layer.5.attention.self.key.bias]
+Loading weights:  87%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûï | 90/103 [00:00<00:00,
+3540.39it/s, Materializing param=encoder.layer.5.attention.self.key.bias]
+Loading weights:  88%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûè | 91/103 [00:00<00:00,
+3569.62it/s, Materializing param=encoder.layer.5.attention.self.key.weight]
+Loading weights:  88%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûè | 91/103 [00:00<00:00,
+3564.89it/s, Materializing param=encoder.layer.5.attention.self.key.weight]
+Loading weights:  89%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûë | 92/103 [00:00<00:00,
+3595.66it/s, Materializing param=encoder.layer.5.attention.self.query.bias]
+Loading weights:  89%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûë | 92/103 [00:00<00:00,
+3591.18it/s, Materializing param=encoder.layer.5.attention.self.query.bias]
+Loading weights:  90%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûê | 93/103 [00:00<00:00,
+3620.11it/s, Materializing param=encoder.layer.5.attention.self.query.weight]
+Loading weights:  90%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûê | 93/103 [00:00<00:00,
+3615.31it/s, Materializing param=encoder.layer.5.attention.self.query.weight]
+Loading weights:  91%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÅ| 94/103 [00:00<00:00,
+3644.19it/s, Materializing param=encoder.layer.5.attention.self.value.bias]
+Loading weights:  91%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÅ| 94/103 [00:00<00:00,
+3639.34it/s, Materializing param=encoder.layer.5.attention.self.value.bias]
+Loading weights:  92%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÅ| 95/103 [00:00<00:00,
+3668.92it/s, Materializing param=encoder.layer.5.attention.self.value.weight]
+Loading weights:  92%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÅ| 95/103 [00:00<00:00,
+3664.36it/s, Materializing param=encoder.layer.5.attention.self.value.weight]
+Loading weights:  93%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÄ| 96/103 [00:00<00:00,
+3693.59it/s, Materializing param=encoder.layer.5.intermediate.dense.bias]
+Loading weights:  93%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûÄ| 96/103 [00:00<00:00,
+3688.52it/s, Materializing param=encoder.layer.5.intermediate.dense.bias]
+Loading weights:  94%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûì| 97/103 [00:00<00:00,
+3718.18it/s, Materializing param=encoder.layer.5.intermediate.dense.weight]
+Loading weights:  94%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûì| 97/103 [00:00<00:00,
+3713.03it/s, Materializing param=encoder.layer.5.intermediate.dense.weight]
+Loading weights:  95%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûî| 98/103 [00:00<00:00,
+3742.63it/s, Materializing param=encoder.layer.5.output.LayerNorm.bias]
+Loading weights:  95%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûî| 98/103 [00:00<00:00,
+3738.00it/s, Materializing param=encoder.layer.5.output.LayerNorm.bias]
+Loading weights:  96%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûî| 99/103 [00:00<00:00,
+3766.31it/s, Materializing param=encoder.layer.5.output.LayerNorm.weight]
+Loading weights:  96%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûî| 99/103 [00:00<00:00,
+3761.57it/s, Materializing param=encoder.layer.5.output.LayerNorm.weight]
+Loading weights:  97%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûï| 100/103 [00:00<00:00,
+3789.95it/s, Materializing param=encoder.layer.5.output.dense.bias]
+Loading weights:  97%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûï| 100/103 [00:00<00:00,
+3785.13it/s, Materializing param=encoder.layer.5.output.dense.bias]
+Loading weights:  98%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûè| 101/103 [00:00<00:00,
+3813.96it/s, Materializing param=encoder.layer.5.output.dense.weight]
+Loading weights:  98%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûè| 101/103 [00:00<00:00,
+3809.23it/s, Materializing param=encoder.layer.5.output.dense.weight]
+Loading weights:  99%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûë| 102/103 [00:00<00:00,
+3837.97it/s, Materializing param=pooler.dense.bias]
+Loading weights:  99%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûë| 102/103 [00:00<00:00,
+3833.68it/s, Materializing param=pooler.dense.bias]
+Loading weights: 100%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûê| 103/103 [00:00<00:00,
+3862.33it/s, Materializing param=pooler.dense.weight]
+Loading weights: 100%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûê| 103/103 [00:00<00:00,
+3857.26it/s, Materializing param=pooler.dense.weight]
+Loading weights: 100%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûê| 103/103 [00:00<00:00,
+3842.34it/s, Materializing param=pooler.dense.weight]
+[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
+Key                     | Status     |  |
+------------------------+------------+--+-
+embeddings.position_ids | UNEXPECTED |  |
+[3mNotes:
+- UNEXPECTED[3m	:can be ignored when loading from different
+task/architecture; not ok if you expect identical arch.[0m
+  [1/1] medqa_0000: Γ£ô top1=N top3=N diff=Y [differential] (281547ms)
+============================================================
+  Validation Results: MEDQA
+============================================================
+  Total cases:      1
+  Successful:       1
+  Failed:           0
+  Duration:         281.5s
+  Metrics:
+    avg_pipeline_time_ms           281547ms
+    differential_accuracy          100.0%
+    mentioned_accuracy             100.0%
+    parse_success                  100.0%
+    top1_accuracy                  0.0%
+    top3_accuracy                  0.0%
+============================================================
+======================================================================
+  COMBINED VALIDATION REPORT
+======================================================================
+  Dataset          Cases  Success                Key Metric    Value
+  --------------- ------ -------- ------------------------- --------
+  medqa                1        1             top3_accuracy    0.0%
+  ΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇΓöÇ
+  MEDQA metrics:
+    avg_pipeline_time_ms                281547ms
+    differential_accuracy               100.0%
+    mentioned_accuracy                  100.0%
+    parse_success                       100.0%
+    top1_accuracy                       0.0%
+    top3_accuracy                       0.0%
+  Total cases:     1
+  Total success:   1
+  Total duration:  281.6s (4.7min)
+  Timestamp:       2026-02-15T06:15:42.932073+00:00
+======================================================================
+  Combined report saved to: F:\kaggle\medgemma_impact_challenge\src\backend\validation\results\combined_20260215_061542.json