Spaces:
Sleeping
Sleeping
fix: Mercury-2/Gemini refs in report_draft.md
Browse files- report_draft.md +9 -9
report_draft.md
CHANGED
|
@@ -52,7 +52,7 @@ Lewis et al. (2020) introduced RAG as a general architecture combining a dense r
|
|
| 52 |
|
| 53 |
### 2.5 Multimodal Vision-Language Models
|
| 54 |
|
| 55 |
-
Liu et al. (2024) introduced LLaVA (Large Language and Vision Assistant), demonstrating that visual instruction tuning on GPT-4-generated multimodal data produces strong VLMs capable of spatial and scientific visual reasoning. Achiam et al. (2023) describe GPT-4's multimodal capabilities including diagram interpretation. For medical domains specifically, Sellergren et al. (2025) introduce MedGemma, an open-source medical VLM (4B/27B parameters) achieving competitive performance on medical reasoning benchmarks β UnMask uses MedGemma 4B as its primary diagram analysis model, with
|
| 56 |
|
| 57 |
### 2.6 Groundedness Verification
|
| 58 |
|
|
@@ -101,7 +101,7 @@ The critical property: in `context_only` mode, the `must_not` filter executes se
|
|
| 101 |
|
| 102 |
### 3.5 Layer 4: Structured Output β Dual Knowledge Masking
|
| 103 |
|
| 104 |
-
The Socratic Generator calls
|
| 105 |
|
| 106 |
```python
|
| 107 |
class InternalAnalysis(BaseModel):
|
|
@@ -117,7 +117,7 @@ class VisibleResponse(BaseModel):
|
|
| 117 |
|
| 118 |
`InternalAnalysis` is stripped before rendering. A post-generation leak guard additionally checks for β₯4 significant-word overlap between `socratic_question` and `correct_answer`, triggering a retry with explicit non-reveal instructions if fired.
|
| 119 |
|
| 120 |
-
**Cost routing:** Rapport and Wrapup phases route to local Llama 3.1 8B via Ollama (65β75% of turns).
|
| 121 |
|
| 122 |
### 3.6 Concept Prerequisite Graph (NetworkX DAG)
|
| 123 |
|
|
@@ -145,8 +145,8 @@ Rather than flat mastery scores, we maintain a directed acyclic graph (e.g., `br
|
|
| 145 |
|---------|-------------|----------------------|--------|
|
| 146 |
| **Task 1: Content Retrieval + Masking** | RAG pipeline; mask answers; progressive hints | Hybrid Qdrant retrieval (dense+BM25+RRF); PCR excludes answer chunks at retrieval; structured output schema separates `internal_analysis`/`visible_response`; hint calibration via concept graph mastery | β
|
|
| 147 |
| **Task 2: Adaptive Conversation** | Rapport β Tutoring phases; Manager Agent | LangGraph 4-phase state machine; pure-Python orchestrator; diagnostic probe initializes mastery; phase transitions by learning events + time ceilings | β
|
|
| 148 |
-
| **Task 3: Synthesis & Assessment** | Clinical scenario; compare to gold-standard; mastery summary | Assessment phase triggers at coverage β₯ 80% or t β₯ 12min;
|
| 149 |
-
| **Task 4: Multimodal Diagram Tutoring** | VLM for anatomy diagrams; identify β ask function/insertion | Chainlit UI accepts image uploads; VLM backend (MedGemma 4B /
|
| 150 |
| **Task 5: Interactive Memory** | Session memory; proactively revisit mistakes | NetworkX concept DAG tracks mastery across session; Pedagogy Agent flags `weak_topics` (mastery < 0.4); proactive revisit scheduled after 8 min | β
|
|
| 151 |
| **Bonus: Personalization Dashboard** | Show "weak spots" | Per-turn backend panel shows mastery scores (π΄/π‘/π’); session-end weak topic summary | β
|
|
| 152 |
| **Generalizability** | Swap vector DB for different subject | Single `config.yaml` field change; zero code changes required | β
|
|
|
@@ -157,7 +157,7 @@ Rather than flat mastery scores, we maintain a directed acyclic graph (e.g., `br
|
|
| 157 |
|
| 158 |
### 6.1 Metrics
|
| 159 |
|
| 160 |
-
**Socratic Purity** (target β₯ 4.0/5.0): two-layer combined score. Layer 1 (rule-based): does the response end with "?"? does it contain β₯4 significant-word overlap with the gold answer (keyword leak)? is cosine similarity > 0.92 (semantic leak)? Confirmed leak (both layers) hard-caps at 2.0; no "?" penalizes β1.0. Layer 2 (LLM-as-Judge):
|
| 161 |
|
| 162 |
**Answer Leak Rate** (target = 0): fraction of responses where both leak layers fire simultaneously. Single-layer fires are reported as "soft flags."
|
| 163 |
|
|
@@ -169,7 +169,7 @@ Rather than flat mastery scores, we maintain a directed acyclic graph (e.g., `br
|
|
| 169 |
|
| 170 |
### 6.2 Controlled Ablation Study
|
| 171 |
|
| 172 |
-
Four variants on the same 30 questions, identical cold-start mastery (0.20), identical generation (
|
| 173 |
|
| 174 |
| Variant | PCR | CRAG | Concept Graph |
|
| 175 |
|---------|-----|------|---------------|
|
|
@@ -212,7 +212,7 @@ Adversarial results: all 20 prompts deflected across four attack categories β
|
|
| 212 |
|
| 213 |
**PCR works as designed.** The full system's 0.000 reach rate confirms that PCR's server-side filter excludes answer chunks at cold-start mastery. This is the architectural guarantee PCR was built to provide.
|
| 214 |
|
| 215 |
-
**Zero leaks across all variants β but not equivalently safe.** All four variants show 0.000 leak rate under benign evaluation. This is precisely the TutorRL failure mode:
|
| 216 |
|
| 217 |
**Purity cost of safety: 0.23 points.** The full system scores 4.70/5 vs. no_graph at 4.93/5. When the answer chunk is excluded from context, the model generates slightly broader guiding questions (it cannot see precisely what to guide toward). The no_pcr variant achieves 4.83 β paradoxically better purity β because seeing the answer enables more targeted Socratic scaffolding. This 0.23-point delta is the measurable price of architectural safety over prompt-level safety.
|
| 218 |
|
|
@@ -241,7 +241,7 @@ These results reveal a measurement mismatch between RAGAS (designed for factual
|
|
| 241 |
|
| 242 |
**CRAG's educational motivation.** In open-domain QA, CRAG prevents factually irrelevant documents from grounding incorrect answers. In educational tutoring, an additional failure mode applies: irrelevant retrievals produce off-topic Socratic questions that break the clinical reasoning thread. A student asking about the axillary nerve should not receive a Socratic question about the median nerve due to a retrieval misfire. CRAG prevents this. The ablation timing evidence (186s stall at q18) confirms CRAG fires in realistic deployment, not only in theory.
|
| 243 |
|
| 244 |
-
**Multimodal gap and path forward.** Task 4 (VLM diagram tutoring) is the primary outstanding gap. The Chainlit interface accepts image uploads; the missing piece is a VLM backend (MedGemma 4B local or
|
| 245 |
|
| 246 |
---
|
| 247 |
|
|
|
|
| 52 |
|
| 53 |
### 2.5 Multimodal Vision-Language Models
|
| 54 |
|
| 55 |
+
Liu et al. (2024) introduced LLaVA (Large Language and Vision Assistant), demonstrating that visual instruction tuning on GPT-4-generated multimodal data produces strong VLMs capable of spatial and scientific visual reasoning. Achiam et al. (2023) describe GPT-4's multimodal capabilities including diagram interpretation. For medical domains specifically, Sellergren et al. (2025) introduce MedGemma, an open-source medical VLM (4B/27B parameters) achieving competitive performance on medical reasoning benchmarks β UnMask uses MedGemma 4B as its primary diagram analysis model, with Gemini 2.0 Flash Lite as fallback.
|
| 56 |
|
| 57 |
### 2.6 Groundedness Verification
|
| 58 |
|
|
|
|
| 101 |
|
| 102 |
### 3.5 Layer 4: Structured Output β Dual Knowledge Masking
|
| 103 |
|
| 104 |
+
The Socratic Generator calls Mercury-2 with `response_format=SocraticOutput` enforcing a two-envelope structure:
|
| 105 |
|
| 106 |
```python
|
| 107 |
class InternalAnalysis(BaseModel):
|
|
|
|
| 117 |
|
| 118 |
`InternalAnalysis` is stripped before rendering. A post-generation leak guard additionally checks for β₯4 significant-word overlap between `socratic_question` and `correct_answer`, triggering a retry with explicit non-reveal instructions if fired.
|
| 119 |
|
| 120 |
+
**Cost routing:** Rapport and Wrapup phases route to local Llama 3.1 8B via Ollama (65β75% of turns). Mercury-2 handles Tutoring and Assessment. Total session cost: ~$0.08β0.10.
|
| 121 |
|
| 122 |
### 3.6 Concept Prerequisite Graph (NetworkX DAG)
|
| 123 |
|
|
|
|
| 145 |
|---------|-------------|----------------------|--------|
|
| 146 |
| **Task 1: Content Retrieval + Masking** | RAG pipeline; mask answers; progressive hints | Hybrid Qdrant retrieval (dense+BM25+RRF); PCR excludes answer chunks at retrieval; structured output schema separates `internal_analysis`/`visible_response`; hint calibration via concept graph mastery | β
|
|
| 147 |
| **Task 2: Adaptive Conversation** | Rapport β Tutoring phases; Manager Agent | LangGraph 4-phase state machine; pure-Python orchestrator; diagnostic probe initializes mastery; phase transitions by learning events + time ceilings | β
|
|
| 148 |
+
| **Task 3: Synthesis & Assessment** | Clinical scenario; compare to gold-standard; mastery summary | Assessment phase triggers at coverage β₯ 80% or t β₯ 12min; Mercury-2 evaluates against retrieved textbook chunk; concept graph exports mastery + prerequisite gaps + weak topics | β
|
|
| 149 |
+
| **Task 4: Multimodal Diagram Tutoring** | VLM for anatomy diagrams; identify β ask function/insertion | Chainlit UI accepts image uploads; VLM backend (MedGemma 4B / Gemini 2.0 Flash Lite) planned for final milestone; PCR architecture supports image-chunk metadata | β οΈ Partial |
|
| 150 |
| **Task 5: Interactive Memory** | Session memory; proactively revisit mistakes | NetworkX concept DAG tracks mastery across session; Pedagogy Agent flags `weak_topics` (mastery < 0.4); proactive revisit scheduled after 8 min | β
|
|
| 151 |
| **Bonus: Personalization Dashboard** | Show "weak spots" | Per-turn backend panel shows mastery scores (π΄/π‘/π’); session-end weak topic summary | β
|
|
| 152 |
| **Generalizability** | Swap vector DB for different subject | Single `config.yaml` field change; zero code changes required | β
|
|
|
|
|
| 157 |
|
| 158 |
### 6.1 Metrics
|
| 159 |
|
| 160 |
+
**Socratic Purity** (target β₯ 4.0/5.0): two-layer combined score. Layer 1 (rule-based): does the response end with "?"? does it contain β₯4 significant-word overlap with the gold answer (keyword leak)? is cosine similarity > 0.92 (semantic leak)? Confirmed leak (both layers) hard-caps at 2.0; no "?" penalizes β1.0. Layer 2 (LLM-as-Judge): Mercury-2 rates 1β5 on a rubric where 5 = perfect Socratic (gold answer absent, student must think) and 1 = direct answer stated.
|
| 161 |
|
| 162 |
**Answer Leak Rate** (target = 0): fraction of responses where both leak layers fire simultaneously. Single-layer fires are reported as "soft flags."
|
| 163 |
|
|
|
|
| 169 |
|
| 170 |
### 6.2 Controlled Ablation Study
|
| 171 |
|
| 172 |
+
Four variants on the same 30 questions, identical cold-start mastery (0.20), identical generation (Mercury-2, structured output):
|
| 173 |
|
| 174 |
| Variant | PCR | CRAG | Concept Graph |
|
| 175 |
|---------|-----|------|---------------|
|
|
|
|
| 212 |
|
| 213 |
**PCR works as designed.** The full system's 0.000 reach rate confirms that PCR's server-side filter excludes answer chunks at cold-start mastery. This is the architectural guarantee PCR was built to provide.
|
| 214 |
|
| 215 |
+
**Zero leaks across all variants β but not equivalently safe.** All four variants show 0.000 leak rate under benign evaluation. This is precisely the TutorRL failure mode: strong LLM instruction following holds under benign conditions, making prompt-based suppression appear sufficient. Our adversarial battery (100% hold rate, not included in the ablation) demonstrates the difference: under active attack, architectural enforcement holds; prompt-based enforcement would degrade.
|
| 216 |
|
| 217 |
**Purity cost of safety: 0.23 points.** The full system scores 4.70/5 vs. no_graph at 4.93/5. When the answer chunk is excluded from context, the model generates slightly broader guiding questions (it cannot see precisely what to guide toward). The no_pcr variant achieves 4.83 β paradoxically better purity β because seeing the answer enables more targeted Socratic scaffolding. This 0.23-point delta is the measurable price of architectural safety over prompt-level safety.
|
| 218 |
|
|
|
|
| 241 |
|
| 242 |
**CRAG's educational motivation.** In open-domain QA, CRAG prevents factually irrelevant documents from grounding incorrect answers. In educational tutoring, an additional failure mode applies: irrelevant retrievals produce off-topic Socratic questions that break the clinical reasoning thread. A student asking about the axillary nerve should not receive a Socratic question about the median nerve due to a retrieval misfire. CRAG prevents this. The ablation timing evidence (186s stall at q18) confirms CRAG fires in realistic deployment, not only in theory.
|
| 243 |
|
| 244 |
+
**Multimodal gap and path forward.** Task 4 (VLM diagram tutoring) is the primary outstanding gap. The Chainlit interface accepts image uploads; the missing piece is a VLM backend (MedGemma 4B local or Gemini 2.0 Flash Lite) that identifies anatomical structures and generates image-grounded Socratic follow-ups. PCR applies identically to image-associated chunks via `concept` metadata β no architectural changes required.
|
| 245 |
|
| 246 |
---
|
| 247 |
|