QVAC MedPsy: State-of-the-Art Medical and Healthcare Language Models for Edge Devices
KEY HIGHLIGHTS
Tether Data’s AI Research group introduces QVAC MedPsy, a family of state-of-the-art, text-only medical and healthcare language models purpose-built for edge deployment. At 1.7B and 4B parameters, these models deliver medical reasoning capabilities previously exclusive to models 2–7x their size, setting a new benchmark for efficient medical AI.
Unprecedented Parameter Efficiency (1.7B Surpasses 4B): Our text-only QVAC MedPsy-1.7B model achieves an average score of 62.62 across seven closed-ended medical benchmarks, decisively outperforming Google's MedGemma-1.5-4B-it (51.20) by +11.42 points despite being less than half its size, and matching Qwen3-4B-Thinking-2507 (63.10), a model 2.4x larger. In realistic health scenarios, it scores 70.33 on HealthBench and 54.33 on HealthBench Hard, beating even MedGemma-27B-text-it (65.00 / 42.00), a model 16x larger. This represents a paradigm shift in what compact medical models can achieve, enabling clinical-grade AI on smartphones, wearables, and resource-constrained healthcare settings.
Surpassing Frontier Models at a Fraction of the Size (4B Beats 27B): Our QVAC MedPsy-4B model scores 70.54 on closed-ended medical benchmarks, surpassing MedGemma-27B-text-it (69.95) despite being nearly 7x smaller. The gap widens dramatically on realistic health scenarios: HealthBench Hard (58.00 vs 42.00, +16.00 points), HealthBench (74.00 vs 65.00, +9.00 points), and MedXpertQA (30.61 vs 25.18, +5.43 points). These are the benchmarks closest to actual clinical decision-making, demonstrating that carefully curated training data and methodology can match or outperform larger competing state-of-the-art models, achieving top-tier results on realistic medical and health clinical assessments.
Up to 3.2x Token Efficiency, Superior Results with Fewer Tokens: Beyond parameter efficiency, our models achieve dramatic reductions in generation length during evaluation. Measured as a weighted average across all benchmarks (weighted by the number of samples per benchmark), QVAC MedPsy-4B produces accurate medical answers in approximately 909 tokens compared to 2,953 tokens for Qwen3-4B-Thinking-2507, a 3.2x reduction. QVAC MedPsy-1.7B averages ~1,110 tokens compared to ~1,901 tokens for Qwen3-1.7B (Thinking), a 1.7x reduction. This improvement in token efficiency translates directly to lower latency, reduced compute costs, and significantly faster inference on edge devices, making it a critical advantage for real-time clinical decision support.
GGUF Models for Private On-Device Inference: We publish GGUF repositories for both MedPsy sizes, including an unquantized BF16 GGUF export and seven quantized variants per model compatible with llama.cpp and the QVAC SDK. The recommended quantized tiers retain almost all benchmark performance while sharply reducing disk usage: Q5_K_M cuts file size by 64% with only −0.29 / −0.02 AVG Score loss for 4B / 1.7B, while Q4_K_M cuts file size by 69% with only −0.81 / −0.73 AVG Score loss. This makes the same medical models practical for private deployment on laptops, high-end mobile devices, and smartphone-class applications.
Comprehensive Evaluation Across Eight Benchmark Suites: We evaluate on a diverse suite spanning clinical knowledge (MedQA-USMLE, MedMCQA), health literacy (MMLU Health, MMLU-Pro Health), expert-level reasoning (MedXpertQA), biomedical research (PubMedQA), underserved contexts (AfriMedQA), and realistic health scenarios (HealthBench & HealthBench Hard), providing the most thorough assessment of edge-scale medical models to date.
Democratizing Medical AI for Edge and Privacy-Sensitive Deployment: We are making QVAC MedPsy models available under the Apache 2.0 license for research and educational purposes. These models are specifically designed for deployment on consumer hardware and edge devices, providing the potential to enable medical AI in bandwidth-constrained environments, privacy-sensitive clinical workflows, and low-resource healthcare settings where data must never leave the device.
Copyright Complaints: We will take appropriate actions in response to notice of copyright infringement. If you believe your work has been used or copied in a manner that infringes upon your intellectual property rights, please email data-apps@tether.io identifying and describing both the copyrighted work and alleged infringing content to file a notice of infringement.
🚀 MedPsy on Hugging Face
All MedPsy models, GGUF files, quantized variants, and resources in one place.
🔗 Open the Collection
🩺 MedPsy-4BHigher-quality edge model. Surpasses MedGemma-27B-text-it on closed-ended medical benchmarks at ~7× smaller. 🔗 Open the model card |
📱 MedPsy-1.7BSmartphone-class medical model. Beats MedGemma-1.5-4B-it by +11.42 points on closed-ended; matches Qwen3-4B-Thinking-2507. 🔗 Open the model card |
📦 MedPsy-4B-GGUFGGUF repo with an unquantized BF16 export and seven quantized files. Q5_K_M (3.16 GB) adds a high-quality 5-bit tier; Q4_K_M (2.72 GB) remains the recommended size/quality trade-off. 🔗 Open the GGUF repo |
📦 MedPsy-1.7B-GGUFSmartphone-ready GGUF repo with an unquantized BF16 export and seven quantized files. Q5_K_M (1.47 GB) is nearly lossless; Q4_K_M (1.28 GB) is the best mobile trade-off. 🔗 Open the GGUF repo |
HEADLINE RESULTS
The two figures below summarize how MedPsy compares to its backbones and to MedGemma on closed-ended medical benchmarks. Detailed results, methodology, and additional evaluations (HealthBench, token efficiency, ablations) are in Section 4.
Figure 1: Benchmark overview for the 4B model class. Per-benchmark scores for MedPsy-4B against MedGemma-27B-text-it, the Qwen3-4B-Thinking-2507 backbone, and MedGemma-1.5-4B-it. The top-left panel summarizes the closed-ended Average, the top-middle panel reports HealthBench and HealthBench Hard side by side, and the remaining panels show per-benchmark closed-ended results. MedPsy-4B leads on Average and on the most reasoning-intensive benchmarks (MedQA-USMLE, MedXpertQA, PubMedQA) despite being ~7× smaller than MedGemma-27B-text-it, and posts the largest gaps on HealthBench and HealthBench Hard.
Figure 2: Benchmark overview for the 1.7B model class. Per-benchmark scores for MedPsy-1.7B against MedGemma-1.5-4B-it, the Qwen3-1.7B (Thinking) backbone, and LFM2.5-1.2B-Thinking. The top-left panel summarizes the closed-ended Average, the top-middle panel reports HealthBench and HealthBench Hard side by side, and the remaining panels show per-benchmark closed-ended results. MedPsy-1.7B beats MedGemma-1.5-4B-it on the closed-ended Average by +11.42 points despite being less than half its size, and surpasses both MedGemma-1.5-4B-it and MedGemma-27B-text-it on HealthBench and HealthBench Hard.
1. Introduction
Medical LLMs have advanced rapidly, but deployment has stayed centralized. The best models, MedGemma-27B-text-it, Med-PaLM, GPT-4-based systems, all require cloud infrastructure or very expensive setups that conflict with the privacy, latency, and reliability needs of clinical environments. At the same time, medicine demands high accuracy and safety: a hallucinated drug interaction or fabricated clinical recommendation has real consequences. The challenge is not just making medical AI smaller, it must also be more accurate, safer, and runnable on the devices where healthcare happens.
This work addresses that challenge directly. Tether Data, S.A. de C.V. (Tether Data, we, us, our) presents MedPsy, a family of text-only medical language models at 1.7B and 4B parameters that achieve state-of-the-art results on a comprehensive suite of medical benchmarks while being purpose-built for edge deployment through the QVAC ecosystem.
1.1 The Challenge of Medical AI at the Edge
Medical data is uniquely sensitive. Patient records, diagnostic queries, symptom descriptions, and clinical notes contain protected health information (PHI) governed by strict regulatory frameworks, including HIPAA in the United States, GDPR in Europe, and equivalent legislation across jurisdictions worldwide. The dominant paradigm of cloud-hosted medical AI requires this data to leave the user's device, traverse network infrastructure, and be processed on remote servers, creating attack surfaces, compliance burdens, and a fundamental tension between AI capability and patient privacy.
The QVAC SDK, Tether Operations, S.A. de C.V.'s open-source, cross-platform AI development kit, was built precisely to solve this problem. QVAC SDK enables developers to run, fine-tune, and deploy AI models locally on any device and operating system, from smartphones to servers, with a single consistent API. The MedPsy models are designed from the ground up to operate within this ecosystem, enabling fully private, on-device medical intelligence.
1.2 Limitations of Existing Medical LLMs
The current landscape of medical LLMs presents a stark trade-off between capability and deployability. Google's MedGemma-27B-text-it delivers strong performance across medical benchmarks, but at 27 billion parameters it is entirely infeasible for edge deployment, requiring GPUs with tens of gigabytes of VRAM. Even MedGemma-1.5-4B-it, while technically runnable on a high-end laptop, remains impractical for smartphone or tablet deployment and delivers underwhelming medical performance (51.20 average across our benchmark suite). No existing model in the 1–4B parameter range achieves the medical accuracy required for meaningful clinical utility.
This gap is not merely a matter of model compression. Smaller models trained with conventional approaches suffer from catastrophic quality degradation on knowledge-intensive medical tasks. The medical domain demands not only factual precision across pharmacology, pathology, anatomy, and clinical reasoning, but also the ability to produce safe, well-structured responses that clinicians can use. Bridging this gap requires purpose-built training methodologies, not just parameter reduction.
Furthermore, most existing medical LLMs are multimodal or general-purpose systems adapted for medicine. While multimodality is valuable for specific use cases such as radiology, the core of clinical decision support (differential diagnosis, treatment reasoning, drug interaction analysis, patient education) is fundamentally text-based. A focused, text-only approach allows us to dedicate the full parameter budget to medical language understanding and reasoning, rather than distributing capacity across modalities.
1.3 Our Contributions
Our work makes the following key contributions:
State-of-the-art medical models at edge scale. We present two text-only medical language models, MedPsy-1.7B and MedPsy-4B, built on the Qwen3 architecture and post-trained with a multi-stage training pipeline including supervised fine-tuning and reinforcement learning. The 1.7B model outperforms MedGemma-1.5-4B-it by +11.42 points on average, and the 4B model surpasses MedGemma-27B-text-it (70.54 vs 69.95) while being 6.75x smaller.
Smartphone-grade medical AI. MedPsy-1.7B is the first model to deliver medical performance surpassing MedGemma-1.5-4B-it while being small enough to run efficiently on a smartphone. At 62.62 average, it matches Qwen3-4B-Thinking-2507 (63.10) despite being 2.4x smaller. MedPsy-1.7B can be combined with the QVAC SDK and QVAC Fabric to create fully private, on-device medical intelligence on the devices people already carry, a capability previously out of reach.
Up to 3.2x token efficiency. Our models produce accurate medical answers with significantly fewer tokens than their backbones. Measured as a weighted average across all evaluation benchmarks, MedPsy-4B averages ~909 tokens per response compared to ~2,953 for Qwen3-4B-Thinking-2507 (3.2x reduction), while MedPsy-1.7B averages ~1,110 tokens compared to ~1,901 for Qwen3-1.7B (Thinking) (1.7x reduction). These reductions translate directly to lower latency, reduced compute, and faster inference on resource-constrained devices (see Section 4.6).
Comprehensive evaluation. We evaluate across eight benchmark suites spanning clinical knowledge, expert reasoning, biomedical research, realistic health scenarios, and underserved-region contexts, providing one of the most thorough assessments of edge-scale medical models to date.
2. Data Methodology
This section describes, at a high level, the data used to post-train MedPsy and the teacher selection process behind it. We have not yet released the training corpus.
2.1 Training Data Overview
We explored several data mixtures and post-training methodologies before settling on the recipe used to train the released MedPsy models. In aggregate, more than 30M synthetic rows of medical and healthcare supervision were generated for these experiments. The final, best-performing recipe organizes the data into a two-stage curriculum: a broad-coverage corpus (Corpus 1) followed by a smaller, higher-value corpus (Corpus 2), described in Section 3.
Two principles guided our synthetic data construction:
- Synthetic, controlled prompt-side supervision. Question-side material is sourced from Genesis II–style synthetic medical seeds [6][18] (covering biology, medicine, and a new health domain that has not yet been publicly released) and from publicly available open-source medical QA prompts. These sources are used purely as questions.
- A single, controlled reasoning teacher. Every long-form reasoning target used for supervision (chain-of-thought traces, extended rationales, decision-oriented answers) is freshly generated by Baichuan-M3-235B [19], the teacher selected in Section 2.2 below. No open-source reasoning traces or public CoT corpora are used as supervision. This is a deliberate choice: the teacher's reasoning style and clinical nuance is the dominant lever on the final model's behavior, so we want every trace in our SFT data to come from a single, controlled, medically strong source.
2.2 Teacher Selection
Based on a detailed literature review and analysis of public benchmarks, we identified three strong candidates for teacher models: Baichuan-M3-235B [19], GPT-OSS-120B [10], and Fleming-R1-32B [20]. These models were chosen for their demonstrated strength in medical reasoning and open performance across key medical AI leaderboards.
To ensure the optimal teacher, we double-checked these findings by running all three candidates through our own benchmark suite, using the same closed-ended medical benchmarks and HealthBench framework as were used for final model evaluation. Based on the results, we selected Baichuan-M3-235B as the teacher model for the generation of all final synthetic data in this work, due to its clear lead across the most relevant evaluation criteria.
Closed-ended benchmarks. Baichuan-M3-235B leads with an average of 74.83, outperforming GPT-OSS-120B (72.73) and Fleming-R1-32B (72.55) by approximately 2 points. Its advantages are strongest on MedXpertQA (+5.61 / +10.98 over the other candidates) and MMLU-Pro Health (+5.05 / +2.73), the benchmarks that most reward expert-level medical reasoning. Fleming-R1-32B leads on PubMedQA (79.20), but as discussed in Section 4.2, teacher performance on PubMedQA does not translate proportionally to student gains due to a performance ceiling in the low-to-mid 70s for distilled models.
| Teacher Model | Average | AfriMedQA | MMLU (Health) | MedQA (USMLE) | PubMedQA | MedMCQA | MedXpertQA | MMLU-Pro Health |
|---|---|---|---|---|---|---|---|---|
| Baichuan-M3-235B | 74.83 | 74.58 | 93.08 | 88.9 | 73.27 | 76.88 | 40.91 | 76.2 |
| GPT-OSS-120B | 72.73 | 73.28 | 91.85 | 89.03 | 75.2 | 73.26 | 35.3 | 71.15 |
| Fleming-R1-32B | 72.55 | 72.68 | 92.61 | 85.91 | 79.2 | 74.02 | 29.93 | 73.47 |
Table: Teacher model closed-ended benchmark results. Bold indicates best per column.
HealthBench. The advantage becomes more decisive on open-ended clinical evaluation. Using three independent judges (described in Section 4.1.2), Baichuan-M3-235B leads all three by roughly 6–12 points over GPT-OSS-120B and 10–12 points over Fleming-R1-32B, with consistent advantages across all seven HealthBench dimensions.
| Teacher Model | HealthBench (CompassJudger) | HealthBench (Llama-3.3-70B) | HealthBench (GPT-OSS-120B) |
|---|---|---|---|
| Baichuan-M3-235B | 77 | 71 | 58 |
| GPT-OSS-120B | 69.67 | 63 | 52.33 |
| Fleming-R1-32B | 67 | 61 | 45.67 |
Table: Teacher model HealthBench overall scores across three judges.
This gap on open-ended clinical evaluation was the decisive factor in teacher selection: since our training data consists of teacher-generated medical reasoning traces, the teacher's ability to produce nuanced clinical communication, structured rationales, and safe medical advice directly determines the quality ceiling of the student's supervision. Based on these results, all final MedPsy training data was generated using Baichuan-M3-235B.
The output of this stage is a curated medical and healthcare post-training corpus that combines synthetic expansion, reasoning-focused data, and public medical QA supervision.
3. Post-Training Methodology
3.1 Backbone Models
All reported MedPsy models are built on the Qwen3 model family. We focus on two edge-oriented backbone sizes:
| Model | Backbone | Positioning |
|---|---|---|
| MedPsy-1.7B | Qwen3-1.7B (Thinking) | Smartphone-class and low-memory edge deployment |
| MedPsy-4B | Qwen3-4B-Thinking-2507 | Higher-quality edge deployment on laptops, workstations, and high-end mobile devices |
Both are text-only medical models. This is a deliberate design choice: for the target use cases in this report, medical reasoning, clinical Q&A, health literacy, exam-style knowledge, and plain language communication are primarily text problems. Keeping the models text-only allows the available parameter budget to be concentrated on medical language understanding and reasoning rather than split across modalities.
The backbone architectures are summarized below.
| Parameter | Qwen3-1.7B | Qwen3-4B |
|---|---|---|
| Hidden size | 2,048 | 2,560 |
| FFN hidden size | 6,144 | 9,728 |
| Layers | 28 | 36 |
| Attention heads | 16 | 32 |
| KV groups (GQA) | 8 | 8 |
| Vocab size | 151,936 | 151,936 |
| Position embedding | RoPE | RoPE |
| Normalization | RMSNorm | RMSNorm |
| Activation | SwiGLU (SiLU + Gated Linear Unit) | SwiGLU (SiLU + Gated Linear Unit) |
Backbone selection. The choice of Qwen3 as the backbone family was informed by a systematic evaluation of candidate models at both target sizes, evaluated on our full medical benchmark suite before any post-training.
4B class. We evaluated five backbones at the ~3–4B parameter scale on both closed-ended benchmarks and HealthBench.
Closed-ended benchmarks.
| Model | Average | AfriMedQA | MMLU (Health) | MedQA (USMLE) | PubMedQA | MedMCQA | MedXpertQA | MMLU-Pro Health |
|---|---|---|---|---|---|---|---|---|
| Qwen3-4B-Thinking-2507 | 63.10 | 64.12 | 85.92 | 70.91 | 74.53 | 61.78 | 16.69 | 67.73 |
| Qwen3-4B (Thinking) | 60.64 | 63.15 | 83.89 | 67.64 | 72.80 | 60.21 | 15.01 | 61.82 |
| Llama-3.2-3B-Instruct | 49.67 | 52.39 | 66.81 | 49.51 | 74.20 | 52.35 | 12.43 | 39.97 |
| SmolLM3-3B | 48.99 | 47.24 | 72.64 | 46.40 | 71.93 | 46.20 | 11.32 | 47.23 |
| gemma-3-4b-it | 42.59 | 45.46 | 62.72 | 40.51 | 59.13 | 45.08 | 10.52 | 34.68 |
Table: 4B-class backbone closed-ended benchmark results. Average is computed over the seven closed-ended benchmarks. Bold indicates best per column.
HealthBench.
| Model | HealthBench (CompassJudger) | HealthBench (Llama-3.3-70B) | HealthBench (GPT-OSS-120B) |
|---|---|---|---|
| Qwen3-4B-Thinking-2507 | 63 | 56 | 36.67 |
| Qwen3-4B (Thinking) | 62 | 55 | 37.67 |
| gemma-3-4b-it | 59 | 53 | 33.33 |
| SmolLM3-3B | 50 | 43.67 | 24 |
| Llama-3.2-3B-Instruct | 37.33 | 32 | 14.67 |
Table: 4B-class backbone HealthBench results across three judges.
Qwen3-4B-Thinking-2507 leads the field on closed-ended benchmarks by +13.43 over Llama-3.2-3B-Instruct, +14.11 over SmolLM3-3B, and +20.51 over gemma-3-4b-it. On HealthBench, it leads Llama-3.2-3B-Instruct by +25.67 (CompassJudger), +24 (Llama-3.3-70B), and +22.00 (GPT-OSS-120B), topping two of the three judges; on the GPT-OSS-120B judge the hybrid Qwen3-4B (Thinking) edges it out by 1 point (37.67 vs 36.67). The two non-Qwen candidates with the strongest HealthBench scores, gemma-3-4b-it (59) and SmolLM3-3B (50), still trail Qwen3-4B-Thinking-2507 by 4 and 13 points respectively on CompassJudger, and gemma-3-4b-it's strong open-ended performance does not translate to closed-ended medical knowledge, where it ranks last (42.59 average). Even the hybrid Qwen3-4B checkpoint (operated in thinking mode) outperforms Llama-3.2-3B-Instruct by +10.97 on average and by +24.67 / +23 / +23.00 across the three HealthBench judges, trailing the dedicated 2507 variant on CompassJudger and Llama-3.3-70B but slightly ahead of it on GPT-OSS-120B. The consistency of Qwen3's lead across independent judges confirms that its advantage on clinical reasoning and communication is robust, not an artifact of a single evaluator. Based on these results, Qwen3-4B-Thinking-2507 was selected as the 4B backbone and Qwen3-1.7B (Thinking) as the sub-2B backbone for all subsequent post-training.
Sub-2B class. We evaluated four backbones at the ~1–2B parameter scale on both closed-ended benchmarks and HealthBench.
Closed-ended benchmarks.
| Model | Average | AfriMedQA | MMLU (Health) | MedQA (USMLE) | PubMedQA | MedMCQA | MedXpertQA | MMLU-Pro Health |
|---|---|---|---|---|---|---|---|---|
| Qwen3-1.7B (Thinking) | 49.95 | 51.87 | 72.49 | 47.18 | 72.33 | 49.14 | 11.60 | 45.07 |
| LFM2.5-1.2B-Thinking | 44.15 | 45.07 | 63.48 | 39.85 | 69.20 | 42.11 | 11.54 | 37.81 |
| Llama-3.2-1B-Instruct | 36.18 | 36.84 | 49.34 | 34.04 | 61.00 | 37.89 | 10.25 | 23.88 |
| SmolLM2-1.7B-Instruct | 33.32 | 31.39 | 49.27 | 29.22 | 59.00 | 33.80 | 9.86 | 20.70 |
Table: Sub-2B-class backbone closed-ended benchmark results. Average is computed over the seven closed-ended benchmarks. Bold indicates best per column.
HealthBench.
| Model | HealthBench (CompassJudger) | HealthBench (Llama-3.3-70B) | HealthBench (GPT-OSS-120B) |
|---|---|---|---|
| Qwen3-1.7B (Thinking) | 53 | 47.33 | 27.67 |
| LFM2.5-1.2B-Thinking | 49 | 41.67 | 22.33 |
| Llama-3.2-1B-Instruct | 25 | 22 | 5.33 |
| SmolLM2-1.7B-Instruct | 23 | 18.67 | 6 |
Table: Sub-2B-class backbone HealthBench results across three judges.
Qwen3-1.7B (Thinking) leads by +5.81 over the next-best backbone (LFM2.5-1.2B-Thinking) and by +13.78 / +16.64 over Llama-3.2-1B-Instruct and SmolLM2-1.7B-Instruct on closed-ended benchmarks. The HealthBench gap is consistent across all three judges: Qwen3-1.7B (Thinking) scores 53 / 47.33 / 27.67 (Compass / Llama / GPT-OSS) versus 25 / 22 / 5.33 for Llama-3.2-1B and 23 / 18.67 / 6 for SmolLM2. The near-zero GPT-OSS scores for Llama-3.2-1B-Instruct and SmolLM2-1.7B-Instruct indicate that their clinical outputs are essentially non-functional under the strictest reasoning judge, further confirming Qwen3 as the only viable backbone at this scale.
These results confirmed Qwen3 as the strongest backbone family at both target sizes, providing the highest-quality starting point for medical post-training.
Thinking mode. Both selected backbones are operated in thinking mode for backbone evaluation, all post-training stages, and final evaluation. Qwen3-4B-Thinking-2507 is a dedicated thinking-only checkpoint released by the Qwen team. Qwen3-1.7B is a hybrid checkpoint that supports both thinking and non-thinking modes via the enable_thinking flag in its chat template; we always set enable_thinking=True, which is why it is consistently referred to as Qwen3-1.7B (Thinking) throughout this report.
3.2 Post-Training Overview
We describe MedPsy as a family of post-trained models rather than simple fine-tuned models. Starting from compact Qwen3 backbones, we apply a multi-stage post-training recipe that combines supervised learning and reinforcement learning over a medical-specialized data mixture.
At a high level, the recipe follows four stages:
- SFT Stage 1 (Corpus 1). Broad medical adaptation on the large-scale synthetic corpus, building wide medical, health and biology coverage.
- SFT Stage 2 (Corpus 2). Reasoning specialization on a smaller, higher-value clinical QA corpus with teacher-generated reasoning.
- AlphaMedQA RL (Stage 1). Reinforcement Learning on the easy and moderate subset of AlphaMedQA [21], as annotated by the SFT model checkpoint. This reinforces correct reasoning patterns where the model already has partial competence.
- Hard-enriched AlphaMedQA RL (Stage 2). A second RL stage on a hard-enriched subset, constructed by re-annotating the full dataset with the best Stage 1 checkpoint and oversampling the cases the model still fails on.
This staged recipe is important for compact edge models: broad SFT builds domain coverage, narrower SFT improves reasoning quality, and RL further sharpens clinical behavior where pure imitation learning is often insufficient.
Figure 3: Overview of the MedPsy post-training schedule. The model is first trained on Corpus 1, then on Corpus 2, and finally refined through two RL stages based on AlphaMedQA and hard AlphaMedQA samples.
3.3 Multi-Stage Supervised Fine-Tuning
Rather than a single monolithic fine-tune, SFT follows a curriculum-style recipe in which data scope narrows and quality density increases across stages:
- Stage 1: broad medical adaptation (Corpus 1). The model absorbs large-scale synthetic supervision spanning biology, medicine, and health topics. This stage builds wide factual coverage and medical vocabulary.
- Stage 2: reasoning specialization (Corpus 2). The model shifts to a smaller set of high-value clinical QA examples with teacher-generated CoT reasoning. This stage sharpens answer structure, clinical reasoning depth, and response quality.
The ordering matters: broad coverage must come first, a model that has not yet seen enough medical material will not benefit from high-quality reasoning examples. Ending on curated reasoning data ensures the model's final behavior reflects the strongest supervision available. Operational details such as exact learning rates and batch sizes will be added in a later revision.
3.4 Multi-Stage Reinforcement Learning
Reinforcement Learning is applied in two stages that progressively increase difficulty, using DAPO (a variant of GRPO without KL penalty) as the optimization algorithm. Two-stage curriculum RL has shown strong results in medical reasoning, notably in Fleming-R1 [20], which uses GRPO with hard-sample mining across successive stages. We adopt a similar staged design but differ in the specifics of difficulty annotation, dataset construction, and reward shaping.
Before training begins, the dataset is annotated for difficulty by running each sample through the SFT checkpoint multiple times (N=5 attempts) and classifying based on correctness:
- Easy: correct on all N attempts
- Moderate: correct on some but not all attempts
- Difficult: correct on none
The reward function incentivizes both structured reasoning and answer correctness:
| Condition | Reward |
|---|---|
| Correct answer + ‘<think>’ reasoning | 1.0 |
| Correct answer, no reasoning | 0.5 |
| Reasoning present + valid format, wrong answer | 0.1 |
| No valid structure | 0.0 |
RL Stage 1 trains on easy and moderate samples (~14-16K of the 18K AlphaMedQA set) for 4 epochs, using the SFT checkpoint as initialization. This stage builds reliable reasoning patterns across clinical scenarios where the model already has partial competence, reinforcing correct chains of thought while discouraging shortcut answers.
After Stage 1, the best checkpoint is selected via held-out evaluation. The full dataset is then re-annotated using this improved checkpoint to obtain an updated difficulty distribution; what was previously difficult may now be moderate, and what remains difficult represents genuinely hard cases.
RL Stage 2 constructs a hard-enriched dataset from the re-annotation: all samples the model still gets wrong, combined with a smaller sample of correctly-answered ones at a 1:2 right-to-wrong ratio (~2–4K samples depending on model size). Training runs for ~500 steps from the best Stage 1 checkpoint, and the best checkpoint is selected via held-out evaluation.
This two-stage curriculum (broad-then-focused) mirrors the SFT design. It is especially important in medicine, where strong average accuracy can mask weaknesses in rare or complex cases. By concentrating Stage 2 on the persistent failure modes, we push the model on exactly the clinical scenarios that matter most.
3.5 Training Infrastructure and Implementation
Cluster
| Component | Specification |
|---|---|
| Worker nodes | 30 nodes (worker-0 through worker-29) |
| GPUs per node | 8x NVIDIA H100 80GB HBM3 |
| Total GPU capacity | 480 H100s |
| Interconnect | InfiniBand (Mellanox ConnectX, 8 HCA ports per node) |
| Job scheduler | SLURM |
| Container | NVIDIA NeMo 25.09 (Enroot), PyTorch 2.5.1, CUDA 12.1 |
Distributed Training
SFT jobs are launched via SLURM + Enroot containers. Each node runs one torchrun launcher that spawns 8 GPU workers. The parallelism strategy per model size is:
| Model Size | Nodes | GPUs | TP | PP | DP |
|---|---|---|---|---|---|
| 1.7B | 4 | 32 | 1 | 1 | 32 |
| 4B | 4 | 32 | 1 | 1 | 32 |
No tensor or pipeline parallelism is used for SFT; both model sizes train with data parallelism across 32 H100 GPUs. Communication uses NCCL over InfiniBand with GPU Direct RDMA. Key optimizations include ZeRO-style distributed optimizer, overlapped gradient reduce/parameter gather, and gradient accumulation fusion.
Throughput
SFT uses 3 epochs, sequence length 4,096 with packed sequence size 4,096, global batch size 512 packed sequences, and bf16 mixed precision. The primary final pipeline required approximately 8,250 H100 GPU hours, with most of the budget spent on data generation (~8,000 GPU hours), followed by SFT (~100 GPU hours) and RL (~150 GPU hours). Including preliminary ablations, failed runs, and evaluation, the total project compute is estimated at approximately 30,000 H100 GPU hours. Training is tracked via Weights & Biases and TensorBoard. Checkpoints are saved in Megatron torch_dist format with fully parallel save, and converted to HuggingFace SafeTensors for evaluation and deployment.
The central claim of this report is that data quality, staged post-training, and alignment design, rather than model scale alone, are the primary reasons these compact models close the gap to or surpass much larger medical baselines.
4. Evaluation
4.1 Evaluation Methodology
We use two evaluation approaches depending on the task type.
4.1.1 Closed-Ended Benchmarks (MCQA and Classification)
For all closed-ended benchmarks, including MMLU (Health), MMLU-Pro Health, MedMCQA, MedQA (USMLE), MedXpertQA, AfriMedQA, and PubMedQA, we adopt the LLM-as-a-Parser evaluation methodology introduced in the QVAC Genesis project [6]. This approach addresses a fundamental reliability problem with conventional evaluation techniques. Traditional methods either constrain the model to output only an option letter (suppressing its reasoning process) or attempt to extract the selected option from free-form responses using fragile regex patterns that frequently fail on edge cases, producing false negatives.
Our approach works in two stages: (1) the model generates its full, unconstrained response including complete reasoning, and (2) a separate reasoning model acts as a parser to extract the final option selected by the model, which is then compared via exact match against the ground truth. This decouples generation from evaluation, ensuring that the model's reasoning ability is exercised during generation while answer extraction remains robust and deterministic. For a comprehensive discussion of why this methodology produces more reliable results than both log-likelihood evaluation and regex-based extraction, we refer the reader to Section 4.2 of the QVAC Genesis II technical report [6].
4.1.2 HealthBench (Open-Ended Clinical Evaluation)
For HealthBench [7], OpenAI's open-ended benchmark designed to evaluate LLMs on realistic, complex clinical scenarios that reflect real-world patient-doctor interactions, we employ an LLM-as-a-Judge methodology. HealthBench evaluates models on nuanced clinical reasoning, safety-critical decision making, and communication quality through open-ended responses that cannot be reduced to a single correct option.
To ensure robustness and reduce single-judge bias, we evaluate using a panel of three independent judge models, each selected to bring a distinct evaluation perspective:
- CompassJudger-2-32B-Instruct [8], a judge model trained with verifiable reward-guided reasoning, top-performing on judge and reward benchmarks.
- Llama-3.3-70B-Instruct [9], Meta's instruction-tuned model as a strong general-purpose judging baseline.
- GPT-OSS-120B [10], OpenAI's open-weight Mixture-of-Experts model with strong chain-of-thought reasoning, included as a dedicated reasoning judge for health and medical evaluation.
All three judge scores are computed and tracked internally to monitor judge agreement. Section 4.3 reports the overall HealthBench and HealthBench Hard results under the full three-judge panel, while the per-dimension breakdowns use CompassJudger-2-32B-Instruct as the judge model.
4.2 Closed-Ended Benchmark Results
Table 1 compares all models on closed-ended benchmarks (MCQA, classification, and biomedical QA). Models are sorted by average score.
| Model Name | Average | MMLU (Health) | AfriMedQA | MMLU-Pro Health | MedMCQA | MedQA (USMLE) | MedXpertQA | PubMedQA |
|---|---|---|---|---|---|---|---|---|
| MedPsy-4B | 70.54 | 89.70 | 71.50 | 70.45 | 72.15 | 84.39 | 30.61 | 75.00 |
| MedGemma-27B-text-it | 69.95 | 90.48 | 73.07 | 72.94 | 72.77 | 83.29 | 25.18 | 71.93 |
| Qwen3-4B-Thinking-2507 | 63.10 | 85.92 | 64.12 | 67.73 | 61.78 | 70.91 | 16.69 | 74.53 |
| MedPsy-1.7B | 62.62 | 82.72 | 64.84 | 61.37 | 63.56 | 75.05 | 21.28 | 69.53 |
| MedGemma-1.5-4B-it | 51.20 | 67.69 | 54.38 | 47.31 | 50.08 | 64.39 | 15.80 | 58.73 |
| Qwen3-1.7B (Thinking) | 49.95 | 72.49 | 51.87 | 45.07 | 49.14 | 47.18 | 11.60 | 72.33 |
Table 1: Closed-ended benchmark results across all models, sorted by average. Average is computed over the seven closed-ended benchmarks. Bold indicates best score per column within each size class.
Our 4B model ranks first overall (70.54 vs 69.95 for MedGemma-27B-text-it), beating a model 7x its size. It leads on MedQA-USMLE (+1.10), MedXpertQA (+5.43), and PubMedQA (+3.07). Compared to its backbone (Qwen3-4B-Thinking-2507), our training adds +7.44 points on average. At 1.7B parameters, our model scores 62.62, beating MedGemma-1.5-4B-it (51.20) by +11.42 points despite being less than half its size, and matching Qwen3-4B-Thinking-2507 (63.10), a model 2.4x larger. The largest 1.7B gains are on MedQA-USMLE (+10.66 vs MedGemma-1.5-4B-it) and MMLU Health (+15.03).
PubMedQA is a notable case: the Qwen3-1.7B (Thinking) backbone already scores 72.33, close to the teacher model's own performance on this benchmark. Our post-training slightly reduces this to 69.53, a known trade-off when specializing compact models for medical reasoning. Teacher ablations confirmed there was limited headroom for improvement on PubMedQA, as even substantially larger models plateau in the low-to-mid 70s. The 4B model, with its larger capacity, absorbs the medical specialization without this regression and improves PubMedQA to 75.00.
For a per-benchmark visual breakdown of Table 1, including the closed-ended Average panel (top-left of each figure), see Figures 1 and 2 at the top of this report. Those figures also include a HealthBench / HealthBench Hard panel that previews the open-ended results discussed in Section 4.3.
4.3 HealthBench Results
HealthBench [7] evaluates models on realistic, open-ended clinical scenarios across seven dimensions. Unlike closed-ended benchmarks, these require coherent medical communication, safety-critical decision making, and nuanced uncertainty handling. We first report the overall Standard and Hard scores under all three independent judges, then provide detailed per-dimension results using CompassJudger-2-32B-Instruct [8].
Overall scores across three judges
| Model | CompassJudger-2-32B | Llama-3.3-70B | GPT-OSS-120B | |||
|---|---|---|---|---|---|---|
| HealthBench | HealthBench-Hard | HealthBench | HealthBench-Hard | HealthBench | HealthBench-Hard | |
| MedPsy-4B | 74.00 | 58.00 | 66.33 | 48.33 | 51.33 | 28.67 |
| MedPsy-1.7B | 70.33 | 54.33 | 63.00 | 46.00 | 46.00 | 24.67 |
| MedGemma-27B-text-it | 65.00 | 42.67 | 59.00 | 36.00 | 44.67 | 13.00 |
| Qwen3-4B-Thinking-2507 | 63.00 | 42.00 | 56.00 | 34.00 | 36.67 | 9.33 |
| MedGemma-1.5-4B-it | 54.00 | 29.67 | 48.00 | 24.67 | 31.00 | 2.00 |
Table 2: HealthBench overall scores under three independent judges for the Standard and Hard splits. MedPsy-4B ranks first across every judge and split, and MedPsy-1.7B ranks second across the board.
HealthBench
| Model Name | Overall | Expertise-Tailored Communication | Response Depth | Context Seeking | Emergency Referrals | Global Health | Health Data Tasks | Responding Under Uncertainty |
|---|---|---|---|---|---|---|---|---|
| MedPsy-4B | 74.00 | 79.33 | 63.67 | 71.67 | 81.67 | 73.67 | 60.67 | 76.33 |
| MedPsy-1.7B | 70.33 | 76.33 | 56.33 | 69.33 | 80.00 | 68.33 | 57.00 | 74.00 |
| MedGemma-27B-text-it | 65.00 | 73.00 | 61.33 | 58.67 | 73.00 | 61.00 | 56.67 | 66.33 |
| Qwen3-4B-Thinking-2507 | 63.00 | 71.00 | 58.00 | 57.67 | 74.00 | 59.00 | 54.67 | 64.33 |
| MedGemma-1.5-4B-it | 54.00 | 62.67 | 48.67 | 46.00 | 64.00 | 47.67 | 44.67 | 58.33 |
| Qwen3-1.7B (Thinking) | 53.00 | 63.67 | 49.67 | 48.33 | 64.67 | 45.67 | 42.33 | 56.33 |
Table 3: HealthBench results by dimension, evaluated using CompassJudger-2-32B-Instruct. MedPsy-4B leads on all seven dimensions. Both MedPsy models surpass MedGemma-27B-text-it on every dimension.
HealthBench Hard
HealthBench Hard is the hardest subset, cases requiring multi-step reasoning, complex safety judgments, and expert-level clinical knowledge.
| Model Name | Overall | Expertise-Tailored Communication | Response Depth | Context Seeking | Emergency Referrals | Global Health | Health Data Tasks | Responding Under Uncertainty |
|---|---|---|---|---|---|---|---|---|
| MedPsy-4B | 58.00 | 55.33 | 47.67 | 63.33 | 62.33 | 60.00 | 46.67 | 61.00 |
| MedPsy-1.7B | 54.33 | 52.33 | 40.33 | 61.00 | 60.33 | 55.00 | 43.33 | 58.33 |
| Qwen3-4B-Thinking-2507 | 42.67 | 45.00 | 38.67 | 43.00 | 47.33 | 43.33 | 39.67 | 42.00 |
| MedGemma-27B-text-it | 42.00 | 44.67 | 38.67 | 42.00 | 39.67 | 42.67 | 39.33 | 42.67 |
| MedGemma-1.5-4B-it | 29.67 | 31.67 | 29.00 | 28.00 | 29.00 | 29.00 | 23.67 | 35.00 |
| Qwen3-1.7B (Thinking) | 28.33 | 31.67 | 28.33 | 32.00 | 27.67 | 26.67 | 21.33 | 31.00 |
Table 4: HealthBench Hard results by dimension, evaluated using CompassJudger-2-32B-Instruct. Bold indicates best score per column. MedPsy-4B leads MedGemma-27B-text-it by +16.00 points overall despite being 6.75x smaller, and MedPsy-1.7B still leads MedGemma-27B-text-it by +12.33 points despite being ~16x smaller.
Both MedPsy models lead on all seven dimensions in both standard and hard evaluations. The 4B model's strongest results are on Emergency Referrals (81.67 / 62.33), Expertise-Tailored Communication (79.33 / 55.33), and Responding Under Uncertainty (76.33 / 61.00). The 1.7B model also beats MedGemma-27B-text-it on every dimension, with the largest gaps on Context Seeking (+10.66), Emergency Referrals (+7.00), and Responding Under Uncertainty (+7.67). On HealthBench Hard, the gaps widen: our 4B model (58.00) outperforms MedGemma-27B-text-it (42.00) by +16.00 points, and our 1.7B model (54.00) beats it by +12.00 points, a model 16x smaller performing better on the hardest clinical cases. The fact that both models maintain and widen their lead as difficulty increases suggests our training produces deeper clinical reasoning, not surface-level pattern matching.
4.4 Analysis
Three patterns stand out from the evaluation results. First, the largest gains appear on HealthBench and HealthBench Hard, which suggests the improvement is not limited to exam memorization but extends to clinically relevant reasoning, communication, and uncertainty handling. Second, the 1.7B model closes most of the performance gap to much larger models, showing that carefully designed medical post-training data can matter more than parameter count in the edge regime. Third, the 4B model provides the strongest overall trade-off for higher-quality edge deployment, while the 1.7B model targets the smallest devices where memory and latency are the dominant constraints.
The most plausible explanation for these gains is the combination of (i) the curriculum-style two-stage post-training recipe described in Section 2.1, built from a large-scale Genesis II–style synthetic medical mixture, and (ii) the use of Baichuan-M3-235B as the single reasoning teacher for every long-form supervision target, with no external CoT corpora mixed in. Together, these give the model broad medical coverage and a single, controlled reasoning style, rather than a heterogeneous mixture of reasoning traces, without sacrificing response structure or answer extractability.
4.5 Training Stage Progression
To quantify the contribution of each post-training stage, we evaluate after every stage of the pipeline described in Section 3.2. Tables 5 and 6 show the cumulative progression from the Qwen3 backbone checkpoint through SFT Stage 1, SFT Stage 2, RL Stage 1, and RL Stage 2.
MedPsy-1.7B
| Training Stage | Average | MMLU (Health) | HealthBench | AfriMedQA | MMLU-Pro Health | MedMCQA | MedQA (USMLE) | MedXpertQA | PubMedQA |
|---|---|---|---|---|---|---|---|---|---|
| Qwen3-1.7B (Thinking) backbone | 49.95 | 72.49 | 53.00 | 51.87 | 45.07 | 49.14 | 47.18 | 11.60 | 72.33 |
| + SFT Stage 1 (Corpus 1) | 57.48 | 79.70 | 70.33 | 60.90 | 57.05 | 57.26 | 63.97 | 17.09 | 66.40 |
| Δ SFT Stage 1 | +7.53 | +7.21 | +17.33 | +9.03 | +11.98 | +8.12 | +16.79 | +5.49 | −5.93 |
| + SFT Stage 2 (Corpus 2) | 59.70 | 80.14 | 70.33 | 62.45 | 60.27 | 58.54 | 68.76 | 18.89 | 68.87 |
| Δ SFT Stage 2 | +2.22 | +0.44 | 0.00 | +1.55 | +3.22 | +1.28 | +4.79 | +1.80 | +2.47 |
| + RL Stage 1 (AlphaMedQA easy/moderate) | 60.00 | 80.92 | 70.33 | 63.90 | 59.66 | 61.33 | 68.21 | 16.26 | 69.73 |
| Δ RL Stage 1 | +0.30 | +0.78 | 0.00 | +1.45 | −0.61 | +2.79 | −0.55 | −2.63 | +0.86 |
| + RL Stage 2 (hard-enriched) | 62.62 | 82.72 | 70.33 | 64.84 | 61.37 | 63.56 | 75.05 | 21.28 | 69.53 |
| Δ RL Stage 2 | +2.62 | +1.80 | 0.00 | +0.94 | +1.71 | +2.23 | +6.84 | +5.02 | −0.20 |
| Total Δ | +12.67 | +10.23 | +17.33 | +12.97 | +16.30 | +14.42 | +27.87 | +9.68 | −2.80 |
Table 5: Training stage progression for MedPsy-1.7B. Δ rows show the gain from each stage. Average is over 7 closed-ended benchmarks. SFT Stage 1 contributes +7.53, SFT Stage 2 adds +2.22, RL Stage 1 adds +0.30, and hard-enriched RL Stage 2 adds +2.62.
MedPsy-4B
| Training Stage | Average | MMLU (Health) | HealthBench | AfriMedQA | MMLU-Pro Health | MedMCQA | MedQA (USMLE) | MedXpertQA | PubMedQA |
|---|---|---|---|---|---|---|---|---|---|
| Qwen3-4B-Thinking-2507 backbone | 63.04 | 85.78 | 63.00 | 64.30 | 66.62 | 61.89 | 71.56 | 17.30 | 73.80 |
| + SFT Stage 1 (Corpus 1) | 67.93 | 88.77 | 74.00 | 69.35 | 68.87 | 67.72 | 82.38 | 26.22 | 72.20 |
| Δ SFT Stage 1 | +4.89 | +2.99 | +11.00 | +5.05 | +2.25 | +5.83 | +10.82 | +8.92 | −1.60 |
| + SFT Stage 2 (Corpus 2) | 69.29 | 89.58 | 74.00 | 70.47 | 69.77 | 69.21 | 84.32 | 28.71 | 73.00 |
| Δ SFT Stage 2 | +1.36 | +0.81 | 0.00 | +1.12 | +0.90 | +1.49 | +1.94 | +2.49 | +0.80 |
| + RL Stage 1 (AlphaMedQA easy/moderate) | 68.85 | 89.32 | 74.00 | 71.23 | 69.56 | 71.21 | 84.13 | 25.47 | 71.07 |
| Δ RL Stage 1 | −0.44 | −0.26 | 0.00 | +0.76 | −0.21 | +2.00 | −0.19 | −3.24 | −1.93 |
| + RL Stage 2 (hard-enriched) | 70.54 | 89.70 | 74.00 | 71.50 | 70.45 | 72.15 | 84.39 | 30.61 | 75.00 |
| Δ RL Stage 2 | +1.69 | +0.38 | 0.00 | +0.27 | +0.89 | +0.94 | +0.26 | +5.14 | +3.93 |
| Total Δ | +7.50 | +3.92 | +11.00 | +7.20 | +3.83 | +10.26 | +12.83 | +13.31 | +1.20 |
Table 6: Training stage progression for MedPsy-4B. Δ rows show the gain from each stage. Average is over 7 closed-ended benchmarks. SFT Stage 1 contributes +4.89, SFT Stage 2 adds +1.36, RL Stage 1 is approximately neutral on average, and hard-enriched RL Stage 2 adds the final +1.69. The backbone row here and the Qwen3-4B-Thinking-2507 row in Table 1 correspond to the same checkpoint; minor sub-point differences reflect independent evaluation runs.
Several patterns are consistent across both model sizes. SFT Stage 1 delivers the largest single improvement (+7.53 for 1.7B, +4.89 for 4B), confirming that broad medical coverage from Corpus 1 is the foundation of the pipeline. The gains are most dramatic on HealthBench (+17.33 for 1.7B, +11.00 for 4B) and MedQA-USMLE (+16.79 for 1.7B, +10.82 for 4B). SFT Stage 2 adds targeted improvements across clinical reasoning benchmarks, MedQA-USMLE (+4.79 for 1.7B), MMLU-Pro Health (+3.22 for 1.7B), and MedXpertQA (+2.49 for 4B) all improve, showing that the smaller, higher-quality Corpus 2 refines the capabilities that Stage 1 established. The two RL stages provide the final sharpening: RL Stage 1 consolidates easy and moderate AlphaMedQA cases, while the hard-enriched RL Stage 2 recovers and expands performance on the hardest benchmarks, especially MedXpertQA (+5.02 for 1.7B and +5.14 for 4B from RL Stage 1 to RL Stage 2).
Two additional observations emerge. First, HealthBench saturates early, the entire HealthBench gain is captured in SFT Stage 1 (+17.00 for 1.7B, +11.00 for 4B), with no further change from Stage 2 or RL. This suggests that realistic health scenarios communication quality is primarily driven by broad medical exposure rather than narrow reasoning refinement. Second, PubMedQA shows a small regression for the 1.7B model (−2.80 total), driven primarily by SFT Stage 1 (−5.93) with partial recovery in later stages. Teacher ablations revealed that even the strongest teacher models plateau in the low-to-mid 70s on PubMedQA, leaving minimal headroom for distillation-based improvement, the Qwen3-1.7B (Thinking) backbone already achieves 72.33, comparable to the teacher's own ceiling on this benchmark. The 4B model, with its larger capacity, avoids this regression entirely and improves PubMedQA to 75.00 (+1.20).
4.6 Token Efficiency
Beyond accuracy, a key advantage of the MedPsy models is their token efficiency, the ability to produce correct, well-structured medical answers using significantly fewer tokens than the corresponding Qwen3 backbones. For edge deployment, response length directly impacts latency, memory bandwidth, and energy consumption per query, making token efficiency as important as accuracy for practical clinical applications.
We measure the weighted average response length in tokens across all evaluation benchmarks, weighted by the number of samples in each benchmark, and compare directly against the Qwen3 backbones.
4B model class.
| Qwen3-4B-Thinking-2507 | MedPsy-4B | |
|---|---|---|
| Weighted Avg. Response Length (Tokens) | 2,953 | 909 |
| Δ Reduction | 3.2x fewer tokens |
The 4B model shows the most dramatic improvement: MedPsy-4B generates answers in approximately 909 tokens on average, compared to 2,953 tokens for Qwen3-4B-Thinking-2507. As shown in Figure 4, this gap is consistent across every benchmark. The largest absolute reductions appear on the most reasoning-intensive tasks, MedXpertQA, MedQA-USMLE, and MMLU-Pro Health, where the backbone's extended thinking process produces substantially longer outputs without corresponding accuracy gains over our post-trained model. Even on HealthBench, the open-ended clinical evaluation where longer responses are often necessary for thorough clinical communication, MedPsy-4B remains significantly more concise than the backbone.
Figure 4: Average response length (tokens) per benchmark for the 4B model class. Lower is better. MedPsy-4B consistently produces substantially shorter responses than Qwen3-4B-Thinking-2507 across all benchmarks while achieving higher overall accuracy.
1.7B model class.
| Qwen3-1.7B (Thinking) | MedPsy-1.7B | |
|---|---|---|
| Weighted Avg. Response Length (Tokens) | 1,901 | 1,110 |
| Δ Reduction | 1.7x fewer tokens |
The 1.7B model achieves a 1.7x reduction, generating approximately 1,110 tokens compared to 1,901 tokens for Qwen3-1.7B (Thinking). While the relative reduction is smaller than at the 4B scale, this is partly because the Qwen3-1.7B (Thinking) backbone already generates relatively concise outputs compared to its 4B counterpart. The per-benchmark breakdown in Figure 5 shows that MedPsy-1.7B achieves large reductions on MedQA-USMLE, MedXpertQA, MMLU (Health), and MMLU-Pro Health. Notably, on HealthBench, MedPsy-1.7B generates slightly longer responses than its backbone, reflecting the richer, more clinically detailed answers that drive its strong HealthBench performance (+17.00 points over the Qwen3-1.7B (Thinking) backbone).
Figure 5: Average response length (tokens) per benchmark for the 1.7B model class. Lower is better. MedPsy-1.7B produces shorter responses than Qwen3-1.7B (Thinking) on most benchmarks. On HealthBench, the slightly longer responses reflect improved clinical communication quality.
Implications for edge deployment. These efficiency gains are a direct consequence of our multi-stage post-training pipeline. The supervised fine-tuning stages teach the model to produce structured, focused medical reasoning without the verbose exploratory chains that characterize backbone outputs, while reinforcement learning further sharpens response conciseness. In our RL ablations, the main compression comes from DAPO's drift-enabling modifications, especially removing the KL anchor and using asymmetric clip-higher, followed by a soft overlong-response penalty that prevents the long-response tail from re-expanding. The ordering also matters: RL Stage 1 installs a low-token-budget reasoning habit on easier and moderate cases, and RL Stage 2 then pushes hard-case accuracy while preserving that shorter reasoning style. Combined with the accuracy gains documented in Sections 4.2–4.4, this means that MedPsy models not only answer medical questions more accurately than their backbones, but do so using significantly fewer tokens, a compound advantage that is critical for real-time clinical decision support on resource-constrained devices.
4.7 Quantization for Mobile Deployment
Accuracy and token efficiency only matter on mobile if the weights themselves fit. Smartphones, tablets, and other consumer hardware typically expose a small RAM budget to a single application and have no dedicated VRAM, so the BF16 checkpoints used in Sections 4.2–4.6 must be compressed before they can be deployed. We deployed through the QVAC SDK and converted both MedPsy checkpoints to the GGUF format using llama.cpp [22]. The GGUF repositories include an unquantized BF16 GGUF export for llama.cpp-native use plus seven quantized GGUF files per model, spanning legacy, K-quant, and I-quant formats at four nominal bit counts (8, 5, 4, and 3). The BF16 GGUF export is not a quantization and has not been separately evaluated with llama.cpp; the quantization results below evaluate only the quantized files against the BF16 source-model baseline.
Quantization methodology
We evaluate three quantized GGUF format groups:
- Legacy block quantization (Q8_0).
Q8_0is the 8-bit legacy quantization version. - K-quants (Q5_K_M, Q4_K_M).
Q5_K_Madds a high-quality 5-bit option with near-lossless performance, whileQ4_K_Mis the long-standing default for 4-bit llama.cpp deployments and offers the best size/quality trade-off in our evaluation. - I-quants (IQ4_NL, IQ4_XS, IQ3_M, IQ3_XXS). Newer non-linear quantization formats designed for low-bit deployments. The 3-bit variants (
IQ3_MandIQ3_XXS) only exist in the I-quant family.
For sub-8-bit quantization we use importance-matrix (imatrix) calibration: per-tensor activation statistics are computed from a representative corpus and used to allocate quantization precision asymmetrically across channels, preserving the directions that matter most for the output distribution. We compared imatrix and non-imatrix builds at every precision level. At Q8_0 the two were indistinguishable on both model sizes (within 0.3 on closed-ended Average and within 1 on any HealthBench dimension), so we publish the non-imatrix Q8_0 file as it is simpler to reproduce. At Q5_K_M and below, imatrix calibration consistently reduces degradation, so all published sub-8-bit variants use imatrix calibration. We quantify the imatrix benefit at 4-bit explicitly in the imatrix ablation below.
We re-run the full closed-ended benchmark suite (7 medical benchmarks, averaged) and HealthBench (Standard and Hard, with CompassJudger-2-32B-Instruct as judge) on every quantized variant. The reference row in each table is the BF16 source model evaluated with vLLM, not the unquantized BF16 GGUF file. AVG Score is the mean of HealthBench Overall and Closed-Ended Average, and is used as a single quality summary throughout this section. Δ Score is the absolute change in AVG Score vs the BF16 source-model baseline (all scores in this section are on a 0–100 scale, so deltas are reported as bare numbers, e.g. −0.81 means the AVG Score drops from 72.27 to 71.46). Δ Score (rel %) reports the same loss as a percentage of the BF16 baseline.
4B model class
| Variant | Imatrix | Size (GB) | Δ Size | HealthBench | HB Hard | Closed-Ended Avg | AVG Score | Δ Score | Δ Score (rel %) |
|---|---|---|---|---|---|---|---|---|---|
| MedPsy-4B BF16 | — | 8.83 | 0% | 74 | 58 | 70.54 | 72.27 | 0.00 | 0.00% |
| MedPsy-4B-Q8_0 | no | 4.69 | −47% | 74 | 57 | 70.25 | 72.13 | −0.15 | −0.20% |
| MedPsy-4B-Q5_K_M | yes | 3.16 | −64% | 74 | 58 | 69.96 | 71.98 | −0.29 | −0.40% |
| MedPsy-4B-Q4_K_M | yes | 2.72 | −69% | 73 | 56 | 69.92 | 71.46 | −0.81 | −1.12% |
| MedPsy-4B-IQ4_NL | yes | 2.60 | −71% | 73 | 57 | 69.50 | 71.25 | −1.02 | −1.41% |
| MedPsy-4B-IQ4_XS | yes | 2.48 | −72% | 73 | 57 | 69.39 | 71.20 | −1.08 | −1.49% |
| MedPsy-4B-IQ3_M | yes | 2.13 | −76% | 73 | 58 | 68.55 | 70.78 | −1.50 | −2.07% |
| MedPsy-4B-IQ3_XXS | yes | 1.84 | −79% | 69 | 51 | 64.42 | 66.71 | −5.56 | −7.69% |
Table 7: Quantized MedPsy-4B GGUF variants compared against the BF16 source-model vLLM baseline. The BF16 row is a reference baseline, not a GGUF quantization result. Δ Size is the relative file-size change vs the BF16 reference size; Δ Score is the absolute change in AVG Score on the 0–100 scale. AVG Score = (HealthBench Overall + Closed-Ended Average) / 2. HealthBench evaluated with CompassJudger-2-32B-Instruct.
The 4B model is remarkably robust to aggressive quantization. Q8_0 is effectively lossless (−0.15 AVG Score at less than half the size), and Q5_K_M adds a high-quality 5-bit option with only −0.29 AVG Score while cutting file size by 64%. Q4_K_M remains the best mobile/laptop trade-off (−0.81 at 69% smaller). The I-quants compress further with only modest additional cost: IQ4_NL (2.60 GB) and IQ4_XS (2.48 GB) lose just 1.0–1.1 in AVG Score, and IQ3_M (2.13 GB) loses only 1.50 while exactly matching the BF16 HealthBench Hard score (58), a remarkable result for a 3-bit format. The collapse only happens at IQ3_XXS (1.84 GB, −5.56), where HealthBench Hard drops from 58 to 51 and Closed-Ended Average from 70.54 to 64.42. Crucially, even this worst configuration still scores 64.42 closed-ended / 69 HealthBench, well above the unquantized Qwen3-4B-Thinking-2507 backbone (63.10 / 63) and unquantized MedGemma-1.5-4B-it (51.20 / 54).
1.7B model class
| Variant | Imatrix | Size (GB) | Δ Size | HealthBench | HB Hard | Closed-Ended Avg | AVG Score | Δ Score | Δ Score (rel %) |
|---|---|---|---|---|---|---|---|---|---|
| MedPsy-1.7B BF16 | — | 4.07 | 0% | 70 | 54 | 62.62 | 66.31 | 0.00 | 0.00% |
| MedPsy-1.7B-Q8_0 | no | 2.16 | −47% | 70 | 55 | 62.62 | 66.31 | 0.00 | 0.00% |
| MedPsy-1.7B-Q5_K_M | yes | 1.47 | −64% | 70 | 55 | 62.58 | 66.29 | −0.02 | −0.03% |
| MedPsy-1.7B-Q4_K_M | yes | 1.28 | −69% | 69 | 52 | 62.16 | 65.58 | −0.73 | −1.10% |
| MedPsy-1.7B-IQ4_NL | yes | 1.23 | −70% | 69 | 51 | 60.22 | 64.61 | −1.70 | −2.56% |
| MedPsy-1.7B-IQ4_XS | yes | 1.18 | −71% | 69 | 53 | 60.05 | 64.53 | −1.79 | −2.69% |
| MedPsy-1.7B-IQ3_M | yes | 1.03 | −75% | 67 | 49 | 58.46 | 62.73 | −3.58 | −5.40% |
| MedPsy-1.7B-IQ3_XXS | yes | 0.89 | −78% | 59 | 40 | 48.71 | 53.86 | −12.46 | −18.78% |
Table 8: Quantized MedPsy-1.7B GGUF variants compared against the BF16 source-model vLLM baseline. The BF16 row is a reference baseline, not a GGUF quantization result. Δ Size is the relative file-size change vs the BF16 reference size; Δ Score is the absolute change in AVG Score on the 0–100 scale. AVG Score = (HealthBench Overall + Closed-Ended Average) / 2.
The 1.7B model behaves very differently below 4-bit. Q8_0 is exactly lossless (AVG Score 66.31 vs 66.31, 47% smaller), and Q5_K_M is effectively unchanged (−0.02 AVG Score at 64% smaller). Q4_K_M is again a near-free win (−0.73 AVG Score, 69% smaller, 1.28 GB). However, the I-quants degrade noticeably faster than at 4B scale: IQ4_NL and IQ4_XS lose 1.70 and 1.79 in AVG Score (vs 1.02 and 1.08 on the 4B model), IQ3_M loses 3.58 (vs 1.50 on 4B), and IQ3_XXS collapses by −12.46 (vs −5.56 on 4B), with HealthBench Hard falling from 54 to 40 and Closed-Ended Average from 62.62 to 48.71, a regression that erases most of the post-training gains documented in Section 4.5. We therefore do not recommend any 3-bit variant of the 1.7B model for medical use, and ship them only as research artifacts.
Capacity vs aggressive quantization: 4B is markedly more robust than 1.7B
The contrast between Tables 7 and 8 is the central new finding of this section. At 8-bit and 5-bit, both models are effectively unchanged. For the same nominal precision below 4-bit, however, the 1.7B model loses roughly 2× more quality than the 4B model:
| Format | Δ Score 4B | Δ Score 1.7B | 1.7B / 4B ratio |
|---|---|---|---|
| Q8_0 | −0.15 | 0.00 | — |
| Q5_K_M | −0.29 | −0.02 | near-lossless |
| Q4_K_M | −0.81 | −0.73 | ≈1× |
| IQ4_NL | −1.02 | −1.70 | 1.7× |
| IQ4_XS | −1.08 | −1.79 | 1.7× |
| IQ3_M | −1.50 | −3.58 | 2.4× |
| IQ3_XXS | −5.56 | −12.46 | 2.2× |
Table 9: Per-format AVG Score degradation by model size (absolute change vs BF16 baseline, on the 0–100 scale). Q8_0 and Q5_K_M are near-lossless, and at Q4_K_M the two models degrade by essentially the same amount. Below 4-bit, the 1.7B model is roughly 2× more sensitive to quantization than the 4B model.
This pattern is consistent with the intuition that smaller models have less weight redundancy: every channel carries more of the model's behavior, so the precision lost when channels are aggressively quantized translates more directly into degraded reasoning. The 4B model, with more than 2× the parameters, can absorb the same quantization error across many more channels and emerges nearly intact even at IQ3_M. A practical consequence is that at the same on-disk size, the 4B model is the better choice: in the ~2 GB band, IQ3_M-4B (2.13 GB, AVG Score 70.78, HealthBench Hard 58) outperforms the best 1.7B variant of comparable size, Q8_0-1.7B (2.16 GB, AVG Score 66.31, HealthBench Hard 55), by +4.47 AVG Score while being slightly smaller. In other words, on devices with a few GB of memory available to the model, spending those bytes on more parameters at lower precision (4B at 3-bit) buys more medical capability than spending them on fewer parameters at higher precision (1.7B at 8-bit).
Imatrix calibration ablation at 4-bit
To isolate the contribution of imatrix calibration at 4-bit, we re-quantized both models in Q4_K_M with and without imatrix calibration. The results below show that imatrix is the dominant lever for preserving 1.7B quality at 4-bit, while it is helpful but not critical at 4B.
| Model | Variant | Closed-Ended Avg | Δ vs BF16 | HealthBench | Δ vs BF16 |
|---|---|---|---|---|---|
| 4B | Q4_K_M with imatrix | 69.92 | −0.62 | 73 | −1.00 |
| 4B | Q4_K_M without imatrix | 69.60 | −0.94 | 73 | −1.00 |
| 1.7B | Q4_K_M with imatrix | 62.16 | −0.46 | 69 | −1.00 |
| 1.7B | Q4_K_M without imatrix | 60.58 | −2.04 | 68 | −2.00 |
Table 10: Imatrix calibration ablation at Q4_K_M. Imatrix gives a small +0.32 closed-ended boost at 4B (within evaluation noise on HealthBench), but a large +1.58 closed-ended boost at 1.7B together with a +1 HealthBench gain.
On the 1.7B model, the closed-ended drop without imatrix is concentrated on the most reasoning-intensive tasks: MedQA-USMLE drops by 4.69, MMLU-Pro Health by 3.59, and MMLU (Health) by 2.07, the same benchmarks that benefited most from post-training (Section 4.5). This makes imatrix calibration essential for any 4-bit-or-lower 1.7B deployment, while at 4B it remains worthwhile but optional. We ship every sub-8-bit GGUF file with imatrix calibration applied.
Implications for mobile deployment
Combining the results above, the recommended on-device configurations are:
- Best quality, near-lossless.
Q8_0on either model size (4.69 GB for the 4B model, 2.16 GB for the 1.7B model). Statistically indistinguishable from BF16, no imatrix needed. - High-quality 5-bit tier.
Q5_K_Mwith imatrix (3.16 GB for 4B, 1.47 GB for 1.7B) gives extra quality headroom over 4-bit with almost no measured loss (−0.29 at 4B, −0.02 at 1.7B). - Best size/quality trade-off.
Q4_K_Mwith imatrix on either model (2.72 GB for 4B, 1.28 GB for 1.7B). Sub-1 AVG Score loss (−0.81 at 4B, −0.73 at 1.7B), comfortably fitting on high-end mobile devices (4B) or any modern smartphone (1.7B). - Around 2 GB on the 4B model.
IQ3_Mwith imatrix (2.13 GB) is an excellent compact option for the 4B model, matching the BF16 HealthBench Hard score at 76% size reduction. - Smartphone-class deployment under 1.5 GB.
Q4_K_M(1.28 GB) on the 1.7B model is the right choice; useQ5_K_M(1.47 GB) if you can spend the extra memory for near-lossless quality. We do not recommend going below 4-bit on the 1.7B model for medical use.
In every recommended quantized configuration, the MedPsy models retain a substantial accuracy lead over the unquantized open-weight baselines in their respective size class, including their own Qwen3 backbones, MedGemma-1.5-4B-it, and, for MedPsy-4B, even MedGemma-27B-text-it.
All published GGUF files are compatible with llama.cpp and designed for deployment through the QVAC SDK, enabling private local inference without sending patient or health data to a remote model endpoint.
5. Future Work
MedPsy is an ongoing research initiative, and future work will broaden both the scope and depth of evaluation. We plan to incorporate more open-ended medical benchmarks, as well as expand assessments to address safety, error detection, and general robustness across diverse clinical situations.
We also plan to evaluate MedPsy on broader general-domain benchmarks. This will help quantify how medical specialization affects general reasoning, instruction following behavior, and everyday assistant capabilities, especially under edge-device constraints and quantized deployment settings.
6. Conclusion
MedPsy shows that compact, text-only medical models can deliver strong clinical reasoning and healthcare performance without relying on frontier-scale parameter counts. Across closed-ended medical benchmarks, HealthBench, HealthBench Hard, token-efficiency analysis, and quantized deployment experiments, both MedPsy-1.7B and MedPsy-4B demonstrate that high-quality medical post-training can make edge-scale models competitive with, and in several cases stronger than, much larger medical baselines.
The key result is practical: medical AI can move closer to where healthcare data already lives, on local devices, with lower latency, stronger privacy, and reduced infrastructure requirements. MedPsy is a step toward clinically useful, deployable, and privacy-preserving medical intelligence for the QVAC ecosystem.
7. References
[1] "QVAC SDK: Decentralized, Local AI in a Single API." Tether Data, S.A. de C.V., 2026. https://qvac.tether.io/
[2] Subash SN, Nambiar, A., Lambert, P., Gritta, M., Cordella, G., and Nurman, A. "An Edge-First Generalized LLM LoRA Fine-Tuning Framework for Heterogeneous GPUs." Tether AI Research, 2025. https://huggingface.co/blog/qvac/fabric-llm-finetune
[3] Yang, A. et al. "Qwen3 Technical Report." arXiv preprint arXiv:2505.09388, 2025. https://arxiv.org/abs/2505.09388
[4] Sellergren, A. et al. "MedGemma Technical Report." arXiv preprint arXiv:2507.05201, 2026. https://arxiv.org/abs/2507.05201
[5] Ardoino, P. "Tether Launches QVAC SDK as the AI Universal Building Block that Runs, Trains, and Evolves Intelligence Across any Device and Platform." Tether.io, April 9, 2026. https://tether.io/news/tether-launches-qvac-sdk-as-the-ai-universal-building-block-that-runs-trains-and-evolves-intelligence-across-any-device-and-platform/
[6] Subash SN, Vitabile, D., Nambiar, A., and Nurman, A. "QVAC Genesis II: Expanding the Largest and Highest-Quality Multi-domain Educational Synthetic Dataset for LLM Pre-training." Tether AI Research, 2025. https://huggingface.co/blog/qvac/genesis-ii
[7] Arora, R. K., Wei, J., Soskin Hicks, R., Bowman, P., Quiñonero-Candela, J., Tsimpourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., Heidecke, J., and Singhal, K. "HealthBench: Evaluating Large Language Models Towards Improved Human Health." arXiv preprint arXiv:2505.08775, 2025. https://arxiv.org/abs/2505.08775. Code and data: https://github.com/openai/simple-evals
[8] Zhang, T., Cao, M., Lam, A., Zhang, S., and Chen, K. "CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards." arXiv preprint arXiv:2507.09104, 2025. https://arxiv.org/abs/2507.09104
[9] Grattafiori, A. et al. "The Llama 3 Herd of Models." arXiv preprint arXiv:2407.21783, 2024. https://arxiv.org/abs/2407.21783
[10] OpenAI. "gpt-oss-120b & gpt-oss-20b Model Card." arXiv preprint arXiv:2508.10925, 2025. https://arxiv.org/abs/2508.10925
[11] Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. "Measuring Massive Multitask Language Understanding." arXiv preprint arXiv:2009.03300, 2021. https://arxiv.org/abs/2009.03300
[12] Wang, Y. et al. "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark." arXiv preprint arXiv:2406.01574, 2024. https://arxiv.org/abs/2406.01574
[13] Pal, A., Umapathi, L. K., and Sankarasubbu, M. "MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering." arXiv preprint arXiv:2203.14371, 2022. https://arxiv.org/abs/2203.14371
[14] Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H., and Szolovits, P. "What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams." Applied Sciences 11(14):6421, 2021. https://doi.org/10.3390/app11146421
[15] Zuo, Y., Qu, S., Li, Y., Chen, Z., Zhu, X., Hua, E., Zhang, K., Ding, N., and Zhou, B. "MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding." arXiv preprint arXiv:2501.18362, 2025. https://arxiv.org/abs/2501.18362
[16] Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W., and Lu, X. "PubMedQA: A Dataset for Biomedical Research Question Answering." arXiv preprint arXiv:1909.06146, 2019. https://arxiv.org/abs/1909.06146
[17] Olatunji, T. et al. "AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset." Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025. https://doi.org/10.18653/v1/2025.acl-long.96
[18] Subash SN, Nambiar, A., Vitabile, D., Gupta, K., and Nurman, A. "QVAC Genesis I: the Largest and Highest-Quality Multi-domain Educational Synthetic Dataset for Pre-training." Tether AI Research, 2025. https://huggingface.co/blog/qvac/genesis-i
[19] M3 Team et al. "Baichuan-M3: Modeling Clinical Inquiry for Reliable Medical Decision-Making." arXiv preprint arXiv:2602.06570, 2026. https://arxiv.org/abs/2602.06570
[20] Liu, C., Li, D., Shu, Y., Chen, R., Duan, D., Fang, T., and Dai, B. "Fleming-R1: Toward Expert-Level Medical Reasoning via Reinforcement Learning." arXiv preprint arXiv:2509.15279, 2025. https://arxiv.org/abs/2509.15279
[21] Liu, C., Wang, H., Pan, J., Wan, Z., Dai, Y., Lin, F., Bai, W., Rueckert, D., and Arcucci, R. "Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL." arXiv preprint arXiv:2505.17952, 2025. https://arxiv.org/abs/2505.17952
[22] Gerganov, G. et al. "llama.cpp: LLM inference in C/C++." 2023–2026. https://github.com/ggml-org/llama.cpp. Importance-matrix (imatrix) quantization documentation: https://github.com/ggml-org/llama.cpp/tree/master/tools/imatrix.




