How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="qvac/MedPsy-4B-GGUF",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

MedPsy-4B-GGUF

MedPsy-4B-GGUF provides GGUF weights of MedPsy-4B for fast, fully on-device inference via llama.cpp and the QVAC SDK. An unquantized BF16 GGUF file (about 8.83 GB) is included alongside seven quantization formats, ranging from near-lossless 8-bit (about 4.69 GB) through a high-quality 5-bit option (about 3.16 GB) down to ultra-compact 3-bit (about 1.84 GB), making the same model deployable across everything from workstations to high-end mobile devices.

Developed by Tether AI Research
Model type Text-only causal language model (decoder-only transformer), GGUF quantized
Base (BF16) model MedPsy-4B
Backbone Qwen3-4B-Thinking-2507
Language English
License Apache 2.0
Quantization tool llama.cpp
Technical report MedPsy Technical Report
Collection MedPsy on Hugging Face
All MedPsy variants MedPsy-4B · MedPsy-1.7B · MedPsy-4B-GGUF · MedPsy-1.7B-GGUF

Available Files

All published files are produced with llama.cpp. The BF16 GGUF file is unquantized: no quantization is applied. We have not separately re-evaluated the BF16 GGUF with llama.cpp; because it preserves the same BF16 tensor precision as the source checkpoint, performance is expected to match the BF16 source model evaluated with vLLM, aside from small backend or runtime differences. Q8_0 does not use imatrix calibration (we verified that imatrix provided no measurable benefit at 8-bit). All sub-8-bit variants use importance-matrix (imatrix) calibration, which consistently reduces quality degradation. See the MedPsy Technical Report (Section 4.7) for the full quantization methodology, including the K-quants vs I-quants comparison and the per-bit-count imatrix ablation.

File Format Imatrix Size Δ Size Δ AVG (pts) Δ AVG (rel %) Recommended For
medpsy-4b-bf16.gguf BF16 n/a 8.83 GB 0% ≈0.00 ≈0.00% Unquantized GGUF (same performance expected)
medpsy-4b-q8_0.gguf Q8_0 no (not needed) 4.69 GB -47% -0.15 -0.20% Best quality, near-lossless
medpsy-4b-q5_k_m-imat.gguf Q5_K_M yes 3.16 GB -64% -0.29 -0.40% Recommended high-quality 5-bit option
medpsy-4b-q4_k_m-imat.gguf Q4_K_M yes 2.72 GB -69% -0.81 -1.12% Recommended for mobile/laptop (best size/quality trade-off)
medpsy-4b-iq4_nl-imat.gguf IQ4_NL yes 2.60 GB -71% -1.02 -1.41% Alternative 4-bit (slightly worse than Q4_K_M)
medpsy-4b-iq4_xs-imat.gguf IQ4_XS yes 2.48 GB -72% -1.08 -1.49% Smaller 4-bit alternative
medpsy-4b-iq3_m-imat.gguf IQ3_M yes 2.13 GB -76% -1.50 -2.07% Strong compact option for tight memory budgets
medpsy-4b-iq3_xxs-imat.gguf IQ3_XXS yes 1.84 GB -79% -5.56 -7.69% ⚠ Significant degradation - not recommended

Two ways to read quality loss. Δ AVG (pts) is the absolute change in AVG Score vs the BF16/vLLM source-model baseline - the raw points lost. Δ AVG (rel %) is the relative change as a fraction of the baseline ((baseline − variant) / baseline). They convey complementary information: the absolute delta is the easiest "how much score did I actually lose?" reading, while the relative delta normalizes by baseline so quality degradation is comparable across models with different starting scores. AVG Score = mean of HealthBench Overall and Closed-Ended Average.

Quick Recommendation

Your constraint Choose
You want a llama.cpp-native unquantized file BF16 - no quantization applied; same quality as the source BF16 checkpoint expected, GGUF format (8.83 GB)
You want the best possible quality at smaller size Q8_0 - statistically indistinguishable from BF16, half the size
You want extra quality headroom over 4-bit Q5_K_M (imatrix) - 64% smaller than BF16, only -0.29 pts (-0.40% rel) on AVG Score
You want the best size/quality trade-off (most users) Q4_K_M (imatrix) - 69% smaller, only -0.81 pts (-1.12% rel) on AVG Score
You need the smallest recommended 4B file IQ3_M (imatrix) - 76% smaller, only -1.50 pts (-2.07% rel) (excellent)
You want maximum compression and can tolerate quality loss IQ4_XS or IQ3_M (avoid IQ3_XXS)

Benchmark Results

The comparison uses the BF16 source model evaluated with vLLM as the reference baseline. Quantized GGUF variants were evaluated on the full closed-ended benchmark suite (7 medical benchmarks, averaged) and HealthBench (CompassJudger-2-32B-Instruct as judge), using the same benchmark protocol. The BF16 GGUF file is the unquantized GGUF export and has not been separately re-run with llama.cpp; since no quantization is applied, its performance is expected to match the BF16 source model aside from small backend or runtime differences. AVG Score is the average of HealthBench Overall and Closed-Ended Average.

Variant Size (GB) HealthBench HB Hard CE Avg AVG Score Δ AVG (pts) Δ AVG (rel %) Δ Size
MedPsy-4B (BF16, vLLM baseline) 8.83 74 58 70.54 72.27 0.00 0.00% 0%
Q8_0 ★ 4.69 74 57 70.25 72.13 -0.15 -0.20% -47%
Q5_K_M ★ 3.16 74 58 69.96 71.98 -0.29 -0.40% -64%
Q4_K_M ★ 2.72 73 56 69.92 71.46 -0.81 -1.12% -69%
IQ4_NL 2.60 73 57 69.50 71.25 -1.02 -1.41% -71%
IQ4_XS 2.48 73 57 69.39 71.20 -1.08 -1.49% -72%
IQ3_M ★ 2.13 73 58 68.55 70.78 -1.50 -2.07% -76%
IQ3_XXS ⚠ 1.84 69 51 64.42 66.71 -5.56 -7.69% -79%

HB = HealthBench; CE = Closed-Ended Average (7 medical benchmarks). Δ AVG (pts) is the absolute point change in AVG Score vs the BF16/vLLM source-model baseline (e.g. 72.27 - 71.46 = -0.81). Δ AVG (rel %) is the relative change as a fraction of the baseline (e.g. -0.81 / 72.27 = -1.12%). Δ Size is the relative file-size change vs BF16. ★ Recommended variants (best of class). ⚠ Not recommended due to significant quality degradation. AVG Score = (HealthBench Overall + Closed-Ended Average) / 2. Results averaged over 3 runs with generation parameters: temperature=0.6, top_k=20, top_p=0.95, max_output_tokens=16384. Quantized GGUF variants were evaluated with llama.cpp; the BF16 baseline was evaluated with vLLM. HealthBench evaluated using CompassJudger-2-32B-Instruct.

Key Findings

  • Q8_0 is effectively lossless: -0.15 pts (-0.20% relative) AVG Score (72.13 vs 72.27) at 47% smaller (4.69 GB vs 8.83 GB), with no need for imatrix calibration.
  • Q5_K_M is a recommended high-quality option: -0.29 pts (-0.40% relative) AVG Score at 64% smaller (3.16 GB), while matching BF16 on HealthBench and HealthBench Hard.
  • Q4_K_M is the sweet spot: only -0.81 pts (-1.12% relative) AVG Score loss for a 69% size reduction (2.72 GB), comfortably fitting on high-end mobile and laptop devices.
  • IQ3_M is exceptionally efficient at this scale: -1.50 pts (-2.07% relative) at 76% size reduction (2.13 GB). It even matches the BF16 HealthBench Hard score (58) - a remarkably strong compact result.
  • IQ3_XXS is too aggressive: HealthBench Hard drops from 58 to 51 and AVG Score by -5.56 pts (-7.69% relative). Avoid this variant for medical use cases unless extreme size constraints leave no alternative.
  • Even the worst quantization beats unquantized peers: IQ3_XXS still scores 64.42 closed-ended / 69 HealthBench, well above the unquantized Qwen3-4B-Thinking-2507 backbone (63.10 / 63.00) and unquantized MedGemma-1.5-4B-it (51.20 / 54.00).

How Quantized Variants Compare to Other Models

Even the most aggressive recommended quantization (IQ3_M at 2.13 GB) retains a substantial accuracy lead over the unquantized open-weight baselines in this size class:

Model Size (GB) Closed-Ended Avg HealthBench
MedPsy-4B (BF16) 8.83 70.54 74
MedPsy-4B Q8_0 ★ 4.69 70.25 74
MedPsy-4B Q5_K_M ★ 3.16 69.96 74
MedPsy-4B Q4_K_M ★ 2.72 69.92 73
MedPsy-4B IQ3_M ★ 2.13 68.55 73
Qwen3-4B-Thinking-2507 (BF16, backbone) 8.83 63.10 63
MedGemma-1.5-4B-it (BF16) 8.0 51.20 54

Usage

llama.cpp

# Download the recommended file (Q4_K_M with imatrix - best size/quality for mobile/laptop)
huggingface-cli download qvac/MedPsy-4B-GGUF medpsy-4b-q4_k_m-imat.gguf --local-dir .

# Run interactively
./llama-cli -m medpsy-4b-q4_k_m-imat.gguf \
    -p "What are the common symptoms and first-line treatments for community-acquired pneumonia?" \
    --temp 0.6 --top-k 20 --top-p 0.95 -n 1024

QVAC SDK

These GGUF files are designed for deployment through the QVAC SDK, enabling fully private on-device inference on smartphones, tablets, and edge devices. See the QVAC documentation for integration guides.


# 1. Project setup
mkdir medpsy && cd medpsy
npm init -y && npm pkg set type=module

# 2. Install SDK + matching Bare runtime binary
#    (swap linux-x64 for: linux-arm64 | darwin-arm64 | darwin-x64 | win32-x64 | win32-arm64)
npm i @qvac/sdk bare-runtime-linux-x64

# 3. Authenticate with Hugging Face (one-time) and download the recommended quant
hf auth login
hf download qvac/MedPsy-4B-GGUF medpsy-4b-q4_k_m-imat.gguf --local-dir ./models

# 4. Run inference (streamed, GPU)
node --input-type=module -e '
import { loadModel, completion, unloadModel, VERBOSITY } from "@qvac/sdk";
import { resolve } from "node:path";
const id = await loadModel({
  modelSrc: resolve("./models/medpsy-4b-q4_k_m-imat.gguf"),
  modelType: "llamacpp-completion",
  modelConfig: { device: "gpu", ctx_size: 4096, verbosity: VERBOSITY.ERROR },
});
const r = completion({
  modelId: id,
  history: [{ role: "user", content: "First-line treatment for community-acquired pneumonia in 2 sentences." }],
  stream: true,
  generationParams: { temp: 0.6, top_p: 0.95, top_k: 20, predict: 2048 },
});
for await (const t of r.tokenStream) process.stdout.write(t);
const s = await r.stats;
console.log(`\n[${s.tokensPerSecond.toFixed(2)} tok/s, TTFT ${s.timeToFirstToken.toFixed(0)}ms, ${s.backendDevice}]`);
await unloadModel({ modelId: id });
'

Use and Limitations

Intended Use

MedPsy-4B-GGUF is intended as a starting point for developers and researchers building downstream healthcare applications involving medical text on-device. Developers are expected to validate, adapt, and make meaningful modifications to the model for their specific use cases.

Appropriate use cases include:

  • On-device medical information retrieval for privacy-sensitive environments
  • Building developer tools and prototypes for health-related applications running on edge devices
  • Research on medical language understanding and reasoning under quantization

Always with appropriate disclaimers.

Limitations

This model is NOT a substitute for professional medical judgment and the model outputs are NOT a substitute for proper clinical diagnosis. Always consult with a certified physician. Despite strong benchmark performance, MedPsy-4B is a compact 4B-parameter language model that will make errors. Quantization may further amplify rare-case failure modes that are not captured by aggregate benchmark numbers. Medical AI systems can produce outputs that appear confident and authoritative while being factually incorrect, incomplete, or clinically inappropriate.

Known limitations include:

  • Hallucinations: The model may generate plausible-sounding but incorrect medical information.
  • Quantization artifacts: Quantized models can occasionally produce subtly degraded outputs (rare-token drops, less stable formatting on long generations) that aggregate benchmarks may not capture. Effects grow at lower bit counts; we strongly recommend Q4_K_M or higher for production deployment, and avoid IQ3_XXS for any medical use case.
  • English only: The model was trained and evaluated primarily in English. Performance in other languages is not validated.
  • Text only: This model processes text inputs only. It cannot interpret medical images, lab results in non-text formats, or other modalities.
  • No real-time knowledge: The model's knowledge has a training data cutoff and does not reflect the latest medical guidelines, drug approvals, or clinical evidence.
  • Bias in training data: As with any model trained on synthetic and public medical data, biases in the source material may propagate to model outputs. Developers should validate performance across diverse patient populations, demographics, and clinical contexts.
  • Not designed for emergencies: This model should never be used as the sole decision-making tool in emergency or life-threatening situations.

Safety Recommendations

When integrating this model into any application:

  1. Always include visible disclaimers informing users that outputs are AI-generated and not a substitute for professional medical advice
  2. Do not use for direct clinical diagnosis or treatment without oversight by qualified healthcare professionals
  3. Monitor for harmful outputs and implement appropriate safety filters in production systems

Related Resources

Citation

@article{medpsy2026,
  title={MedPsy: State-of-the-Art Medical and Healthcare Language Models for Edge Devices},
  author={Vitabile, Davide and Buffa, Alexandro and Nambiar, Akshay and Nazir, Amril},
  year={2026},
  url={https://huggingface.co/blog/qvac/medpsy},
  institution={Tether AI Research}
}

Copyright

We will take appropriate actions in response to notices of copyright infringement. If you believe your work has been used or copied in a manner that infringes upon your intellectual property rights, please email data-apps@tether.io identifying and describing both the copyrighted work and alleged infringing content.

Licensing

This model, which was trained as described in the MedPsy Technical Report, is licensed by Tether Data, S.A. de C.V. under the Apache 2.0 license for research and educational purposes. As described above, this model is a version of Qwen3-4B-Thinking-2507, which is also under the Apache 2.0 license.

As described above, a subset of the Genesis I and Genesis II datasets was used by the Baichuan-M3-235B model—which itself is also available under the Apache 2.0 license to generate synthetic data for training this model. The Genesis I dataset is made available under the CC-BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0) license. The Genesis II dataset is also made available under the CC-BY-NC 4.0 license.

Downloads last month
1,777
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for qvac/MedPsy-4B-GGUF

Finetuned
qvac/MedPsy-4B
Quantized
(3)
this model

Collection including qvac/MedPsy-4B-GGUF