Instructions to use qvac/MedPsy-4B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use qvac/MedPsy-4B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="qvac/MedPsy-4B-GGUF", filename="medpsy-4b-bf16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use qvac/MedPsy-4B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf qvac/MedPsy-4B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf qvac/MedPsy-4B-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf qvac/MedPsy-4B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf qvac/MedPsy-4B-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf qvac/MedPsy-4B-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf qvac/MedPsy-4B-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf qvac/MedPsy-4B-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf qvac/MedPsy-4B-GGUF:Q4_K_M
Use Docker
docker model run hf.co/qvac/MedPsy-4B-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use qvac/MedPsy-4B-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "qvac/MedPsy-4B-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "qvac/MedPsy-4B-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/qvac/MedPsy-4B-GGUF:Q4_K_M
- Ollama
How to use qvac/MedPsy-4B-GGUF with Ollama:
ollama run hf.co/qvac/MedPsy-4B-GGUF:Q4_K_M
- Unsloth Studio new
How to use qvac/MedPsy-4B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for qvac/MedPsy-4B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for qvac/MedPsy-4B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for qvac/MedPsy-4B-GGUF to start chatting
- Pi new
How to use qvac/MedPsy-4B-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf qvac/MedPsy-4B-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "qvac/MedPsy-4B-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use qvac/MedPsy-4B-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf qvac/MedPsy-4B-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default qvac/MedPsy-4B-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use qvac/MedPsy-4B-GGUF with Docker Model Runner:
docker model run hf.co/qvac/MedPsy-4B-GGUF:Q4_K_M
- Lemonade
How to use qvac/MedPsy-4B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull qvac/MedPsy-4B-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.MedPsy-4B-GGUF-Q4_K_M
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)MedPsy-4B-GGUF
MedPsy-4B-GGUF provides GGUF weights of MedPsy-4B for fast, fully on-device inference via llama.cpp and the QVAC SDK. An unquantized BF16 GGUF file (about 8.83 GB) is included alongside seven quantization formats, ranging from near-lossless 8-bit (about 4.69 GB) through a high-quality 5-bit option (about 3.16 GB) down to ultra-compact 3-bit (about 1.84 GB), making the same model deployable across everything from workstations to high-end mobile devices.
| Developed by | Tether AI Research |
| Model type | Text-only causal language model (decoder-only transformer), GGUF quantized |
| Base (BF16) model | MedPsy-4B |
| Backbone | Qwen3-4B-Thinking-2507 |
| Language | English |
| License | Apache 2.0 |
| Quantization tool | llama.cpp |
| Technical report | MedPsy Technical Report |
| Collection | MedPsy on Hugging Face |
| All MedPsy variants | MedPsy-4B · MedPsy-1.7B · MedPsy-4B-GGUF · MedPsy-1.7B-GGUF |
Available Files
All published files are produced with llama.cpp. The BF16 GGUF file is unquantized: no quantization is applied. We have not separately re-evaluated the BF16 GGUF with llama.cpp; because it preserves the same BF16 tensor precision as the source checkpoint, performance is expected to match the BF16 source model evaluated with vLLM, aside from small backend or runtime differences. Q8_0 does not use imatrix calibration (we verified that imatrix provided no measurable benefit at 8-bit). All sub-8-bit variants use importance-matrix (imatrix) calibration, which consistently reduces quality degradation. See the MedPsy Technical Report (Section 4.7) for the full quantization methodology, including the K-quants vs I-quants comparison and the per-bit-count imatrix ablation.
| File | Format | Imatrix | Size | Δ Size | Δ AVG (pts) | Δ AVG (rel %) | Recommended For |
|---|---|---|---|---|---|---|---|
medpsy-4b-bf16.gguf |
BF16 | n/a | 8.83 GB | 0% | ≈0.00 | ≈0.00% | Unquantized GGUF (same performance expected) |
medpsy-4b-q8_0.gguf |
Q8_0 | no (not needed) | 4.69 GB | -47% | -0.15 | -0.20% | Best quality, near-lossless |
medpsy-4b-q5_k_m-imat.gguf |
Q5_K_M | yes | 3.16 GB | -64% | -0.29 | -0.40% | Recommended high-quality 5-bit option |
medpsy-4b-q4_k_m-imat.gguf |
Q4_K_M | yes | 2.72 GB | -69% | -0.81 | -1.12% | Recommended for mobile/laptop (best size/quality trade-off) |
medpsy-4b-iq4_nl-imat.gguf |
IQ4_NL | yes | 2.60 GB | -71% | -1.02 | -1.41% | Alternative 4-bit (slightly worse than Q4_K_M) |
medpsy-4b-iq4_xs-imat.gguf |
IQ4_XS | yes | 2.48 GB | -72% | -1.08 | -1.49% | Smaller 4-bit alternative |
medpsy-4b-iq3_m-imat.gguf |
IQ3_M | yes | 2.13 GB | -76% | -1.50 | -2.07% | Strong compact option for tight memory budgets |
medpsy-4b-iq3_xxs-imat.gguf |
IQ3_XXS | yes | 1.84 GB | -79% | -5.56 | -7.69% | ⚠ Significant degradation - not recommended |
Two ways to read quality loss. Δ AVG (pts) is the absolute change in AVG Score vs the BF16/vLLM source-model baseline - the raw points lost. Δ AVG (rel %) is the relative change as a fraction of the baseline (
(baseline − variant) / baseline). They convey complementary information: the absolute delta is the easiest "how much score did I actually lose?" reading, while the relative delta normalizes by baseline so quality degradation is comparable across models with different starting scores. AVG Score = mean of HealthBench Overall and Closed-Ended Average.
Quick Recommendation
| Your constraint | Choose |
|---|---|
| You want a llama.cpp-native unquantized file | BF16 - no quantization applied; same quality as the source BF16 checkpoint expected, GGUF format (8.83 GB) |
| You want the best possible quality at smaller size | Q8_0 - statistically indistinguishable from BF16, half the size |
| You want extra quality headroom over 4-bit | Q5_K_M (imatrix) - 64% smaller than BF16, only -0.29 pts (-0.40% rel) on AVG Score |
| You want the best size/quality trade-off (most users) | Q4_K_M (imatrix) - 69% smaller, only -0.81 pts (-1.12% rel) on AVG Score |
| You need the smallest recommended 4B file | IQ3_M (imatrix) - 76% smaller, only -1.50 pts (-2.07% rel) (excellent) |
| You want maximum compression and can tolerate quality loss | IQ4_XS or IQ3_M (avoid IQ3_XXS) |
Benchmark Results
The comparison uses the BF16 source model evaluated with vLLM as the reference baseline. Quantized GGUF variants were evaluated on the full closed-ended benchmark suite (7 medical benchmarks, averaged) and HealthBench (CompassJudger-2-32B-Instruct as judge), using the same benchmark protocol. The BF16 GGUF file is the unquantized GGUF export and has not been separately re-run with llama.cpp; since no quantization is applied, its performance is expected to match the BF16 source model aside from small backend or runtime differences. AVG Score is the average of HealthBench Overall and Closed-Ended Average.
| Variant | Size (GB) | HealthBench | HB Hard | CE Avg | AVG Score | Δ AVG (pts) | Δ AVG (rel %) | Δ Size |
|---|---|---|---|---|---|---|---|---|
| MedPsy-4B (BF16, vLLM baseline) | 8.83 | 74 | 58 | 70.54 | 72.27 | 0.00 | 0.00% | 0% |
| Q8_0 ★ | 4.69 | 74 | 57 | 70.25 | 72.13 | -0.15 | -0.20% | -47% |
| Q5_K_M ★ | 3.16 | 74 | 58 | 69.96 | 71.98 | -0.29 | -0.40% | -64% |
| Q4_K_M ★ | 2.72 | 73 | 56 | 69.92 | 71.46 | -0.81 | -1.12% | -69% |
| IQ4_NL | 2.60 | 73 | 57 | 69.50 | 71.25 | -1.02 | -1.41% | -71% |
| IQ4_XS | 2.48 | 73 | 57 | 69.39 | 71.20 | -1.08 | -1.49% | -72% |
| IQ3_M ★ | 2.13 | 73 | 58 | 68.55 | 70.78 | -1.50 | -2.07% | -76% |
| IQ3_XXS ⚠ | 1.84 | 69 | 51 | 64.42 | 66.71 | -5.56 | -7.69% | -79% |
HB = HealthBench; CE = Closed-Ended Average (7 medical benchmarks). Δ AVG (pts) is the absolute point change in AVG Score vs the BF16/vLLM source-model baseline (e.g. 72.27 - 71.46 = -0.81). Δ AVG (rel %) is the relative change as a fraction of the baseline (e.g. -0.81 / 72.27 = -1.12%). Δ Size is the relative file-size change vs BF16. ★ Recommended variants (best of class). ⚠ Not recommended due to significant quality degradation. AVG Score = (HealthBench Overall + Closed-Ended Average) / 2. Results averaged over 3 runs with generation parameters: temperature=0.6, top_k=20, top_p=0.95, max_output_tokens=16384. Quantized GGUF variants were evaluated with llama.cpp; the BF16 baseline was evaluated with vLLM. HealthBench evaluated using CompassJudger-2-32B-Instruct.
Key Findings
- Q8_0 is effectively lossless: -0.15 pts (-0.20% relative) AVG Score (72.13 vs 72.27) at 47% smaller (4.69 GB vs 8.83 GB), with no need for imatrix calibration.
- Q5_K_M is a recommended high-quality option: -0.29 pts (-0.40% relative) AVG Score at 64% smaller (3.16 GB), while matching BF16 on HealthBench and HealthBench Hard.
- Q4_K_M is the sweet spot: only -0.81 pts (-1.12% relative) AVG Score loss for a 69% size reduction (2.72 GB), comfortably fitting on high-end mobile and laptop devices.
- IQ3_M is exceptionally efficient at this scale: -1.50 pts (-2.07% relative) at 76% size reduction (2.13 GB). It even matches the BF16 HealthBench Hard score (58) - a remarkably strong compact result.
- IQ3_XXS is too aggressive: HealthBench Hard drops from 58 to 51 and AVG Score by -5.56 pts (-7.69% relative). Avoid this variant for medical use cases unless extreme size constraints leave no alternative.
- Even the worst quantization beats unquantized peers: IQ3_XXS still scores 64.42 closed-ended / 69 HealthBench, well above the unquantized Qwen3-4B-Thinking-2507 backbone (63.10 / 63.00) and unquantized MedGemma-1.5-4B-it (51.20 / 54.00).
How Quantized Variants Compare to Other Models
Even the most aggressive recommended quantization (IQ3_M at 2.13 GB) retains a substantial accuracy lead over the unquantized open-weight baselines in this size class:
| Model | Size (GB) | Closed-Ended Avg | HealthBench |
|---|---|---|---|
| MedPsy-4B (BF16) | 8.83 | 70.54 | 74 |
| MedPsy-4B Q8_0 ★ | 4.69 | 70.25 | 74 |
| MedPsy-4B Q5_K_M ★ | 3.16 | 69.96 | 74 |
| MedPsy-4B Q4_K_M ★ | 2.72 | 69.92 | 73 |
| MedPsy-4B IQ3_M ★ | 2.13 | 68.55 | 73 |
| Qwen3-4B-Thinking-2507 (BF16, backbone) | 8.83 | 63.10 | 63 |
| MedGemma-1.5-4B-it (BF16) | 8.0 | 51.20 | 54 |
Usage
llama.cpp
# Download the recommended file (Q4_K_M with imatrix - best size/quality for mobile/laptop)
huggingface-cli download qvac/MedPsy-4B-GGUF medpsy-4b-q4_k_m-imat.gguf --local-dir .
# Run interactively
./llama-cli -m medpsy-4b-q4_k_m-imat.gguf \
-p "What are the common symptoms and first-line treatments for community-acquired pneumonia?" \
--temp 0.6 --top-k 20 --top-p 0.95 -n 1024
QVAC SDK
These GGUF files are designed for deployment through the QVAC SDK, enabling fully private on-device inference on smartphones, tablets, and edge devices. See the QVAC documentation for integration guides.
# 1. Project setup
mkdir medpsy && cd medpsy
npm init -y && npm pkg set type=module
# 2. Install SDK + matching Bare runtime binary
# (swap linux-x64 for: linux-arm64 | darwin-arm64 | darwin-x64 | win32-x64 | win32-arm64)
npm i @qvac/sdk bare-runtime-linux-x64
# 3. Authenticate with Hugging Face (one-time) and download the recommended quant
hf auth login
hf download qvac/MedPsy-4B-GGUF medpsy-4b-q4_k_m-imat.gguf --local-dir ./models
# 4. Run inference (streamed, GPU)
node --input-type=module -e '
import { loadModel, completion, unloadModel, VERBOSITY } from "@qvac/sdk";
import { resolve } from "node:path";
const id = await loadModel({
modelSrc: resolve("./models/medpsy-4b-q4_k_m-imat.gguf"),
modelType: "llamacpp-completion",
modelConfig: { device: "gpu", ctx_size: 4096, verbosity: VERBOSITY.ERROR },
});
const r = completion({
modelId: id,
history: [{ role: "user", content: "First-line treatment for community-acquired pneumonia in 2 sentences." }],
stream: true,
generationParams: { temp: 0.6, top_p: 0.95, top_k: 20, predict: 2048 },
});
for await (const t of r.tokenStream) process.stdout.write(t);
const s = await r.stats;
console.log(`\n[${s.tokensPerSecond.toFixed(2)} tok/s, TTFT ${s.timeToFirstToken.toFixed(0)}ms, ${s.backendDevice}]`);
await unloadModel({ modelId: id });
'
Use and Limitations
Intended Use
MedPsy-4B-GGUF is intended as a starting point for developers and researchers building downstream healthcare applications involving medical text on-device. Developers are expected to validate, adapt, and make meaningful modifications to the model for their specific use cases.
Appropriate use cases include:
- On-device medical information retrieval for privacy-sensitive environments
- Building developer tools and prototypes for health-related applications running on edge devices
- Research on medical language understanding and reasoning under quantization
Always with appropriate disclaimers.
Limitations
This model is NOT a substitute for professional medical judgment and the model outputs are NOT a substitute for proper clinical diagnosis. Always consult with a certified physician. Despite strong benchmark performance, MedPsy-4B is a compact 4B-parameter language model that will make errors. Quantization may further amplify rare-case failure modes that are not captured by aggregate benchmark numbers. Medical AI systems can produce outputs that appear confident and authoritative while being factually incorrect, incomplete, or clinically inappropriate.
Known limitations include:
- Hallucinations: The model may generate plausible-sounding but incorrect medical information.
- Quantization artifacts: Quantized models can occasionally produce subtly degraded outputs (rare-token drops, less stable formatting on long generations) that aggregate benchmarks may not capture. Effects grow at lower bit counts; we strongly recommend Q4_K_M or higher for production deployment, and avoid IQ3_XXS for any medical use case.
- English only: The model was trained and evaluated primarily in English. Performance in other languages is not validated.
- Text only: This model processes text inputs only. It cannot interpret medical images, lab results in non-text formats, or other modalities.
- No real-time knowledge: The model's knowledge has a training data cutoff and does not reflect the latest medical guidelines, drug approvals, or clinical evidence.
- Bias in training data: As with any model trained on synthetic and public medical data, biases in the source material may propagate to model outputs. Developers should validate performance across diverse patient populations, demographics, and clinical contexts.
- Not designed for emergencies: This model should never be used as the sole decision-making tool in emergency or life-threatening situations.
Safety Recommendations
When integrating this model into any application:
- Always include visible disclaimers informing users that outputs are AI-generated and not a substitute for professional medical advice
- Do not use for direct clinical diagnosis or treatment without oversight by qualified healthcare professionals
- Monitor for harmful outputs and implement appropriate safety filters in production systems
Related Resources
- MedPsy Collection: All MedPsy models, datasets, and resources in one place
- MedPsy-4B (BF16): Full-precision source model
- MedPsy-1.7B-GGUF: Smaller GGUF sibling for smartphone-class deployment
- MedPsy Technical Report: Full quantization methodology and results (Section 4.7)
- QVAC SDK: On-device AI deployment framework
- llama.cpp: Inference engine and quantization toolchain
Citation
@article{medpsy2026,
title={MedPsy: State-of-the-Art Medical and Healthcare Language Models for Edge Devices},
author={Vitabile, Davide and Buffa, Alexandro and Nambiar, Akshay and Nazir, Amril},
year={2026},
url={https://huggingface.co/blog/qvac/medpsy},
institution={Tether AI Research}
}
Copyright
We will take appropriate actions in response to notices of copyright infringement. If you believe your work has been used or copied in a manner that infringes upon your intellectual property rights, please email data-apps@tether.io identifying and describing both the copyrighted work and alleged infringing content.
Licensing
This model, which was trained as described in the MedPsy Technical Report, is licensed by Tether Data, S.A. de C.V. under the Apache 2.0 license for research and educational purposes. As described above, this model is a version of Qwen3-4B-Thinking-2507, which is also under the Apache 2.0 license.
As described above, a subset of the Genesis I and Genesis II datasets was used by the Baichuan-M3-235B model—which itself is also available under the Apache 2.0 license to generate synthetic data for training this model. The Genesis I dataset is made available under the CC-BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0) license. The Genesis II dataset is also made available under the CC-BY-NC 4.0 license.
- Downloads last month
- 1,777
3-bit
4-bit
5-bit
8-bit
16-bit
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="qvac/MedPsy-4B-GGUF", filename="", )