Instructions to use qvac/MedPsy-4B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use qvac/MedPsy-4B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="qvac/MedPsy-4B-GGUF",
	filename="medpsy-4b-bf16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use qvac/MedPsy-4B-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf qvac/MedPsy-4B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf qvac/MedPsy-4B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf qvac/MedPsy-4B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf qvac/MedPsy-4B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf qvac/MedPsy-4B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf qvac/MedPsy-4B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf qvac/MedPsy-4B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf qvac/MedPsy-4B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/qvac/MedPsy-4B-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use qvac/MedPsy-4B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "qvac/MedPsy-4B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "qvac/MedPsy-4B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/qvac/MedPsy-4B-GGUF:Q4_K_M

Ollama
How to use qvac/MedPsy-4B-GGUF with Ollama:
```
ollama run hf.co/qvac/MedPsy-4B-GGUF:Q4_K_M
```

Unsloth Studio

How to use qvac/MedPsy-4B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for qvac/MedPsy-4B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for qvac/MedPsy-4B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for qvac/MedPsy-4B-GGUF to start chatting

How to use qvac/MedPsy-4B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf qvac/MedPsy-4B-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "qvac/MedPsy-4B-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use qvac/MedPsy-4B-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf qvac/MedPsy-4B-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default qvac/MedPsy-4B-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use qvac/MedPsy-4B-GGUF with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf qvac/MedPsy-4B-GGUF:Q4_K_M

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "qvac/MedPsy-4B-GGUF:Q4_K_M" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use qvac/MedPsy-4B-GGUF with Docker Model Runner:
```
docker model run hf.co/qvac/MedPsy-4B-GGUF:Q4_K_M
```

Lemonade

How to use qvac/MedPsy-4B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull qvac/MedPsy-4B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.MedPsy-4B-GGUF-Q4_K_M

List all available models

lemonade list

MedPsy-4B-GGUF

MedPsy-4B-GGUF provides GGUF weights of MedPsy-4B for fast, fully on-device inference via llama.cpp and the QVAC SDK. An unquantized BF16 GGUF file (about 8.83 GB) is included alongside seven quantization formats, ranging from near-lossless 8-bit (about 4.69 GB) through a high-quality 5-bit option (about 3.16 GB) down to ultra-compact 3-bit (about 1.84 GB), making the same model deployable across everything from workstations to high-end mobile devices.


Developed by	Tether AI Research
Model type	Text-only causal language model (decoder-only transformer), GGUF quantized
Base (BF16) model	MedPsy-4B
Backbone	Qwen3-4B-Thinking-2507
Language	English
License	Apache 2.0
Quantization tool	llama.cpp
Technical report	MedPsy Technical Report
Collection	MedPsy on Hugging Face
All MedPsy variants	MedPsy-4B · MedPsy-1.7B · MedPsy-4B-GGUF · MedPsy-1.7B-GGUF

Available Files

All published files are produced with llama.cpp. The BF16 GGUF file is unquantized: no quantization is applied. We have not separately re-evaluated the BF16 GGUF with llama.cpp; because it preserves the same BF16 tensor precision as the source checkpoint, performance is expected to match the BF16 source model evaluated with vLLM, aside from small backend or runtime differences. Q8_0 does not use imatrix calibration (we verified that imatrix provided no measurable benefit at 8-bit). All sub-8-bit variants use importance-matrix (imatrix) calibration, which consistently reduces quality degradation. See the MedPsy Technical Report (Section 4.7) for the full quantization methodology, including the K-quants vs I-quants comparison and the per-bit-count imatrix ablation.

File	Format	Imatrix	Size	Δ Size	Δ AVG (pts)	Δ AVG (rel %)	Recommended For
`medpsy-4b-bf16.gguf`	BF16	n/a	8.83 GB	0%	≈0.00	≈0.00%	Unquantized GGUF (same performance expected)
`medpsy-4b-q8_0.gguf`	Q8_0	no (not needed)	4.69 GB	-47%	-0.15	-0.20%	Best quality, near-lossless
`medpsy-4b-q5_k_m-imat.gguf`	Q5_K_M	yes	3.16 GB	-64%	-0.29	-0.40%	Recommended high-quality 5-bit option
`medpsy-4b-q4_k_m-imat.gguf`	Q4_K_M	yes	2.72 GB	-69%	-0.81	-1.12%	Recommended for mobile/laptop (best size/quality trade-off)
`medpsy-4b-iq4_nl-imat.gguf`	IQ4_NL	yes	2.60 GB	-71%	-1.02	-1.41%	Alternative 4-bit (slightly worse than Q4_K_M)
`medpsy-4b-iq4_xs-imat.gguf`	IQ4_XS	yes	2.48 GB	-72%	-1.08	-1.49%	Smaller 4-bit alternative
`medpsy-4b-iq3_m-imat.gguf`	IQ3_M	yes	2.13 GB	-76%	-1.50	-2.07%	Strong compact option for tight memory budgets
`medpsy-4b-iq3_xxs-imat.gguf`	IQ3_XXS	yes	1.84 GB	-79%	-5.56	-7.69%	⚠ Significant degradation - not recommended

Two ways to read quality loss. Δ AVG (pts) is the absolute change in AVG Score vs the BF16/vLLM source-model baseline - the raw points lost. Δ AVG (rel %) is the relative change as a fraction of the baseline ((baseline − variant) / baseline). They convey complementary information: the absolute delta is the easiest "how much score did I actually lose?" reading, while the relative delta normalizes by baseline so quality degradation is comparable across models with different starting scores. AVG Score = mean of HealthBench Overall and Closed-Ended Average.

Quick Recommendation

Your constraint	Choose
You want a llama.cpp-native unquantized file	BF16 - no quantization applied; same quality as the source BF16 checkpoint expected, GGUF format (8.83 GB)
You want the best possible quality at smaller size	Q8_0 - statistically indistinguishable from BF16, half the size
You want extra quality headroom over 4-bit	Q5_K_M (imatrix) - 64% smaller than BF16, only -0.29 pts (-0.40% rel) on AVG Score
You want the best size/quality trade-off (most users)	Q4_K_M (imatrix) - 69% smaller, only -0.81 pts (-1.12% rel) on AVG Score
You need the smallest recommended 4B file	IQ3_M (imatrix) - 76% smaller, only -1.50 pts (-2.07% rel) (excellent)
You want maximum compression and can tolerate quality loss	IQ4_XS or IQ3_M (avoid IQ3_XXS)

Benchmark Results

The comparison uses the BF16 source model evaluated with vLLM as the reference baseline. Quantized GGUF variants were evaluated on the full closed-ended benchmark suite (7 medical benchmarks, averaged) and HealthBench (CompassJudger-2-32B-Instruct as judge), using the same benchmark protocol. The BF16 GGUF file is the unquantized GGUF export and has not been separately re-run with llama.cpp; since no quantization is applied, its performance is expected to match the BF16 source model aside from small backend or runtime differences. AVG Score is the average of HealthBench Overall and Closed-Ended Average.

Variant	Size (GB)	HealthBench	HB Hard	CE Avg	AVG Score	Δ AVG (pts)	Δ AVG (rel %)	Δ Size
MedPsy-4B (BF16, vLLM baseline)	8.83	74	58	70.54	72.27	0.00	0.00%	0%
Q8_0 ★	4.69	74	57	70.25	72.13	-0.15	-0.20%	-47%
Q5_K_M ★	3.16	74	58	69.96	71.98	-0.29	-0.40%	-64%
Q4_K_M ★	2.72	73	56	69.92	71.46	-0.81	-1.12%	-69%
IQ4_NL	2.60	73	57	69.50	71.25	-1.02	-1.41%	-71%
IQ4_XS	2.48	73	57	69.39	71.20	-1.08	-1.49%	-72%
IQ3_M ★	2.13	73	58	68.55	70.78	-1.50	-2.07%	-76%
IQ3_XXS ⚠	1.84	69	51	64.42	66.71	-5.56	-7.69%	-79%

HB = HealthBench; CE = Closed-Ended Average (7 medical benchmarks). Δ AVG (pts) is the absolute point change in AVG Score vs the BF16/vLLM source-model baseline (e.g. 72.27 - 71.46 = -0.81). Δ AVG (rel %) is the relative change as a fraction of the baseline (e.g. -0.81 / 72.27 = -1.12%). Δ Size is the relative file-size change vs BF16. ★ Recommended variants (best of class). ⚠ Not recommended due to significant quality degradation. AVG Score = (HealthBench Overall + Closed-Ended Average) / 2. Results averaged over 3 runs with generation parameters: temperature=0.6, top_k=20, top_p=0.95, max_output_tokens=16384. Quantized GGUF variants were evaluated with llama.cpp; the BF16 baseline was evaluated with vLLM. HealthBench evaluated using CompassJudger-2-32B-Instruct.

Key Findings

Q8_0 is effectively lossless: -0.15 pts (-0.20% relative) AVG Score (72.13 vs 72.27) at 47% smaller (4.69 GB vs 8.83 GB), with no need for imatrix calibration.
Q5_K_M is a recommended high-quality option: -0.29 pts (-0.40% relative) AVG Score at 64% smaller (3.16 GB), while matching BF16 on HealthBench and HealthBench Hard.
Q4_K_M is the sweet spot: only -0.81 pts (-1.12% relative) AVG Score loss for a 69% size reduction (2.72 GB), comfortably fitting on high-end mobile and laptop devices.
IQ3_M is exceptionally efficient at this scale: -1.50 pts (-2.07% relative) at 76% size reduction (2.13 GB). It even matches the BF16 HealthBench Hard score (58) - a remarkably strong compact result.
IQ3_XXS is too aggressive: HealthBench Hard drops from 58 to 51 and AVG Score by -5.56 pts (-7.69% relative). Avoid this variant for medical use cases unless extreme size constraints leave no alternative.
Even the worst quantization beats unquantized peers: IQ3_XXS still scores 64.42 closed-ended / 69 HealthBench, well above the unquantized Qwen3-4B-Thinking-2507 backbone (63.10 / 63.00) and unquantized MedGemma-1.5-4B-it (51.20 / 54.00).

How Quantized Variants Compare to Other Models

Even the most aggressive recommended quantization (IQ3_M at 2.13 GB) retains a substantial accuracy lead over the unquantized open-weight baselines in this size class:

Model	Size (GB)	Closed-Ended Avg	HealthBench
MedPsy-4B (BF16)	8.83	70.54	74
MedPsy-4B Q8_0 ★	4.69	70.25	74
MedPsy-4B Q5_K_M ★	3.16	69.96	74
MedPsy-4B Q4_K_M ★	2.72	69.92	73
MedPsy-4B IQ3_M ★	2.13	68.55	73
Qwen3-4B-Thinking-2507 (BF16, backbone)	8.83	63.10	63
MedGemma-1.5-4B-it (BF16)	8.0	51.20	54

Usage

llama.cpp

# Download the recommended file (Q4_K_M with imatrix - best size/quality for mobile/laptop)
huggingface-cli download qvac/MedPsy-4B-GGUF medpsy-4b-q4_k_m-imat.gguf --local-dir .

# Run interactively
./llama-cli -m medpsy-4b-q4_k_m-imat.gguf \
    -p "What are the common symptoms and first-line treatments for community-acquired pneumonia?" \
    --temp 0.6 --top-k 20 --top-p 0.95 -n 1024

QVAC SDK

These GGUF files are designed for deployment through the QVAC SDK, enabling fully private on-device inference on smartphones, tablets, and edge devices. See the QVAC documentation for integration guides.


# 1. Project setup
mkdir medpsy && cd medpsy
npm init -y && npm pkg set type=module

# 2. Install SDK + matching Bare runtime binary
#    (swap linux-x64 for: linux-arm64 | darwin-arm64 | darwin-x64 | win32-x64 | win32-arm64)
npm i @qvac/sdk bare-runtime-linux-x64

# 3. Authenticate with Hugging Face (one-time) and download the recommended quant
hf auth login
hf download qvac/MedPsy-4B-GGUF medpsy-4b-q4_k_m-imat.gguf --local-dir ./models

# 4. Run inference (streamed, GPU)
node --input-type=module -e '
import { loadModel, completion, unloadModel, VERBOSITY } from "@qvac/sdk";
import { resolve } from "node:path";
const id = await loadModel({
  modelSrc: resolve("./models/medpsy-4b-q4_k_m-imat.gguf"),
  modelType: "llamacpp-completion",
  modelConfig: { device: "gpu", ctx_size: 4096, verbosity: VERBOSITY.ERROR },
});
const r = completion({
  modelId: id,
  history: [{ role: "user", content: "First-line treatment for community-acquired pneumonia in 2 sentences." }],
  stream: true,
  generationParams: { temp: 0.6, top_p: 0.95, top_k: 20, predict: 2048 },
});
for await (const t of r.tokenStream) process.stdout.write(t);
const s = await r.stats;
console.log(`\n[${s.tokensPerSecond.toFixed(2)} tok/s, TTFT ${s.timeToFirstToken.toFixed(0)}ms, ${s.backendDevice}]`);
await unloadModel({ modelId: id });
'

Use and Limitations

Intended Use

MedPsy-4B-GGUF is intended as a starting point for developers and researchers building downstream healthcare applications involving medical text on-device. Developers are expected to validate, adapt, and make meaningful modifications to the model for their specific use cases.

Appropriate use cases include:

On-device medical information retrieval for privacy-sensitive environments
Building developer tools and prototypes for health-related applications running on edge devices
Research on medical language understanding and reasoning under quantization

Always with appropriate disclaimers.

Limitations

This model is NOT a substitute for professional medical judgment and the model outputs are NOT a substitute for proper clinical diagnosis. Always consult with a certified physician. Despite strong benchmark performance, MedPsy-4B is a compact 4B-parameter language model that will make errors. Quantization may further amplify rare-case failure modes that are not captured by aggregate benchmark numbers. Medical AI systems can produce outputs that appear confident and authoritative while being factually incorrect, incomplete, or clinically inappropriate.

Known limitations include:

Hallucinations: The model may generate plausible-sounding but incorrect medical information.
Quantization artifacts: Quantized models can occasionally produce subtly degraded outputs (rare-token drops, less stable formatting on long generations) that aggregate benchmarks may not capture. Effects grow at lower bit counts; we strongly recommend Q4_K_M or higher for production deployment, and avoid IQ3_XXS for any medical use case.
English only: The model was trained and evaluated primarily in English. Performance in other languages is not validated.
Text only: This model processes text inputs only. It cannot interpret medical images, lab results in non-text formats, or other modalities.
No real-time knowledge: The model's knowledge has a training data cutoff and does not reflect the latest medical guidelines, drug approvals, or clinical evidence.
Bias in training data: As with any model trained on synthetic and public medical data, biases in the source material may propagate to model outputs. Developers should validate performance across diverse patient populations, demographics, and clinical contexts.
Not designed for emergencies: This model should never be used as the sole decision-making tool in emergency or life-threatening situations.

Safety Recommendations

When integrating this model into any application:

Always include visible disclaimers informing users that outputs are AI-generated and not a substitute for professional medical advice
Do not use for direct clinical diagnosis or treatment without oversight by qualified healthcare professionals
Monitor for harmful outputs and implement appropriate safety filters in production systems

Related Resources

MedPsy Collection: All MedPsy models, datasets, and resources in one place
MedPsy-4B (BF16): Full-precision source model
MedPsy-1.7B-GGUF: Smaller GGUF sibling for smartphone-class deployment
MedPsy Technical Report: Full quantization methodology and results (Section 4.7)
QVAC SDK: On-device AI deployment framework
llama.cpp: Inference engine and quantization toolchain

Citation

@article{medpsy2026,
  title={MedPsy: State-of-the-Art Medical and Healthcare Language Models for Edge Devices},
  author={Vitabile, Davide and Buffa, Alexandro and Nambiar, Akshay and Nazir, Amril},
  year={2026},
  url={https://huggingface.co/blog/qvac/medpsy},
  institution={Tether AI Research}
}

Copyright

We will take appropriate actions in response to notices of copyright infringement. If you believe your work has been used or copied in a manner that infringes upon your intellectual property rights, please email data-apps@tether.io identifying and describing both the copyrighted work and alleged infringing content.

Licensing

This model, which was trained as described in the MedPsy Technical Report, is licensed by Tether Data, S.A. de C.V. under the Apache 2.0 license for research and educational purposes. As described above, this model is a version of Qwen3-4B-Thinking-2507, which is also under the Apache 2.0 license.

As described above, a subset of the Genesis I and Genesis II datasets was used by the Baichuan-M3-235B model—which itself is also available under the Apache 2.0 license to generate synthetic data for training this model. The Genesis I dataset is made available under the CC-BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0) license. The Genesis II dataset is also made available under the CC-BY-NC 4.0 license.

Downloads last month: 906

GGUF

Model size

4B params

Architecture

qwen3

Hardware compatibility

3-bit

4-bit

5-bit

8-bit

16-bit

Model tree for qvac/MedPsy-4B-GGUF

Base model

Qwen/Qwen3-4B-Thinking-2507

Finetuned

qvac/MedPsy-4B

Quantized

(3)

this model

Collection including qvac/MedPsy-4B-GGUF

MedPsy

Collection

SOTA Medical and Healthcare text-only Small Language Models for Edge deployment • 4 items • Updated May 7 • 3