Instructions to use qvac/MedPsy-1.7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use qvac/MedPsy-1.7B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="qvac/MedPsy-1.7B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("qvac/MedPsy-1.7B")
model = AutoModelForCausalLM.from_pretrained("qvac/MedPsy-1.7B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use qvac/MedPsy-1.7B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "qvac/MedPsy-1.7B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "qvac/MedPsy-1.7B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/qvac/MedPsy-1.7B

SGLang

How to use qvac/MedPsy-1.7B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "qvac/MedPsy-1.7B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "qvac/MedPsy-1.7B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "qvac/MedPsy-1.7B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "qvac/MedPsy-1.7B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use qvac/MedPsy-1.7B with Docker Model Runner:
```
docker model run hf.co/qvac/MedPsy-1.7B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

MedPsy-1.7B

MedPsy-1.7B is a state-of-the-art, text-only medical and healthcare language model purpose-built for edge and smartphone deployment. Built on top of Qwen3-1.7B (operated in thinking mode, i.e. with enable_thinking=True) and post-trained with a multi-stage pipeline (supervised fine-tuning + reinforcement learning) on curated medical data, it delivers medical reasoning capabilities previously exclusive to models 2–7x its size.


Developed by	Tether AI Research
Model type	Text-only causal language model (decoder-only transformer)
Base model	Qwen3-1.7B
Language	English
License	Apache 2.0
Technical report	MedPsy Technical Report
Collection	MedPsy on Hugging Face
All MedPsy variants	MedPsy-4B · MedPsy-1.7B · MedPsy-4B-GGUF · MedPsy-1.7B-GGUF

Key Highlights

Smartphone-class medical AI: At only 1.7B parameters, small enough to run efficiently on mobile and edge devices
Outperforms models 2–16x larger: Scores 62.62 on closed-ended medical benchmarks, beating MedGemma-1.5-4B (51.20) by +11.42 points and matching Qwen3-4B Thinking (63.10)
Beats MedGemma-27B on real-world clinical tasks: Achieves 70.33 on HealthBench and 54.33 on HealthBench Hard, surpassing MedGemma-27B (65.00 / 42.00), a model 16x larger
1.7x token efficiency: Produces accurate medical answers in ~1,110 tokens vs ~1,901 for Qwen3-1.7B (Thinking), reducing latency and compute cost
Privacy-first: Enables fully on-device inference via the QVAC SDK and QVAC Fabric, patient data never leaves the device.

Benchmark Results

	MedPsy-1.7B	MedGemma-1.5-4B-it	Qwen3-1.7B (Thinking)	LFM2.5-1.2B-Thinking
Closed-Ended Medical Benchmarks
Average	62.62	51.20	49.95	44.15
MMLU (Health)	82.72	67.69	72.49	63.48
AfriMedQA	64.84	54.38	51.87	45.07
MMLU-Pro Health	61.37	47.31	45.07	37.81
MedMCQA	63.56	50.08	49.14	42.11
MedQA (USMLE)	75.05	64.39	47.18	39.85
MedXpertQA	21.28	15.80	11.60	11.54
PubMedQA	69.53	58.73	72.33	69.20
HealthBench
Overall	70.33	54.00	53.00	49.00
Expertise-Tailored Communication	76.33	62.67	63.67	60.00
Response Depth	56.33	48.67	49.67	43.00
Context Seeking	69.33	46.00	48.33	45.00
Emergency Referrals	80.00	64.00	64.67	60.00
Global Health	68.33	47.67	45.67	41.33
Health Data Tasks	57.00	44.67	42.33	35.33
Responding Under Uncertainty	74.00	58.33	56.33	51.00
HealthBench Hard
Overall	54.33	29.67	28.33	24.67
Expertise-Tailored Communication	52.33	31.67	31.67	30.67
Response Depth	40.33	29.00	28.33	23.33
Context Seeking	61.00	28.00	32.00	27.67
Emergency Referrals	60.33	29.00	27.67	22.00
Global Health	55.00	29.00	26.67	25.33
Health Data Tasks	43.33	23.67	21.33	15.33
Responding Under Uncertainty	58.33	35.00	31.00	25.67

* MMLU (Health): averaged accuracy across 6 sub-domains: anatomy, clinical_knowledge, college_biology, college_medicine, medical_genetics, professional_medicine.
* HealthBench evaluated using CompassJudger-2-32B-Instruct as judge.
* All results are averaged over 3 runs with generation parameters: temperature=0.6, top_k=20, top_p=0.95, max_output_tokens=16384.

Token Efficiency

Beyond raw accuracy, MedPsy-1.7B achieves a 1.7x reduction in average response length compared to its base model (Qwen3-1.7B (Thinking)). Shorter responses translate directly to faster inference, lower memory bandwidth usage, and reduced energy consumption - critical for smartphone and low-power edge deployment.

	Qwen3-1.7B (Thinking)	MedPsy-1.7B
Avg. Response Length (Tokens)	1,901	1,110
Δ Reduction	1.7x fewer tokens

The chart below shows per-benchmark response lengths. MedPsy-1.7B achieves large reductions on MedQA-USMLE, MedXpertQA, MMLU, and MMLU-Pro Health. On HealthBench, the model generates slightly longer responses than its base, reflecting the richer, more clinically detailed answers that drive its strong HealthBench performance (+17.33 points over base Qwen3-1.7B (Thinking)).

Average Response Length (Tokens) - 1.7B model class

Average response length (tokens) per benchmark. Lower is better. MedPsy-1.7B produces shorter responses than Qwen3-1.7B (Thinking) on most benchmarks while achieving significantly higher accuracy.

Model Details

Parameter	Value
Architecture	Qwen3ForCausalLM
Parameters	1.7B
Hidden size	2,048
FFN hidden size	6,144
Layers	28
Attention heads	16
KV groups (GQA)	8
Vocab size	151,936
Max position embeddings	40,960
Precision	bfloat16
Position embedding	RoPE
Normalization	RMSNorm
Activation	SwiGLU

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "qvac/MedPsy-1.7B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

messages = [
    {"role": "user", "content": "What are the common symptoms and first-line treatments for community-acquired pneumonia?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1024)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Training

The model was post-trained through a multi-stage pipeline on the Qwen3-1.7B (Thinking) backbone:

SFT Stage 1 (Corpus 1): Broad medical adaptation on a large-scale synthetic corpus spanning biology, medicine, and health (including a new health domain not yet publicly released), built from Genesis II–style medical seeds and open-source medical QA prompts used purely as questions, with all reasoning targets freshly generated by Baichuan-M3-235B.
SFT Stage 2 (Corpus 2): Reasoning specialization on a smaller, higher-value clinical QA corpus with teacher-generated chain-of-thought reasoning from Baichuan-M3-235B.
RL Stage 1: Reinforcement learning (DAPO) on the easy/moderate subset of AlphaMedQA (Liu et al., 2025), annotated with the SFT checkpoint.
RL Stage 2: Focused RL on a hard-enriched AlphaMedQA subset re-annotated with the best Stage 1 checkpoint, targeting persistent failure modes.

For full methodology details, see the MedPsy Technical Report.

Use and Limitations

Intended Use

MedPsy-1.7B is an open language model intended as a starting point for developers and researchers building downstream healthcare applications involving medical text. Developers are expected to validate, adapt, and make meaningful modifications to the model for their specific use cases.

Appropriate use cases include:

Research on medical language understanding and reasoning
Building developer tools and prototypes for health-related applications
On-device medical information retrieval for privacy-sensitive environments

Always with appropriate disclaimers.

Limitations

This model is NOT a substitute for professional medical judgment and the model outputs are NOT a substitute for proper clinical diagnosis. Always consult with a certified physician. Despite strong benchmark performance, MedPsy-1.7B is a compact 1.7B-parameter language model, one of the smallest in its class, and will make errors. Its small size makes it particularly susceptible to mistakes on complex, multi-step clinical reasoning tasks. Medical AI systems can produce outputs that appear confident and authoritative while being factually incorrect, incomplete, or clinically inappropriate.

Known limitations include:

Hallucinations: The model may generate plausible-sounding but incorrect medical information.
Compact model trade-offs: At 1.7B parameters, the model has inherently less capacity than larger models. It may struggle with rare conditions, complex multi-step reasoning, or nuanced clinical scenarios that require deep domain knowledge.
English only: The model was trained and evaluated primarily in English. Performance in other languages is not validated.
Text only: This model processes text inputs only. It cannot interpret medical images, lab results in non-text formats, or other modalities.
No real-time knowledge: The model's knowledge has a training data cutoff and does not reflect the latest medical guidelines, drug approvals, or clinical evidence.
Bias in training data: As with any model trained on synthetic and public medical data, biases in the source material may propagate to model outputs. Developers should validate performance across diverse patient populations, demographics, and clinical contexts.
Not designed for emergencies: This model should never be used as the sole decision-making tool in emergency or life-threatening situations.

Safety Recommendations

When integrating this model into any application:

Always include visible disclaimers informing users that outputs are AI-generated and not a substitute for professional medical advice
Do not use for direct clinical diagnosis or treatment without oversight by qualified healthcare professionals
Monitor for harmful outputs and implement appropriate safety filters in production systems

Ethics and Safety

The model was evaluated on medical safety dimensions through the HealthBench evaluation framework, which assesses Emergency Referrals, Responding Under Uncertainty, and Context Seeking, all critical safety dimensions for medical AI. However, no dedicated red-teaming or adversarial safety testing has been conducted on this model to date. Developers deploying this model in production should conduct their own safety evaluations appropriate to their use case.

Related Resources

MedPsy Collection: All MedPsy models, datasets, and resources in one place
MedPsy Technical Report: Full methodology and ablation details
MedPsy-4B: Larger sibling model for higher-quality edge deployment
MedPsy-1.7B-GGUF: Quantized GGUF weights for smartphone-class deployment via llama.cpp / QVAC SDK
QVAC SDK: On-device AI deployment framework
QVAC Genesis II: Underlying data generation methodology

Citation

@article{medpsy2026,
  title={MedPsy: State-of-the-Art Medical and Healthcare Language Models for Edge Devices},
  author={Vitabile, Davide and Buffa, Alexandro and Nambiar, Akshay and Nazir, Amril},
  year={2026},
  url={https://huggingface.co/blog/qvac/medpsy}
  institution={Tether AI Research}
}

Copyright

We will take appropriate actions in response to notices of copyright infringement. If you believe your work has been used or copied in a manner that infringes upon your intellectual property rights, please email data-apps@tether.io identifying and describing both the copyrighted work and alleged infringing content.

Licensing

This model, which was trained as described in the MedPsy Technical Report, is licensed by Tether Data, S.A. de C.V. under the Apache 2.0 license for research and educational purposes. As described above, this model is a version of Qwen3-1.7B, which is also available under the Apache 2.0 license.

As described above, a subset of the Genesis I and Genesis II datasets was used by the Baichuan-M3-235B model, which itself is also available under the Apache 2.0 license to generate synthetic data for training this model. The Genesis I dataset is made available under the CC-BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0) license. The Genesis II dataset is also made available under the CC-BY-NC 4.0 license.