Instructions to use balastml/balastmed-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use balastml/balastmed-4B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="balastml/balastmed-4B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("balastml/balastmed-4B", dtype="auto") - llama-cpp-python
How to use balastml/balastmed-4B with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="balastml/balastmed-4B", filename="balastmed-4b-q4_k_m.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use balastml/balastmed-4B with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf balastml/balastmed-4B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf balastml/balastmed-4B:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf balastml/balastmed-4B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf balastml/balastmed-4B:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf balastml/balastmed-4B:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf balastml/balastmed-4B:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf balastml/balastmed-4B:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf balastml/balastmed-4B:Q4_K_M
Use Docker
docker model run hf.co/balastml/balastmed-4B:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use balastml/balastmed-4B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "balastml/balastmed-4B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "balastml/balastmed-4B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/balastml/balastmed-4B:Q4_K_M
- SGLang
How to use balastml/balastmed-4B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "balastml/balastmed-4B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "balastml/balastmed-4B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "balastml/balastmed-4B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "balastml/balastmed-4B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use balastml/balastmed-4B with Ollama:
ollama run hf.co/balastml/balastmed-4B:Q4_K_M
- Unsloth Studio
How to use balastml/balastmed-4B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for balastml/balastmed-4B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for balastml/balastmed-4B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for balastml/balastmed-4B to start chatting
- Pi
How to use balastml/balastmed-4B with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf balastml/balastmed-4B:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "balastml/balastmed-4B:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use balastml/balastmed-4B with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf balastml/balastmed-4B:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default balastml/balastmed-4B:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use balastml/balastmed-4B with Docker Model Runner:
docker model run hf.co/balastml/balastmed-4B:Q4_K_M
- Lemonade
How to use balastml/balastmed-4B with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull balastml/balastmed-4B:Q4_K_M
Run and chat with the model
lemonade run user.balastmed-4B-Q4_K_M
List all available models
lemonade list
🏥 BalastMed-4B — Local Medical Assistant for Clinicians
A fine-tuned version of Qwen/Qwen3.5-4B designed to run fully locally as a clinical decision support assistant for doctors and healthcare professionals.
Specialized in emergency triage, ESI scoring, differential diagnosis, and medical situation management — without sending any patient data to external servers.
⚠️ Disclaimer: This model is for research and clinical support purposes only. It is NOT a substitute for professional medical judgment. Final decisions always rest with licensed medical professionals.
🎯 Model Overview
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.5-4B |
| Fine-tuning Method | LoRA + SFT (Thinking pipeline re-training) |
| Task | Medical Triage / Clinical Decision Support |
| Language | English |
| License | CC-BY-NC 4.0 |
| Parameters | ~4B |
| Quantization | Q4_K_M (GGUF) — 2.78 GB |
📊 Evaluation Results
| Benchmark | Score |
|---|---|
| MedQA (USMLE-style) | 77.6% |
MedQA tests clinical reasoning across USMLE-style multiple choice questions covering diagnosis, treatment, and medical knowledge.
🧠 Training Details
- Method: LoRA fine-tuning + full SFT for clinical thinking pipeline re-training
- Base Model: Qwen/Qwen3.5-4B
- Hardware: 1× NVIDIA A100 40GB
- Training Data: Proprietary clinical dataset (not publicly available)
- Thinking Pipeline: The model's reasoning chain was completely re-trained via SFT to follow structured clinical logic — differentials, missing data identification, emergency flagging
- Focus Areas:
- ESI (Emergency Severity Index) levels 1–5
- Symptom assessment and chief complaint classification
- Differential diagnosis support
- Medical situation management for clinical staff
💬 Recommended System Prompt
You are a clinical medical assistant. Think through clinical reasoning, consider differentials, identify what data is missing, and flag emergencies. State uncertainty when evidence is insufficient. Defer final decisions to clinicians.
⚙️ Recommended Parameters
| Parameter | Value | Notes |
|---|---|---|
temperature |
0.72 |
Balanced between consistency and nuanced clinical reasoning |
top_p |
0.94 |
Wide token probability coverage |
top_k |
60 |
For rare conditions and broader differential evaluation |
top_k |
20–40 |
For focused, high-confidence diagnosis |
repetition_penalty |
1.08 |
Prevents output looping without over-constraining |
max_new_tokens |
512–1024 |
Higher range recommended for thinking mode |
Tip: Use
top_k: 60when exploring broad differentials or rare presentations. Usetop_k: 20–40when you need a clear, direct clinical answer. The thinking pipeline produces higher quality output whenmax_new_tokensis set generously (≥1024).
🚀 Quick Start
With Ollama (Recommended for local use)
ollama run hf.co/balastml/balastmed-4B:Q4_K_M
With llama.cpp
brew install llama.cpp
llama-server -hf balastml/balastmed-4B:Q4_K_M
With LM Studio
Search for balastml/balastmed-4B in LM Studio's model browser and download the Q4_K_M variant.
With Python (transformers)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "balastml/balastmed-4B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
system_prompt = "You are a clinical medical assistant. Think through clinical reasoning, consider differentials, identify what data is missing, and flag emergencies. State uncertainty when evidence is insufficient. Defer final decisions to clinicians."
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "58yo male, crushing chest pain radiating to left arm, diaphoresis, BP 90/60. ESI level and immediate actions?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=1024,
temperature=0.72,
top_p=0.94,
top_k=40,
repetition_penalty=1.08,
do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
🩺 Example Use Cases
Emergency Triage:
22yo female, sudden onset severe dyspnea, SpO2 82%, stridor present.
→ ESI level and initial management?
Differential Diagnosis:
45yo male, 3-week history of progressive fatigue, night sweats,
unintentional 8kg weight loss, palpable cervical lymphadenopathy.
→ Top differentials and recommended workup?
Medical Situation Management:
ICU patient, post-op day 2 after bowel resection. Sudden fever 39.8°C,
HR 118, BP dropping to 88/55, rising lactate. Current antibiotics: piperacillin-tazobactam.
→ Assessment and management priorities?
🔒 Privacy & Local Deployment
BalastMed-4B is designed for fully offline, local deployment. No patient data is sent to external servers. This makes it suitable for:
- Hospital internal networks
- Clinics with strict data privacy requirements
- GDPR / HIPAA-conscious environments (with appropriate institutional validation)
Minimum hardware for local use: 8GB RAM (Q4_K_M quantization, ~2.78 GB)
⚠️ Limitations
- Not validated for autonomous clinical deployment — requires physician oversight
- Trained primarily on English-language clinical data
- Training dataset is proprietary and not available for public inspection
- Performance may vary on highly specialized sub-specialties
- Should be used only by or under supervision of licensed medical professionals
🔗 Related Models
| Model | MedQA | Languages | Notes |
|---|---|---|---|
| BalastMed-4B | 77.6% | EN | This model |
| BalastMed-9B | 88.2% | EN + TR | Larger, bilingual |
📬 Contact & Feedback
For questions, collaborations, or clinical feedback, open a discussion on the Community tab.
- Downloads last month
- 40
4-bit