SwarmMed-14B-v1.2
A 14-billion parameter medical language model trained on 14,174 independently verified clinical QA pairs across 80+ medical specialties.
Every training example has been fact-checked using Chain-of-Verification (CoVe) — each factual claim independently verified by a 235B parameter model without access to the original answer. No unverified data touches the training loop.
This is the merged, ready-to-deploy version (bfloat16, 28GB). Load it with any standard transformers pipeline — no adapters or quantization libraries required.
Built by Swarm & Bee — sovereign compute infrastructure for specialized AI.
Results
Evaluated on 50 expert-crafted clinical questions across 10 specialties, scored on a 6-dimension rubric (max 15 points per question):
| Specialty | Score | Grade |
|---|---|---|
| Cardiology | 11.0/15 (73%) | A- |
| Pediatrics | 10.8/15 (72%) | A- |
| Oncology | 10.6/15 (71%) | B+ |
| Internal Medicine | 10.2/15 (68%) | B+ |
| Emergency Medicine | 10.0/15 (67%) | B+ |
| Neurology | 9.4/15 (63%) | B |
| Psychiatry | 9.0/15 (60%) | B |
| Radiology | 9.0/15 (60%) | B |
| Endocrinology | 8.6/15 (57%) | B- |
| Pharmacology | 7.8/15 (52%) | C+ |
| Composite | 9.64/15 (64%) | B+ |
Version Trajectory
| Version | Training Data | Composite | Delta |
|---|---|---|---|
| v1.0 | 5,070 platinum | 7.6/15 | — |
| v1.1 | 10,008 platinum | 9.0/15 | +1.4 |
| v1.2 | 14,174 platinum | 9.64/15 | +0.64 |
Scoring Rubric
| Dimension | Max | What It Measures |
|---|---|---|
| Concept Depth | 3 | Pathophysiology, mechanisms, differential diagnosis |
| Guidelines | 3 | Current evidence-based clinical recommendations |
| Numerical Accuracy | 3 | Drug doses, lab values, vital sign thresholds |
| Disclaimer | 2 | Appropriate safety and consultation language |
| Syndrome Naming | 2 | Correct medical terminology and eponyms |
| Urgency Triage | 2 | Appropriate escalation and referral language |
Quick Start
Inference with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "SwarmOS/SwarmMed-14B-v1.2-merged"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a board-certified physician. Provide evidence-based clinical reasoning with appropriate safety disclaimers."},
{"role": "user", "content": "A 62-year-old male presents with acute chest pain, ST elevation in leads II, III, and aVF, and troponin I of 15.2 ng/mL. BP 88/54, HR 48. What is the diagnosis and immediate management?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.3, do_sample=True)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
Inference with vLLM (Production)
vllm serve SwarmOS/SwarmMed-14B-v1.2-merged \
--max-model-len 4096 \
--gpu-memory-utilization 0.90
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="SwarmOS/SwarmMed-14B-v1.2-merged",
messages=[
{"role": "system", "content": "You are a board-certified physician."},
{"role": "user", "content": "Your clinical question here..."}
],
temperature=0.3,
max_tokens=1024,
)
print(response.choices[0].message.content)
Production throughput: ~35 tokens/second on RTX PRO 6000 Blackwell (96GB).
Training Details
Data Pipeline
This model was trained exclusively on platinum-tier data — every training example has passed a multi-stage verification pipeline:
Medical Literature 18 Specialty Templates
(Harrison's, × (cardiology, oncology,
Robbins, Katzung, neurology, emergency,
Nelson's, etc.) pharma, psych, etc.)
│ │
└──────────┬─────────────┘
│
┌─────▼─────┐
│ GRIND │ Generate structured clinical QA
└─────┬──────┘
│ 24,000+ raw pairs
┌─────▼──────┐
│ CoVe │ Chain-of-Verification
│ VERIFY │ 235B checks each claim independently
└─────┬──────┘
│ 93.6% survive verification
┌──────────┼──────────┐
PASS (57%) FLAG (36%) FAIL (6.4%)
│ 235B Rewrite │
│ (verified facts) Rejected
└────────┬──────────┘
│
┌─────▼─────┐
│ PLATINUM │ 14,174 verified pairs
│ VAULT │ 80+ specialties
└────────────┘
Key insight from our experiments: Platinum-verified data is 4.6x more efficient per training pair than unverified gold data. 1,191 platinum pairs outperform 5,000 gold pairs on clinical benchmarks.
Hyperparameters
| Parameter | Value |
|---|---|
| Base model | Qwen2.5-14B-Instruct |
| Method | LoRA (PEFT) |
| LoRA rank (r) | 128 |
| LoRA alpha | 256 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable parameters | ~2.5B of 14.8B total |
| Training pairs | 14,174 platinum |
| Evaluation pairs | 224 |
| Epochs | 3 |
| Effective batch size | 32 (8 × 4 gradient accumulation) |
| Learning rate | 8e-5 |
| Max sequence length | 4,096 tokens |
| Final training loss | 0.219 |
| Final eval loss | 0.223 |
| Optimizer | AdamW (8-bit) |
| Precision | bfloat16 |
| Framework | Unsloth + TRL |
Compute
| Resource | Value |
|---|---|
| GPU | NVIDIA RTX PRO 6000 Blackwell (96GB) |
| Training time | 7 hours 25 minutes |
| Energy | 2.23 kWh |
| Weights hash | sha256:7dcf97d5... |
The Swarm & Bee Thesis
Why Verified Data Matters
The medical AI space has a quality problem. Thousands of medical QA datasets exist on HuggingFace. Most are LLM-generated, unverified, and contain hallucinations that compound through fine-tuning. A model trained on hallucinated drug doses will confidently generate hallucinated drug doses.
Our approach inverts this: verify first, train second.
The cost of verification is amortized across every model version. The platinum vault grows daily. Each new model version trains on a strictly larger, strictly cleaner dataset. The trajectory is monotonically improving.
How We Build
Swarm & Bee is a sovereign compute infrastructure firm. We operate our own GPU fleet, run our own inference stack, and control the full pipeline from data harvesting to model deployment.
Infrastructure:
- Multi-node GPU cluster (RTX 3090 Ti, RTX PRO 6000 Blackwell)
- vLLM inference serving (~35 tok/s per node)
- 22 production services running 24/7
- Together.ai Qwen3-235B for verification (factored CoVe)
- Proof-of-Pair attestation with Ethereum L1 anchoring
- On-chain agent identity (ERC-8004 #17493 on Base)
Data assets (as of Feb 21, 2026):
- 15,025 platinum-verified clinical QA pairs
- 9,456 gold-tier pairs
- 80+ medical specialties covered
- 47 distinct specialty classifiers
- Growing 24/7 across 4 compute nodes
The Roadmap
| Phase | Status | Description |
|---|---|---|
| Phase 1 | Complete | Anchor models (7B v1-v5, initial datasets) |
| Phase 2 | In Progress | Specialty depth (cardiology, ER, oncology, pharma) |
| Phase 3 | Planned | Cross-vertical expansion (aviation, legal, finance) |
| Phase 4 | Planned | Next-gen base models + Blackwell hardware fleet |
Target: 100,000 platinum pairs. 50+ specialized models. Sovereign deployment for every vertical.
Limitations
- Not a diagnostic tool. This model is for research and development. It does not constitute medical advice and should not be used for clinical decision-making without professional oversight.
- English only. Training data and clinical guidelines are primarily US/English-language. Performance on non-English queries or jurisdiction-specific guidelines is untested.
- Pharmacology is weakest. The model scores 52% on pharmacology questions — drug interaction and dosing queries should be independently verified.
- Point-in-time knowledge. Clinical guidelines evolve. The model reflects medical knowledge current as of February 2026.
- Verification reduces but does not eliminate error. CoVe significantly reduces hallucination (Meta AI reports -77% in their paper), but no verification system is perfect.
Training Data
This model was trained on the Swarm & Bee platinum vault — a proprietary collection of 14,174 verified clinical QA pairs.
A free, open-source sample of 500 pairs is available for inspection and research: SwarmMed Platinum 500 — 500 CoVe-verified pairs across 25 specialties, Apache-2.0 licensed.
Verification Reference
The CoVe methodology is described in:
Dhuliawala, S., et al. (2023). "Chain-of-Verification Reduces Hallucination in Large Language Models." arXiv:2309.11495. Meta AI.
Citation
@model{swarmmed_14b_v1.2,
title={SwarmMed-14B-v1.2: Verified Clinical Language Model},
author={Swarm and Bee},
year={2026},
url={https://huggingface.co/SwarmOS/SwarmMed-14B-v1.2-merged},
base_model={Qwen/Qwen2.5-14B-Instruct},
license={Apache-2.0},
note={14,174 CoVe-verified platinum training pairs, 80+ specialties}
}
Related Resources
| Resource | Link |
|---|---|
| LoRA Adapter (2.2GB) | SwarmMed-14B-v1.2 |
| Training Data Sample | SwarmMed Platinum 500 |
| CoVe Paper | arXiv:2309.11495 |
| Swarm & Bee | swarmandbee.com |
| All Models & Data | SwarmOS on HuggingFace |
Last mile intelligence. Sovereign compute. Your data never leaves your rack.
Swarm & Bee | swarmandbee.com | SwarmOS on HuggingFace
Model tree for SwarmOS/SwarmMed-14B-v1.2
Dataset used to train SwarmOS/SwarmMed-14B-v1.2
Paper for SwarmOS/SwarmMed-14B-v1.2
Evaluation results
- Composite Score on SwarmMed Platinum Eval (50 questions, 10 specialties)self-reported9.640
- Cardiology on SwarmMed Platinum Eval (50 questions, 10 specialties)self-reported11.000
- Pediatrics on SwarmMed Platinum Eval (50 questions, 10 specialties)self-reported10.800
- Oncology on SwarmMed Platinum Eval (50 questions, 10 specialties)self-reported10.600
- Internal Medicine on SwarmMed Platinum Eval (50 questions, 10 specialties)self-reported10.200
- Emergency Medicine on SwarmMed Platinum Eval (50 questions, 10 specialties)self-reported10.000