SwarmMed-14B-v1.2

A 14-billion parameter medical language model trained on 14,174 independently verified clinical QA pairs across 80+ medical specialties.

Every training example has been fact-checked using Chain-of-Verification (CoVe) — each factual claim independently verified by a 235B parameter model without access to the original answer. No unverified data touches the training loop.

This is the merged, ready-to-deploy version (bfloat16, 28GB). Load it with any standard transformers pipeline — no adapters or quantization libraries required.

Built by Swarm & Bee — sovereign compute infrastructure for specialized AI.

Results

Evaluated on 50 expert-crafted clinical questions across 10 specialties, scored on a 6-dimension rubric (max 15 points per question):

Specialty	Score	Grade
Cardiology	11.0/15 (73%)	A-
Pediatrics	10.8/15 (72%)	A-
Oncology	10.6/15 (71%)	B+
Internal Medicine	10.2/15 (68%)	B+
Emergency Medicine	10.0/15 (67%)	B+
Neurology	9.4/15 (63%)	B
Psychiatry	9.0/15 (60%)	B
Radiology	9.0/15 (60%)	B
Endocrinology	8.6/15 (57%)	B-
Pharmacology	7.8/15 (52%)	C+
Composite	9.64/15 (64%)	B+

Version Trajectory

Version	Training Data	Composite	Delta
v1.0	5,070 platinum	7.6/15	—
v1.1	10,008 platinum	9.0/15	+1.4
v1.2	14,174 platinum	9.64/15	+0.64

Scoring Rubric

Dimension	Max	What It Measures
Concept Depth	3	Pathophysiology, mechanisms, differential diagnosis
Guidelines	3	Current evidence-based clinical recommendations
Numerical Accuracy	3	Drug doses, lab values, vital sign thresholds
Disclaimer	2	Appropriate safety and consultation language
Syndrome Naming	2	Correct medical terminology and eponyms
Urgency Triage	2	Appropriate escalation and referral language

Quick Start

Inference with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "SwarmOS/SwarmMed-14B-v1.2-merged"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a board-certified physician. Provide evidence-based clinical reasoning with appropriate safety disclaimers."},
    {"role": "user", "content": "A 62-year-old male presents with acute chest pain, ST elevation in leads II, III, and aVF, and troponin I of 15.2 ng/mL. BP 88/54, HR 48. What is the diagnosis and immediate management?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.3, do_sample=True)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Inference with vLLM (Production)

vllm serve SwarmOS/SwarmMed-14B-v1.2-merged \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="SwarmOS/SwarmMed-14B-v1.2-merged",
    messages=[
        {"role": "system", "content": "You are a board-certified physician."},
        {"role": "user", "content": "Your clinical question here..."}
    ],
    temperature=0.3,
    max_tokens=1024,
)
print(response.choices[0].message.content)

Production throughput: ~35 tokens/second on RTX PRO 6000 Blackwell (96GB).

Training Details

Data Pipeline

This model was trained exclusively on platinum-tier data — every training example has passed a multi-stage verification pipeline:

Medical Literature          18 Specialty Templates
    (Harrison's,       ×     (cardiology, oncology,
     Robbins, Katzung,        neurology, emergency,
     Nelson's, etc.)          pharma, psych, etc.)
         │                        │
         └──────────┬─────────────┘
                    │
              ┌─────▼─────┐
              │   GRIND    │  Generate structured clinical QA
              └─────┬──────┘
                    │ 24,000+ raw pairs
              ┌─────▼──────┐
              │    CoVe     │  Chain-of-Verification
              │   VERIFY    │  235B checks each claim independently
              └─────┬──────┘
                    │ 93.6% survive verification
         ┌──────────┼──────────┐
      PASS (57%)  FLAG (36%)  FAIL (6.4%)
         │      235B Rewrite     │
         │    (verified facts)   Rejected
         └────────┬──────────┘
                  │
            ┌─────▼─────┐
            │  PLATINUM  │  14,174 verified pairs
            │   VAULT    │  80+ specialties
            └────────────┘

Key insight from our experiments: Platinum-verified data is 4.6x more efficient per training pair than unverified gold data. 1,191 platinum pairs outperform 5,000 gold pairs on clinical benchmarks.

Hyperparameters

Parameter	Value
Base model	Qwen2.5-14B-Instruct
Method	LoRA (PEFT)
LoRA rank (r)	128
LoRA alpha	256
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters	~2.5B of 14.8B total
Training pairs	14,174 platinum
Evaluation pairs	224
Epochs	3
Effective batch size	32 (8 × 4 gradient accumulation)
Learning rate	8e-5
Max sequence length	4,096 tokens
Final training loss	0.219
Final eval loss	0.223
Optimizer	AdamW (8-bit)
Precision	bfloat16
Framework	Unsloth + TRL

Compute

Resource	Value
GPU	NVIDIA RTX PRO 6000 Blackwell (96GB)
Training time	7 hours 25 minutes
Energy	2.23 kWh
Weights hash	`sha256:7dcf97d5...`

The Swarm & Bee Thesis

Why Verified Data Matters

The medical AI space has a quality problem. Thousands of medical QA datasets exist on HuggingFace. Most are LLM-generated, unverified, and contain hallucinations that compound through fine-tuning. A model trained on hallucinated drug doses will confidently generate hallucinated drug doses.

Our approach inverts this: verify first, train second.

The cost of verification is amortized across every model version. The platinum vault grows daily. Each new model version trains on a strictly larger, strictly cleaner dataset. The trajectory is monotonically improving.

How We Build

Swarm & Bee is a sovereign compute infrastructure firm. We operate our own GPU fleet, run our own inference stack, and control the full pipeline from data harvesting to model deployment.

Infrastructure:

Multi-node GPU cluster (RTX 3090 Ti, RTX PRO 6000 Blackwell)
vLLM inference serving (~35 tok/s per node)
22 production services running 24/7
Together.ai Qwen3-235B for verification (factored CoVe)
Proof-of-Pair attestation with Ethereum L1 anchoring
On-chain agent identity (ERC-8004 #17493 on Base)

Data assets (as of Feb 21, 2026):

15,025 platinum-verified clinical QA pairs
9,456 gold-tier pairs
80+ medical specialties covered
47 distinct specialty classifiers
Growing 24/7 across 4 compute nodes

The Roadmap

Phase	Status	Description
Phase 1	Complete	Anchor models (7B v1-v5, initial datasets)
Phase 2	In Progress	Specialty depth (cardiology, ER, oncology, pharma)
Phase 3	Planned	Cross-vertical expansion (aviation, legal, finance)
Phase 4	Planned	Next-gen base models + Blackwell hardware fleet

Target: 100,000 platinum pairs. 50+ specialized models. Sovereign deployment for every vertical.

Limitations

Not a diagnostic tool. This model is for research and development. It does not constitute medical advice and should not be used for clinical decision-making without professional oversight.
English only. Training data and clinical guidelines are primarily US/English-language. Performance on non-English queries or jurisdiction-specific guidelines is untested.
Pharmacology is weakest. The model scores 52% on pharmacology questions — drug interaction and dosing queries should be independently verified.
Point-in-time knowledge. Clinical guidelines evolve. The model reflects medical knowledge current as of February 2026.
Verification reduces but does not eliminate error. CoVe significantly reduces hallucination (Meta AI reports -77% in their paper), but no verification system is perfect.

Training Data

This model was trained on the Swarm & Bee platinum vault — a proprietary collection of 14,174 verified clinical QA pairs.

A free, open-source sample of 500 pairs is available for inspection and research: SwarmMed Platinum 500 — 500 CoVe-verified pairs across 25 specialties, Apache-2.0 licensed.

Verification Reference

The CoVe methodology is described in:

Dhuliawala, S., et al. (2023). "Chain-of-Verification Reduces Hallucination in Large Language Models." arXiv:2309.11495. Meta AI.

Citation

@model{swarmmed_14b_v1.2,
  title={SwarmMed-14B-v1.2: Verified Clinical Language Model},
  author={Swarm and Bee},
  year={2026},
  url={https://huggingface.co/SwarmOS/SwarmMed-14B-v1.2-merged},
  base_model={Qwen/Qwen2.5-14B-Instruct},
  license={Apache-2.0},
  note={14,174 CoVe-verified platinum training pairs, 80+ specialties}
}

Related Resources

Resource	Link
LoRA Adapter (2.2GB)	SwarmMed-14B-v1.2
Training Data Sample	SwarmMed Platinum 500
CoVe Paper	arXiv:2309.11495
Swarm & Bee	swarmandbee.com
All Models & Data	SwarmOS on HuggingFace

Last mile intelligence. Sovereign compute. Your data never leaves your rack.

Swarm & Bee | swarmandbee.com | SwarmOS on HuggingFace

Downloads last month: -

Safetensors

Model size

15B params

Tensor type

BF16

Model tree for SwarmOS/SwarmMed-14B-v1.2-merged

Base model

Qwen/Qwen2.5-14B

Finetuned

Qwen/Qwen2.5-14B-Instruct

Adapter

(222)

this model

Dataset used to train SwarmOS/SwarmMed-14B-v1.2-merged

Paper for SwarmOS/SwarmMed-14B-v1.2-merged

Chain-of-Verification Reduces Hallucination in Large Language Models

Paper • 2309.11495 • Published Sep 20, 2023 • 40

Evaluation results

Composite Score on SwarmMed Platinum Eval (50 questions, 10 specialties)
self-reported

9.640
Cardiology on SwarmMed Platinum Eval (50 questions, 10 specialties)
self-reported

11.000
Pediatrics on SwarmMed Platinum Eval (50 questions, 10 specialties)
self-reported

10.800
Oncology on SwarmMed Platinum Eval (50 questions, 10 specialties)
self-reported

10.600
Internal Medicine on SwarmMed Platinum Eval (50 questions, 10 specialties)
self-reported

10.200
Emergency Medicine on SwarmMed Platinum Eval (50 questions, 10 specialties)
self-reported

10.000