SwarmMed-14B-v1.2

A 14-billion parameter medical language model trained on 14,174 independently verified clinical QA pairs across 80+ medical specialties.

Every training example has been fact-checked using Chain-of-Verification (CoVe) — each factual claim independently verified by a 235B parameter model without access to the original answer. No unverified data touches the training loop.

This is the merged, ready-to-deploy version (bfloat16, 28GB). Load it with any standard transformers pipeline — no adapters or quantization libraries required.

Built by Swarm & Bee — sovereign compute infrastructure for specialized AI.


Results

Evaluated on 50 expert-crafted clinical questions across 10 specialties, scored on a 6-dimension rubric (max 15 points per question):

Specialty Score Grade
Cardiology 11.0/15 (73%) A-
Pediatrics 10.8/15 (72%) A-
Oncology 10.6/15 (71%) B+
Internal Medicine 10.2/15 (68%) B+
Emergency Medicine 10.0/15 (67%) B+
Neurology 9.4/15 (63%) B
Psychiatry 9.0/15 (60%) B
Radiology 9.0/15 (60%) B
Endocrinology 8.6/15 (57%) B-
Pharmacology 7.8/15 (52%) C+
Composite 9.64/15 (64%) B+

Version Trajectory

Version Training Data Composite Delta
v1.0 5,070 platinum 7.6/15
v1.1 10,008 platinum 9.0/15 +1.4
v1.2 14,174 platinum 9.64/15 +0.64

Scoring Rubric

Dimension Max What It Measures
Concept Depth 3 Pathophysiology, mechanisms, differential diagnosis
Guidelines 3 Current evidence-based clinical recommendations
Numerical Accuracy 3 Drug doses, lab values, vital sign thresholds
Disclaimer 2 Appropriate safety and consultation language
Syndrome Naming 2 Correct medical terminology and eponyms
Urgency Triage 2 Appropriate escalation and referral language

Quick Start

Inference with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "SwarmOS/SwarmMed-14B-v1.2-merged"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a board-certified physician. Provide evidence-based clinical reasoning with appropriate safety disclaimers."},
    {"role": "user", "content": "A 62-year-old male presents with acute chest pain, ST elevation in leads II, III, and aVF, and troponin I of 15.2 ng/mL. BP 88/54, HR 48. What is the diagnosis and immediate management?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.3, do_sample=True)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Inference with vLLM (Production)

vllm serve SwarmOS/SwarmMed-14B-v1.2-merged \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="SwarmOS/SwarmMed-14B-v1.2-merged",
    messages=[
        {"role": "system", "content": "You are a board-certified physician."},
        {"role": "user", "content": "Your clinical question here..."}
    ],
    temperature=0.3,
    max_tokens=1024,
)
print(response.choices[0].message.content)

Production throughput: ~35 tokens/second on RTX PRO 6000 Blackwell (96GB).


Training Details

Data Pipeline

This model was trained exclusively on platinum-tier data — every training example has passed a multi-stage verification pipeline:

Medical Literature          18 Specialty Templates
    (Harrison's,       ×     (cardiology, oncology,
     Robbins, Katzung,        neurology, emergency,
     Nelson's, etc.)          pharma, psych, etc.)
         │                        │
         └──────────┬─────────────┘
                    │
              ┌─────▼─────┐
              │   GRIND    │  Generate structured clinical QA
              └─────┬──────┘
                    │ 24,000+ raw pairs
              ┌─────▼──────┐
              │    CoVe     │  Chain-of-Verification
              │   VERIFY    │  235B checks each claim independently
              └─────┬──────┘
                    │ 93.6% survive verification
         ┌──────────┼──────────┐
      PASS (57%)  FLAG (36%)  FAIL (6.4%)
         │      235B Rewrite     │
         │    (verified facts)   Rejected
         └────────┬──────────┘
                  │
            ┌─────▼─────┐
            │  PLATINUM  │  14,174 verified pairs
            │   VAULT    │  80+ specialties
            └────────────┘

Key insight from our experiments: Platinum-verified data is 4.6x more efficient per training pair than unverified gold data. 1,191 platinum pairs outperform 5,000 gold pairs on clinical benchmarks.

Hyperparameters

Parameter Value
Base model Qwen2.5-14B-Instruct
Method LoRA (PEFT)
LoRA rank (r) 128
LoRA alpha 256
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters ~2.5B of 14.8B total
Training pairs 14,174 platinum
Evaluation pairs 224
Epochs 3
Effective batch size 32 (8 × 4 gradient accumulation)
Learning rate 8e-5
Max sequence length 4,096 tokens
Final training loss 0.219
Final eval loss 0.223
Optimizer AdamW (8-bit)
Precision bfloat16
Framework Unsloth + TRL

Compute

Resource Value
GPU NVIDIA RTX PRO 6000 Blackwell (96GB)
Training time 7 hours 25 minutes
Energy 2.23 kWh
Weights hash sha256:7dcf97d5...

The Swarm & Bee Thesis

Why Verified Data Matters

The medical AI space has a quality problem. Thousands of medical QA datasets exist on HuggingFace. Most are LLM-generated, unverified, and contain hallucinations that compound through fine-tuning. A model trained on hallucinated drug doses will confidently generate hallucinated drug doses.

Our approach inverts this: verify first, train second.

The cost of verification is amortized across every model version. The platinum vault grows daily. Each new model version trains on a strictly larger, strictly cleaner dataset. The trajectory is monotonically improving.

How We Build

Swarm & Bee is a sovereign compute infrastructure firm. We operate our own GPU fleet, run our own inference stack, and control the full pipeline from data harvesting to model deployment.

Infrastructure:

  • Multi-node GPU cluster (RTX 3090 Ti, RTX PRO 6000 Blackwell)
  • vLLM inference serving (~35 tok/s per node)
  • 22 production services running 24/7
  • Together.ai Qwen3-235B for verification (factored CoVe)
  • Proof-of-Pair attestation with Ethereum L1 anchoring
  • On-chain agent identity (ERC-8004 #17493 on Base)

Data assets (as of Feb 21, 2026):

  • 15,025 platinum-verified clinical QA pairs
  • 9,456 gold-tier pairs
  • 80+ medical specialties covered
  • 47 distinct specialty classifiers
  • Growing 24/7 across 4 compute nodes

The Roadmap

Phase Status Description
Phase 1 Complete Anchor models (7B v1-v5, initial datasets)
Phase 2 In Progress Specialty depth (cardiology, ER, oncology, pharma)
Phase 3 Planned Cross-vertical expansion (aviation, legal, finance)
Phase 4 Planned Next-gen base models + Blackwell hardware fleet

Target: 100,000 platinum pairs. 50+ specialized models. Sovereign deployment for every vertical.


Limitations

  • Not a diagnostic tool. This model is for research and development. It does not constitute medical advice and should not be used for clinical decision-making without professional oversight.
  • English only. Training data and clinical guidelines are primarily US/English-language. Performance on non-English queries or jurisdiction-specific guidelines is untested.
  • Pharmacology is weakest. The model scores 52% on pharmacology questions — drug interaction and dosing queries should be independently verified.
  • Point-in-time knowledge. Clinical guidelines evolve. The model reflects medical knowledge current as of February 2026.
  • Verification reduces but does not eliminate error. CoVe significantly reduces hallucination (Meta AI reports -77% in their paper), but no verification system is perfect.

Training Data

This model was trained on the Swarm & Bee platinum vault — a proprietary collection of 14,174 verified clinical QA pairs.

A free, open-source sample of 500 pairs is available for inspection and research: SwarmMed Platinum 500 — 500 CoVe-verified pairs across 25 specialties, Apache-2.0 licensed.

Verification Reference

The CoVe methodology is described in:

Dhuliawala, S., et al. (2023). "Chain-of-Verification Reduces Hallucination in Large Language Models." arXiv:2309.11495. Meta AI.


Citation

@model{swarmmed_14b_v1.2,
  title={SwarmMed-14B-v1.2: Verified Clinical Language Model},
  author={Swarm and Bee},
  year={2026},
  url={https://huggingface.co/SwarmOS/SwarmMed-14B-v1.2-merged},
  base_model={Qwen/Qwen2.5-14B-Instruct},
  license={Apache-2.0},
  note={14,174 CoVe-verified platinum training pairs, 80+ specialties}
}

Related Resources

Resource Link
LoRA Adapter (2.2GB) SwarmMed-14B-v1.2
Training Data Sample SwarmMed Platinum 500
CoVe Paper arXiv:2309.11495
Swarm & Bee swarmandbee.com
All Models & Data SwarmOS on HuggingFace

Last mile intelligence. Sovereign compute. Your data never leaves your rack.

Swarm & Bee | swarmandbee.com | SwarmOS on HuggingFace

Downloads last month
-
Safetensors
Model size
15B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SwarmOS/SwarmMed-14B-v1.2-merged

Base model

Qwen/Qwen2.5-14B
Adapter
(222)
this model

Dataset used to train SwarmOS/SwarmMed-14B-v1.2-merged

Paper for SwarmOS/SwarmMed-14B-v1.2-merged

Evaluation results

  • Composite Score on SwarmMed Platinum Eval (50 questions, 10 specialties)
    self-reported
    9.640
  • Cardiology on SwarmMed Platinum Eval (50 questions, 10 specialties)
    self-reported
    11.000
  • Pediatrics on SwarmMed Platinum Eval (50 questions, 10 specialties)
    self-reported
    10.800
  • Oncology on SwarmMed Platinum Eval (50 questions, 10 specialties)
    self-reported
    10.600
  • Internal Medicine on SwarmMed Platinum Eval (50 questions, 10 specialties)
    self-reported
    10.200
  • Emergency Medicine on SwarmMed Platinum Eval (50 questions, 10 specialties)
    self-reported
    10.000