Trustcat commited on
Commit
7cb8d5a
Β·
verified Β·
1 Parent(s): f8c62b4

Add comprehensive model card with eval results, training details, and S&B pipeline documentation

Browse files
Files changed (1) hide show
  1. README.md +312 -30
README.md CHANGED
@@ -1,57 +1,339 @@
1
  ---
2
  license: apache-2.0
3
- base_model: unsloth/Qwen2.5-14B-Instruct
 
4
  tags:
5
  - medical
6
  - healthcare
7
  - clinical-reasoning
8
- - swarm-and-bee
9
  - platinum-pairs
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  datasets:
11
- - SwarmOS/SwarmMedQA
12
- language:
13
- - en
14
  pipeline_tag: text-generation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ---
16
 
17
  # SwarmMed-14B-v1.2
18
 
19
- Medical AI model fine-tuned on 14,174 platinum-verified QA pairs across 80+ medical specialties.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
- ## Training
22
 
23
- - **Base**: Qwen2.5-14B-Instruct (4-bit quantized)
24
- - **Method**: LoRA r=128, alpha=256, all projection layers
25
- - **Data**: 14,174 platinum pairs (CoVe-verified + 235B rewritten)
26
- - **Epochs**: 3, effective batch size 32
27
- - **Loss**: 1.058 β†’ 0.169 (final train), 0.223 (eval)
28
- - **GPU**: NVIDIA RTX PRO 6000 Blackwell 96GB
29
- - **Time**: 7h 25min, 2.23 kWh
30
- - **Rig**: swarmrails (Xeon w9-3475X, 256GB RAM)
 
 
 
31
 
32
- ## Improvements over v1.1
 
33
 
34
- - **+40% more training data** (14,174 vs 10,008 pairs)
35
- - **Lower final loss** (0.169 vs 0.198)
36
- - **Broader specialty coverage** (80+ specialties)
 
37
 
38
- ## Usage
 
 
 
 
 
 
39
 
40
  ```python
41
- from unsloth import FastLanguageModel
 
42
 
43
- model, tokenizer = FastLanguageModel.from_pretrained(
44
- model_name="SwarmOS/SwarmMed-14B-v1.2",
45
- max_seq_length=4096,
46
- load_in_4bit=True,
 
 
 
 
47
  )
48
- FastLanguageModel.for_inference(model)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  ```
50
 
51
- ## Safety
52
 
53
- This model is for research and educational purposes only. Not for clinical decision-making. Always consult qualified healthcare professionals.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
- ## Built by
56
 
57
- [Swarm & Bee](https://swarmandbee.com) β€” Last mile intelligence. Sovereign compute. Specialized models.
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
  tags:
6
  - medical
7
  - healthcare
8
  - clinical-reasoning
 
9
  - platinum-pairs
10
+ - cove-verified
11
+ - chain-of-thought
12
+ - cardiology
13
+ - oncology
14
+ - neurology
15
+ - emergency-medicine
16
+ - psychiatry
17
+ - pediatrics
18
+ - pharmacology
19
+ - endocrinology
20
+ - radiology
21
+ - internal-medicine
22
+ - sft
23
+ - lora
24
+ - fine-tuned
25
+ - swarm-and-bee
26
+ base_model: Qwen/Qwen2.5-14B-Instruct
27
  datasets:
28
+ - SwarmOS/SwarmMed-Platinum-500
 
 
29
  pipeline_tag: text-generation
30
+ model-index:
31
+ - name: SwarmMed-14B-v1.2
32
+ results:
33
+ - task:
34
+ type: text-generation
35
+ name: Clinical Reasoning
36
+ dataset:
37
+ name: SwarmMed Platinum Eval (50 questions, 10 specialties)
38
+ type: custom
39
+ metrics:
40
+ - type: custom
41
+ name: Composite Score
42
+ value: 9.64
43
+ verified: false
44
+ - type: custom
45
+ name: Cardiology
46
+ value: 11.0
47
+ verified: false
48
+ - type: custom
49
+ name: Pediatrics
50
+ value: 10.8
51
+ verified: false
52
+ - type: custom
53
+ name: Oncology
54
+ value: 10.6
55
+ verified: false
56
+ - type: custom
57
+ name: Internal Medicine
58
+ value: 10.2
59
+ verified: false
60
+ - type: custom
61
+ name: Emergency Medicine
62
+ value: 10.0
63
+ verified: false
64
  ---
65
 
66
  # SwarmMed-14B-v1.2
67
 
68
+ **A 14-billion parameter medical language model trained on 14,174 independently verified clinical QA pairs across 80+ medical specialties.**
69
+
70
+ Every training example has been fact-checked using Chain-of-Verification (CoVe) β€” each factual claim independently verified by a 235B parameter model without access to the original answer. No unverified data touches the training loop.
71
+
72
+ This is the merged, ready-to-deploy version (bfloat16, 28GB). Load it with any standard `transformers` pipeline β€” no adapters or quantization libraries required.
73
+
74
+ **Built by [Swarm & Bee](https://swarmandbee.com)** β€” sovereign compute infrastructure for specialized AI.
75
+
76
+ ---
77
+
78
+ ## Results
79
+
80
+ Evaluated on 50 expert-crafted clinical questions across 10 specialties, scored on a 6-dimension rubric (max 15 points per question):
81
+
82
+ | Specialty | Score | Grade |
83
+ |-----------|-------|-------|
84
+ | **Cardiology** | **11.0/15 (73%)** | A- |
85
+ | **Pediatrics** | **10.8/15 (72%)** | A- |
86
+ | **Oncology** | **10.6/15 (71%)** | B+ |
87
+ | **Internal Medicine** | **10.2/15 (68%)** | B+ |
88
+ | **Emergency Medicine** | **10.0/15 (67%)** | B+ |
89
+ | Neurology | 9.4/15 (63%) | B |
90
+ | Psychiatry | 9.0/15 (60%) | B |
91
+ | Radiology | 9.0/15 (60%) | B |
92
+ | Endocrinology | 8.6/15 (57%) | B- |
93
+ | Pharmacology | 7.8/15 (52%) | C+ |
94
+ | **Composite** | **9.64/15 (64%)** | **B+** |
95
+
96
+ ### Version Trajectory
97
+
98
+ | Version | Training Data | Composite | Delta |
99
+ |---------|--------------|-----------|-------|
100
+ | v1.0 | 5,070 platinum | 7.6/15 | β€” |
101
+ | v1.1 | 10,008 platinum | 9.0/15 | +1.4 |
102
+ | **v1.2** | **14,174 platinum** | **9.64/15** | **+0.64** |
103
+
104
+ ### Scoring Rubric
105
+
106
+ | Dimension | Max | What It Measures |
107
+ |-----------|-----|------------------|
108
+ | Concept Depth | 3 | Pathophysiology, mechanisms, differential diagnosis |
109
+ | Guidelines | 3 | Current evidence-based clinical recommendations |
110
+ | Numerical Accuracy | 3 | Drug doses, lab values, vital sign thresholds |
111
+ | Disclaimer | 2 | Appropriate safety and consultation language |
112
+ | Syndrome Naming | 2 | Correct medical terminology and eponyms |
113
+ | Urgency Triage | 2 | Appropriate escalation and referral language |
114
+
115
+ ---
116
+
117
+ ## Quick Start
118
+
119
+ ### Inference with Transformers
120
+
121
+ ```python
122
+ from transformers import AutoModelForCausalLM, AutoTokenizer
123
 
124
+ model_id = "SwarmOS/SwarmMed-14B-v1.2-merged"
125
 
126
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
127
+ model = AutoModelForCausalLM.from_pretrained(
128
+ model_id,
129
+ torch_dtype="auto",
130
+ device_map="auto",
131
+ )
132
+
133
+ messages = [
134
+ {"role": "system", "content": "You are a board-certified physician. Provide evidence-based clinical reasoning with appropriate safety disclaimers."},
135
+ {"role": "user", "content": "A 62-year-old male presents with acute chest pain, ST elevation in leads II, III, and aVF, and troponin I of 15.2 ng/mL. BP 88/54, HR 48. What is the diagnosis and immediate management?"}
136
+ ]
137
 
138
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
139
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
140
 
141
+ outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.3, do_sample=True)
142
+ response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
143
+ print(response)
144
+ ```
145
 
146
+ ### Inference with vLLM (Production)
147
+
148
+ ```bash
149
+ vllm serve SwarmOS/SwarmMed-14B-v1.2-merged \
150
+ --max-model-len 4096 \
151
+ --gpu-memory-utilization 0.90
152
+ ```
153
 
154
  ```python
155
+ from openai import OpenAI
156
+ client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
157
 
158
+ response = client.chat.completions.create(
159
+ model="SwarmOS/SwarmMed-14B-v1.2-merged",
160
+ messages=[
161
+ {"role": "system", "content": "You are a board-certified physician."},
162
+ {"role": "user", "content": "Your clinical question here..."}
163
+ ],
164
+ temperature=0.3,
165
+ max_tokens=1024,
166
  )
167
+ print(response.choices[0].message.content)
168
+ ```
169
+
170
+ **Production throughput**: ~35 tokens/second on RTX PRO 6000 Blackwell (96GB).
171
+
172
+ ---
173
+
174
+ ## Training Details
175
+
176
+ ### Data Pipeline
177
+
178
+ This model was trained exclusively on **platinum-tier** data β€” every training example has passed a multi-stage verification pipeline:
179
+
180
+ ```
181
+ Medical Literature 18 Specialty Templates
182
+ (Harrison's, Γ— (cardiology, oncology,
183
+ Robbins, Katzung, neurology, emergency,
184
+ Nelson's, etc.) pharma, psych, etc.)
185
+ β”‚ β”‚
186
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
187
+ β”‚
188
+ β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”
189
+ β”‚ GRIND β”‚ Generate structured clinical QA
190
+ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
191
+ β”‚ 24,000+ raw pairs
192
+ β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
193
+ β”‚ CoVe β”‚ Chain-of-Verification
194
+ β”‚ VERIFY β”‚ 235B checks each claim independently
195
+ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
196
+ β”‚ 93.6% survive verification
197
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
198
+ PASS (57%) FLAG (36%) FAIL (6.4%)
199
+ β”‚ 235B Rewrite β”‚
200
+ β”‚ (verified facts) Rejected
201
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
202
+ β”‚
203
+ β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”
204
+ β”‚ PLATINUM β”‚ 14,174 verified pairs
205
+ β”‚ VAULT β”‚ 80+ specialties
206
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
207
  ```
208
 
209
+ **Key insight from our experiments**: Platinum-verified data is **4.6x more efficient** per training pair than unverified gold data. 1,191 platinum pairs outperform 5,000 gold pairs on clinical benchmarks.
210
 
211
+ ### Hyperparameters
212
+
213
+ | Parameter | Value |
214
+ |-----------|-------|
215
+ | Base model | Qwen2.5-14B-Instruct |
216
+ | Method | LoRA (PEFT) |
217
+ | LoRA rank (r) | 128 |
218
+ | LoRA alpha | 256 |
219
+ | Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
220
+ | Trainable parameters | ~2.5B of 14.8B total |
221
+ | Training pairs | 14,174 platinum |
222
+ | Evaluation pairs | 224 |
223
+ | Epochs | 3 |
224
+ | Effective batch size | 32 (8 Γ— 4 gradient accumulation) |
225
+ | Learning rate | 8e-5 |
226
+ | Max sequence length | 4,096 tokens |
227
+ | Final training loss | 0.219 |
228
+ | Final eval loss | 0.223 |
229
+ | Optimizer | AdamW (8-bit) |
230
+ | Precision | bfloat16 |
231
+ | Framework | Unsloth + TRL |
232
+
233
+ ### Compute
234
+
235
+ | Resource | Value |
236
+ |----------|-------|
237
+ | GPU | NVIDIA RTX PRO 6000 Blackwell (96GB) |
238
+ | Training time | 7 hours 25 minutes |
239
+ | Energy | 2.23 kWh |
240
+ | Weights hash | `sha256:7dcf97d5...` |
241
+
242
+ ---
243
+
244
+ ## The Swarm & Bee Thesis
245
+
246
+ ### Why Verified Data Matters
247
+
248
+ The medical AI space has a quality problem. Thousands of medical QA datasets exist on HuggingFace. Most are LLM-generated, unverified, and contain hallucinations that compound through fine-tuning. A model trained on hallucinated drug doses will confidently generate hallucinated drug doses.
249
+
250
+ Our approach inverts this: **verify first, train second.**
251
+
252
+ The cost of verification is amortized across every model version. The platinum vault grows daily. Each new model version trains on a strictly larger, strictly cleaner dataset. The trajectory is monotonically improving.
253
+
254
+ ### How We Build
255
+
256
+ Swarm & Bee is a sovereign compute infrastructure firm. We operate our own GPU fleet, run our own inference stack, and control the full pipeline from data harvesting to model deployment.
257
+
258
+ **Infrastructure:**
259
+ - Multi-node GPU cluster (RTX 3090 Ti, RTX PRO 6000 Blackwell)
260
+ - vLLM inference serving (~35 tok/s per node)
261
+ - 22 production services running 24/7
262
+ - Together.ai Qwen3-235B for verification (factored CoVe)
263
+ - Proof-of-Pair attestation with Ethereum L1 anchoring
264
+ - On-chain agent identity (ERC-8004 #17493 on Base)
265
+
266
+ **Data assets (as of Feb 21, 2026):**
267
+ - 15,025 platinum-verified clinical QA pairs
268
+ - 9,456 gold-tier pairs
269
+ - 80+ medical specialties covered
270
+ - 47 distinct specialty classifiers
271
+ - Growing 24/7 across 4 compute nodes
272
+
273
+ ### The Roadmap
274
+
275
+ | Phase | Status | Description |
276
+ |-------|--------|-------------|
277
+ | Phase 1 | Complete | Anchor models (7B v1-v5, initial datasets) |
278
+ | Phase 2 | **In Progress** | Specialty depth (cardiology, ER, oncology, pharma) |
279
+ | Phase 3 | Planned | Cross-vertical expansion (aviation, legal, finance) |
280
+ | Phase 4 | Planned | Next-gen base models + Blackwell hardware fleet |
281
+
282
+ **Target**: 100,000 platinum pairs. 50+ specialized models. Sovereign deployment for every vertical.
283
+
284
+ ---
285
+
286
+ ## Limitations
287
+
288
+ - **Not a diagnostic tool.** This model is for research and development. It does not constitute medical advice and should not be used for clinical decision-making without professional oversight.
289
+ - **English only.** Training data and clinical guidelines are primarily US/English-language. Performance on non-English queries or jurisdiction-specific guidelines is untested.
290
+ - **Pharmacology is weakest.** The model scores 52% on pharmacology questions β€” drug interaction and dosing queries should be independently verified.
291
+ - **Point-in-time knowledge.** Clinical guidelines evolve. The model reflects medical knowledge current as of February 2026.
292
+ - **Verification reduces but does not eliminate error.** CoVe significantly reduces hallucination (Meta AI reports -77% in their paper), but no verification system is perfect.
293
+
294
+ ---
295
+
296
+ ## Training Data
297
+
298
+ This model was trained on the Swarm & Bee platinum vault β€” a proprietary collection of 14,174 verified clinical QA pairs.
299
+
300
+ A free, open-source sample of 500 pairs is available for inspection and research:
301
+ **[SwarmMed Platinum 500](https://huggingface.co/datasets/SwarmOS/SwarmMed-Platinum-500)** β€” 500 CoVe-verified pairs across 25 specialties, Apache-2.0 licensed.
302
+
303
+ ### Verification Reference
304
+
305
+ The CoVe methodology is described in:
306
+
307
+ > Dhuliawala, S., et al. (2023). "Chain-of-Verification Reduces Hallucination in Large Language Models." *arXiv:2309.11495*. Meta AI.
308
+
309
+ ---
310
+
311
+ ## Citation
312
+
313
+ ```bibtex
314
+ @model{swarmmed_14b_v1.2,
315
+ title={SwarmMed-14B-v1.2: Verified Clinical Language Model},
316
+ author={Swarm and Bee},
317
+ year={2026},
318
+ url={https://huggingface.co/SwarmOS/SwarmMed-14B-v1.2-merged},
319
+ base_model={Qwen/Qwen2.5-14B-Instruct},
320
+ license={Apache-2.0},
321
+ note={14,174 CoVe-verified platinum training pairs, 80+ specialties}
322
+ }
323
+ ```
324
+
325
+ ## Related Resources
326
+
327
+ | Resource | Link |
328
+ |----------|------|
329
+ | LoRA Adapter (2.2GB) | [SwarmMed-14B-v1.2](https://huggingface.co/SwarmOS/SwarmMed-14B-v1.2) |
330
+ | Training Data Sample | [SwarmMed Platinum 500](https://huggingface.co/datasets/SwarmOS/SwarmMed-Platinum-500) |
331
+ | CoVe Paper | [arXiv:2309.11495](https://arxiv.org/abs/2309.11495) |
332
+ | Swarm & Bee | [swarmandbee.com](https://swarmandbee.com) |
333
+ | All Models & Data | [SwarmOS on HuggingFace](https://huggingface.co/SwarmOS) |
334
+
335
+ ---
336
 
337
+ *Last mile intelligence. Sovereign compute. Your data never leaves your rack.*
338
 
339
+ **Swarm & Bee** | [swarmandbee.com](https://swarmandbee.com) | SwarmOS on HuggingFace