ClinicDx V1

ClinicDx V1 is a fine-tuned multimodal clinical decision support (CDS) model based on google/medgemma-4b-it. It is trained to generate structured, evidence-grounded clinical assessments from patient presentations, integrating a retrieval-augmented knowledge base (KB) pipeline and an audio input pathway for voice-driven clinical observation extraction.

ClinicDx is an open-source trimodal inference system for edge clinical AI — combining a medical ASR encoder, a learned audio projector, and a fine-tuned 4B clinical LLM in a single llama.cpp binary, deployable fully offline on consumer hardware. For research contributions, open problems, and comparison to prior work, see RESEARCH.md on GitHub.

GitHub · Website · npm Package

This repository contains all four artifacts needed to run the full system with llama-server:

File	Size	Description
`clinicdx-v1-q8.gguf`	3.9 GB	ClinicDx V1 language model (Q8_0 quantisation)
`medasr-encoder.gguf`	401 MB	MedASR Conformer encoder (frozen, 105M params)
`audio-projector-v3-best.gguf`	46 MB	AudioProjector v3 — best checkpoint (step 40000, val LM 0.1042)
`who_knowledge_vec_v2.mv2`	1.1 GB	WHO/MSF knowledge base v2.1 (27,860 chunks, BM25 + semantic hybrid)

Knowledge Base

The who_knowledge_vec_v2.mv2 index contains 27,860 chunks from WHO and MSF clinical guidelines, built via a Docling + HybridChunker pipeline with safety keyword detection for life-threatening conditions.

The retrieval pipeline uses hybrid search (BM25 + EmbedGemma 300M semantic, merged via Reciprocal Rank Fusion) followed by a 4-slot clinical intent reranker that extracts condition, severity, population, and task from each query and rescores hits with multiplicative penalties (×0.12 for off-condition, ×0.35 for severity mismatch, ×0.30 for wrong population) and additive boosts for slot alignment. The reranker covers 30+ clinical conditions with inclusion/exclusion patterns and condition-specific overrides.

During CDS inference, the model controls its own retrieval via a multi-turn ReAct loop — emitting <KB_QUERY> tags that the middleware resolves against this index and injects as <KB_RESULT> context, for up to 5 retrieval turns per request.

Model Description

ClinicDx V1 is a LoRA-fine-tuned and fully merged version of MedGemma 4B Instruct. It was trained with input masking (only model output turns are trained; user and KB turns are masked) on a quality-filtered dataset of 27,592 clinical conversations augmented with KB retrieval.

The model generates structured 6-section responses with inline citations sourced from retrieved knowledge base content.

Training Details

Parameter	Value
Base model	`google/medgemma-4b-it`
Training method	LoRA (r=64, α=128) — fully merged
Training cycle	Cycle 2 — Masked
Dataset	`production_run_v2` (quality-filtered)
Train samples	27,592
Validation samples	1,452
Input masking	64% masked (user + KB turns), 36% trainable (model turns only)
Best eval loss	0.4758
Best eval accuracy	86.25%
Best checkpoint	Step 4000
Epochs trained	2.32 (early stopped, patience=5)
LoRA target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
LoRA dropout	0.05
Max sequence length	8192
Precision	bfloat16

Output Schema

Each response is structured into 6 sections:

Alert Level — Urgency classification (e.g. Routine / Urgent / Emergency)
Clinical Assessment — Summary of presenting findings and clinical reasoning
Differential Considerations — Ranked differential diagnoses with rationale
Recommended Actions — Investigations and immediate management steps
Safety Alerts — Red-flag signs, drug interactions, contraindications
Key Points — Concise summary for handover or documentation

KB integration uses an EXTRACTED/BANKED think-block pattern with inline [WHO: source] citations in the final response.

Architecture

Base: Gemma3ForConditionalGeneration (4.3B parameters)
LoRA adapters: Merged into base weights — no adapter files needed at inference
Vision tower: Present (inherited from MedGemma base, frozen, not used in CDS or Scribe)
Audio projector: Included in this repository as audio-projector-v3-best.gguf

Audio Projector & Voice Input

The ClinicDx production server combines this model with a MedASR encoder and a lightweight AudioProjector to enable voice-to-CDS inference. The architecture mirrors how Gemma3 integrates vision — a frozen encoder feeds a trainable projector whose output is injected into the LLM's embedding sequence.

Full System Architecture

Patient audio (16kHz mono WAV)
        |
        v
MedASR Conformer Encoder  (frozen, 105M params)
  Mel spectrogram (128 bins, hop=160, n_fft=512)
  Natural log + 1e-5 clamp normalisation
  -> 17-layer Conformer encoder, 512 hidden dim
  -> [B, T_enc, 512]  (T_enc ≈ audio_seconds × 50)
        |
        v
AudioProjector v3  (trainable, 11,806,720 params)
  Frame stacking  k=4:   [B, T_enc, 512] -> [B, T_enc/4, 2048]
  Linear(2048 → 2560, bias=False)
  RMSNorm(2560)
  GELU
  Linear(2560 → 2560, bias=False)
  LayerNorm(2560)  [ln_final — added in v3]
  Pad (learned padding embedding) or truncate to 64 tokens
  -> [B, 64, 2560]   (MedGemma embedding space)
        |
        v
ClinicDx V1 Language Model  (4.3B params)
  <image_soft_token> × 64 placeholders in the text sequence
  are replaced with projected audio embeddings via masked_scatter
  (reuses Gemma3's image token injection mechanism)
        |
        v
Structured medical observations (key: value format)

AudioProjector Architecture Detail

The Gemma3AudioProjector is the only trainable component during audio projector training. It is a 2-layer MLP with frame stacking and a final LayerNorm:

class Gemma3AudioProjector(nn.Module):
    # Input:  [B, T_enc, 512]  — MedASR encoder output
    # Step 1: Frame stacking (k=4): [B, T_enc/4, 2048]
    # Step 2: proj = Sequential(
    #             Linear(2048 → 2560, bias=False),
    #             RMSNorm(2560),
    #             GELU(),
    #             Linear(2560 → 2560, bias=False),
    #         )
    # Step 3: ln_final = LayerNorm(2560)
    # Step 4: Pad (learned audio_padding_emb) or truncate to 64 tokens
    # Output: [B, 64, 2560]

Trainable parameters breakdown (11,806,720 total):

Tensor	Shape	Parameters
`audio_padding_emb`	[1, 1, 2560]	2,560
`proj.0.weight` (Linear 1)	[2560, 2048]	5,242,880
`proj.1.weight` (RMSNorm)	[2560]	2,560
`proj.3.weight` (Linear 2)	[2560, 2560]	6,553,600
`ln_final.weight` (LayerNorm)	[2560]	2,560
`ln_final.bias` (LayerNorm)	[2560]	2,560

Token budget per audio duration (16kHz, hop=160, ~50 frames/sec encoder output, 4× stacking):

Audio length	Encoder frames	After stacking	After pad/trunc
1 second	~50	~13	64 (padded)
3 seconds	~150	~38	64 (padded)
5 seconds	~250	~63	64 (padded)
10 seconds	~500	~125	64 (truncated)
20 seconds	~1000	~250	64 (truncated)

AudioProjector Training

The projector was trained independently from the CDS LoRA on a clinical audio dataset:

Parameter	Value
Config	`train_config_mvp.yaml`
Trainable params	11,806,720 (projector only)
Frozen params	~4.5B (base model + MedASR encoder)
Training data	47,500 pre-computed audio clips (MVP audio dataset)
Validation data	2,500 clips (5% held-out split)
Batch size	8
Learning rate	5.0e-4 (AdamW)
Warmup steps	500
Gradient clipping	1.0
Epochs	10
Best checkpoint	Step 40000 (epoch 6 of 10)
Best val LM loss	0.1042
Best val key accuracy	84.0%
Precision	bfloat16

Validation history:

Step	Val LM Loss	Key Accuracy
5,000	0.1496	77.1%
10,000	0.1262	77.9%
15,000	0.1198	77.9%
20,000	0.1169	79.4%
25,000	0.1115	84.7%
30,000	0.1094	82.4%
35,000	0.1062	80.9%
40,000	0.1042 ✓	84.0%
45,000	0.1123	84.7%
50,000	0.1200	86.3%

Step 40,000 produced the best generalisation (lowest val LM loss). Training continued for an additional 15,000 steps but did not improve val LM loss, indicating overfitting onset. The audio-projector-v3-best.gguf file contains the weights from this step.

Loss functions:

L_lm          — Cross-entropy on target output tokens (main loss)
L_contrastive — Cosine similarity between projected audio embeddings
                and concept text embeddings (single-phrase clips only)
L_total = L_lm + 0.1 × L_contrastive

Audio Token Details

The audio pathway reuses Gemma3's existing image token injection mechanism. The audio soft token (<image_soft_token>, ID 262144) is repurposed as the audio placeholder. No new tokens are added to the vocabulary.

Token	ID	Purpose
`<start_of_image>`	255,999	Begin-of-audio delimiter (reused for audio)
`<end_of_image>`	256,000	End-of-audio delimiter (reused for audio)
`<image_soft_token>`	262,144	Audio embedding placeholder (×64 per clip)

The system prompt uses <start_of_audio> / <end_of_audio> as human-readable markers in the text, while the tokenised form uses the image token IDs above for embedding injection.

Running with llama-server (GGUF — Recommended)

The fastest deployment path uses the three GGUF files in this repository with a CUDA-enabled llama-server build that includes --medasr-encoder and --audio-proj support.

Prerequisites

NVIDIA GPU with ≥8 GB VRAM (≥12 GB recommended for full Q8 + encoder + projector)
llama-server built with CUDA and MedASR/audio-projector support
ffmpeg available on the host (for audio transcoding to 16kHz PCM-16 WAV)

Download GGUFs

pip install huggingface_hub
python - <<'EOF'
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="ClinicDx1/ClinicDx",
    allow_patterns=["*.gguf"],
    local_dir="./clinicdx-gguf"
)
EOF

Start the Server

llama-server \
  --model             ./clinicdx-gguf/clinicdx-v1-q8.gguf \
  --medasr-encoder    ./clinicdx-gguf/medasr-encoder.gguf \
  --audio-proj        ./clinicdx-gguf/audio-projector-v3-best.gguf \
  --n-gpu-layers      999 \
  --ctx-size          8192 \
  --parallel          1 \
  --threads           8 \
  --host              0.0.0.0 \
  --port              8180

Note: --parallel 1 is required. The audio extraction endpoint (/v1/audio/extract) performs a blocking llama_decode on the shared context; parallel slots cause assertion failures.

Audio Inference (via REST API)

# Transcode browser audio to required format first
ffmpeg -i input.webm -ar 16000 -ac 1 -c:a pcm_s16le output.wav

# Send to the audio extraction endpoint
curl -X POST http://localhost:8180/v1/audio/extract \
  -H "Content-Type: audio/wav" \
  --data-binary @output.wav

The endpoint returns structured key: value observations matching the clinical manifest provided in the system prompt.

Usage (Text CDS — no audio)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "ClinicDx1/ClinicDx"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = """<start_of_turn>user
Patient: 45-year-old male, 2 days of fever (39.2°C), productive cough, right-sided pleuritic chest pain, decreased breath sounds right base.
<end_of_turn>
<start_of_turn>model
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.1, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Use

Clinical decision support for trained healthcare professionals
Structured differential diagnosis generation
Evidence-grounded treatment planning with KB citations
Voice-driven clinical observation extraction in low-resource clinical settings
Not intended for direct patient-facing use or autonomous clinical decision making

Limitations and Open Problems

No formal clinical validation. Accuracy metrics (86.25% CDS, 84% Scribe key accuracy) are measured on held-out synthetic data, not on real clinical encounters. A prospective evaluation with practicing clinicians is the highest-priority gap.
English only. All CDS outputs, Scribe extraction, and KB retrieval operate in English. Multilingual support (Swahili, Amharic, Hausa, Yoruba) is not yet implemented.
Audio token norm mismatch. Projected audio token norms are 360-620x larger than text token norms in the LLM embedding space. The current mitigation uses adaptive norm alignment loss, but this remains an active area of investigation.
Synthetic training data. Trained on curated synthetic/augmented clinical data — real-world performance may vary.
KB-dependent for best results. Knowledge base integration requires the ClinicDx retrieval pipeline; standalone use generates structure but without live KB citations.
Audio projector trained on synthetic speech. Accuracy on natural conversational speech, accented English, or noisy clinical environments may be lower.
Q8 quantization chosen empirically. Q8_0 was selected over Q4 variants because lower quantization degraded structured output behavior (THINK block coherence, KB query emission) more than perplexity. No systematic ablation study has been conducted.
No ARM / low-power benchmarks. Validated on x86_64 with NVIDIA GPUs and CPU-only mode. Latency on ARM edge devices is unknown.
The model may produce plausible-sounding but incorrect clinical information — always verify with a qualified clinician.

License

This model is released under the Gemma Terms of Use. Use is subject to those terms.

Downloads last month: 64

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for ClinicDx1/ClinicDx

Base model

google/gemma-3-4b-pt

Finetuned

google/medgemma-4b-pt

Finetuned

google/medgemma-4b-it

Adapter

(110)

this model