chest2vec CT Report Labeler (0.6B)

A weakly-supervised multi-label classifier that reads a free-text chest-CT report and predicts a 137-leaf chest-imaging taxonomy, with a ternary status per label (negative / uncertain / positive).

It also provides a CheXbert / SRR-BERT-style report-comparison F1: label a list of ground-truth reports and a list of generated/predicted reports, then score them against each other (micro / macro / weighted F1) — useful for evaluating radiology report generation.

Base architecture: Qwen/Qwen3-Embedding-0.6B (Apache-2.0)
Adaptation: LoRA (r=16, α=32) merged into the weights + last-token (EOS) pooling + L2-norm + a linear ternary head (1024 → 137 × 3)
Self-contained: the full model (encoder + head) ships in model.safetensors. Loading does not download Qwen3-Embedding weights — the architecture is rebuilt from the bundled config and our weights are loaded in. Tokenizer is bundled too.
Params: ~596M · weights in float32
Training labels: chest2vec/chest2vec_labels (revised CT-RATE, 137-leaf taxonomy)

Label space

The model head predicts 137 leaf labels. They roll up through the chest-imaging hierarchy into 38 upper/container groups and 10 anatomy sections (the label_hierarchy in config.json), so predictions and report-comparison F1 can be reported at leaf, upper, or anatomy granularity.

The model outputs all 137 leaves. In the training data, 136 of them have at least one positive example; the single exception is IVC filter (kept for taxonomy completeness, but it had no positives, so the model effectively never predicts it).
The exact label list is in config.json (labels). Full definitions and per-split counts are in the chest2vec/chest2vec_labels dataset's LABEL_HIERARCHY.md.

This model was trained and evaluated on the chest2vec/chest2vec_labels dataset (revised CT-RATE, 137-leaf taxonomy).

Ternary head — softmax(logits, dim=-1) over class indices [0, 1, 2]:

class index	meaning	value
0	negative	0
1	uncertain	-1
2	positive	1

A label is reported positive when P(class=2) ≥ threshold (default 0.5).

Usage

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("chest2vec/chest2vec_labeler", trust_remote_code=True).eval()
tok   = AutoTokenizer.from_pretrained("chest2vec/chest2vec_labeler", trust_remote_code=True)

reports = ["Bibasilar atelectasis with small bilateral pleural effusions. Cardiomegaly. Coronary artery calcification."]

# 1) human-readable positive labels per report
print(model.label_reports(reports, tokenizer=tok))
# [{'Subsegmental / linear atelectasis': 'positive', 'Pleural effusion': 'positive',
#   'Cardiomegaly': 'positive', 'Coronary artery calcification': 'positive'}]

# 2) full prediction matrices
out = model.predict(reports, tokenizer=tok, threshold=0.5, return_ternary=True)
out["labels"]      # list of 137 label names
out["proba"]       # [N, 137] P(positive)
out["positive"]    # [N, 137] in {0,1}
out["ternary"]     # [N, 137] in {-1,0,1}

CheXbert / SRR-BERT-style report comparison

Label both ground-truth and predicted reports, then compute label-level F1 (GT-labels treated as truth):

res = model.score_reports(gt_reports, pred_reports, tokenizer=tok)   # equal-length lists
# scores are reported at three hierarchy levels:
for level in ("leaf", "upper", "anatomy"):
    b = res[level]
    print(level, b["n_labels"], b["micro"]["f1"], b["macro"]["f1"], b["weighted"]["f1"])
print(res["leaf"]["per_label"]["Pleural effusion"])   # {'precision':..,'recall':..,'f1':..,'support_gt':..}

# or one-liner that loads the model for you:
from modeling_chest2vec_labeler import report_f1
report_f1(gt_reports, pred_reports, tokenizer=tok)

Each level (leaf = 137 labels, upper = 38 container groups, anatomy = 10 sections) returns micro / macro / weighted precision/recall/F1 plus per_label. Upper/anatomy scores are the max-over-children roll-up of the leaf predictions (model.aggregate_hierarchy(...)). Coarser levels are easier to match, so upper/anatomy F1 are typically higher than leaf.

Per-label best F1 (threshold tuning)

The default decision threshold is a single global value, but the F1-optimal threshold differs per label. To get the best achievable F1 per label (and the threshold that achieves it) against a ground-truth label set:

# gt: a DataFrame with the 137 label columns (ternary; positive == 1), or a binary array
res = model.per_label_best_f1(reports, gt, tokenizer=tok, level="leaf", min_pos=30)
res["macro_best_f1_min_pos"]                 # macro best-F1 over labels with >= min_pos positives
res["per_label"]["Pleural effusion"]         # {'best_f1':.., 'best_threshold':.., 'n_pos':..}

Per-label threshold tuning lifts macro-F1 by ~4–6 points over the fixed-0.5 threshold (see below).

Inputs & conventions

Input is the findings text (the model was trained on CT-RATE findings + their refined section-structured form). Reports are formatted internally as Instruct: Given the following chest CT report, extract the presence/absence of entities\nQuery: <report>, truncated to 512 tokens, with an EOS token appended and left-padding.
For best fidelity, run in float32 (default). bf16 is fine for throughput with negligible drift.

Evaluation

How these numbers were produced: run_para_v2_eval.sh runs the model in direct-paragraph mode (full report, max_len 512) and writes eval_ctrate_test_direct.json (public) and eval_sample1000_private.json (private). Metric = macro-F1 of the positive class (softmax probability of class 2) at threshold 0.33. Because the all-labels macro is dragged down by sparse-tail labels, the headline restricts to leaf labels with ≥30 positive examples in that eval set — 53 of the evaluated leaves on the public set, 29 on the private set. Upper/anatomy rows are the hierarchy roll-up.

CT-RATE revised test (public, 1,464 reports) — from chest2vec/chest2vec_labels test split:

Level	# labels	macro-F1 @0.33	macro-AUC
leaf (≥30 positives)	53	0.875	0.989
leaf (all evaluated)	131	0.749	—
upper (≥30 positives)	27	0.938	0.994
anatomy	10	0.956	0.993

Private evaluation set (1,000 reports) — a held-out internal set, not released:

Level	# labels	macro-F1 @0.33	macro-AUC
leaf (≥30 positives)	29	0.766	0.972
leaf (all evaluated)	60	0.731	—
upper (≥30 positives)	19	0.837	—
anatomy	10	0.869	—

Per-label best F1 (threshold swept per label to maximize F1; macro over leaf labels with ≥30 positives, via model.per_label_best_f1):

Eval set	macro best-F1 (≥30)	macro-F1 @0.5 (≥30)	macro best-F1 (all evaluated)
CT-RATE public	0.907	0.866	0.844
Private	0.820	0.761	0.795

F1-optimal thresholds vary widely by label (~0.04–0.75), so per-label tuning recovers ~4–6 macro-F1 points over a single global threshold.

Leaf macro-AUC barely moves public→private (0.989 → 0.972), i.e. label ranking transfers to the unseen set; the F1 gap is mostly threshold / labeling-convention, not a domain failure. Separately, a radiologist spot-checked 966 reports of the public test labels (857 fully accepted / 60 imperfect-but-acceptable / 49 failed; see the dataset card).

Caveats

Weakly supervised — trained on LLM-generated labels (not radiologist ground truth) derived from report text, not images. Not a medical device; not for clinical use.
IVC filter is in the taxonomy for completeness but had no training positives.
score_reports measures label agreement between two reports as judged by this labeler; like CheXbert-F1 it inherits the labeler's own error modes.

License & attribution

Released under CC-BY-NC-SA-4.0. Built on Qwen/Qwen3-Embedding-0.6B (Apache-2.0) and trained using labels derived from CT-RATE (CC-BY-NC-SA-4.0). If you use this model, cite the CT-RATE paper (arXiv:2403.17834) and acknowledge Qwen3-Embedding. See the dataset card for the full citation.

Downloads last month: 9

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for chest2vec/chest2vec_labeler

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-Embedding-0.6B

Finetuned

(195)

this model

Dataset used to train chest2vec/chest2vec_labeler

Paper for chest2vec/chest2vec_labeler

A foundation model utilizing chest CT volumes and radiology reports for supervised-level zero-shot detection of abnormalities

Paper • 2403.17834 • Published Mar 26, 2024 • 5