How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="chest2vec/chest2vec_labeler", trust_remote_code=True)
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("chest2vec/chest2vec_labeler", trust_remote_code=True, dtype="auto")
Quick Links

chest2vec CT Report Labeler (0.6B)

A weakly-supervised multi-label classifier that reads a free-text chest-CT report and predicts a 137-leaf chest-imaging taxonomy, with a ternary status per label (negative / uncertain / positive).

It also provides a CheXbert / SRR-BERT-style report-comparison F1: label a list of ground-truth reports and a list of generated/predicted reports, then score them against each other (micro / macro / weighted F1) — useful for evaluating radiology report generation.

  • Base architecture: Qwen/Qwen3-Embedding-0.6B (Apache-2.0)
  • Adaptation: LoRA (r=16, α=32) merged into the weights + last-token (EOS) pooling + L2-norm + a linear ternary head (1024 → 137 × 3)
  • Self-contained: the full model (encoder + head) ships in model.safetensors. Loading does not download Qwen3-Embedding weights — the architecture is rebuilt from the bundled config and our weights are loaded in. Tokenizer is bundled too.
  • Params: ~596M · weights in float32
  • Training labels: chest2vec/chest2vec_labels (revised CT-RATE, 137-leaf taxonomy)

Label space

The model head predicts 137 leaf labels. They roll up through the chest-imaging hierarchy into 38 upper/container groups and 10 anatomy sections (the label_hierarchy in config.json), so predictions and report-comparison F1 can be reported at leaf, upper, or anatomy granularity.

  • The model outputs all 137 leaves. In the training data, 136 of them have at least one positive example; the single exception is IVC filter (kept for taxonomy completeness, but it had no positives, so the model effectively never predicts it).
  • The exact label list is in config.json (labels). Full definitions and per-split counts are in the chest2vec/chest2vec_labels dataset's LABEL_HIERARCHY.md.

This model was trained and evaluated on the chest2vec/chest2vec_labels dataset (revised CT-RATE, 137-leaf taxonomy).

Ternary head — softmax(logits, dim=-1) over class indices [0, 1, 2]:

class index meaning value
0 negative 0
1 uncertain -1
2 positive 1

A label is reported positive when P(class=2) ≥ threshold (default 0.5).

Usage

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("chest2vec/chest2vec_labeler", trust_remote_code=True).eval()
tok   = AutoTokenizer.from_pretrained("chest2vec/chest2vec_labeler", trust_remote_code=True)

reports = ["Bibasilar atelectasis with small bilateral pleural effusions. Cardiomegaly. Coronary artery calcification."]

# 1) human-readable positive labels per report
print(model.label_reports(reports, tokenizer=tok))
# [{'Subsegmental / linear atelectasis': 'positive', 'Pleural effusion': 'positive',
#   'Cardiomegaly': 'positive', 'Coronary artery calcification': 'positive'}]

# 2) full prediction matrices
out = model.predict(reports, tokenizer=tok, threshold=0.5, return_ternary=True)
out["labels"]      # list of 137 label names
out["proba"]       # [N, 137] P(positive)
out["positive"]    # [N, 137] in {0,1}
out["ternary"]     # [N, 137] in {-1,0,1}

CheXbert / SRR-BERT-style report comparison

Label both ground-truth and predicted reports, then compute label-level F1 (GT-labels treated as truth):

res = model.score_reports(gt_reports, pred_reports, tokenizer=tok)   # equal-length lists
# scores are reported at three hierarchy levels:
for level in ("leaf", "upper", "anatomy"):
    b = res[level]
    print(level, b["n_labels"], b["micro"]["f1"], b["macro"]["f1"], b["weighted"]["f1"])
print(res["leaf"]["per_label"]["Pleural effusion"])   # {'precision':..,'recall':..,'f1':..,'support_gt':..}

# or one-liner that loads the model for you:
from modeling_chest2vec_labeler import report_f1
report_f1(gt_reports, pred_reports, tokenizer=tok)

Each level (leaf = 137 labels, upper = 38 container groups, anatomy = 10 sections) returns micro / macro / weighted precision/recall/F1 plus per_label. Upper/anatomy scores are the max-over-children roll-up of the leaf predictions (model.aggregate_hierarchy(...)). Coarser levels are easier to match, so upper/anatomy F1 are typically higher than leaf.

Per-label best F1 (threshold tuning)

The default decision threshold is a single global value, but the F1-optimal threshold differs per label. To get the best achievable F1 per label (and the threshold that achieves it) against a ground-truth label set:

# gt: a DataFrame with the 137 label columns (ternary; positive == 1), or a binary array
res = model.per_label_best_f1(reports, gt, tokenizer=tok, level="leaf", min_pos=30)
res["macro_best_f1_min_pos"]                 # macro best-F1 over labels with >= min_pos positives
res["per_label"]["Pleural effusion"]         # {'best_f1':.., 'best_threshold':.., 'n_pos':..}

Per-label threshold tuning lifts macro-F1 by ~4–6 points over the fixed-0.5 threshold (see below).

Inputs & conventions

  • Input is the findings text (the model was trained on CT-RATE findings + their refined section-structured form). Reports are formatted internally as Instruct: Given the following chest CT report, extract the presence/absence of entities\nQuery: <report>, truncated to 512 tokens, with an EOS token appended and left-padding.
  • For best fidelity, run in float32 (default). bf16 is fine for throughput with negligible drift.

Evaluation

How these numbers were produced: run_para_v2_eval.sh runs the model in direct-paragraph mode (full report, max_len 512) and writes eval_ctrate_test_direct.json (public) and eval_sample1000_private.json (private). Metric = macro-F1 of the positive class (softmax probability of class 2) at threshold 0.33. Because the all-labels macro is dragged down by sparse-tail labels, the headline restricts to leaf labels with ≥30 positive examples in that eval set — 53 of the evaluated leaves on the public set, 29 on the private set. Upper/anatomy rows are the hierarchy roll-up.

CT-RATE revised test (public, 1,464 reports) — from chest2vec/chest2vec_labels test split:

Level # labels macro-F1 @0.33 macro-AUC
leaf (≥30 positives) 53 0.875 0.989
leaf (all evaluated) 131 0.749 —
upper (≥30 positives) 27 0.938 0.994
anatomy 10 0.956 0.993

Private evaluation set (1,000 reports) — a held-out internal set, not released:

Level # labels macro-F1 @0.33 macro-AUC
leaf (≥30 positives) 29 0.766 0.972
leaf (all evaluated) 60 0.731 —
upper (≥30 positives) 19 0.837 —
anatomy 10 0.869 —

Per-label best F1 (threshold swept per label to maximize F1; macro over leaf labels with ≥30 positives, via model.per_label_best_f1):

Eval set macro best-F1 (≥30) macro-F1 @0.5 (≥30) macro best-F1 (all evaluated)
CT-RATE public 0.907 0.866 0.844
Private 0.820 0.761 0.795

F1-optimal thresholds vary widely by label (~0.04–0.75), so per-label tuning recovers ~4–6 macro-F1 points over a single global threshold.

Leaf macro-AUC barely moves public→private (0.989 → 0.972), i.e. label ranking transfers to the unseen set; the F1 gap is mostly threshold / labeling-convention, not a domain failure. Separately, a radiologist spot-checked 966 reports of the public test labels (857 fully accepted / 60 imperfect-but-acceptable / 49 failed; see the dataset card).

Caveats

  • Weakly supervised — trained on LLM-generated labels (not radiologist ground truth) derived from report text, not images. Not a medical device; not for clinical use.
  • IVC filter is in the taxonomy for completeness but had no training positives.
  • score_reports measures label agreement between two reports as judged by this labeler; like CheXbert-F1 it inherits the labeler's own error modes.

License & attribution

Released under CC-BY-NC-SA-4.0. Built on Qwen/Qwen3-Embedding-0.6B (Apache-2.0) and trained using labels derived from CT-RATE (CC-BY-NC-SA-4.0). If you use this model, cite the CT-RATE paper (arXiv:2403.17834) and acknowledge Qwen3-Embedding. See the dataset card for the full citation.

Downloads last month
9
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chest2vec/chest2vec_labeler

Finetuned
(195)
this model

Dataset used to train chest2vec/chest2vec_labeler

Paper for chest2vec/chest2vec_labeler