File size: 9,200 Bytes

---
license: cc-by-nc-sa-4.0
language:
- en
base_model:
- Qwen/Qwen3-Embedding-0.6B
datasets:
- chest2vec/chest2vec_labels
pipeline_tag: text-classification
library_name: transformers
tags:
- radiology
- chest-ct
- report-labeling
- multi-label
- ct-rate
- chexbert-style-f1
---

# chest2vec CT Report Labeler (0.6B)

A weakly-supervised **multi-label classifier** that reads a free-text **chest-CT report** and
predicts a **137-leaf chest-imaging taxonomy**, with a **ternary** status per label
(*negative / uncertain / positive*).

It also provides a **CheXbert / SRR-BERT-style report-comparison F1**: label a list of
ground-truth reports and a list of generated/predicted reports, then score them against each
other (micro / macro / weighted F1) — useful for evaluating radiology report generation.

- **Base architecture:** [`Qwen/Qwen3-Embedding-0.6B`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) (Apache-2.0)
- **Adaptation:** LoRA (r=16, α=32) **merged into the weights** + last-token (EOS) pooling + L2-norm + a linear ternary head (`1024 → 137 × 3`)
- **Self-contained:** the full model (encoder + head) ships in `model.safetensors`. Loading does **not** download Qwen3-Embedding weights — the architecture is rebuilt from the bundled config and our weights are loaded in. Tokenizer is bundled too.
- **Params:** ~596M · weights in float32
- **Training labels:** [`chest2vec/chest2vec_labels`](https://huggingface.co/datasets/chest2vec/chest2vec_labels) (revised CT-RATE, 137-leaf taxonomy)

## Label space

The model head predicts **137 leaf labels**. They roll up through the chest-imaging hierarchy
into **38 upper/container groups** and **10 anatomy sections** (the `label_hierarchy` in
`config.json`), so predictions and report-comparison F1 can be reported at leaf, upper, or
anatomy granularity.

- The model outputs all **137** leaves. In the training data, **136** of them have at least one
  positive example; the single exception is **`IVC filter`** (kept for taxonomy completeness,
  but it had no positives, so the model effectively never predicts it).
- The exact label list is in `config.json` (`labels`). Full definitions and per-split counts are
  in the **[chest2vec/chest2vec_labels](https://huggingface.co/datasets/chest2vec/chest2vec_labels)**
  dataset's [`LABEL_HIERARCHY.md`](https://huggingface.co/datasets/chest2vec/chest2vec_labels/blob/main/LABEL_HIERARCHY.md).

This model was **trained and evaluated on the
[chest2vec/chest2vec_labels](https://huggingface.co/datasets/chest2vec/chest2vec_labels)
dataset** (revised CT-RATE, 137-leaf taxonomy).

**Ternary head** — `softmax(logits, dim=-1)` over class indices `[0, 1, 2]`:

| class index | meaning | value |
|---:|---|---:|
| 0 | negative | 0 |
| 1 | uncertain | -1 |
| 2 | positive | 1 |

A label is reported **positive** when `P(class=2) ≥ threshold` (default **0.5**).

## Usage

```python
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("chest2vec/chest2vec_labeler", trust_remote_code=True).eval()
tok   = AutoTokenizer.from_pretrained("chest2vec/chest2vec_labeler", trust_remote_code=True)

reports = ["Bibasilar atelectasis with small bilateral pleural effusions. Cardiomegaly. Coronary artery calcification."]

# 1) human-readable positive labels per report
print(model.label_reports(reports, tokenizer=tok))
# [{'Subsegmental / linear atelectasis': 'positive', 'Pleural effusion': 'positive',
#   'Cardiomegaly': 'positive', 'Coronary artery calcification': 'positive'}]

# 2) full prediction matrices
out = model.predict(reports, tokenizer=tok, threshold=0.5, return_ternary=True)
out["labels"]      # list of 137 label names
out["proba"]       # [N, 137] P(positive)
out["positive"]    # [N, 137] in {0,1}
out["ternary"]     # [N, 137] in {-1,0,1}
```

### CheXbert / SRR-BERT-style report comparison

Label both ground-truth and predicted reports, then compute label-level F1 (GT-labels treated
as truth):

```python
res = model.score_reports(gt_reports, pred_reports, tokenizer=tok)   # equal-length lists
# scores are reported at three hierarchy levels:
for level in ("leaf", "upper", "anatomy"):
    b = res[level]
    print(level, b["n_labels"], b["micro"]["f1"], b["macro"]["f1"], b["weighted"]["f1"])
print(res["leaf"]["per_label"]["Pleural effusion"])   # {'precision':..,'recall':..,'f1':..,'support_gt':..}

# or one-liner that loads the model for you:
from modeling_chest2vec_labeler import report_f1
report_f1(gt_reports, pred_reports, tokenizer=tok)
```

Each level (`leaf` = 137 labels, `upper` = 38 container groups, `anatomy` = 10 sections) returns
`micro` / `macro` / `weighted` precision/recall/F1 plus `per_label`. Upper/anatomy scores are the
max-over-children roll-up of the leaf predictions (`model.aggregate_hierarchy(...)`). Coarser
levels are easier to match, so upper/anatomy F1 are typically higher than leaf.

### Per-label best F1 (threshold tuning)

The default decision threshold is a single global value, but the F1-optimal threshold differs per
label. To get the **best achievable F1 per label** (and the threshold that achieves it) against a
ground-truth label set:

```python
# gt: a DataFrame with the 137 label columns (ternary; positive == 1), or a binary array
res = model.per_label_best_f1(reports, gt, tokenizer=tok, level="leaf", min_pos=30)
res["macro_best_f1_min_pos"]                 # macro best-F1 over labels with >= min_pos positives
res["per_label"]["Pleural effusion"]         # {'best_f1':.., 'best_threshold':.., 'n_pos':..}
```

Per-label threshold tuning lifts macro-F1 by ~4–6 points over the fixed-0.5 threshold (see below).

## Inputs & conventions

- Input is the **findings** text (the model was trained on CT-RATE findings + their refined
  section-structured form). Reports are formatted internally as
  `Instruct: Given the following chest CT report, extract the presence/absence of entities\nQuery: <report>`,
  truncated to **512** tokens, with an EOS token appended and left-padding.
- For best fidelity, run in float32 (default). bf16 is fine for throughput with negligible drift.

## Evaluation

**How these numbers were produced:** `run_para_v2_eval.sh` runs the model in **direct-paragraph**
mode (full report, max_len 512) and writes `eval_ctrate_test_direct.json` (public) and
`eval_sample1000_private.json` (private). Metric = **macro-F1** of the **positive class** (softmax
probability of class 2) at **threshold 0.33**. Because the all-labels macro is dragged down by
sparse-tail labels, the **headline restricts to leaf labels with ≥30 positive examples** in that
eval set — **53 of the evaluated leaves on the public set, 29 on the private set**. Upper/anatomy
rows are the hierarchy roll-up.

**CT-RATE revised test (public, 1,464 reports)** — from `chest2vec/chest2vec_labels` test split:

| Level | # labels | macro-F1 @0.33 | macro-AUC |
|---|--:|--:|--:|
| leaf (≥30 positives) | 53 | **0.875** | 0.989 |
| leaf (all evaluated) | 131 | 0.749 | — |
| upper (≥30 positives) | 27 | 0.938 | 0.994 |
| anatomy | 10 | 0.956 | 0.993 |

**Private evaluation set (1,000 reports)** — a held-out internal set, not released:

| Level | # labels | macro-F1 @0.33 | macro-AUC |
|---|--:|--:|--:|
| leaf (≥30 positives) | 29 | **0.766** | 0.972 |
| leaf (all evaluated) | 60 | 0.731 | — |
| upper (≥30 positives) | 19 | 0.837 | — |
| anatomy | 10 | 0.869 | — |

**Per-label best F1** (threshold swept per label to maximize F1; macro over leaf labels with ≥30
positives, via `model.per_label_best_f1`):

| Eval set | macro best-F1 (≥30) | macro-F1 @0.5 (≥30) | macro best-F1 (all evaluated) |
|---|--:|--:|--:|
| CT-RATE public | **0.907** | 0.866 | 0.844 |
| Private | **0.820** | 0.761 | 0.795 |

F1-optimal thresholds vary widely by label (~0.04–0.75), so per-label tuning recovers ~4–6 macro-F1
points over a single global threshold.

Leaf macro-AUC barely moves public→private (**0.989 → 0.972**), i.e. label ranking transfers to
the unseen set; the F1 gap is mostly threshold / labeling-convention, not a domain failure.
Separately, a radiologist spot-checked **966** reports of the public test labels (857 fully
accepted / 60 imperfect-but-acceptable / 49 failed; see the [dataset card](https://huggingface.co/datasets/chest2vec/chest2vec_labels)).

## Caveats

- **Weakly supervised** — trained on LLM-generated labels (not radiologist ground truth) derived
  from report **text**, not images. Not a medical device; not for clinical use.
- `IVC filter` is in the taxonomy for completeness but had no training positives.
- `score_reports` measures **label agreement** between two reports as judged by this labeler;
  like CheXbert-F1 it inherits the labeler's own error modes.

## License & attribution

Released under **CC-BY-NC-SA-4.0**. Built on **`Qwen/Qwen3-Embedding-0.6B`** (Apache-2.0) and
trained using labels derived from **[CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE)**
(CC-BY-NC-SA-4.0). **If you use this model, cite the CT-RATE paper** (arXiv:2403.17834) and
acknowledge Qwen3-Embedding. See the [dataset card](https://huggingface.co/datasets/chest2vec/chest2vec_labels)
for the full citation.