File size: 9,200 Bytes
0008ed1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10679da
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0008ed1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10679da
 
 
 
 
0008ed1
 
 
 
 
 
10679da
 
 
 
0008ed1
f96e85a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0008ed1
 
 
 
 
 
 
 
 
 
10679da
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f96e85a
 
 
 
 
 
 
 
 
 
 
10679da
 
 
 
0008ed1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
---
license: cc-by-nc-sa-4.0
language:
- en
base_model:
- Qwen/Qwen3-Embedding-0.6B
datasets:
- chest2vec/chest2vec_labels
pipeline_tag: text-classification
library_name: transformers
tags:
- radiology
- chest-ct
- report-labeling
- multi-label
- ct-rate
- chexbert-style-f1
---

# chest2vec CT Report Labeler (0.6B)

A weakly-supervised **multi-label classifier** that reads a free-text **chest-CT report** and
predicts a **137-leaf chest-imaging taxonomy**, with a **ternary** status per label
(*negative / uncertain / positive*).

It also provides a **CheXbert / SRR-BERT-style report-comparison F1**: label a list of
ground-truth reports and a list of generated/predicted reports, then score them against each
other (micro / macro / weighted F1) — useful for evaluating radiology report generation.

- **Base architecture:** [`Qwen/Qwen3-Embedding-0.6B`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) (Apache-2.0)
- **Adaptation:** LoRA (r=16, α=32) **merged into the weights** + last-token (EOS) pooling + L2-norm + a linear ternary head (`1024 → 137 × 3`)
- **Self-contained:** the full model (encoder + head) ships in `model.safetensors`. Loading does **not** download Qwen3-Embedding weights — the architecture is rebuilt from the bundled config and our weights are loaded in. Tokenizer is bundled too.
- **Params:** ~596M · weights in float32
- **Training labels:** [`chest2vec/chest2vec_labels`](https://huggingface.co/datasets/chest2vec/chest2vec_labels) (revised CT-RATE, 137-leaf taxonomy)

## Label space

The model head predicts **137 leaf labels**. They roll up through the chest-imaging hierarchy
into **38 upper/container groups** and **10 anatomy sections** (the `label_hierarchy` in
`config.json`), so predictions and report-comparison F1 can be reported at leaf, upper, or
anatomy granularity.

- The model outputs all **137** leaves. In the training data, **136** of them have at least one
  positive example; the single exception is **`IVC filter`** (kept for taxonomy completeness,
  but it had no positives, so the model effectively never predicts it).
- The exact label list is in `config.json` (`labels`). Full definitions and per-split counts are
  in the **[chest2vec/chest2vec_labels](https://huggingface.co/datasets/chest2vec/chest2vec_labels)**
  dataset's [`LABEL_HIERARCHY.md`](https://huggingface.co/datasets/chest2vec/chest2vec_labels/blob/main/LABEL_HIERARCHY.md).

This model was **trained and evaluated on the
[chest2vec/chest2vec_labels](https://huggingface.co/datasets/chest2vec/chest2vec_labels)
dataset** (revised CT-RATE, 137-leaf taxonomy).

**Ternary head**`softmax(logits, dim=-1)` over class indices `[0, 1, 2]`:

| class index | meaning | value |
|---:|---|---:|
| 0 | negative | 0 |
| 1 | uncertain | -1 |
| 2 | positive | 1 |

A label is reported **positive** when `P(class=2) ≥ threshold` (default **0.5**).

## Usage

```python
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("chest2vec/chest2vec_labeler", trust_remote_code=True).eval()
tok   = AutoTokenizer.from_pretrained("chest2vec/chest2vec_labeler", trust_remote_code=True)

reports = ["Bibasilar atelectasis with small bilateral pleural effusions. Cardiomegaly. Coronary artery calcification."]

# 1) human-readable positive labels per report
print(model.label_reports(reports, tokenizer=tok))
# [{'Subsegmental / linear atelectasis': 'positive', 'Pleural effusion': 'positive',
#   'Cardiomegaly': 'positive', 'Coronary artery calcification': 'positive'}]

# 2) full prediction matrices
out = model.predict(reports, tokenizer=tok, threshold=0.5, return_ternary=True)
out["labels"]      # list of 137 label names
out["proba"]       # [N, 137] P(positive)
out["positive"]    # [N, 137] in {0,1}
out["ternary"]     # [N, 137] in {-1,0,1}
```

### CheXbert / SRR-BERT-style report comparison

Label both ground-truth and predicted reports, then compute label-level F1 (GT-labels treated
as truth):

```python
res = model.score_reports(gt_reports, pred_reports, tokenizer=tok)   # equal-length lists
# scores are reported at three hierarchy levels:
for level in ("leaf", "upper", "anatomy"):
    b = res[level]
    print(level, b["n_labels"], b["micro"]["f1"], b["macro"]["f1"], b["weighted"]["f1"])
print(res["leaf"]["per_label"]["Pleural effusion"])   # {'precision':..,'recall':..,'f1':..,'support_gt':..}

# or one-liner that loads the model for you:
from modeling_chest2vec_labeler import report_f1
report_f1(gt_reports, pred_reports, tokenizer=tok)
```

Each level (`leaf` = 137 labels, `upper` = 38 container groups, `anatomy` = 10 sections) returns
`micro` / `macro` / `weighted` precision/recall/F1 plus `per_label`. Upper/anatomy scores are the
max-over-children roll-up of the leaf predictions (`model.aggregate_hierarchy(...)`). Coarser
levels are easier to match, so upper/anatomy F1 are typically higher than leaf.

### Per-label best F1 (threshold tuning)

The default decision threshold is a single global value, but the F1-optimal threshold differs per
label. To get the **best achievable F1 per label** (and the threshold that achieves it) against a
ground-truth label set:

```python
# gt: a DataFrame with the 137 label columns (ternary; positive == 1), or a binary array
res = model.per_label_best_f1(reports, gt, tokenizer=tok, level="leaf", min_pos=30)
res["macro_best_f1_min_pos"]                 # macro best-F1 over labels with >= min_pos positives
res["per_label"]["Pleural effusion"]         # {'best_f1':.., 'best_threshold':.., 'n_pos':..}
```

Per-label threshold tuning lifts macro-F1 by ~4–6 points over the fixed-0.5 threshold (see below).

## Inputs & conventions

- Input is the **findings** text (the model was trained on CT-RATE findings + their refined
  section-structured form). Reports are formatted internally as
  `Instruct: Given the following chest CT report, extract the presence/absence of entities\nQuery: <report>`,
  truncated to **512** tokens, with an EOS token appended and left-padding.
- For best fidelity, run in float32 (default). bf16 is fine for throughput with negligible drift.

## Evaluation

**How these numbers were produced:** `run_para_v2_eval.sh` runs the model in **direct-paragraph**
mode (full report, max_len 512) and writes `eval_ctrate_test_direct.json` (public) and
`eval_sample1000_private.json` (private). Metric = **macro-F1** of the **positive class** (softmax
probability of class 2) at **threshold 0.33**. Because the all-labels macro is dragged down by
sparse-tail labels, the **headline restricts to leaf labels with ≥30 positive examples** in that
eval set — **53 of the evaluated leaves on the public set, 29 on the private set**. Upper/anatomy
rows are the hierarchy roll-up.

**CT-RATE revised test (public, 1,464 reports)** — from `chest2vec/chest2vec_labels` test split:

| Level | # labels | macro-F1 @0.33 | macro-AUC |
|---|--:|--:|--:|
| leaf (≥30 positives) | 53 | **0.875** | 0.989 |
| leaf (all evaluated) | 131 | 0.749 | — |
| upper (≥30 positives) | 27 | 0.938 | 0.994 |
| anatomy | 10 | 0.956 | 0.993 |

**Private evaluation set (1,000 reports)** — a held-out internal set, not released:

| Level | # labels | macro-F1 @0.33 | macro-AUC |
|---|--:|--:|--:|
| leaf (≥30 positives) | 29 | **0.766** | 0.972 |
| leaf (all evaluated) | 60 | 0.731 | — |
| upper (≥30 positives) | 19 | 0.837 | — |
| anatomy | 10 | 0.869 | — |

**Per-label best F1** (threshold swept per label to maximize F1; macro over leaf labels with ≥30
positives, via `model.per_label_best_f1`):

| Eval set | macro best-F1 (≥30) | macro-F1 @0.5 (≥30) | macro best-F1 (all evaluated) |
|---|--:|--:|--:|
| CT-RATE public | **0.907** | 0.866 | 0.844 |
| Private | **0.820** | 0.761 | 0.795 |

F1-optimal thresholds vary widely by label (~0.04–0.75), so per-label tuning recovers ~4–6 macro-F1
points over a single global threshold.

Leaf macro-AUC barely moves public→private (**0.989 → 0.972**), i.e. label ranking transfers to
the unseen set; the F1 gap is mostly threshold / labeling-convention, not a domain failure.
Separately, a radiologist spot-checked **966** reports of the public test labels (857 fully
accepted / 60 imperfect-but-acceptable / 49 failed; see the [dataset card](https://huggingface.co/datasets/chest2vec/chest2vec_labels)).

## Caveats

- **Weakly supervised** — trained on LLM-generated labels (not radiologist ground truth) derived
  from report **text**, not images. Not a medical device; not for clinical use.
- `IVC filter` is in the taxonomy for completeness but had no training positives.
- `score_reports` measures **label agreement** between two reports as judged by this labeler;
  like CheXbert-F1 it inherits the labeler's own error modes.

## License & attribution

Released under **CC-BY-NC-SA-4.0**. Built on **`Qwen/Qwen3-Embedding-0.6B`** (Apache-2.0) and
trained using labels derived from **[CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE)**
(CC-BY-NC-SA-4.0). **If you use this model, cite the CT-RATE paper** (arXiv:2403.17834) and
acknowledge Qwen3-Embedding. See the [dataset card](https://huggingface.co/datasets/chest2vec/chest2vec_labels)
for the full citation.