Text Classification
Transformers
Safetensors
English
chest2vec_labeler
feature-extraction
radiology
chest-ct
report-labeling
multi-label
ct-rate
chexbert-style-f1
custom_code
Instructions to use chest2vec/chest2vec_labeler with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use chest2vec/chest2vec_labeler with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="chest2vec/chest2vec_labeler", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("chest2vec/chest2vec_labeler", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 9,200 Bytes
0008ed1 10679da 0008ed1 10679da 0008ed1 10679da 0008ed1 f96e85a 0008ed1 10679da f96e85a 10679da 0008ed1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 | ---
license: cc-by-nc-sa-4.0
language:
- en
base_model:
- Qwen/Qwen3-Embedding-0.6B
datasets:
- chest2vec/chest2vec_labels
pipeline_tag: text-classification
library_name: transformers
tags:
- radiology
- chest-ct
- report-labeling
- multi-label
- ct-rate
- chexbert-style-f1
---
# chest2vec CT Report Labeler (0.6B)
A weakly-supervised **multi-label classifier** that reads a free-text **chest-CT report** and
predicts a **137-leaf chest-imaging taxonomy**, with a **ternary** status per label
(*negative / uncertain / positive*).
It also provides a **CheXbert / SRR-BERT-style report-comparison F1**: label a list of
ground-truth reports and a list of generated/predicted reports, then score them against each
other (micro / macro / weighted F1) — useful for evaluating radiology report generation.
- **Base architecture:** [`Qwen/Qwen3-Embedding-0.6B`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) (Apache-2.0)
- **Adaptation:** LoRA (r=16, α=32) **merged into the weights** + last-token (EOS) pooling + L2-norm + a linear ternary head (`1024 → 137 × 3`)
- **Self-contained:** the full model (encoder + head) ships in `model.safetensors`. Loading does **not** download Qwen3-Embedding weights — the architecture is rebuilt from the bundled config and our weights are loaded in. Tokenizer is bundled too.
- **Params:** ~596M · weights in float32
- **Training labels:** [`chest2vec/chest2vec_labels`](https://huggingface.co/datasets/chest2vec/chest2vec_labels) (revised CT-RATE, 137-leaf taxonomy)
## Label space
The model head predicts **137 leaf labels**. They roll up through the chest-imaging hierarchy
into **38 upper/container groups** and **10 anatomy sections** (the `label_hierarchy` in
`config.json`), so predictions and report-comparison F1 can be reported at leaf, upper, or
anatomy granularity.
- The model outputs all **137** leaves. In the training data, **136** of them have at least one
positive example; the single exception is **`IVC filter`** (kept for taxonomy completeness,
but it had no positives, so the model effectively never predicts it).
- The exact label list is in `config.json` (`labels`). Full definitions and per-split counts are
in the **[chest2vec/chest2vec_labels](https://huggingface.co/datasets/chest2vec/chest2vec_labels)**
dataset's [`LABEL_HIERARCHY.md`](https://huggingface.co/datasets/chest2vec/chest2vec_labels/blob/main/LABEL_HIERARCHY.md).
This model was **trained and evaluated on the
[chest2vec/chest2vec_labels](https://huggingface.co/datasets/chest2vec/chest2vec_labels)
dataset** (revised CT-RATE, 137-leaf taxonomy).
**Ternary head** — `softmax(logits, dim=-1)` over class indices `[0, 1, 2]`:
| class index | meaning | value |
|---:|---|---:|
| 0 | negative | 0 |
| 1 | uncertain | -1 |
| 2 | positive | 1 |
A label is reported **positive** when `P(class=2) ≥ threshold` (default **0.5**).
## Usage
```python
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("chest2vec/chest2vec_labeler", trust_remote_code=True).eval()
tok = AutoTokenizer.from_pretrained("chest2vec/chest2vec_labeler", trust_remote_code=True)
reports = ["Bibasilar atelectasis with small bilateral pleural effusions. Cardiomegaly. Coronary artery calcification."]
# 1) human-readable positive labels per report
print(model.label_reports(reports, tokenizer=tok))
# [{'Subsegmental / linear atelectasis': 'positive', 'Pleural effusion': 'positive',
# 'Cardiomegaly': 'positive', 'Coronary artery calcification': 'positive'}]
# 2) full prediction matrices
out = model.predict(reports, tokenizer=tok, threshold=0.5, return_ternary=True)
out["labels"] # list of 137 label names
out["proba"] # [N, 137] P(positive)
out["positive"] # [N, 137] in {0,1}
out["ternary"] # [N, 137] in {-1,0,1}
```
### CheXbert / SRR-BERT-style report comparison
Label both ground-truth and predicted reports, then compute label-level F1 (GT-labels treated
as truth):
```python
res = model.score_reports(gt_reports, pred_reports, tokenizer=tok) # equal-length lists
# scores are reported at three hierarchy levels:
for level in ("leaf", "upper", "anatomy"):
b = res[level]
print(level, b["n_labels"], b["micro"]["f1"], b["macro"]["f1"], b["weighted"]["f1"])
print(res["leaf"]["per_label"]["Pleural effusion"]) # {'precision':..,'recall':..,'f1':..,'support_gt':..}
# or one-liner that loads the model for you:
from modeling_chest2vec_labeler import report_f1
report_f1(gt_reports, pred_reports, tokenizer=tok)
```
Each level (`leaf` = 137 labels, `upper` = 38 container groups, `anatomy` = 10 sections) returns
`micro` / `macro` / `weighted` precision/recall/F1 plus `per_label`. Upper/anatomy scores are the
max-over-children roll-up of the leaf predictions (`model.aggregate_hierarchy(...)`). Coarser
levels are easier to match, so upper/anatomy F1 are typically higher than leaf.
### Per-label best F1 (threshold tuning)
The default decision threshold is a single global value, but the F1-optimal threshold differs per
label. To get the **best achievable F1 per label** (and the threshold that achieves it) against a
ground-truth label set:
```python
# gt: a DataFrame with the 137 label columns (ternary; positive == 1), or a binary array
res = model.per_label_best_f1(reports, gt, tokenizer=tok, level="leaf", min_pos=30)
res["macro_best_f1_min_pos"] # macro best-F1 over labels with >= min_pos positives
res["per_label"]["Pleural effusion"] # {'best_f1':.., 'best_threshold':.., 'n_pos':..}
```
Per-label threshold tuning lifts macro-F1 by ~4–6 points over the fixed-0.5 threshold (see below).
## Inputs & conventions
- Input is the **findings** text (the model was trained on CT-RATE findings + their refined
section-structured form). Reports are formatted internally as
`Instruct: Given the following chest CT report, extract the presence/absence of entities\nQuery: <report>`,
truncated to **512** tokens, with an EOS token appended and left-padding.
- For best fidelity, run in float32 (default). bf16 is fine for throughput with negligible drift.
## Evaluation
**How these numbers were produced:** `run_para_v2_eval.sh` runs the model in **direct-paragraph**
mode (full report, max_len 512) and writes `eval_ctrate_test_direct.json` (public) and
`eval_sample1000_private.json` (private). Metric = **macro-F1** of the **positive class** (softmax
probability of class 2) at **threshold 0.33**. Because the all-labels macro is dragged down by
sparse-tail labels, the **headline restricts to leaf labels with ≥30 positive examples** in that
eval set — **53 of the evaluated leaves on the public set, 29 on the private set**. Upper/anatomy
rows are the hierarchy roll-up.
**CT-RATE revised test (public, 1,464 reports)** — from `chest2vec/chest2vec_labels` test split:
| Level | # labels | macro-F1 @0.33 | macro-AUC |
|---|--:|--:|--:|
| leaf (≥30 positives) | 53 | **0.875** | 0.989 |
| leaf (all evaluated) | 131 | 0.749 | — |
| upper (≥30 positives) | 27 | 0.938 | 0.994 |
| anatomy | 10 | 0.956 | 0.993 |
**Private evaluation set (1,000 reports)** — a held-out internal set, not released:
| Level | # labels | macro-F1 @0.33 | macro-AUC |
|---|--:|--:|--:|
| leaf (≥30 positives) | 29 | **0.766** | 0.972 |
| leaf (all evaluated) | 60 | 0.731 | — |
| upper (≥30 positives) | 19 | 0.837 | — |
| anatomy | 10 | 0.869 | — |
**Per-label best F1** (threshold swept per label to maximize F1; macro over leaf labels with ≥30
positives, via `model.per_label_best_f1`):
| Eval set | macro best-F1 (≥30) | macro-F1 @0.5 (≥30) | macro best-F1 (all evaluated) |
|---|--:|--:|--:|
| CT-RATE public | **0.907** | 0.866 | 0.844 |
| Private | **0.820** | 0.761 | 0.795 |
F1-optimal thresholds vary widely by label (~0.04–0.75), so per-label tuning recovers ~4–6 macro-F1
points over a single global threshold.
Leaf macro-AUC barely moves public→private (**0.989 → 0.972**), i.e. label ranking transfers to
the unseen set; the F1 gap is mostly threshold / labeling-convention, not a domain failure.
Separately, a radiologist spot-checked **966** reports of the public test labels (857 fully
accepted / 60 imperfect-but-acceptable / 49 failed; see the [dataset card](https://huggingface.co/datasets/chest2vec/chest2vec_labels)).
## Caveats
- **Weakly supervised** — trained on LLM-generated labels (not radiologist ground truth) derived
from report **text**, not images. Not a medical device; not for clinical use.
- `IVC filter` is in the taxonomy for completeness but had no training positives.
- `score_reports` measures **label agreement** between two reports as judged by this labeler;
like CheXbert-F1 it inherits the labeler's own error modes.
## License & attribution
Released under **CC-BY-NC-SA-4.0**. Built on **`Qwen/Qwen3-Embedding-0.6B`** (Apache-2.0) and
trained using labels derived from **[CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE)**
(CC-BY-NC-SA-4.0). **If you use this model, cite the CT-RATE paper** (arXiv:2403.17834) and
acknowledge Qwen3-Embedding. See the [dataset card](https://huggingface.co/datasets/chest2vec/chest2vec_labels)
for the full citation.
|