Text Classification
Transformers
Safetensors
English
chest2vec_labeler
feature-extraction
radiology
chest-ct
report-labeling
multi-label
ct-rate
chexbert-style-f1
custom_code
Instructions to use chest2vec/chest2vec_labeler with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use chest2vec/chest2vec_labeler with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="chest2vec/chest2vec_labeler", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("chest2vec/chest2vec_labeler", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: cc-by-nc-sa-4.0 | |
| language: | |
| - en | |
| base_model: | |
| - Qwen/Qwen3-Embedding-0.6B | |
| datasets: | |
| - chest2vec/chest2vec_labels | |
| pipeline_tag: text-classification | |
| library_name: transformers | |
| tags: | |
| - radiology | |
| - chest-ct | |
| - report-labeling | |
| - multi-label | |
| - ct-rate | |
| - chexbert-style-f1 | |
| # chest2vec CT Report Labeler (0.6B) | |
| A weakly-supervised **multi-label classifier** that reads a free-text **chest-CT report** and | |
| predicts a **137-leaf chest-imaging taxonomy**, with a **ternary** status per label | |
| (*negative / uncertain / positive*). | |
| It also provides a **CheXbert / SRR-BERT-style report-comparison F1**: label a list of | |
| ground-truth reports and a list of generated/predicted reports, then score them against each | |
| other (micro / macro / weighted F1) — useful for evaluating radiology report generation. | |
| - **Base architecture:** [`Qwen/Qwen3-Embedding-0.6B`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) (Apache-2.0) | |
| - **Adaptation:** LoRA (r=16, α=32) **merged into the weights** + last-token (EOS) pooling + L2-norm + a linear ternary head (`1024 → 137 × 3`) | |
| - **Self-contained:** the full model (encoder + head) ships in `model.safetensors`. Loading does **not** download Qwen3-Embedding weights — the architecture is rebuilt from the bundled config and our weights are loaded in. Tokenizer is bundled too. | |
| - **Params:** ~596M · weights in float32 | |
| - **Training labels:** [`chest2vec/chest2vec_labels`](https://huggingface.co/datasets/chest2vec/chest2vec_labels) (revised CT-RATE, 137-leaf taxonomy) | |
| ## Label space | |
| The model head predicts **137 leaf labels**. They roll up through the chest-imaging hierarchy | |
| into **38 upper/container groups** and **10 anatomy sections** (the `label_hierarchy` in | |
| `config.json`), so predictions and report-comparison F1 can be reported at leaf, upper, or | |
| anatomy granularity. | |
| - The model outputs all **137** leaves. In the training data, **136** of them have at least one | |
| positive example; the single exception is **`IVC filter`** (kept for taxonomy completeness, | |
| but it had no positives, so the model effectively never predicts it). | |
| - The exact label list is in `config.json` (`labels`). Full definitions and per-split counts are | |
| in the **[chest2vec/chest2vec_labels](https://huggingface.co/datasets/chest2vec/chest2vec_labels)** | |
| dataset's [`LABEL_HIERARCHY.md`](https://huggingface.co/datasets/chest2vec/chest2vec_labels/blob/main/LABEL_HIERARCHY.md). | |
| This model was **trained and evaluated on the | |
| [chest2vec/chest2vec_labels](https://huggingface.co/datasets/chest2vec/chest2vec_labels) | |
| dataset** (revised CT-RATE, 137-leaf taxonomy). | |
| **Ternary head** — `softmax(logits, dim=-1)` over class indices `[0, 1, 2]`: | |
| | class index | meaning | value | | |
| |---:|---|---:| | |
| | 0 | negative | 0 | | |
| | 1 | uncertain | -1 | | |
| | 2 | positive | 1 | | |
| A label is reported **positive** when `P(class=2) ≥ threshold` (default **0.5**). | |
| ## Usage | |
| ```python | |
| from transformers import AutoModel, AutoTokenizer | |
| model = AutoModel.from_pretrained("chest2vec/chest2vec_labeler", trust_remote_code=True).eval() | |
| tok = AutoTokenizer.from_pretrained("chest2vec/chest2vec_labeler", trust_remote_code=True) | |
| reports = ["Bibasilar atelectasis with small bilateral pleural effusions. Cardiomegaly. Coronary artery calcification."] | |
| # 1) human-readable positive labels per report | |
| print(model.label_reports(reports, tokenizer=tok)) | |
| # [{'Subsegmental / linear atelectasis': 'positive', 'Pleural effusion': 'positive', | |
| # 'Cardiomegaly': 'positive', 'Coronary artery calcification': 'positive'}] | |
| # 2) full prediction matrices | |
| out = model.predict(reports, tokenizer=tok, threshold=0.5, return_ternary=True) | |
| out["labels"] # list of 137 label names | |
| out["proba"] # [N, 137] P(positive) | |
| out["positive"] # [N, 137] in {0,1} | |
| out["ternary"] # [N, 137] in {-1,0,1} | |
| ``` | |
| ### CheXbert / SRR-BERT-style report comparison | |
| Label both ground-truth and predicted reports, then compute label-level F1 (GT-labels treated | |
| as truth): | |
| ```python | |
| res = model.score_reports(gt_reports, pred_reports, tokenizer=tok) # equal-length lists | |
| # scores are reported at three hierarchy levels: | |
| for level in ("leaf", "upper", "anatomy"): | |
| b = res[level] | |
| print(level, b["n_labels"], b["micro"]["f1"], b["macro"]["f1"], b["weighted"]["f1"]) | |
| print(res["leaf"]["per_label"]["Pleural effusion"]) # {'precision':..,'recall':..,'f1':..,'support_gt':..} | |
| # or one-liner that loads the model for you: | |
| from modeling_chest2vec_labeler import report_f1 | |
| report_f1(gt_reports, pred_reports, tokenizer=tok) | |
| ``` | |
| Each level (`leaf` = 137 labels, `upper` = 38 container groups, `anatomy` = 10 sections) returns | |
| `micro` / `macro` / `weighted` precision/recall/F1 plus `per_label`. Upper/anatomy scores are the | |
| max-over-children roll-up of the leaf predictions (`model.aggregate_hierarchy(...)`). Coarser | |
| levels are easier to match, so upper/anatomy F1 are typically higher than leaf. | |
| ### Per-label best F1 (threshold tuning) | |
| The default decision threshold is a single global value, but the F1-optimal threshold differs per | |
| label. To get the **best achievable F1 per label** (and the threshold that achieves it) against a | |
| ground-truth label set: | |
| ```python | |
| # gt: a DataFrame with the 137 label columns (ternary; positive == 1), or a binary array | |
| res = model.per_label_best_f1(reports, gt, tokenizer=tok, level="leaf", min_pos=30) | |
| res["macro_best_f1_min_pos"] # macro best-F1 over labels with >= min_pos positives | |
| res["per_label"]["Pleural effusion"] # {'best_f1':.., 'best_threshold':.., 'n_pos':..} | |
| ``` | |
| Per-label threshold tuning lifts macro-F1 by ~4–6 points over the fixed-0.5 threshold (see below). | |
| ## Inputs & conventions | |
| - Input is the **findings** text (the model was trained on CT-RATE findings + their refined | |
| section-structured form). Reports are formatted internally as | |
| `Instruct: Given the following chest CT report, extract the presence/absence of entities\nQuery: <report>`, | |
| truncated to **512** tokens, with an EOS token appended and left-padding. | |
| - For best fidelity, run in float32 (default). bf16 is fine for throughput with negligible drift. | |
| ## Evaluation | |
| **How these numbers were produced:** `run_para_v2_eval.sh` runs the model in **direct-paragraph** | |
| mode (full report, max_len 512) and writes `eval_ctrate_test_direct.json` (public) and | |
| `eval_sample1000_private.json` (private). Metric = **macro-F1** of the **positive class** (softmax | |
| probability of class 2) at **threshold 0.33**. Because the all-labels macro is dragged down by | |
| sparse-tail labels, the **headline restricts to leaf labels with ≥30 positive examples** in that | |
| eval set — **53 of the evaluated leaves on the public set, 29 on the private set**. Upper/anatomy | |
| rows are the hierarchy roll-up. | |
| **CT-RATE revised test (public, 1,464 reports)** — from `chest2vec/chest2vec_labels` test split: | |
| | Level | # labels | macro-F1 @0.33 | macro-AUC | | |
| |---|--:|--:|--:| | |
| | leaf (≥30 positives) | 53 | **0.875** | 0.989 | | |
| | leaf (all evaluated) | 131 | 0.749 | — | | |
| | upper (≥30 positives) | 27 | 0.938 | 0.994 | | |
| | anatomy | 10 | 0.956 | 0.993 | | |
| **Private evaluation set (1,000 reports)** — a held-out internal set, not released: | |
| | Level | # labels | macro-F1 @0.33 | macro-AUC | | |
| |---|--:|--:|--:| | |
| | leaf (≥30 positives) | 29 | **0.766** | 0.972 | | |
| | leaf (all evaluated) | 60 | 0.731 | — | | |
| | upper (≥30 positives) | 19 | 0.837 | — | | |
| | anatomy | 10 | 0.869 | — | | |
| **Per-label best F1** (threshold swept per label to maximize F1; macro over leaf labels with ≥30 | |
| positives, via `model.per_label_best_f1`): | |
| | Eval set | macro best-F1 (≥30) | macro-F1 @0.5 (≥30) | macro best-F1 (all evaluated) | | |
| |---|--:|--:|--:| | |
| | CT-RATE public | **0.907** | 0.866 | 0.844 | | |
| | Private | **0.820** | 0.761 | 0.795 | | |
| F1-optimal thresholds vary widely by label (~0.04–0.75), so per-label tuning recovers ~4–6 macro-F1 | |
| points over a single global threshold. | |
| Leaf macro-AUC barely moves public→private (**0.989 → 0.972**), i.e. label ranking transfers to | |
| the unseen set; the F1 gap is mostly threshold / labeling-convention, not a domain failure. | |
| Separately, a radiologist spot-checked **966** reports of the public test labels (857 fully | |
| accepted / 60 imperfect-but-acceptable / 49 failed; see the [dataset card](https://huggingface.co/datasets/chest2vec/chest2vec_labels)). | |
| ## Caveats | |
| - **Weakly supervised** — trained on LLM-generated labels (not radiologist ground truth) derived | |
| from report **text**, not images. Not a medical device; not for clinical use. | |
| - `IVC filter` is in the taxonomy for completeness but had no training positives. | |
| - `score_reports` measures **label agreement** between two reports as judged by this labeler; | |
| like CheXbert-F1 it inherits the labeler's own error modes. | |
| ## License & attribution | |
| Released under **CC-BY-NC-SA-4.0**. Built on **`Qwen/Qwen3-Embedding-0.6B`** (Apache-2.0) and | |
| trained using labels derived from **[CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE)** | |
| (CC-BY-NC-SA-4.0). **If you use this model, cite the CT-RATE paper** (arXiv:2403.17834) and | |
| acknowledge Qwen3-Embedding. See the [dataset card](https://huggingface.co/datasets/chest2vec/chest2vec_labels) | |
| for the full citation. | |