--- license: cc-by-nc-sa-4.0 language: - en base_model: - Qwen/Qwen3-Embedding-0.6B datasets: - chest2vec/chest2vec_labels pipeline_tag: text-classification library_name: transformers tags: - radiology - chest-ct - report-labeling - multi-label - ct-rate - chexbert-style-f1 --- # chest2vec CT Report Labeler (0.6B) A weakly-supervised **multi-label classifier** that reads a free-text **chest-CT report** and predicts a **137-leaf chest-imaging taxonomy**, with a **ternary** status per label (*negative / uncertain / positive*). It also provides a **CheXbert / SRR-BERT-style report-comparison F1**: label a list of ground-truth reports and a list of generated/predicted reports, then score them against each other (micro / macro / weighted F1) — useful for evaluating radiology report generation. - **Base architecture:** [`Qwen/Qwen3-Embedding-0.6B`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) (Apache-2.0) - **Adaptation:** LoRA (r=16, α=32) **merged into the weights** + last-token (EOS) pooling + L2-norm + a linear ternary head (`1024 → 137 × 3`) - **Self-contained:** the full model (encoder + head) ships in `model.safetensors`. Loading does **not** download Qwen3-Embedding weights — the architecture is rebuilt from the bundled config and our weights are loaded in. Tokenizer is bundled too. - **Params:** ~596M · weights in float32 - **Training labels:** [`chest2vec/chest2vec_labels`](https://huggingface.co/datasets/chest2vec/chest2vec_labels) (revised CT-RATE, 137-leaf taxonomy) ## Label space The model head predicts **137 leaf labels**. They roll up through the chest-imaging hierarchy into **38 upper/container groups** and **10 anatomy sections** (the `label_hierarchy` in `config.json`), so predictions and report-comparison F1 can be reported at leaf, upper, or anatomy granularity. - The model outputs all **137** leaves. In the training data, **136** of them have at least one positive example; the single exception is **`IVC filter`** (kept for taxonomy completeness, but it had no positives, so the model effectively never predicts it). - The exact label list is in `config.json` (`labels`). Full definitions and per-split counts are in the **[chest2vec/chest2vec_labels](https://huggingface.co/datasets/chest2vec/chest2vec_labels)** dataset's [`LABEL_HIERARCHY.md`](https://huggingface.co/datasets/chest2vec/chest2vec_labels/blob/main/LABEL_HIERARCHY.md). This model was **trained and evaluated on the [chest2vec/chest2vec_labels](https://huggingface.co/datasets/chest2vec/chest2vec_labels) dataset** (revised CT-RATE, 137-leaf taxonomy). **Ternary head** — `softmax(logits, dim=-1)` over class indices `[0, 1, 2]`: | class index | meaning | value | |---:|---|---:| | 0 | negative | 0 | | 1 | uncertain | -1 | | 2 | positive | 1 | A label is reported **positive** when `P(class=2) ≥ threshold` (default **0.5**). ## Usage ```python from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained("chest2vec/chest2vec_labeler", trust_remote_code=True).eval() tok = AutoTokenizer.from_pretrained("chest2vec/chest2vec_labeler", trust_remote_code=True) reports = ["Bibasilar atelectasis with small bilateral pleural effusions. Cardiomegaly. Coronary artery calcification."] # 1) human-readable positive labels per report print(model.label_reports(reports, tokenizer=tok)) # [{'Subsegmental / linear atelectasis': 'positive', 'Pleural effusion': 'positive', # 'Cardiomegaly': 'positive', 'Coronary artery calcification': 'positive'}] # 2) full prediction matrices out = model.predict(reports, tokenizer=tok, threshold=0.5, return_ternary=True) out["labels"] # list of 137 label names out["proba"] # [N, 137] P(positive) out["positive"] # [N, 137] in {0,1} out["ternary"] # [N, 137] in {-1,0,1} ``` ### CheXbert / SRR-BERT-style report comparison Label both ground-truth and predicted reports, then compute label-level F1 (GT-labels treated as truth): ```python res = model.score_reports(gt_reports, pred_reports, tokenizer=tok) # equal-length lists # scores are reported at three hierarchy levels: for level in ("leaf", "upper", "anatomy"): b = res[level] print(level, b["n_labels"], b["micro"]["f1"], b["macro"]["f1"], b["weighted"]["f1"]) print(res["leaf"]["per_label"]["Pleural effusion"]) # {'precision':..,'recall':..,'f1':..,'support_gt':..} # or one-liner that loads the model for you: from modeling_chest2vec_labeler import report_f1 report_f1(gt_reports, pred_reports, tokenizer=tok) ``` Each level (`leaf` = 137 labels, `upper` = 38 container groups, `anatomy` = 10 sections) returns `micro` / `macro` / `weighted` precision/recall/F1 plus `per_label`. Upper/anatomy scores are the max-over-children roll-up of the leaf predictions (`model.aggregate_hierarchy(...)`). Coarser levels are easier to match, so upper/anatomy F1 are typically higher than leaf. ### Per-label best F1 (threshold tuning) The default decision threshold is a single global value, but the F1-optimal threshold differs per label. To get the **best achievable F1 per label** (and the threshold that achieves it) against a ground-truth label set: ```python # gt: a DataFrame with the 137 label columns (ternary; positive == 1), or a binary array res = model.per_label_best_f1(reports, gt, tokenizer=tok, level="leaf", min_pos=30) res["macro_best_f1_min_pos"] # macro best-F1 over labels with >= min_pos positives res["per_label"]["Pleural effusion"] # {'best_f1':.., 'best_threshold':.., 'n_pos':..} ``` Per-label threshold tuning lifts macro-F1 by ~4–6 points over the fixed-0.5 threshold (see below). ## Inputs & conventions - Input is the **findings** text (the model was trained on CT-RATE findings + their refined section-structured form). Reports are formatted internally as `Instruct: Given the following chest CT report, extract the presence/absence of entities\nQuery: `, truncated to **512** tokens, with an EOS token appended and left-padding. - For best fidelity, run in float32 (default). bf16 is fine for throughput with negligible drift. ## Evaluation **How these numbers were produced:** `run_para_v2_eval.sh` runs the model in **direct-paragraph** mode (full report, max_len 512) and writes `eval_ctrate_test_direct.json` (public) and `eval_sample1000_private.json` (private). Metric = **macro-F1** of the **positive class** (softmax probability of class 2) at **threshold 0.33**. Because the all-labels macro is dragged down by sparse-tail labels, the **headline restricts to leaf labels with ≥30 positive examples** in that eval set — **53 of the evaluated leaves on the public set, 29 on the private set**. Upper/anatomy rows are the hierarchy roll-up. **CT-RATE revised test (public, 1,464 reports)** — from `chest2vec/chest2vec_labels` test split: | Level | # labels | macro-F1 @0.33 | macro-AUC | |---|--:|--:|--:| | leaf (≥30 positives) | 53 | **0.875** | 0.989 | | leaf (all evaluated) | 131 | 0.749 | — | | upper (≥30 positives) | 27 | 0.938 | 0.994 | | anatomy | 10 | 0.956 | 0.993 | **Private evaluation set (1,000 reports)** — a held-out internal set, not released: | Level | # labels | macro-F1 @0.33 | macro-AUC | |---|--:|--:|--:| | leaf (≥30 positives) | 29 | **0.766** | 0.972 | | leaf (all evaluated) | 60 | 0.731 | — | | upper (≥30 positives) | 19 | 0.837 | — | | anatomy | 10 | 0.869 | — | **Per-label best F1** (threshold swept per label to maximize F1; macro over leaf labels with ≥30 positives, via `model.per_label_best_f1`): | Eval set | macro best-F1 (≥30) | macro-F1 @0.5 (≥30) | macro best-F1 (all evaluated) | |---|--:|--:|--:| | CT-RATE public | **0.907** | 0.866 | 0.844 | | Private | **0.820** | 0.761 | 0.795 | F1-optimal thresholds vary widely by label (~0.04–0.75), so per-label tuning recovers ~4–6 macro-F1 points over a single global threshold. Leaf macro-AUC barely moves public→private (**0.989 → 0.972**), i.e. label ranking transfers to the unseen set; the F1 gap is mostly threshold / labeling-convention, not a domain failure. Separately, a radiologist spot-checked **966** reports of the public test labels (857 fully accepted / 60 imperfect-but-acceptable / 49 failed; see the [dataset card](https://huggingface.co/datasets/chest2vec/chest2vec_labels)). ## Caveats - **Weakly supervised** — trained on LLM-generated labels (not radiologist ground truth) derived from report **text**, not images. Not a medical device; not for clinical use. - `IVC filter` is in the taxonomy for completeness but had no training positives. - `score_reports` measures **label agreement** between two reports as judged by this labeler; like CheXbert-F1 it inherits the labeler's own error modes. ## License & attribution Released under **CC-BY-NC-SA-4.0**. Built on **`Qwen/Qwen3-Embedding-0.6B`** (Apache-2.0) and trained using labels derived from **[CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE)** (CC-BY-NC-SA-4.0). **If you use this model, cite the CT-RATE paper** (arXiv:2403.17834) and acknowledge Qwen3-Embedding. See the [dataset card](https://huggingface.co/datasets/chest2vec/chest2vec_labels) for the full citation.