Upload README.md with huggingface_hub

f96e85a verified 6 days ago

9.2 kB

	---
	license: cc-by-nc-sa-4.0
	language:
	- en
	base_model:
	- Qwen/Qwen3-Embedding-0.6B
	datasets:
	- chest2vec/chest2vec_labels
	pipeline_tag: text-classification
	library_name: transformers
	tags:
	- radiology
	- chest-ct
	- report-labeling
	- multi-label
	- ct-rate
	- chexbert-style-f1
	---

	# chest2vec CT Report Labeler (0.6B)

	A weakly-supervised multi-label classifier that reads a free-text chest-CT report and
	predicts a 137-leaf chest-imaging taxonomy, with a ternary status per label
	(negative / uncertain / positive).

	It also provides a CheXbert / SRR-BERT-style report-comparison F1: label a list of
	ground-truth reports and a list of generated/predicted reports, then score them against each
	other (micro / macro / weighted F1) — useful for evaluating radiology report generation.

	- Base architecture: [`Qwen/Qwen3-Embedding-0.6B`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) (Apache-2.0)
	- Adaptation: LoRA (r=16, α=32) merged into the weights + last-token (EOS) pooling + L2-norm + a linear ternary head (`1024 → 137 × 3`)
	- Self-contained: the full model (encoder + head) ships in `model.safetensors`. Loading does not download Qwen3-Embedding weights — the architecture is rebuilt from the bundled config and our weights are loaded in. Tokenizer is bundled too.
	- Params: ~596M · weights in float32
	- Training labels: [`chest2vec/chest2vec_labels`](https://huggingface.co/datasets/chest2vec/chest2vec_labels) (revised CT-RATE, 137-leaf taxonomy)

	## Label space

	The model head predicts 137 leaf labels. They roll up through the chest-imaging hierarchy
	into 38 upper/container groups and 10 anatomy sections (the `label_hierarchy` in
	`config.json`), so predictions and report-comparison F1 can be reported at leaf, upper, or
	anatomy granularity.

	- The model outputs all 137 leaves. In the training data, 136 of them have at least one
	positive example; the single exception is `IVC filter` (kept for taxonomy completeness,
	but it had no positives, so the model effectively never predicts it).
	- The exact label list is in `config.json` (`labels`). Full definitions and per-split counts are
	in the [chest2vec/chest2vec_labels](https://huggingface.co/datasets/chest2vec/chest2vec_labels)
	dataset's [`LABEL_HIERARCHY.md`](https://huggingface.co/datasets/chest2vec/chest2vec_labels/blob/main/LABEL_HIERARCHY.md).

	This model was **trained and evaluated on the
	[chest2vec/chest2vec_labels](https://huggingface.co/datasets/chest2vec/chest2vec_labels)
	dataset** (revised CT-RATE, 137-leaf taxonomy).

	Ternary head — `softmax(logits, dim=-1)` over class indices `[0, 1, 2]`:

	\| class index \| meaning \| value \|
	\|---:\|---\|---:\|
	\| 0 \| negative \| 0 \|
	\| 1 \| uncertain \| -1 \|
	\| 2 \| positive \| 1 \|

	A label is reported positive when `P(class=2) ≥ threshold` (default 0.5).

	## Usage

	```python
	from transformers import AutoModel, AutoTokenizer

	model = AutoModel.from_pretrained("chest2vec/chest2vec_labeler", trust_remote_code=True).eval()
	tok = AutoTokenizer.from_pretrained("chest2vec/chest2vec_labeler", trust_remote_code=True)

	reports = ["Bibasilar atelectasis with small bilateral pleural effusions. Cardiomegaly. Coronary artery calcification."]

	# 1) human-readable positive labels per report
	print(model.label_reports(reports, tokenizer=tok))
	# [{'Subsegmental / linear atelectasis': 'positive', 'Pleural effusion': 'positive',
	# 'Cardiomegaly': 'positive', 'Coronary artery calcification': 'positive'}]

	# 2) full prediction matrices
	out = model.predict(reports, tokenizer=tok, threshold=0.5, return_ternary=True)
	out["labels"] # list of 137 label names
	out["proba"] # [N, 137] P(positive)
	out["positive"] # [N, 137] in {0,1}
	out["ternary"] # [N, 137] in {-1,0,1}
	```

	### CheXbert / SRR-BERT-style report comparison

	Label both ground-truth and predicted reports, then compute label-level F1 (GT-labels treated
	as truth):

	```python
	res = model.score_reports(gt_reports, pred_reports, tokenizer=tok) # equal-length lists
	# scores are reported at three hierarchy levels:
	for level in ("leaf", "upper", "anatomy"):
	b = res[level]
	print(level, b["n_labels"], b["micro"]["f1"], b["macro"]["f1"], b["weighted"]["f1"])
	print(res["leaf"]["per_label"]["Pleural effusion"]) # {'precision':..,'recall':..,'f1':..,'support_gt':..}

	# or one-liner that loads the model for you:
	from modeling_chest2vec_labeler import report_f1
	report_f1(gt_reports, pred_reports, tokenizer=tok)
	```

	Each level (`leaf` = 137 labels, `upper` = 38 container groups, `anatomy` = 10 sections) returns
	`micro` / `macro` / `weighted` precision/recall/F1 plus `per_label`. Upper/anatomy scores are the
	max-over-children roll-up of the leaf predictions (`model.aggregate_hierarchy(...)`). Coarser
	levels are easier to match, so upper/anatomy F1 are typically higher than leaf.

	### Per-label best F1 (threshold tuning)

	The default decision threshold is a single global value, but the F1-optimal threshold differs per
	label. To get the best achievable F1 per label (and the threshold that achieves it) against a
	ground-truth label set:

	```python
	# gt: a DataFrame with the 137 label columns (ternary; positive == 1), or a binary array
	res = model.per_label_best_f1(reports, gt, tokenizer=tok, level="leaf", min_pos=30)
	res["macro_best_f1_min_pos"] # macro best-F1 over labels with >= min_pos positives
	res["per_label"]["Pleural effusion"] # {'best_f1':.., 'best_threshold':.., 'n_pos':..}
	```

	Per-label threshold tuning lifts macro-F1 by ~4–6 points over the fixed-0.5 threshold (see below).

	## Inputs & conventions

	- Input is the findings text (the model was trained on CT-RATE findings + their refined
	section-structured form). Reports are formatted internally as
	`Instruct: Given the following chest CT report, extract the presence/absence of entities\nQuery: <report>`,
	truncated to 512 tokens, with an EOS token appended and left-padding.
	- For best fidelity, run in float32 (default). bf16 is fine for throughput with negligible drift.

	## Evaluation

	How these numbers were produced: `run_para_v2_eval.sh` runs the model in direct-paragraph
	mode (full report, max_len 512) and writes `eval_ctrate_test_direct.json` (public) and
	`eval_sample1000_private.json` (private). Metric = macro-F1 of the positive class (softmax
	probability of class 2) at threshold 0.33. Because the all-labels macro is dragged down by
	sparse-tail labels, the headline restricts to leaf labels with ≥30 positive examples in that
	eval set — 53 of the evaluated leaves on the public set, 29 on the private set. Upper/anatomy
	rows are the hierarchy roll-up.

	CT-RATE revised test (public, 1,464 reports) — from `chest2vec/chest2vec_labels` test split:

	\| Level \| # labels \| macro-F1 @0.33 \| macro-AUC \|
	\|---\|--:\|--:\|--:\|
	\| leaf (≥30 positives) \| 53 \| 0.875 \| 0.989 \|
	\| leaf (all evaluated) \| 131 \| 0.749 \| — \|
	\| upper (≥30 positives) \| 27 \| 0.938 \| 0.994 \|
	\| anatomy \| 10 \| 0.956 \| 0.993 \|

	Private evaluation set (1,000 reports) — a held-out internal set, not released:

	\| Level \| # labels \| macro-F1 @0.33 \| macro-AUC \|
	\|---\|--:\|--:\|--:\|
	\| leaf (≥30 positives) \| 29 \| 0.766 \| 0.972 \|
	\| leaf (all evaluated) \| 60 \| 0.731 \| — \|
	\| upper (≥30 positives) \| 19 \| 0.837 \| — \|
	\| anatomy \| 10 \| 0.869 \| — \|

	Per-label best F1 (threshold swept per label to maximize F1; macro over leaf labels with ≥30
	positives, via `model.per_label_best_f1`):

	\| Eval set \| macro best-F1 (≥30) \| macro-F1 @0.5 (≥30) \| macro best-F1 (all evaluated) \|
	\|---\|--:\|--:\|--:\|
	\| CT-RATE public \| 0.907 \| 0.866 \| 0.844 \|
	\| Private \| 0.820 \| 0.761 \| 0.795 \|

	F1-optimal thresholds vary widely by label (~0.04–0.75), so per-label tuning recovers ~4–6 macro-F1
	points over a single global threshold.

	Leaf macro-AUC barely moves public→private (0.989 → 0.972), i.e. label ranking transfers to
	the unseen set; the F1 gap is mostly threshold / labeling-convention, not a domain failure.
	Separately, a radiologist spot-checked 966 reports of the public test labels (857 fully
	accepted / 60 imperfect-but-acceptable / 49 failed; see the [dataset card](https://huggingface.co/datasets/chest2vec/chest2vec_labels)).

	## Caveats

	- Weakly supervised — trained on LLM-generated labels (not radiologist ground truth) derived
	from report text, not images. Not a medical device; not for clinical use.
	- `IVC filter` is in the taxonomy for completeness but had no training positives.
	- `score_reports` measures label agreement between two reports as judged by this labeler;
	like CheXbert-F1 it inherits the labeler's own error modes.

	## License & attribution

	Released under CC-BY-NC-SA-4.0. Built on `Qwen/Qwen3-Embedding-0.6B` (Apache-2.0) and
	trained using labels derived from [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE)
	(CC-BY-NC-SA-4.0). If you use this model, cite the CT-RATE paper (arXiv:2403.17834) and
	acknowledge Qwen3-Embedding. See the [dataset card](https://huggingface.co/datasets/chest2vec/chest2vec_labels)
	for the full citation.