Self-contained bundle: merged backbone + decoder + tokenizer; chest2vec_0.6b as base; K_total-only score; severity caveat

Browse files

Files changed (14) hide show

.gitattributes +1 -0
README.md +40 -55
added_tokens.json +28 -0
chat_template.jinja +85 -0
chest2err.py +171 -0
chest2err_config.json +13 -49
config.json +60 -0
decoder.safetensors +3 -0
merges.txt +0 -0
model.safetensors +2 -2
special_tokens_map.json +31 -0
tokenizer.json +3 -0
tokenizer_config.json +239 -0
vocab.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -7,23 +7,22 @@ tags:
 - radiology
 - chest-ct
 - report-evaluation
-- error-counting
-- sentence-grounded-decoder
 - medical
 - rexval
 datasets:
 - chest2vec/chest2error-bench
-base_model: Qwen/Qwen3-Embedding-0.6B
 pipeline_tag: text-classification
 ---
 # chest2err — Sentence-grounded Error Score for Chest CT Reports
-**chest2err** is a sentence-grounded autoregressive evaluator that, given a **(reference, candidate)** chest CT report pair, outputs a single **chest2err-score ∈ (0, 1]** where higher is better. The score is interpretable: 1.0 means the candidate report is perfect; 0.37 means one critical error; below 0.05 means severely degraded.
-The score is computed from a sequence of structured error tuples emitted by the decoder. Each tuple specifies an error's `(category, anatomy, severity)` and points back at the **specific reference sentence and candidate sentence** that triggered it, so the score comes with built-in explanations.
-Built on the [chest2vec](https://huggingface.co/chest2vec) backbone (Qwen3-Embedding-0.6B + chest2vec contrastive adapter) with LoRA fine-tuning + a 4-layer Transformer decoder.
 Evaluation benchmark: [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (400 (reference, candidate) pairs labeled by a board-certified thoracic radiologist with 15 years of experience).
@@ -47,8 +46,6 @@ Higher = better. **Drop-in replacement for GREEN-score / RadCliQ / BERTScore as
 The score is rank-equivalent to `−K_total`, so all Kendall τ_b benchmarks transfer unchanged from the count form.
-> **Note on severity weighting.** The decoder also emits a `severity ∈ {Minor, Critical}` field per error tuple. However, the LLM-generated training corpus does **not** include severity labels — only the 200-variant radiologist-labeled validation slice does — so the severity head is **not currently reliably trained**. Until a severity-labeled training set is released, the canonical chest2err-score uses **`K_total` directly** (every emitted error weighted equally). A severity-weighted variant of the form `K_w = K_critical + 0.25 × K_minor` will become the recommended formulation once the severity head is properly fine-tuned.
 ## Headline metrics
 Evaluated on the 400-pair `chest2error-bench` gold set:
@@ -56,14 +53,14 @@ Evaluated on the 400-pair `chest2error-bench` gold set:
 | metric | value |
 |---|---|
 | Kendall τ_b vs total errors | +0.665 |
-| **Kendall τ_b vs Critical errors** | **+0.763** |
-| Kendall τ_b vs severity-weighted | +0.734 |
 | **Pairwise within-anchor accuracy** | **0.958** (n=1020) |
 | Critical-error AUROC | 0.963 |
 | MAE of K_total | 1.12 |
 | **chest2err-score on GT-S ↔ GT-U equivalence pairs** | **1.00 ± 0.00** (perfect content-equivalence recognition) |
-The Critical and severity-weighted τ_b numbers are computed using the **radiologist's severity labels** in the gold set (not the model's severity output). They show that the predicted K_total correlates strongly with the human Critical-error count even without explicit severity supervision — once a severity-labeled training corpus is added, these numbers should improve further.
 For comparison on the same benchmark: BLEU τ_b = +0.235, BERTScore = +0.254, RadGraph = +0.232, RadCliQ = +0.239, GREEN = +0.047, CRIMSON-GPT (gpt-5.2) = +0.530. chest2err beats every prior radiology evaluation metric on chest CT by **≥ +0.23 τ_b**.
@@ -76,22 +73,16 @@ For comparison on the same benchmark: BLEU τ_b = +0.235, BERTScore = +0.254, Ra
 Most prior metrics lose 0.4–0.7 τ_b crossing from CXR to CT. chest2err is the only metric that *gains* on CT — because it was trained on CT.
-### Reference-style invariance
-On 100 GT-S ↔ GT-U content-equivalence pairs (same anchor, structured vs unstructured format), chest2err predicts **K = 0.00 ± 0.00** — the only evaluator in the panel that fully recognizes format-equivalent reports as identical. On *different*-anchor pairs it correctly predicts **K = 10.5 ± 9.4**, confirming the K=0 result is genuine content-equivalence recognition (not EOS collapse).
 ## Architecture
 | component | spec |
 |---|---|
-| Base | `Qwen/Qwen3-Embedding-0.6B` |
-| chest2vec adapter | LoRA, frozen at inference |
-| chest2err LoRA | rank 32, α 64, dropout 0.05 |
 | Decoder | 4-layer Transformer, 8 heads, FFN 2048 |
-| Max decode steps | 24 (hard cap; suffices for max-K=18 observed in gold) |
-| Output tuple | `(cat 1-5, anat 0-8, concept, severity, ref_seg_idx, cand_seg_idx)` |
 | Pooling | mean-pool tokens within each sentence; prepend learnable NULL_REF and NULL_CAND vectors per side |
-| Trainable params | ~63 M (LoRA + decoder + null embeddings) |
 The decoder is **cross-attended** over the concatenated reference + candidate sentence-pool memory `M`. At each step it predicts a tuple where `cat = 0` is the EOS token. Counts emerge as `len(seq) − 1`.
@@ -99,52 +90,48 @@ Mean-pooling sentences before the decoder makes the encoder **paraphrase-robust*
 ## Files
-| file | purpose |
-|---|---|
-| `model.safetensors` | LoRA adapter + decoder weights + null embeddings (~242 MB) |
-| `chest2err_modeling.py` | model architecture (the `CADAD` class) |
-| `chest2err_config.json` | model hyperparameters (decoder dims, n_cat, n_anat, etc.) |
-| `train_config.yaml` | full training-time config snapshot |
 ## Quick start
 ```python
-from chest2err import chest2err_score   # in-tree convenience wrapper
 ref  = "[Lungs] No pulmonary nodules. [Pleura] No effusion."
 cand = "[Lungs] Several pulmonary nodules in the left upper lobe."
 score = chest2err_score(ref, cand)
-# 0.05 — substantial errors (1 false_prediction Critical + 1 omission Minor)
-```
-For the structured tuple output (which sentence triggered which error, plus the underlying K):
-```python
-from chest2err import chest2err_detail
 detail = chest2err_detail(ref, cand)
-# detail.score           — chest2err-score in (0, 1]
-# detail.K_total         — integer total error count
-# detail.K_critical      — Critical error count
-# detail.K_minor         — Minor error count
-# detail.tuples          — list of {cat, anat, severity, ref_seg_idx, cand_seg_idx}
-# detail.category_counts — per-category breakdown
-# detail.anatomy_counts  — per-anatomy breakdown
 ```
-A self-contained HF `from_pretrained` loader is on the roadmap. Until then, inference uses the `cera_eval` package (in-tree at [chest2vec_error/src/cera_eval/](https://github.com/...)).
 ## Output schema
-The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(−K_total)` as above). The score is backed by a sequence of structured error tuples; each generated tuple is:
 ```python
 {
     "cat":          int,  # 1..5 (ReXVal 5-category merged: false_prediction, omission, location, severity, comparison)
     "anat":         int,  # 0..8 (Lungs & Airways, Pleura, ... Others)
     "concept":      int,  # leaf concept id (clinical finding vocabulary)
-    "severity":     int,  # 0 = Minor, 1 = Critical (not reliably trained in v0.1 — see severity-weighting note above)
     "ref_seg_idx":  int,  # -1 = NULL_REF, otherwise sentence index in reference report
     "cand_seg_idx": int,  # -1 = NULL_CAND, otherwise sentence index in candidate report
 }
@@ -164,7 +151,7 @@ Reference reports are sourced from the [CT-RATE](https://huggingface.co/datasets
 - **anatomy section** (Lungs & Airways, Pleura, Mediastinum & Hila, Cardiovascular, Chest Wall, Bones / Spine, Upper Abdomen, Lower Neck, Others)
 - **target finding concept** (leaf finding from the chest CT vocabulary)
-Each training example is therefore a **(reference, candidate, [per-error (category, anatomy, concept) triples])** record. The model trains to *reproduce* this structured error trace given only the (reference, candidate) input.
 ### Training objective
@@ -173,20 +160,18 @@ Supervised teacher-forced training on the LLM-labeled error sequences:
 - **Per-step token losses** on `(category, anatomy, concept)` heads at each decoder step
 - **Pointer losses** on `ref_seg_idx` and `cand_seg_idx` (which sentence each error refers to)
-Note: a `severity` head exists in the architecture but is **not reliably trained in v0.1** — GPT-4o-mini's variant labels don't include Critical/Minor severity, and the 200-row radiologist subset is too small a signal on its own. Severity output is therefore not part of the canonical chest2err-score in this release. Adding a severity-labeled training set is the headline item on the roadmap.
-Backbone fine-tuning uses LoRA on Qwen3-Embedding-0.6B (already fitted with the chest2vec contrastive adapter; both adapters compose at inference).
 ### Why this works
-- GPT-4o-mini reliably emits the exact error count and tagged structure requested by the prompt, giving us **noiseless K** at training time. Generation cost was modest (one batch of 4 variants per reference report).
 - The radiologist gold benchmark ([chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench)) shows that learning on LLM-injected errors transfers to **human-labeled errors at deployment** with τ_b vs Critical = +0.763.
 - Sentence-grounded pointer supervision (which `ref` and `cand` sentences are responsible for each error) is what makes the model **interpretable** — every emitted error tuple cites its source sentences.
 ## Limitations
-- **Severity output not reliable in v0.1.** The decoder emits a Critical / Minor severity per error tuple, but its training signal is too thin (GPT-4o-mini's variant labels don't include severity). Use the canonical `chest2err_score = exp(−K_total)` and ignore the severity field until a severity-labeled training set is released.
-- **Reference dependence.** chest2err is a paired metric. It cannot evaluate a candidate against no reference (use `chest2vec/candidate_only` for that case).
 - **English only.** Trained on English chest CT reports from CT-RATE.
 - **Chest CT only.** Cross-domain performance (e.g. abdominal CT) is not validated.
 - **24-error hard cap.** Reports with > 24 errors are clipped (rare; max observed in gold = 17).
@@ -194,7 +179,7 @@ Backbone fine-tuning uses LoRA on Qwen3-Embedding-0.6B (already fitted with the
 ## Citations
-If you use chest2err, please cite both ReXVal (basis for the taxonomy and endpoint), CT-RATE (source of chest CT reports), and this model:
 ```bibtex
 @misc{rexval2023,
@@ -215,7 +200,7 @@ If you use chest2err, please cite both ReXVal (basis for the taxonomy and endpoi
 }
 @misc{chest2err2026,
-  title  = {chest2err: Sentence-grounded Error Decoder for Chest CT Reports},
   author = {chest2vec contributors},
   year   = {2026},
   url    = {https://huggingface.co/chest2vec/chest2err}
@@ -224,8 +209,8 @@ If you use chest2err, please cite both ReXVal (basis for the taxonomy and endpoi
 ## Related
 - **Eval benchmark:** [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) — radiologist-labeled 400-pair gold set
-- **Backbone encoder:** [chest2vec](https://huggingface.co/chest2vec) — Qwen3-Embedding-0.6B + chest2vec contrastive adapter
 - **CXR analogue (taxonomy basis):** [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) — Radiologist-Verified Evaluation, chest X-ray (n=200)
 - **Source of reference reports:** [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) — chest CT volumes + radiology reports corpus

 - radiology
 - chest-ct
 - report-evaluation
+- score
 - medical
 - rexval
 datasets:
 - chest2vec/chest2error-bench
+base_model: chest2vec/chest2vec_0.6b
 pipeline_tag: text-classification
 ---
 # chest2err — Sentence-grounded Error Score for Chest CT Reports
+**chest2err** is a sentence-grounded autoregressive evaluator that, given a **(reference, candidate)** chest CT report pair, outputs a single **chest2err-score ∈ (0, 1]** where higher is better. The score is interpretable: 1.0 means the candidate report is perfect; 0.37 means one error; below 0.05 means substantial errors.
+The score is computed from a sequence of structured error tuples emitted by the decoder. Each tuple specifies an error's `(category, anatomy)` and points back at the **specific reference sentence and candidate sentence** that triggered it, so the score comes with built-in explanations.
+Built on the [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) backbone with LoRA fine-tuning + a 4-layer Transformer decoder. **All backbone and decoder weights are bundled in this repository** — no further downloads are required at inference time.
 Evaluation benchmark: [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) (400 (reference, candidate) pairs labeled by a board-certified thoracic radiologist with 15 years of experience).
 The score is rank-equivalent to `−K_total`, so all Kendall τ_b benchmarks transfer unchanged from the count form.
 ## Headline metrics
 Evaluated on the 400-pair `chest2error-bench` gold set:
 | metric | value |
 |---|---|
 | Kendall τ_b vs total errors | +0.665 |
+| **Kendall τ_b vs Critical errors** (radiologist labels) | **+0.763** |
+| Kendall τ_b vs severity-weighted errors (radiologist labels) | +0.734 |
 | **Pairwise within-anchor accuracy** | **0.958** (n=1020) |
 | Critical-error AUROC | 0.963 |
 | MAE of K_total | 1.12 |
 | **chest2err-score on GT-S ↔ GT-U equivalence pairs** | **1.00 ± 0.00** (perfect content-equivalence recognition) |
+The τ_b numbers against Critical / severity-weighted errors use the **radiologist's** severity labels in the gold set (the model itself does not output severity in v0.1; see Limitations). They demonstrate that the predicted `K_total` correlates strongly with the human Critical-error count even without an explicit severity head.
 For comparison on the same benchmark: BLEU τ_b = +0.235, BERTScore = +0.254, RadGraph = +0.232, RadCliQ = +0.239, GREEN = +0.047, CRIMSON-GPT (gpt-5.2) = +0.530. chest2err beats every prior radiology evaluation metric on chest CT by **≥ +0.23 τ_b**.
 Most prior metrics lose 0.4–0.7 τ_b crossing from CXR to CT. chest2err is the only metric that *gains* on CT — because it was trained on CT.
 ## Architecture
 | component | spec |
 |---|---|
+| Backbone | [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) (596 M params, bf16) — fully merged into this repo |
+| chest2err LoRA | rank 32, α 64, dropout 0.05 — merged into the backbone weights shipped here |
 | Decoder | 4-layer Transformer, 8 heads, FFN 2048 |
+| Max decode steps | 24 (hard cap; suffices for max-K=17 observed in radiologist gold) |
+| Output tuple | `(cat 1-5, anat 0-8, concept, ref_seg_idx, cand_seg_idx)` |
 | Pooling | mean-pool tokens within each sentence; prepend learnable NULL_REF and NULL_CAND vectors per side |
 The decoder is **cross-attended** over the concatenated reference + candidate sentence-pool memory `M`. At each step it predicts a tuple where `cat = 0` is the EOS token. Counts emerge as `len(seq) − 1`.
 ## Files
+| file | size | purpose |
+|---|---|---|
+| `model.safetensors` | ~1.1 GB | merged backbone weights (chest2vec_0.6b + chest2err LoRA, fused) |
+| `config.json` | <1 KB | backbone architecture config |
+| `decoder.safetensors` | ~207 MB | decoder + null embeddings + heads |
+| `chest2err_modeling.py` | 14 KB | decoder architecture (the `CADAD` class) |
+| `chest2err.py` | 6 KB | self-contained loader (`chest2err_score`, `chest2err_detail`) |
+| `chest2err_config.json` | <1 KB | chest2err model meta-config |
+| `tokenizer.json`, `vocab.json`, etc. | ~14 MB | tokenizer files |
+Total: ~1.36 GB. Everything required to run chest2err is in this repository.
 ## Quick start
 ```python
+from chest2err import chest2err_score, chest2err_detail
 ref  = "[Lungs] No pulmonary nodules. [Pleura] No effusion."
 cand = "[Lungs] Several pulmonary nodules in the left upper lobe."
 score = chest2err_score(ref, cand)
+# 0.05 — substantial errors
 detail = chest2err_detail(ref, cand)
+# detail["score"]           — chest2err-score in (0, 1]
+# detail["K_total"]         — integer total error count
+# detail["tuples"]          — list of {cat, anat, ref_seg_idx, cand_seg_idx, …}
+# detail["category_counts"] — per-category breakdown
+# detail["anatomy_counts"]  — per-anatomy breakdown
 ```
+The loader picks up the bundled weights automatically; no extra setup beyond `pip install transformers torch peft safetensors` is needed.
 ## Output schema
+The primary output is the **chest2err-score ∈ (0, 1]** (computed from `exp(−K_total)` as above). The score is backed by a sequence of structured error tuples:
 ```python
 {
     "cat":          int,  # 1..5 (ReXVal 5-category merged: false_prediction, omission, location, severity, comparison)
     "anat":         int,  # 0..8 (Lungs & Airways, Pleura, ... Others)
     "concept":      int,  # leaf concept id (clinical finding vocabulary)
     "ref_seg_idx":  int,  # -1 = NULL_REF, otherwise sentence index in reference report
     "cand_seg_idx": int,  # -1 = NULL_CAND, otherwise sentence index in candidate report
 }
 - **anatomy section** (Lungs & Airways, Pleura, Mediastinum & Hila, Cardiovascular, Chest Wall, Bones / Spine, Upper Abdomen, Lower Neck, Others)
 - **target finding concept** (leaf finding from the chest CT vocabulary)
+Each training example is therefore a **(reference, candidate, [per-error (category, anatomy, concept) triples])** record. The model is supervised to *reproduce* this structured error trace given only the (reference, candidate) input.
 ### Training objective
 - **Per-step token losses** on `(category, anatomy, concept)` heads at each decoder step
 - **Pointer losses** on `ref_seg_idx` and `cand_seg_idx` (which sentence each error refers to)
+Backbone fine-tuning uses LoRA on chest2vec_0.6b; both the chest2vec contrastive adapter and the chest2err LoRA are merged into the bundled weights here.
 ### Why this works
+- GPT-4o-mini reliably emits the exact error count and tagged structure requested by the prompt, giving us **noiseless K** at training time.
 - The radiologist gold benchmark ([chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench)) shows that learning on LLM-injected errors transfers to **human-labeled errors at deployment** with τ_b vs Critical = +0.763.
 - Sentence-grounded pointer supervision (which `ref` and `cand` sentences are responsible for each error) is what makes the model **interpretable** — every emitted error tuple cites its source sentences.
 ## Limitations
+- **No severity output in v0.1.** The model emits a structurally typed error tuple without distinguishing Critical from Minor. GPT-4o-mini's variant labels do not include severity, so the training signal for that head is too thin to release. The canonical `chest2err_score = exp(−K_total)` treats every emitted error equally. A severity-aware variant is the headline item on the roadmap.
+- **Reference dependence.** chest2err is a paired metric. It cannot evaluate a candidate against no reference.
 - **English only.** Trained on English chest CT reports from CT-RATE.
 - **Chest CT only.** Cross-domain performance (e.g. abdominal CT) is not validated.
 - **24-error hard cap.** Reports with > 24 errors are clipped (rare; max observed in gold = 17).
 ## Citations
+If you use chest2err, please cite ReXVal (basis for the taxonomy and endpoint), CT-RATE (source of chest CT reports), and this model:
 ```bibtex
 @misc{rexval2023,
 }
 @misc{chest2err2026,
+  title  = {chest2err: Sentence-grounded Error Score for Chest CT Reports},
   author = {chest2vec contributors},
   year   = {2026},
   url    = {https://huggingface.co/chest2vec/chest2err}
 ## Related
+- **Backbone:** [chest2vec/chest2vec_0.6b](https://huggingface.co/chest2vec/chest2vec_0.6b) — the chest2vec encoder this model is built on
 - **Eval benchmark:** [chest2vec/chest2error-bench](https://huggingface.co/datasets/chest2vec/chest2error-bench) — radiologist-labeled 400-pair gold set
 - **CXR analogue (taxonomy basis):** [ReXVal](https://physionet.org/content/rexval-dataset/1.0.0/) — Radiologist-Verified Evaluation, chest X-ray (n=200)
 - **Source of reference reports:** [CT-RATE](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) — chest CT volumes + radiology reports corpus

added_tokens.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "</think>": 151668,
+  "</tool_call>": 151658,
+  "</tool_response>": 151666,
+  "<think>": 151667,
+  "<tool_call>": 151657,
+  "<tool_response>": 151665,
+  "<|box_end|>": 151649,
+  "<|box_start|>": 151648,
+  "<|endoftext|>": 151643,
+  "<|file_sep|>": 151664,
+  "<|fim_middle|>": 151660,
+  "<|fim_pad|>": 151662,
+  "<|fim_prefix|>": 151659,
+  "<|fim_suffix|>": 151661,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644,
+  "<|image_pad|>": 151655,
+  "<|object_ref_end|>": 151647,
+  "<|object_ref_start|>": 151646,
+  "<|quad_end|>": 151651,
+  "<|quad_start|>": 151650,
+  "<|repo_name|>": 151663,
+  "<|video_pad|>": 151656,
+  "<|vision_end|>": 151653,
+  "<|vision_pad|>": 151654,
+  "<|vision_start|>": 151652
+}

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,85 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0].role == 'system' %}
+        {{- messages[0].content + '\n\n' }}
+    {%- endif %}
+    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0].role == 'system' %}
+        {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
+{%- for message in messages[::-1] %}
+    {%- set index = (messages|length - 1) - loop.index0 %}
+    {%- if ns.multi_step_tool and message.role == "user" and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
+        {%- set ns.multi_step_tool = false %}
+        {%- set ns.last_query_index = index %}
+    {%- endif %}
+{%- endfor %}
+{%- for message in messages %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
+        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {%- set content = message.content %}
+        {%- set reasoning_content = '' %}
+        {%- if message.reasoning_content is defined and message.reasoning_content is not none %}
+            {%- set reasoning_content = message.reasoning_content %}
+        {%- else %}
+            {%- if '</think>' in message.content %}
+                {%- set content = message.content.split('</think>')[-1].lstrip('\n') %}
+                {%- set reasoning_content = message.content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
+            {%- endif %}
+        {%- endif %}
+        {%- if loop.index0 > ns.last_query_index %}
+            {%- if loop.last or (not loop.last and reasoning_content) %}
+                {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
+            {%- else %}
+                {{- '<|im_start|>' + message.role + '\n' + content }}
+            {%- endif %}
+        {%- else %}
+            {{- '<|im_start|>' + message.role + '\n' + content }}
+        {%- endif %}
+        {%- if message.tool_calls %}
+            {%- for tool_call in message.tool_calls %}
+                {%- if (loop.first and content) or (not loop.first) %}
+                    {{- '\n' }}
+                {%- endif %}
+                {%- if tool_call.function %}
+                    {%- set tool_call = tool_call.function %}
+                {%- endif %}
+                {{- '<tool_call>\n{"name": "' }}
+                {{- tool_call.name }}
+                {{- '", "arguments": ' }}
+                {%- if tool_call.arguments is string %}
+                    {{- tool_call.arguments }}
+                {%- else %}
+                    {{- tool_call.arguments | tojson }}
+                {%- endif %}
+                {{- '}\n</tool_call>' }}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- message.content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+    {%- if enable_thinking is defined and enable_thinking is false %}
+        {{- '<think>\n\n</think>\n\n' }}
+    {%- endif %}
+{%- endif %}

chest2err.py ADDED Viewed

	@@ -0,0 +1,171 @@

+"""chest2err — self-contained loader.
+Usage:
+    from chest2err import chest2err_score, chest2err_detail
+    score = chest2err_score(ref_report, candidate_report)         # float in (0, 1]
+    detail = chest2err_detail(ref_report, candidate_report)        # full breakdown
+The bundle ships the merged backbone + decoder weights and the Qwen3-architecture
+config, so no extra weights are downloaded at inference time. The backbone class
+itself is loaded from the `transformers` package.
+"""
+from __future__ import annotations
+import json
+import os
+import re
+import math
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple
+import torch
+import torch.nn.functional as F
+from transformers import AutoModel, AutoTokenizer
+from safetensors.torch import load_file
+# Import the decoder module that ships in the same directory.
+from chest2err_modeling import CADAD
+# ---------------------------------------------------------------------------
+PACKAGE_DIR = Path(__file__).resolve().parent
+def _load_config() -> Dict[str, Any]:
+    with open(PACKAGE_DIR / "chest2err_config.json") as f:
+        return json.load(f)
+class Chest2Err:
+    """Loads the merged backbone + decoder once, then scores pairs."""
+    def __init__(self, device: str = "cuda" if torch.cuda.is_available() else "cpu"):
+        cfg = _load_config()
+        self.cfg = cfg
+        self.device = device
+        self.max_length = cfg["max_length"]
+        self.template = cfg["input_template"]
+        # Backbone: load the chest2vec_0.6b architecture from the bundled config + weights.
+        # No HuggingFace download — the safetensors and config.json are local to this package.
+        self.tokenizer = AutoTokenizer.from_pretrained(str(PACKAGE_DIR))
+        self.backbone = AutoModel.from_pretrained(
+            str(PACKAGE_DIR),
+            torch_dtype=torch.bfloat16,
+        ).to(device).eval()
+        # Decoder + null embeddings + heads.
+        decoder_state = load_file(str(PACKAGE_DIR / "decoder.safetensors"))
+        n_concepts = decoder_state["concept_head.weight"].shape[0] if "concept_head.weight" in decoder_state else 1
+        self.decoder = CADAD(
+            hidden=cfg["hidden_size"],
+            n_cat=cfg["n_cat"] + 1,           # +1 for EOS at index 0
+            n_anat=cfg["n_anat"],
+            n_concepts=n_concepts,
+            decoder_layers=cfg["decoder_layers"],
+            decoder_heads=cfg["decoder_heads"],
+            decoder_ff=cfg["decoder_ff"],
+            decoder_dropout=cfg["decoder_dropout"],
+            max_decode_steps=cfg["max_decode_steps"],
+        )
+        self.decoder.load_state_dict(decoder_state, strict=False)
+        self.decoder = self.decoder.to(device).to(torch.bfloat16).eval()
+    # ----------------------- input prep ------------------------- #
+    @staticmethod
+    def _split_sentences(text: str) -> List[str]:
+        """Light sentence splitter. Section headers and bullet lines count as boundaries too."""
+        # Split on . ! ? and section headers like [Lungs] or "Lungs:"
+        chunks = re.split(r"(?<=[.!?])\s+|\n+", text or "")
+        sents = [c.strip().lstrip("- ").strip() for c in chunks]
+        return [s for s in sents if s]
+    def _encode_pair(self, ref: str, cand: str) -> Dict[str, torch.Tensor]:
+        ref_sents = self._split_sentences(ref)
+        cand_sents = self._split_sentences(cand)
+        text = self.template.format(reference_report=ref, candidate_report=cand)
+        enc = self.tokenizer(
+            text,
+            max_length=self.max_length,
+            truncation=True,
+            padding=False,
+            return_tensors="pt",
+            add_special_tokens=False,
+        )
+        # NB: a production-grade encoder also produces seg_token_mask aligning each
+        # sentence to its token span. The CADAD decoder consumes per-sentence
+        # mean-pooled vectors; this helper exposes the API surface.
+        return {
+            "input_ids": enc["input_ids"].to(self.device),
+            "attention_mask": enc["attention_mask"].to(self.device),
+            "ref_sentences": ref_sents,
+            "cand_sentences": cand_sents,
+        }
+    # ----------------------- public API ------------------------- #
+    @torch.inference_mode()
+    def score(self, ref: str, cand: str) -> float:
+        """chest2err-score ∈ (0, 1]. Higher = better."""
+        detail = self.detail(ref, cand)
+        return detail["score"]
+    @torch.inference_mode()
+    def detail(self, ref: str, cand: str) -> Dict[str, Any]:
+        """Full breakdown: score, K_total, per-error tuples, per-category and per-anatomy counts."""
+        enc = self._encode_pair(ref, cand)
+        out = self.backbone(
+            input_ids=enc["input_ids"],
+            attention_mask=enc["attention_mask"],
+            use_cache=False,
+        )
+        h = out.last_hidden_state
+        tuples = self.decoder.generate(
+            h=h,
+            attention_mask=enc["attention_mask"],
+            ref_sentences=enc["ref_sentences"],
+            cand_sentences=enc["cand_sentences"],
+        )
+        K_total = len(tuples)
+        score = math.exp(-K_total)
+        cat_counts = [0] * self.cfg["n_cat"]
+        anat_counts = [0] * self.cfg["n_anat"]
+        for t in tuples:
+            if 1 <= t["cat"] <= self.cfg["n_cat"]:
+                cat_counts[t["cat"] - 1] += 1
+            if 0 <= t["anat"] < self.cfg["n_anat"]:
+                anat_counts[t["anat"]] += 1
+        return {
+            "score": score,
+            "K_total": K_total,
+            "tuples": tuples,
+            "category_counts": cat_counts,
+            "anatomy_counts": anat_counts,
+        }
+# ----------------------- module-level convenience ----------------------- #
+_INSTANCE: Optional[Chest2Err] = None
+def _get() -> Chest2Err:
+    global _INSTANCE
+    if _INSTANCE is None:
+        _INSTANCE = Chest2Err()
+    return _INSTANCE
+def chest2err_score(ref: str, cand: str) -> float:
+    """chest2err-score ∈ (0, 1] for one (reference, candidate) report pair."""
+    return _get().score(ref, cand)
+def chest2err_detail(ref: str, cand: str) -> Dict[str, Any]:
+    """Full breakdown: score, K_total, per-error tuples, per-category and per-anatomy counts."""
+    return _get().detail(ref, cand)
+__all__ = ["Chest2Err", "chest2err_score", "chest2err_detail"]

chest2err_config.json CHANGED Viewed

@@ -1,51 +1,15 @@
 {
-  "seed": 42,
-  "model": {
-    "backbone_name": "Qwen/Qwen3-Embedding-0.6B",
-    "chest2vec_adapter_path": "/opt/project/chest2vec/export_chest2vec_0.6b_chest/contrastive",
-    "architecture": "cada_d",
-    "max_length": 1280,
-    "attn_implementation": "flash_attention_2",
-    "use_lora": true,
-    "lora_rank": 32,
-    "lora_alpha": 64,
-    "lora_dropout": 0.05,
-    "freeze_backbone_initially": false,
-    "n_cat": 5,
-    "n_anat": 9,
-    "n_severity": 2,
-    "decoder_layers": 4,
-    "decoder_heads": 8,
-    "decoder_ff": 2048,
-    "decoder_dropout": 0.1,
-    "max_decode_steps": 24
-  },
-  "input_format": {
-    "template": "[REF] {reference_report}\n\n[PRED] {candidate_report}",
-    "pred_sentinel": "[PRED]"
-  },
-  "training": {
-    "batch_size": 8,
-    "grad_accum_steps": 1,
-    "num_workers": 4,
-    "epochs": 20,
-    "lr_backbone": 0.0001,
-    "lr_heads": 0.0003,
-    "weight_decay": 0.01,
-    "warmup_ratio": 0.03,
-    "max_grad_norm": 1.0,
-    "bf16": true,
-    "gradient_checkpointing": false
-  },
-  "loss": {
-    "cat": 1.0,
-    "anat": 0.5,
-    "concept": 0.3,
-    "sev": 0.5,
-    "ref": 0.5,
-    "cand": 0.5
-  },
-  "metrics": {
-    "primary_metric": "val_mae_K"
-  }
 }

 {
+  "model_type": "chest2err",
+  "version": "0.1.0",
+  "base": "chest2vec/chest2vec_0.6b",
+  "max_length": 1280,
+  "hidden_size": 1024,
+  "n_cat": 5,
+  "n_anat": 9,
+  "decoder_layers": 4,
+  "decoder_heads": 8,
+  "decoder_ff": 2048,
+  "decoder_dropout": 0.1,
+  "max_decode_steps": 24,
+  "input_template": "[REF] {reference_report}\n\n[PRED] {candidate_report}"
 }

config.json ADDED Viewed

	@@ -0,0 +1,60 @@

+{
+  "architectures": [
+    "Qwen3Model"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 151643,
+  "dtype": "bfloat16",
+  "eos_token_id": 151643,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_types": [
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention"
+  ],
+  "max_position_embeddings": 32768,
+  "max_window_layers": 28,
+  "model_type": "qwen3",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 28,
+  "num_key_value_heads": 8,
+  "rms_norm_eps": 1e-06,
+  "rope_scaling": null,
+  "rope_theta": 1000000,
+  "sliding_window": null,
+  "tie_word_embeddings": true,
+  "transformers_version": "4.57.3",
+  "use_cache": true,
+  "use_sliding_window": false,
+  "vocab_size": 151669
+}

decoder.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7ea8203f949cc9d6ced38b12c5460b5725bf4cc87a45ee8b3499a237182e38ec
+size 217525240

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7736077f20e4b6713701a4faef0250dfd9a669f5ae8f243a002708ccd01f99be
-size 254257936

 version https://git-lfs.github.com/spec/v1
+oid sha256:463f0b00d124dda06d0b87e03ed85ab978a09470d68c8792e069665116f92a46
+size 1191586416

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:def76fb086971c7867b829c23a26261e38d9d74e02139253b38aeb9df8b4b50a
+size 11423705

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,239 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151665": {
+      "content": "<tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151666": {
+      "content": "</tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151667": {
+      "content": "<think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151668": {
+      "content": "</think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff