dipta007 commited on
Commit
486896f
Β·
verified Β·
1 Parent(s): 188754b

Update model card: 7 rewards, OOD coverbench, pretty-print helper, fix max-len

Browse files
Files changed (1) hide show
  1. README.md +108 -16
README.md CHANGED
@@ -17,12 +17,13 @@ tags:
17
 
18
  # DecomposeRL-7B
19
 
20
- **DecomposeRL-7B** is a fact-verification model that learns to decompose claims into atomic sub-questions, iteratively answer them from an evidence document, and produce a final `Supported` / `Refuted` judgment. It is trained from `Qwen2.5-7B-Instruct` with **GRPO + LoRA** using rewards over format validity, sub-claim coverage, answer necessity, and question diversity.
21
 
22
  ## Highlights
23
 
24
  - **84.5% micro-average balanced accuracy** across 9 in-domain claim-verification benchmarks (sample-weighted)
25
  - **84.6% macro-average balanced accuracy** across the same 9 benchmarks
 
26
  - Strong on long-form evidence: 87% on Ex-FEVER, 92% on FEVEROUS, 76% on HoVer
27
  - Reasoning is **fully transparent** β€” the model emits its sub-claim checklist, every question it asked, every quote from evidence, and a final label
28
 
@@ -35,7 +36,7 @@ tags:
35
  | **Parameters** | 7B |
36
  | **Training** | GRPO + LoRA (r=64, Ξ±=128) |
37
  | **LoRA Targets** | q, k, v, o, gate, up, down projections |
38
- | **Context Length** | 4,096 tokens |
39
  | **Language** | English |
40
 
41
  ## Method
@@ -47,15 +48,32 @@ DecomposeRL trains the policy to follow a **decompose-question-answer-verify** l
47
  3. **Sufficiency check** (`<think>`): track which sub-claims are resolved; loop until every sub-claim is addressed.
48
  4. **Final verdict** (`<verification>`): `Supported` or `Refuted`.
49
 
50
- ### Training Rewards
51
 
52
- GRPO is supervised with a composite reward over generated trajectories:
53
 
54
- - **Format reward** β€” well-formed `<think>`/`<question>`/`<answer>`/`<verification>` structure
55
- - **Verification reward** β€” correct final label
56
- - **Necessity reward** β€” generated sub-questions are necessary to verify the claim
57
- - **Diversity reward** β€” sub-questions cover distinct aspects (MMR-based)
58
- - **Coverage reward** β€” sub-questions jointly cover the claim
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
  ## Quickstart
61
 
@@ -108,33 +126,99 @@ messages = [{"role": "user", "content": user_prompt}]
108
  text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
109
  inputs = tokenizer([text], return_tensors="pt").to(model.device)
110
 
111
- out = model.generate(**inputs, max_new_tokens=2048, temperature=0.7, do_sample=True)
 
112
  response = tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
113
  print(response)
114
  ```
115
 
116
  The full training-time prompt template (with extended instructions, a worked example, and sub-claim classification guidance) lives in `decomposer/prompts.py` of the source repo and is what gives the strongest performance.
117
 
118
- ### Parsing the Output
119
 
120
- The final label is between the last `<verification>` and `</verification>` tags:
121
 
122
  ```python
123
  import re
124
 
125
- match = re.search(r"<verification>\s*(Supported|Refuted)\s*</verification>", response)
126
- label = match.group(1) if match else None
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
  ```
128
 
129
  ### Using vLLM
130
 
131
  ```bash
132
- vllm serve dipta007/decomposeRL-7b --max-model-len 4096
133
  ```
134
 
 
 
135
  ## Performance
136
 
137
- Balanced accuracy on 9 in-domain claim-verification benchmarks (best checkpoint, `step 5400`):
138
 
139
  | Dataset | # Examples | Balanced Acc |
140
  |---|---:|---:|
@@ -150,6 +234,14 @@ Balanced accuracy on 9 in-domain claim-verification benchmarks (best checkpoint,
150
  | **Micro-average** (sample-weighted) | 14,503 | **0.8445** |
151
  | **Macro-average** | β€” | **0.8417** |
152
 
 
 
 
 
 
 
 
 
153
  ## Intended Use
154
 
155
  - **In-scope**: verifying factual claims against a *provided* evidence document (open-book fact verification, retrieval-augmented fact-checking pipelines).
 
17
 
18
  # DecomposeRL-7B
19
 
20
+ **DecomposeRL-7B** is a fact-verification model that learns to *decompose* a claim into atomic sub-questions, iteratively answer them from an evidence document, and produce a final `Supported` / `Refuted` judgment. It is trained from `Qwen2.5-7B-Instruct` with **GRPO + LoRA** under a stack of **seven complementary rewards** that shape the reward landscape around three axes: structural correctness, per-question quality, and set-level sufficiency.
21
 
22
  ## Highlights
23
 
24
  - **84.5% micro-average balanced accuracy** across 9 in-domain claim-verification benchmarks (sample-weighted)
25
  - **84.6% macro-average balanced accuracy** across the same 9 benchmarks
26
+ - **60.2% balanced accuracy on Coverbench** (out-of-domain, long-evidence)
27
  - Strong on long-form evidence: 87% on Ex-FEVER, 92% on FEVEROUS, 76% on HoVer
28
  - Reasoning is **fully transparent** β€” the model emits its sub-claim checklist, every question it asked, every quote from evidence, and a final label
29
 
 
36
  | **Parameters** | 7B |
37
  | **Training** | GRPO + LoRA (r=64, Ξ±=128) |
38
  | **LoRA Targets** | q, k, v, o, gate, up, down projections |
39
+ | **Max Sequence Length** | 16,016 tokens (training-time) |
40
  | **Language** | English |
41
 
42
  ## Method
 
48
  3. **Sufficiency check** (`<think>`): track which sub-claims are resolved; loop until every sub-claim is addressed.
49
  4. **Final verdict** (`<verification>`): `Supported` or `Refuted`.
50
 
51
+ ### Reward Stack β€” seven complementary signals
52
 
53
+ GRPO is supervised with a sum of seven rewards $R(\tau) = \sum_k R_k(\tau)$, grouped into three families:
54
 
55
+ **Programmatic anchors** (no judge call)
56
+
57
+ 1. **Format** ($R_\text{fmt}$) β€” fraction of structural conditions satisfied: well-formed XML, `<question>`β†’`<answer>` alternation, valid final verification label. Treated as a gating prerequisite.
58
+ 2. **Question count** ($R_\text{qc}$) β€” triangular kernel on $r = n/n^\star$ (predicted vs silver decomposition length): $\max(0, 1-|r-1|)$, peaking at $r=1$, vanishing for $r\ge 2$.
59
+ 3. **Diversity** ($R_\text{div}$) β€” MMR-style penalty for redundancy across $\{q_1,\dots,q_n\}$, computed over `Qwen3-Embedding-8B` embeddings as $-\tfrac{1}{n}\sum_{i=2}^{n}\max_{j<i}\cos(q_i, q_j)$.
60
+
61
+ **Set-level signals**
62
+
63
+ 4. **Coverage** ($R_\text{cov}$) β€” judge predicts the verdict from the answers alone (without the evidence document) and is rewarded for matching the gold label. Cleanest test of whether the decomposition is *collectively sufficient*.
64
+ 5. **Verification** ($R_\text{ver}$) β€” sparse but unambiguous outcome anchor: does the final `<verification>` label match the gold label?
65
+
66
+ **Leave-one-out and per-question composites**
67
+
68
+ 6. **Necessity** ($R_\text{nec}$) β€” for each $q_i$, run the coverage judge with and without it. Score on a four-state matrix (necessary $+1$ / redundant $+\tfrac{1}{2}$ / neutral $0$ / harmful $-1$) and aggregate via the minimum. The harmful $-1$ case is the only negative signal in the stack β€” it lets the policy *remove* misleading questions.
69
+ 7. **Joint multiplicative quality** ($R_\text{joint}$) β€” three judge-based sub-signals composed multiplicatively per question:
70
+ - **(7a) Answerability** ($R_\text{ans}^{(i)}$) β€” is $q_i$ answerable from the evidence alone?
71
+ - **(7b) Atomicity** ($R_\text{atom}^{(i)}$) β€” five-criterion checklist: a real question, single-focus, no compound conjunctions, verifiable, claim-grounded.
72
+ - **(7c) Answer correctness** ($R_\text{corr}^{(i)}$) β€” is $a_i$ faithful to the document (no contradictions, no extrinsic info)?
73
+
74
+ $$R_\text{joint} = \tfrac{1}{n}\sum_{i=1}^{n} R_\text{ans}^{(i)} \cdot R_\text{atom}^{(i)} \cdot R_\text{corr}^{(i)}$$
75
+
76
+ A failure on any single axis drives that question's term to zero. For honest abstentions (`a_i = "I don't know"`) the undefined $R_\text{corr}$ factor is dropped so a calibrated abstention is rewarded by question quality alone.
77
 
78
  ## Quickstart
79
 
 
126
  text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
127
  inputs = tokenizer([text], return_tensors="pt").to(model.device)
128
 
129
+ # max_new_tokens matches training-time max_completion_length
130
+ out = model.generate(**inputs, max_new_tokens=4500, temperature=0.7, do_sample=True)
131
  response = tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
132
  print(response)
133
  ```
134
 
135
  The full training-time prompt template (with extended instructions, a worked example, and sub-claim classification guidance) lives in `decomposer/prompts.py` of the source repo and is what gives the strongest performance.
136
 
137
+ ### Pretty-print the trace
138
 
139
+ The model produces an iterative `<think>` / `<question>` / `<answer>` / `<verification>` trace. The helper below parses it into a structured form and prints it as a readable conversation:
140
 
141
  ```python
142
  import re
143
 
144
+ TAG_RE = re.compile(r"<(think|question|answer|verification)>(.*?)</\1>", re.DOTALL)
145
+
146
+ def parse_trace(text: str):
147
+ """Return a list of (tag, content) tuples in the order they appear."""
148
+ return [(tag, body.strip()) for tag, body in TAG_RE.findall(text)]
149
+
150
+ def pretty_print(text: str) -> None:
151
+ cycle_idx = 0
152
+ pending_q = None
153
+ for tag, body in parse_trace(text):
154
+ if tag == "think":
155
+ print("─" * 78)
156
+ print("🧠 THINK")
157
+ print("─" * 78)
158
+ print(body)
159
+ print()
160
+ elif tag == "question":
161
+ cycle_idx += 1
162
+ pending_q = body
163
+ elif tag == "answer":
164
+ print(f"πŸ”Έ Q{cycle_idx}: {pending_q}")
165
+ print(f"πŸ’¬ A{cycle_idx}: {body}")
166
+ print()
167
+ pending_q = None
168
+ elif tag == "verification":
169
+ print("=" * 78)
170
+ print(f"βœ… VERIFICATION: {body}")
171
+ print("=" * 78)
172
+
173
+ # usage:
174
+ pretty_print(response)
175
+
176
+ # extract just the final label:
177
+ label_match = re.search(r"<verification>\s*(Supported|Refuted)\s*</verification>", response)
178
+ label = label_match.group(1) if label_match else None
179
+ print("Label:", label)
180
+ ```
181
+
182
+ **Example output:**
183
+
184
+ ```
185
+ ──────────────────────────────────────────────────────────────────────────────
186
+ 🧠 THINK
187
+ ──────────────────────────────────────────────────────────────────────────────
188
+ The claim has two atomic sub-claims:
189
+ 1. The Eiffel Tower was completed in 1887 (temporal, independently falsifiable)
190
+ 2. The Eiffel Tower stands 330 metres tall (quantitative)
191
+ I will verify each in turn against the evidence document.
192
+
193
+ πŸ”Έ Q1: When was the Eiffel Tower completed?
194
+ πŸ’¬ A1: The evidence states the tower was built "from 1887 to 1889", so it was
195
+ completed in 1889, not 1887.
196
+
197
+ πŸ”Έ Q2: What is the height of the Eiffel Tower?
198
+ πŸ’¬ A2: The evidence states "330 metres (1,083 ft) tall." β€” 330 m confirmed.
199
+
200
+ ──────────────────────────────────────────────────────────────────────────────
201
+ 🧠 THINK
202
+ ──────────────────────────────────────────────────────────────────────────────
203
+ Sub-claim 1 is refuted (1889 β‰  1887). Since it is independently falsifiable,
204
+ the overall claim is refuted regardless of sub-claim 2.
205
+
206
+ ==============================================================================
207
+ βœ… VERIFICATION: Refuted
208
+ ==============================================================================
209
  ```
210
 
211
  ### Using vLLM
212
 
213
  ```bash
214
+ vllm serve dipta007/decomposeRL-7b --max-model-len 16016
215
  ```
216
 
217
+ The `--max-model-len` matches the training-time `max_seq_length=16016` (with `max_prompt_length=11500` + `max_completion_length=4500`).
218
+
219
  ## Performance
220
 
221
+ ### In-domain β€” balanced accuracy on 9 claim-verification benchmarks
222
 
223
  | Dataset | # Examples | Balanced Acc |
224
  |---|---:|---:|
 
234
  | **Micro-average** (sample-weighted) | 14,503 | **0.8445** |
235
  | **Macro-average** | β€” | **0.8417** |
236
 
237
+ ### Out-of-domain β€” Coverbench
238
+
239
+ | Dataset | # Examples | Balanced Acc | Accuracy | F1 |
240
+ |---|---:|---:|---:|---:|
241
+ | Coverbench (long-evidence, OOD) | 728 | **0.6021** | 0.5989 | 0.6086 |
242
+
243
+ Coverbench evaluates claims paired with substantially longer evidence than the training distribution, so it is a stress test of how well the decomposition behavior transfers when the document grows.
244
+
245
  ## Intended Use
246
 
247
  - **In-scope**: verifying factual claims against a *provided* evidence document (open-book fact verification, retrieval-augmented fact-checking pipelines).