File size: 15,457 Bytes
81a8ce8
188754b
81a8ce8
188754b
 
81a8ce8
188754b
 
 
 
 
 
 
 
 
81a8ce8
 
188754b
 
663efe3
 
 
 
 
 
 
 
 
 
 
 
486896f
188754b
 
 
 
 
663efe3
188754b
257b03c
188754b
 
 
 
 
 
 
 
 
 
486896f
188754b
 
 
 
 
 
 
 
 
 
 
257b03c
188754b
663efe3
188754b
486896f
 
257b03c
 
 
486896f
 
 
257b03c
 
486896f
 
 
257b03c
 
 
 
 
188754b
 
 
70ca118
 
 
 
 
 
df8336a
 
188754b
 
 
 
 
 
 
 
 
 
 
 
 
df8336a
188754b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
df8336a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
188754b
 
 
486896f
188754b
486896f
188754b
 
 
 
486896f
 
 
 
 
 
 
df8336a
 
 
 
 
 
 
 
 
 
 
486896f
 
df8336a
486896f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
663efe3
486896f
 
 
 
 
663efe3
 
 
 
 
486896f
 
663efe3
 
 
 
 
486896f
663efe3
 
 
 
 
 
 
 
 
 
257b03c
663efe3
486896f
 
 
 
663efe3
 
 
486896f
 
 
 
188754b
 
 
 
 
486896f
188754b
 
486896f
 
188754b
 
257b03c
910375e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
188754b
663efe3
486896f
 
 
663efe3
 
486896f
188754b
 
 
 
81a8ce8
257b03c
81a8ce8
663efe3
 
 
 
 
 
188754b
81a8ce8
188754b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
---
library_name: transformers
license: apache-2.0
base_model: unsloth/Qwen2.5-7B-Instruct
pipeline_tag: text-generation
language:
  - en
tags:
  - fact-verification
  - claim-verification
  - reasoning
  - grpo
  - lora
  - decomposition
  - qwen2
---

# DecomposeRL-7B

<p align="center">
  <a href="https://arxiv.org/abs/0000.00000">
    <img src="https://img.shields.io/badge/%F0%9F%93%84_Paper-Coming_Soon-b12a00?style=for-the-badge&labelColor=ffb300" alt="Paper Coming Soon">
  </a>
</p>

[![Paper](https://img.shields.io/badge/arXiv-coming--soon-red)](https://arxiv.org/abs/0000.00000)
[![Project Page](https://img.shields.io/badge/Project-Page-green)](https://dipta007.github.io/DecomposeRL/)
[![Dataset](https://img.shields.io/badge/HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/dipta007/decomposeRL)
[![Models](https://img.shields.io/badge/HuggingFace-Models-orange)](https://huggingface.co/collections/dipta007/decomposerl)
[![GitHub](https://img.shields.io/badge/GitHub-Code-blue)](https://github.com/dipta007/DecomposeRL)

**DecomposeRL-7B** is a fact-verification model that learns to *decompose* a claim into atomic sub-questions, iteratively answer them from an evidence document, and produce a final `Supported` / `Refuted` judgment. It is trained from `Qwen2.5-7B-Instruct` with **GRPO + LoRA** under a stack of **seven complementary rewards** that shape the reward landscape around three axes: structural correctness, per-question quality, and set-level sufficiency.

## Highlights

- **84.5% micro-average balanced accuracy** across 9 in-domain claim-verification benchmarks (sample-weighted)
- **84.6% macro-average balanced accuracy** across the same 9 benchmarks
- Out-of-domain: **60.2% balanced accuracy on Coverbench**, **77.0% on LLM-AggreFact**
- Strong on long-form evidence: 87% on Ex-FEVER, 92% on FEVEROUS, 76% on HoVer
- Reasoning is **fully transparent**: the model emits its sub-claim checklist, every question it asked, every quote from evidence, and a final label

## Model Overview

| Property | Value |
|----------|-------|
| **Model Type** | Causal Language Model |
| **Base Model** | unsloth/Qwen2.5-7B-Instruct |
| **Parameters** | 7B |
| **Training** | GRPO + LoRA (r=64, Ξ±=128) |
| **LoRA Targets** | q, k, v, o, gate, up, down projections |
| **Max Sequence Length** | 16,016 tokens (training-time) |
| **Language** | English |

## Method

DecomposeRL trains the policy to follow a **decompose-question-answer-verify** loop:

1. **Initial analysis** (`<think>`): identify atomic sub-claims, classify them (entity / relational / quantitative / causal / temporal / comparative), and flag independently falsifiable sub-claims.
2. **Iterative QA cycle** (`<question>` β†’ `<answer>`): for each sub-claim or ambiguity, ask a single targeted question and answer it **only** from the evidence document, quoting passages directly (or saying *"I don't know"* if the evidence is silent).
3. **Sufficiency check** (`<think>`): track which sub-claims are resolved; loop until every sub-claim is addressed.
4. **Final verdict** (`<verification>`): `Supported` or `Refuted`.

### Reward Stack: seven complementary signals

GRPO is supervised with a sum of seven rewards, grouped into three families:

**Programmatic anchors** (no judge call)

1. **Format**: ensures the trace is parseable; a gating prerequisite without which no other reward can be computed.
2. **Question count**: discourages collapsing the decomposition into one mega-question or padding it with filler.
3. **Diversity**: penalizes redundant questions so the policy covers distinct sub-claims instead of rewording the same one.

**Set-level signals**

4. **Coverage**: checks whether the verdict can be recovered from the answers alone; tests if the decomposition is *collectively sufficient*.
5. **Verification**: direct outcome anchor; did the final label match the gold label?

**Leave-one-out and per-question composites**

6. **Necessity (leave-one-out)**: the only signal that can push the policy to *remove* misleading questions; a question is necessary iff its removal would change the verdict.
7. **Joint multiplicative quality**: composes three per-question sub-signals so a question must clear *all* of them simultaneously rather than scoring partial credit:
   - **(7a) Answerability**: is the question answerable from the evidence?
   - **(7b) Atomicity**: is it a single-focus, verifiable question grounded in the claim?
   - **(7c) Answer correctness**: is the answer faithful to the document (no contradictions, no extrinsic info)?

## Quickstart

A complete runnable script is included in the repo as [`example.py`](./example.py) (download it [here](https://huggingface.co/dipta007/decomposeRL-7b/resolve/main/example.py)):

```bash
python example.py
```

DecomposeRL expects a specific verification prompt around your `claim` + `evidence_doc`. The `build_prompt` helper below wraps them for you so you don't have to construct the full instruction block every time.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "dipta007/decomposeRL-7b"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)


PROMPT_TEMPLATE = """You are tasked with systematically verifying the accuracy of a claim. You will be provided with a claim to verify and an evidence document to consult.

Here is the evidence document you should consult:

<evidence_document>
{evidence_doc}
</evidence_document>

Here is the claim you need to verify:

<claim>
{claim}
</claim>

Your task is to verify whether this claim is Supported or Refuted through an iterative process of asking questions and gathering information.

# Verification Process

Begin by analyzing the claim in <think> tags, then enter an iterative cycle of <question>/<answer> pairs answered ONLY from the evidence document. When every sub-claim is addressed, output your final label inside <verification> tags. The label must be exactly one of: Supported, Refuted.

Stop immediately after the closing </verification> tag.

Begin your verification process now."""


def build_prompt(claim: str, evidence_doc: str) -> str:
    """Wrap a claim + evidence document in the DecomposeRL verification prompt."""
    return PROMPT_TEMPLATE.format(claim=claim, evidence_doc=evidence_doc)


def verify(claim: str, evidence_doc: str, max_new_tokens: int = 4500, temperature: float = 0.7) -> str:
    """Run the model end-to-end on a (claim, evidence_doc) pair and return the raw trace."""
    messages = [{"role": "user", "content": build_prompt(claim, evidence_doc)}]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer([text], return_tensors="pt").to(model.device)
    out = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,  # matches training-time max_completion_length
        temperature=temperature,
        do_sample=True,
    )
    return tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)


# Usage
evidence_doc = (
    "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, "
    "France. It is named after the engineer Gustave Eiffel, whose company designed and "
    "built the tower from 1887 to 1889. Locally nicknamed 'La dame de fer', it was "
    "constructed as the centerpiece of the 1889 World's Fair. The tower is 330 metres "
    "(1,083 ft) tall."
)
claim = "The Eiffel Tower was completed in 1887 and stands 330 metres tall."

response = verify(claim, evidence_doc)
print(response)
```

### Pretty-print the trace

The model produces an iterative `<think>` / `<question>` / `<answer>` / `<verification>` trace. The helper below parses it into a structured form and prints it as a readable conversation:

```python
import re

TAG_RE = re.compile(r"<(think|question|answer|verification)>(.*?)</\1>", re.DOTALL)

def parse_trace(text: str):
    """Return a list of (tag, content) tuples in the order they appear."""
    return [(tag, body.strip()) for tag, body in TAG_RE.findall(text)]

def pretty_print(text: str) -> None:
    parsed = parse_trace(text)
    tags = {tag for tag, _ in parsed}
    if not parsed or "verification" not in tags:
        print("⚠️  Could not parse output into the expected "
              "think/question/answer/verification structure.")
        print("Raw output:")
        print("─" * 78)
        print(text)
        print("─" * 78)
        return

    cycle_idx = 0
    pending_q = None
    for tag, body in parsed:
        if tag == "think":
            print("─" * 78)
            print("🧠  THINK")
            print("─" * 78)
            print(body)
            print()
        elif tag == "question":
            cycle_idx += 1
            pending_q = body
        elif tag == "answer":
            print(f"πŸ”Έ  Q{cycle_idx}: {pending_q}")
            print(f"πŸ’¬  A{cycle_idx}: {body}")
            print()
            pending_q = None
        elif tag == "verification":
            print("=" * 78)
            print(f"βœ…  VERIFICATION: {body}")
            print("=" * 78)

pretty_print(response)

# extract just the final label:
label_match = re.search(r"<verification>\s*(Supported|Refuted)\s*</verification>", response)
label = label_match.group(1) if label_match else None
print("Label:", label)
```

**Example output on a real trace** (claim: *"New reports show Kevin Nicholson made over \$1 million advising companies that shut down Wisconsin plants and laid off 1,900 people."*):

```
──────────────────────────────────────────────────────────────────────────────
🧠  THINK
──────────────────────────────────────────────────────────────────────────────
The claim can be broken down into several sub-claims:
1. Nicholson made over $1 million
2. The companies were in Wisconsin
3. The companies shut down plants and laid off 1,900 people
4. Nicholson's work is causally tied to those layoffs and shutdowns
I will verify each in turn against the evidence document.

πŸ”Έ  Q1: What does the evidence document say about Nicholson's earnings?
πŸ’¬  A1: The evidence document states that Nicholson received more than $1 million
       for his consulting work between January 2016 and November 2017. However,
       it does not say how much he was paid for consulting for the companies
       that did layoffs and shutdowns.

──────────────────────────────────────────────────────────────────────────────
🧠  THINK
──────────────────────────────────────────────────────────────────────────────
Sub-claim 1 ($1 million) is partially supported, but the amount is not tied to
the specific companies. Next, verify the companies and the layoffs.

πŸ”Έ  Q2: What does the evidence document say about the companies and layoffs?
πŸ’¬  A2: The evidence document states that Nicholson worked as a consultant for
       companies that laid off nearly 1,900 people since 2015, shutting down
       plants in Wisconsin and other states. But it also says Baldwin cites no
       evidence that Nicholson's work caused the layoffs and shutdowns, only
       some element of truth, our definition of Mostly False.

──────────────────────────────────────────────────────────────────────────────
🧠  THINK
──────────────────────────────────────────────────────────────────────────────
The causal link between Nicholson's consulting and the layoffs is unsupported.
The document explicitly rates the claim Mostly False, so the overall claim is
refuted.

==============================================================================
βœ…  VERIFICATION: Refuted
==============================================================================
```

### Using vLLM

```bash
vllm serve dipta007/decomposeRL-7b --max-model-len 16016
```

The `--max-model-len` matches the training-time `max_seq_length=16016` (with `max_prompt_length=11500` + `max_completion_length=4500`).

## Performance

### In-domain: balanced accuracy (%) on 9 claim-verification benchmarks

Compared against every same-size (Qwen-7B) baseline plus MiniCheck. *Micro* is pooled balanced accuracy across all in-domain samples; *Macro* is the uniform mean across the 9 datasets. **Bold** marks the column winner; *italic* marks the second-best.

| Method | FEVER | ClaimDecomp | HoVer | FEVEROUS | WiCE | Ex-FEVER | PubHealth | PubMedClaim | FoolMeTwice | Micro | Macro |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| **DecomposeRL-7B (ours)** | **74.1** | **98.6** | **76.4** | *93.1* | *86.5* | **87.6** | *87.5* | **85.5** | **87.7** | **84.4** | **86.3** |
| Simple (7B) | *72.7* | 94.9 | 71.0 | **93.5** | 83.2 | 82.7 | 84.2 | *84.1* | *86.6* | *82.0* | *83.7* |
| CoT (7B) | 70.0 | 95.5 | 70.9 | 92.2 | 85.6 | *83.8* | 83.8 | 83.2 | 85.0 | 81.8 | 83.3 |
| DecomP (7B) | 65.5 | 95.3 | 69.0 | 91.9 | 85.0 | 78.0 | 85.7 | 82.5 | 84.1 | 79.3 | 81.9 |
| HiSS (7B) | 67.7 | 92.8 | 70.2 | 92.7 | 83.6 | 82.4 | 79.2 | 77.0 | 84.5 | 80.7 | 81.1 |
| MiniCheck | 69.9 | 77.5 | *73.8* | 89.2 | **87.2** | 82.9 | 76.3 | 83.0 | 84.5 | 81.9 | 80.5 |
| Self-Ask (7B) | 66.5 | 92.7 | 66.9 | 91.9 | 82.5 | 71.7 | 84.2 | 82.6 | 82.8 | 76.7 | 80.2 |
| FOLK (7B) | 65.0 | 90.8 | 68.2 | 91.0 | 83.6 | 80.2 | 80.5 | 77.8 | 83.1 | 79.0 | 80.0 |
| QACheck (7B) | 65.4 | *97.3* | 59.1 | 92.7 | 83.0 | 65.4 | **91.0** | 78.0 | 81.6 | 73.1 | 79.3 |
| Chen-2024 (7B) | 65.4 | 91.1 | 65.3 | 87.9 | 79.6 | 73.3 | 83.3 | 79.2 | 82.3 | 75.7 | 78.6 |
| ProgramFC (7B) | 60.5 | 92.9 | 65.9 | 88.2 | 85.4 | 74.6 | 77.4 | 74.3 | 76.9 | 75.2 | 77.3 |
| ClaimDecomp (7B) | 65.2 | 78.9 | 63.5 | 85.5 | 79.2 | 71.6 | 76.0 | 77.6 | 79.4 | 73.3 | 75.2 |

### Out-of-domain

| Dataset | # Examples | Balanced Acc | Accuracy | F1 |
|---|---:|---:|---:|---:|
| Coverbench | 728 | **0.6021** | 0.5989 | 0.6086 |
| LLM-AggreFact | 29,320 | **0.7695** | 0.8510 | 0.9054 |

## Intended Use

- **In-scope**: verifying factual claims against a *provided* evidence document (open-book fact verification, retrieval-augmented fact-checking pipelines).
- **Out-of-scope**: closed-book fact-checking, claim verification against the model's parametric knowledge, real-time news verification without supplied evidence.

The model is trained to say *"I don't know"* when the evidence document is silent; please respect that signal in downstream systems instead of forcing a label.

## Citation

```bibtex

```

## License

Released under the Apache 2.0 License.