---
library_name: transformers
license: apache-2.0
base_model: unsloth/Qwen2.5-7B-Instruct
pipeline_tag: text-generation
language:
- en
tags:
- fact-verification
- claim-verification
- reasoning
- grpo
- lora
- decomposition
- qwen2
---
# DecomposeRL-7B
[](https://arxiv.org/abs/0000.00000)
[](https://dipta007.github.io/DecomposeRL/)
[](https://huggingface.co/datasets/dipta007/decomposeRL)
[](https://huggingface.co/collections/dipta007/decomposerl)
[](https://github.com/dipta007/DecomposeRL)
**DecomposeRL-7B** is a fact-verification model that learns to *decompose* a claim into atomic sub-questions, iteratively answer them from an evidence document, and produce a final `Supported` / `Refuted` judgment. It is trained from `Qwen2.5-7B-Instruct` with **GRPO + LoRA** under a stack of **seven complementary rewards** that shape the reward landscape around three axes: structural correctness, per-question quality, and set-level sufficiency.
## Highlights
- **84.5% micro-average balanced accuracy** across 9 in-domain claim-verification benchmarks (sample-weighted)
- **84.6% macro-average balanced accuracy** across the same 9 benchmarks
- Out-of-domain: **60.2% balanced accuracy on Coverbench**, **77.0% on LLM-AggreFact**
- Strong on long-form evidence: 87% on Ex-FEVER, 92% on FEVEROUS, 76% on HoVer
- Reasoning is **fully transparent**: the model emits its sub-claim checklist, every question it asked, every quote from evidence, and a final label
## Model Overview
| Property | Value |
|----------|-------|
| **Model Type** | Causal Language Model |
| **Base Model** | unsloth/Qwen2.5-7B-Instruct |
| **Parameters** | 7B |
| **Training** | GRPO + LoRA (r=64, α=128) |
| **LoRA Targets** | q, k, v, o, gate, up, down projections |
| **Max Sequence Length** | 16,016 tokens (training-time) |
| **Language** | English |
## Method
DecomposeRL trains the policy to follow a **decompose-question-answer-verify** loop:
1. **Initial analysis** (``): identify atomic sub-claims, classify them (entity / relational / quantitative / causal / temporal / comparative), and flag independently falsifiable sub-claims.
2. **Iterative QA cycle** (`` → ``): for each sub-claim or ambiguity, ask a single targeted question and answer it **only** from the evidence document, quoting passages directly (or saying *"I don't know"* if the evidence is silent).
3. **Sufficiency check** (``): track which sub-claims are resolved; loop until every sub-claim is addressed.
4. **Final verdict** (``): `Supported` or `Refuted`.
### Reward Stack: seven complementary signals
GRPO is supervised with a sum of seven rewards, grouped into three families:
**Programmatic anchors** (no judge call)
1. **Format**: ensures the trace is parseable; a gating prerequisite without which no other reward can be computed.
2. **Question count**: discourages collapsing the decomposition into one mega-question or padding it with filler.
3. **Diversity**: penalizes redundant questions so the policy covers distinct sub-claims instead of rewording the same one.
**Set-level signals**
4. **Coverage**: checks whether the verdict can be recovered from the answers alone; tests if the decomposition is *collectively sufficient*.
5. **Verification**: direct outcome anchor; did the final label match the gold label?
**Leave-one-out and per-question composites**
6. **Necessity (leave-one-out)**: the only signal that can push the policy to *remove* misleading questions; a question is necessary iff its removal would change the verdict.
7. **Joint multiplicative quality**: composes three per-question sub-signals so a question must clear *all* of them simultaneously rather than scoring partial credit:
- **(7a) Answerability**: is the question answerable from the evidence?
- **(7b) Atomicity**: is it a single-focus, verifiable question grounded in the claim?
- **(7c) Answer correctness**: is the answer faithful to the document (no contradictions, no extrinsic info)?
## Quickstart
A complete runnable script is included in the repo as [`example.py`](./example.py) (download it [here](https://huggingface.co/dipta007/decomposeRL-7b/resolve/main/example.py)):
```bash
python example.py
```
DecomposeRL expects a specific verification prompt around your `claim` + `evidence_doc`. The `build_prompt` helper below wraps them for you so you don't have to construct the full instruction block every time.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "dipta007/decomposeRL-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
)
PROMPT_TEMPLATE = """You are tasked with systematically verifying the accuracy of a claim. You will be provided with a claim to verify and an evidence document to consult.
Here is the evidence document you should consult:
{evidence_doc}
Here is the claim you need to verify:
{claim}
Your task is to verify whether this claim is Supported or Refuted through an iterative process of asking questions and gathering information.
# Verification Process
Begin by analyzing the claim in tags, then enter an iterative cycle of / pairs answered ONLY from the evidence document. When every sub-claim is addressed, output your final label inside tags. The label must be exactly one of: Supported, Refuted.
Stop immediately after the closing tag.
Begin your verification process now."""
def build_prompt(claim: str, evidence_doc: str) -> str:
"""Wrap a claim + evidence document in the DecomposeRL verification prompt."""
return PROMPT_TEMPLATE.format(claim=claim, evidence_doc=evidence_doc)
def verify(claim: str, evidence_doc: str, max_new_tokens: int = 4500, temperature: float = 0.7) -> str:
"""Run the model end-to-end on a (claim, evidence_doc) pair and return the raw trace."""
messages = [{"role": "user", "content": build_prompt(claim, evidence_doc)}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
out = model.generate(
**inputs,
max_new_tokens=max_new_tokens, # matches training-time max_completion_length
temperature=temperature,
do_sample=True,
)
return tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
# Usage
evidence_doc = (
"The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, "
"France. It is named after the engineer Gustave Eiffel, whose company designed and "
"built the tower from 1887 to 1889. Locally nicknamed 'La dame de fer', it was "
"constructed as the centerpiece of the 1889 World's Fair. The tower is 330 metres "
"(1,083 ft) tall."
)
claim = "The Eiffel Tower was completed in 1887 and stands 330 metres tall."
response = verify(claim, evidence_doc)
print(response)
```
### Pretty-print the trace
The model produces an iterative `` / `` / `` / `` trace. The helper below parses it into a structured form and prints it as a readable conversation:
```python
import re
TAG_RE = re.compile(r"<(think|question|answer|verification)>(.*?)\1>", re.DOTALL)
def parse_trace(text: str):
"""Return a list of (tag, content) tuples in the order they appear."""
return [(tag, body.strip()) for tag, body in TAG_RE.findall(text)]
def pretty_print(text: str) -> None:
parsed = parse_trace(text)
tags = {tag for tag, _ in parsed}
if not parsed or "verification" not in tags:
print("⚠️ Could not parse output into the expected "
"think/question/answer/verification structure.")
print("Raw output:")
print("─" * 78)
print(text)
print("─" * 78)
return
cycle_idx = 0
pending_q = None
for tag, body in parsed:
if tag == "think":
print("─" * 78)
print("🧠 THINK")
print("─" * 78)
print(body)
print()
elif tag == "question":
cycle_idx += 1
pending_q = body
elif tag == "answer":
print(f"🔸 Q{cycle_idx}: {pending_q}")
print(f"💬 A{cycle_idx}: {body}")
print()
pending_q = None
elif tag == "verification":
print("=" * 78)
print(f"✅ VERIFICATION: {body}")
print("=" * 78)
pretty_print(response)
# extract just the final label:
label_match = re.search(r"\s*(Supported|Refuted)\s*", response)
label = label_match.group(1) if label_match else None
print("Label:", label)
```
**Example output on a real trace** (claim: *"New reports show Kevin Nicholson made over \$1 million advising companies that shut down Wisconsin plants and laid off 1,900 people."*):
```
──────────────────────────────────────────────────────────────────────────────
🧠 THINK
──────────────────────────────────────────────────────────────────────────────
The claim can be broken down into several sub-claims:
1. Nicholson made over $1 million
2. The companies were in Wisconsin
3. The companies shut down plants and laid off 1,900 people
4. Nicholson's work is causally tied to those layoffs and shutdowns
I will verify each in turn against the evidence document.
🔸 Q1: What does the evidence document say about Nicholson's earnings?
💬 A1: The evidence document states that Nicholson received more than $1 million
for his consulting work between January 2016 and November 2017. However,
it does not say how much he was paid for consulting for the companies
that did layoffs and shutdowns.
──────────────────────────────────────────────────────────────────────────────
🧠 THINK
──────────────────────────────────────────────────────────────────────────────
Sub-claim 1 ($1 million) is partially supported, but the amount is not tied to
the specific companies. Next, verify the companies and the layoffs.
🔸 Q2: What does the evidence document say about the companies and layoffs?
💬 A2: The evidence document states that Nicholson worked as a consultant for
companies that laid off nearly 1,900 people since 2015, shutting down
plants in Wisconsin and other states. But it also says Baldwin cites no
evidence that Nicholson's work caused the layoffs and shutdowns, only
some element of truth, our definition of Mostly False.
──────────────────────────────────────────────────────────────────────────────
🧠 THINK
──────────────────────────────────────────────────────────────────────────────
The causal link between Nicholson's consulting and the layoffs is unsupported.
The document explicitly rates the claim Mostly False, so the overall claim is
refuted.
==============================================================================
✅ VERIFICATION: Refuted
==============================================================================
```
### Using vLLM
```bash
vllm serve dipta007/decomposeRL-7b --max-model-len 16016
```
The `--max-model-len` matches the training-time `max_seq_length=16016` (with `max_prompt_length=11500` + `max_completion_length=4500`).
## Performance
### In-domain: balanced accuracy (%) on 9 claim-verification benchmarks
Compared against every same-size (Qwen-7B) baseline plus MiniCheck. *Micro* is pooled balanced accuracy across all in-domain samples; *Macro* is the uniform mean across the 9 datasets. **Bold** marks the column winner; *italic* marks the second-best.
| Method | FEVER | ClaimDecomp | HoVer | FEVEROUS | WiCE | Ex-FEVER | PubHealth | PubMedClaim | FoolMeTwice | Micro | Macro |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| **DecomposeRL-7B (ours)** | **74.1** | **98.6** | **76.4** | *93.1* | *86.5* | **87.6** | *87.5* | **85.5** | **87.7** | **84.4** | **86.3** |
| Simple (7B) | *72.7* | 94.9 | 71.0 | **93.5** | 83.2 | 82.7 | 84.2 | *84.1* | *86.6* | *82.0* | *83.7* |
| CoT (7B) | 70.0 | 95.5 | 70.9 | 92.2 | 85.6 | *83.8* | 83.8 | 83.2 | 85.0 | 81.8 | 83.3 |
| DecomP (7B) | 65.5 | 95.3 | 69.0 | 91.9 | 85.0 | 78.0 | 85.7 | 82.5 | 84.1 | 79.3 | 81.9 |
| HiSS (7B) | 67.7 | 92.8 | 70.2 | 92.7 | 83.6 | 82.4 | 79.2 | 77.0 | 84.5 | 80.7 | 81.1 |
| MiniCheck | 69.9 | 77.5 | *73.8* | 89.2 | **87.2** | 82.9 | 76.3 | 83.0 | 84.5 | 81.9 | 80.5 |
| Self-Ask (7B) | 66.5 | 92.7 | 66.9 | 91.9 | 82.5 | 71.7 | 84.2 | 82.6 | 82.8 | 76.7 | 80.2 |
| FOLK (7B) | 65.0 | 90.8 | 68.2 | 91.0 | 83.6 | 80.2 | 80.5 | 77.8 | 83.1 | 79.0 | 80.0 |
| QACheck (7B) | 65.4 | *97.3* | 59.1 | 92.7 | 83.0 | 65.4 | **91.0** | 78.0 | 81.6 | 73.1 | 79.3 |
| Chen-2024 (7B) | 65.4 | 91.1 | 65.3 | 87.9 | 79.6 | 73.3 | 83.3 | 79.2 | 82.3 | 75.7 | 78.6 |
| ProgramFC (7B) | 60.5 | 92.9 | 65.9 | 88.2 | 85.4 | 74.6 | 77.4 | 74.3 | 76.9 | 75.2 | 77.3 |
| ClaimDecomp (7B) | 65.2 | 78.9 | 63.5 | 85.5 | 79.2 | 71.6 | 76.0 | 77.6 | 79.4 | 73.3 | 75.2 |
### Out-of-domain
| Dataset | # Examples | Balanced Acc | Accuracy | F1 |
|---|---:|---:|---:|---:|
| Coverbench | 728 | **0.6021** | 0.5989 | 0.6086 |
| LLM-AggreFact | 29,320 | **0.7695** | 0.8510 | 0.9054 |
## Intended Use
- **In-scope**: verifying factual claims against a *provided* evidence document (open-book fact verification, retrieval-augmented fact-checking pipelines).
- **Out-of-scope**: closed-book fact-checking, claim verification against the model's parametric knowledge, real-time news verification without supplied evidence.
The model is trained to say *"I don't know"* when the evidence document is silent; please respect that signal in downstream systems instead of forcing a label.
## Citation
```bibtex
```
## License
Released under the Apache 2.0 License.