Text Generation
Transformers
Safetensors
English
qwen2
fact-verification
claim-verification
reasoning
grpo
lora
decomposition
conversational
text-generation-inference
Instructions to use dipta007/decomposeRL-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use dipta007/decomposeRL-7b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="dipta007/decomposeRL-7b") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("dipta007/decomposeRL-7b") model = AutoModelForCausalLM.from_pretrained("dipta007/decomposeRL-7b") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use dipta007/decomposeRL-7b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "dipta007/decomposeRL-7b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dipta007/decomposeRL-7b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/dipta007/decomposeRL-7b
- SGLang
How to use dipta007/decomposeRL-7b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "dipta007/decomposeRL-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dipta007/decomposeRL-7b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "dipta007/decomposeRL-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dipta007/decomposeRL-7b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use dipta007/decomposeRL-7b with Docker Model Runner:
docker model run hf.co/dipta007/decomposeRL-7b
| library_name: transformers | |
| license: apache-2.0 | |
| base_model: unsloth/Qwen2.5-7B-Instruct | |
| pipeline_tag: text-generation | |
| language: | |
| - en | |
| tags: | |
| - fact-verification | |
| - claim-verification | |
| - reasoning | |
| - grpo | |
| - lora | |
| - decomposition | |
| - qwen2 | |
| # DecomposeRL-7B | |
| <p align="center"> | |
| <a href="https://arxiv.org/abs/0000.00000"> | |
| <img src="https://img.shields.io/badge/%F0%9F%93%84_Paper-Coming_Soon-b12a00?style=for-the-badge&labelColor=ffb300" alt="Paper Coming Soon"> | |
| </a> | |
| </p> | |
| [](https://arxiv.org/abs/0000.00000) | |
| [](https://dipta007.github.io/DecomposeRL/) | |
| [](https://huggingface.co/datasets/dipta007/decomposeRL) | |
| [](https://huggingface.co/collections/dipta007/decomposerl) | |
| [](https://github.com/dipta007/DecomposeRL) | |
| **DecomposeRL-7B** is a fact-verification model that learns to *decompose* a claim into atomic sub-questions, iteratively answer them from an evidence document, and produce a final `Supported` / `Refuted` judgment. It is trained from `Qwen2.5-7B-Instruct` with **GRPO + LoRA** under a stack of **seven complementary rewards** that shape the reward landscape around three axes: structural correctness, per-question quality, and set-level sufficiency. | |
| ## Highlights | |
| - **84.5% micro-average balanced accuracy** across 9 in-domain claim-verification benchmarks (sample-weighted) | |
| - **84.6% macro-average balanced accuracy** across the same 9 benchmarks | |
| - Out-of-domain: **60.2% balanced accuracy on Coverbench**, **77.0% on LLM-AggreFact** | |
| - Strong on long-form evidence: 87% on Ex-FEVER, 92% on FEVEROUS, 76% on HoVer | |
| - Reasoning is **fully transparent**: the model emits its sub-claim checklist, every question it asked, every quote from evidence, and a final label | |
| ## Model Overview | |
| | Property | Value | | |
| |----------|-------| | |
| | **Model Type** | Causal Language Model | | |
| | **Base Model** | unsloth/Qwen2.5-7B-Instruct | | |
| | **Parameters** | 7B | | |
| | **Training** | GRPO + LoRA (r=64, Ξ±=128) | | |
| | **LoRA Targets** | q, k, v, o, gate, up, down projections | | |
| | **Max Sequence Length** | 16,016 tokens (training-time) | | |
| | **Language** | English | | |
| ## Method | |
| DecomposeRL trains the policy to follow a **decompose-question-answer-verify** loop: | |
| 1. **Initial analysis** (`<think>`): identify atomic sub-claims, classify them (entity / relational / quantitative / causal / temporal / comparative), and flag independently falsifiable sub-claims. | |
| 2. **Iterative QA cycle** (`<question>` β `<answer>`): for each sub-claim or ambiguity, ask a single targeted question and answer it **only** from the evidence document, quoting passages directly (or saying *"I don't know"* if the evidence is silent). | |
| 3. **Sufficiency check** (`<think>`): track which sub-claims are resolved; loop until every sub-claim is addressed. | |
| 4. **Final verdict** (`<verification>`): `Supported` or `Refuted`. | |
| ### Reward Stack: seven complementary signals | |
| GRPO is supervised with a sum of seven rewards, grouped into three families: | |
| **Programmatic anchors** (no judge call) | |
| 1. **Format**: ensures the trace is parseable; a gating prerequisite without which no other reward can be computed. | |
| 2. **Question count**: discourages collapsing the decomposition into one mega-question or padding it with filler. | |
| 3. **Diversity**: penalizes redundant questions so the policy covers distinct sub-claims instead of rewording the same one. | |
| **Set-level signals** | |
| 4. **Coverage**: checks whether the verdict can be recovered from the answers alone; tests if the decomposition is *collectively sufficient*. | |
| 5. **Verification**: direct outcome anchor; did the final label match the gold label? | |
| **Leave-one-out and per-question composites** | |
| 6. **Necessity (leave-one-out)**: the only signal that can push the policy to *remove* misleading questions; a question is necessary iff its removal would change the verdict. | |
| 7. **Joint multiplicative quality**: composes three per-question sub-signals so a question must clear *all* of them simultaneously rather than scoring partial credit: | |
| - **(7a) Answerability**: is the question answerable from the evidence? | |
| - **(7b) Atomicity**: is it a single-focus, verifiable question grounded in the claim? | |
| - **(7c) Answer correctness**: is the answer faithful to the document (no contradictions, no extrinsic info)? | |
| ## Quickstart | |
| A complete runnable script is included in the repo as [`example.py`](./example.py) (download it [here](https://huggingface.co/dipta007/decomposeRL-7b/resolve/main/example.py)): | |
| ```bash | |
| python example.py | |
| ``` | |
| DecomposeRL expects a specific verification prompt around your `claim` + `evidence_doc`. The `build_prompt` helper below wraps them for you so you don't have to construct the full instruction block every time. | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model_name = "dipta007/decomposeRL-7b" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_name, | |
| torch_dtype="auto", | |
| device_map="auto", | |
| ) | |
| PROMPT_TEMPLATE = """You are tasked with systematically verifying the accuracy of a claim. You will be provided with a claim to verify and an evidence document to consult. | |
| Here is the evidence document you should consult: | |
| <evidence_document> | |
| {evidence_doc} | |
| </evidence_document> | |
| Here is the claim you need to verify: | |
| <claim> | |
| {claim} | |
| </claim> | |
| Your task is to verify whether this claim is Supported or Refuted through an iterative process of asking questions and gathering information. | |
| # Verification Process | |
| Begin by analyzing the claim in <think> tags, then enter an iterative cycle of <question>/<answer> pairs answered ONLY from the evidence document. When every sub-claim is addressed, output your final label inside <verification> tags. The label must be exactly one of: Supported, Refuted. | |
| Stop immediately after the closing </verification> tag. | |
| Begin your verification process now.""" | |
| def build_prompt(claim: str, evidence_doc: str) -> str: | |
| """Wrap a claim + evidence document in the DecomposeRL verification prompt.""" | |
| return PROMPT_TEMPLATE.format(claim=claim, evidence_doc=evidence_doc) | |
| def verify(claim: str, evidence_doc: str, max_new_tokens: int = 4500, temperature: float = 0.7) -> str: | |
| """Run the model end-to-end on a (claim, evidence_doc) pair and return the raw trace.""" | |
| messages = [{"role": "user", "content": build_prompt(claim, evidence_doc)}] | |
| text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| inputs = tokenizer([text], return_tensors="pt").to(model.device) | |
| out = model.generate( | |
| **inputs, | |
| max_new_tokens=max_new_tokens, # matches training-time max_completion_length | |
| temperature=temperature, | |
| do_sample=True, | |
| ) | |
| return tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) | |
| # Usage | |
| evidence_doc = ( | |
| "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, " | |
| "France. It is named after the engineer Gustave Eiffel, whose company designed and " | |
| "built the tower from 1887 to 1889. Locally nicknamed 'La dame de fer', it was " | |
| "constructed as the centerpiece of the 1889 World's Fair. The tower is 330 metres " | |
| "(1,083 ft) tall." | |
| ) | |
| claim = "The Eiffel Tower was completed in 1887 and stands 330 metres tall." | |
| response = verify(claim, evidence_doc) | |
| print(response) | |
| ``` | |
| ### Pretty-print the trace | |
| The model produces an iterative `<think>` / `<question>` / `<answer>` / `<verification>` trace. The helper below parses it into a structured form and prints it as a readable conversation: | |
| ```python | |
| import re | |
| TAG_RE = re.compile(r"<(think|question|answer|verification)>(.*?)</\1>", re.DOTALL) | |
| def parse_trace(text: str): | |
| """Return a list of (tag, content) tuples in the order they appear.""" | |
| return [(tag, body.strip()) for tag, body in TAG_RE.findall(text)] | |
| def pretty_print(text: str) -> None: | |
| parsed = parse_trace(text) | |
| tags = {tag for tag, _ in parsed} | |
| if not parsed or "verification" not in tags: | |
| print("β οΈ Could not parse output into the expected " | |
| "think/question/answer/verification structure.") | |
| print("Raw output:") | |
| print("β" * 78) | |
| print(text) | |
| print("β" * 78) | |
| return | |
| cycle_idx = 0 | |
| pending_q = None | |
| for tag, body in parsed: | |
| if tag == "think": | |
| print("β" * 78) | |
| print("π§ THINK") | |
| print("β" * 78) | |
| print(body) | |
| print() | |
| elif tag == "question": | |
| cycle_idx += 1 | |
| pending_q = body | |
| elif tag == "answer": | |
| print(f"πΈ Q{cycle_idx}: {pending_q}") | |
| print(f"π¬ A{cycle_idx}: {body}") | |
| print() | |
| pending_q = None | |
| elif tag == "verification": | |
| print("=" * 78) | |
| print(f"β VERIFICATION: {body}") | |
| print("=" * 78) | |
| pretty_print(response) | |
| # extract just the final label: | |
| label_match = re.search(r"<verification>\s*(Supported|Refuted)\s*</verification>", response) | |
| label = label_match.group(1) if label_match else None | |
| print("Label:", label) | |
| ``` | |
| **Example output on a real trace** (claim: *"New reports show Kevin Nicholson made over \$1 million advising companies that shut down Wisconsin plants and laid off 1,900 people."*): | |
| ``` | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| π§ THINK | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| The claim can be broken down into several sub-claims: | |
| 1. Nicholson made over $1 million | |
| 2. The companies were in Wisconsin | |
| 3. The companies shut down plants and laid off 1,900 people | |
| 4. Nicholson's work is causally tied to those layoffs and shutdowns | |
| I will verify each in turn against the evidence document. | |
| πΈ Q1: What does the evidence document say about Nicholson's earnings? | |
| π¬ A1: The evidence document states that Nicholson received more than $1 million | |
| for his consulting work between January 2016 and November 2017. However, | |
| it does not say how much he was paid for consulting for the companies | |
| that did layoffs and shutdowns. | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| π§ THINK | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| Sub-claim 1 ($1 million) is partially supported, but the amount is not tied to | |
| the specific companies. Next, verify the companies and the layoffs. | |
| πΈ Q2: What does the evidence document say about the companies and layoffs? | |
| π¬ A2: The evidence document states that Nicholson worked as a consultant for | |
| companies that laid off nearly 1,900 people since 2015, shutting down | |
| plants in Wisconsin and other states. But it also says Baldwin cites no | |
| evidence that Nicholson's work caused the layoffs and shutdowns, only | |
| some element of truth, our definition of Mostly False. | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| π§ THINK | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| The causal link between Nicholson's consulting and the layoffs is unsupported. | |
| The document explicitly rates the claim Mostly False, so the overall claim is | |
| refuted. | |
| ============================================================================== | |
| β VERIFICATION: Refuted | |
| ============================================================================== | |
| ``` | |
| ### Using vLLM | |
| ```bash | |
| vllm serve dipta007/decomposeRL-7b --max-model-len 16016 | |
| ``` | |
| The `--max-model-len` matches the training-time `max_seq_length=16016` (with `max_prompt_length=11500` + `max_completion_length=4500`). | |
| ## Performance | |
| ### In-domain: balanced accuracy (%) on 9 claim-verification benchmarks | |
| Compared against every same-size (Qwen-7B) baseline plus MiniCheck. *Micro* is pooled balanced accuracy across all in-domain samples; *Macro* is the uniform mean across the 9 datasets. **Bold** marks the column winner; *italic* marks the second-best. | |
| | Method | FEVER | ClaimDecomp | HoVer | FEVEROUS | WiCE | Ex-FEVER | PubHealth | PubMedClaim | FoolMeTwice | Micro | Macro | | |
| | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | |
| | **DecomposeRL-7B (ours)** | **74.1** | **98.6** | **76.4** | *93.1* | *86.5* | **87.6** | *87.5* | **85.5** | **87.7** | **84.4** | **86.3** | | |
| | Simple (7B) | *72.7* | 94.9 | 71.0 | **93.5** | 83.2 | 82.7 | 84.2 | *84.1* | *86.6* | *82.0* | *83.7* | | |
| | CoT (7B) | 70.0 | 95.5 | 70.9 | 92.2 | 85.6 | *83.8* | 83.8 | 83.2 | 85.0 | 81.8 | 83.3 | | |
| | DecomP (7B) | 65.5 | 95.3 | 69.0 | 91.9 | 85.0 | 78.0 | 85.7 | 82.5 | 84.1 | 79.3 | 81.9 | | |
| | HiSS (7B) | 67.7 | 92.8 | 70.2 | 92.7 | 83.6 | 82.4 | 79.2 | 77.0 | 84.5 | 80.7 | 81.1 | | |
| | MiniCheck | 69.9 | 77.5 | *73.8* | 89.2 | **87.2** | 82.9 | 76.3 | 83.0 | 84.5 | 81.9 | 80.5 | | |
| | Self-Ask (7B) | 66.5 | 92.7 | 66.9 | 91.9 | 82.5 | 71.7 | 84.2 | 82.6 | 82.8 | 76.7 | 80.2 | | |
| | FOLK (7B) | 65.0 | 90.8 | 68.2 | 91.0 | 83.6 | 80.2 | 80.5 | 77.8 | 83.1 | 79.0 | 80.0 | | |
| | QACheck (7B) | 65.4 | *97.3* | 59.1 | 92.7 | 83.0 | 65.4 | **91.0** | 78.0 | 81.6 | 73.1 | 79.3 | | |
| | Chen-2024 (7B) | 65.4 | 91.1 | 65.3 | 87.9 | 79.6 | 73.3 | 83.3 | 79.2 | 82.3 | 75.7 | 78.6 | | |
| | ProgramFC (7B) | 60.5 | 92.9 | 65.9 | 88.2 | 85.4 | 74.6 | 77.4 | 74.3 | 76.9 | 75.2 | 77.3 | | |
| | ClaimDecomp (7B) | 65.2 | 78.9 | 63.5 | 85.5 | 79.2 | 71.6 | 76.0 | 77.6 | 79.4 | 73.3 | 75.2 | | |
| ### Out-of-domain | |
| | Dataset | # Examples | Balanced Acc | Accuracy | F1 | | |
| |---|---:|---:|---:|---:| | |
| | Coverbench | 728 | **0.6021** | 0.5989 | 0.6086 | | |
| | LLM-AggreFact | 29,320 | **0.7695** | 0.8510 | 0.9054 | | |
| ## Intended Use | |
| - **In-scope**: verifying factual claims against a *provided* evidence document (open-book fact verification, retrieval-augmented fact-checking pipelines). | |
| - **Out-of-scope**: closed-book fact-checking, claim verification against the model's parametric knowledge, real-time news verification without supplied evidence. | |
| The model is trained to say *"I don't know"* when the evidence document is silent; please respect that signal in downstream systems instead of forcing a label. | |
| ## Citation | |
| ```bibtex | |
| ``` | |
| ## License | |
| Released under the Apache 2.0 License. | |