dipta007 commited on
Commit
257b03c
Β·
verified Β·
1 Parent(s): 910375e

Update README

Browse files
Files changed (1) hide show
  1. README.md +15 -15
README.md CHANGED
@@ -37,7 +37,7 @@ tags:
37
  - **84.6% macro-average balanced accuracy** across the same 9 benchmarks
38
  - Out-of-domain: **60.2% balanced accuracy on Coverbench**, **77.0% on LLM-AggreFact**
39
  - Strong on long-form evidence: 87% on Ex-FEVER, 92% on FEVEROUS, 76% on HoVer
40
- - Reasoning is **fully transparent** β€” the model emits its sub-claim checklist, every question it asked, every quote from evidence, and a final label
41
 
42
  ## Model Overview
43
 
@@ -60,28 +60,28 @@ DecomposeRL trains the policy to follow a **decompose-question-answer-verify** l
60
  3. **Sufficiency check** (`<think>`): track which sub-claims are resolved; loop until every sub-claim is addressed.
61
  4. **Final verdict** (`<verification>`): `Supported` or `Refuted`.
62
 
63
- ### Reward Stack β€” seven complementary signals
64
 
65
  GRPO is supervised with a sum of seven rewards, grouped into three families:
66
 
67
  **Programmatic anchors** (no judge call)
68
 
69
- 1. **Format** β€” ensures the trace is parseable; a gating prerequisite without which no other reward can be computed.
70
- 2. **Question count** β€” discourages collapsing the decomposition into one mega-question or padding it with filler.
71
- 3. **Diversity** β€” penalizes redundant questions so the policy covers distinct sub-claims instead of rewording the same one.
72
 
73
  **Set-level signals**
74
 
75
- 4. **Coverage** β€” checks whether the verdict can be recovered from the answers alone; tests if the decomposition is *collectively sufficient*.
76
- 5. **Verification** β€” direct outcome anchor: did the final label match the gold label?
77
 
78
  **Leave-one-out and per-question composites**
79
 
80
- 6. **Necessity (leave-one-out)** β€” the only signal that can push the policy to *remove* misleading questions; a question is necessary iff its removal would change the verdict.
81
- 7. **Joint multiplicative quality** β€” composes three per-question sub-signals so a question must clear *all* of them simultaneously rather than scoring partial credit:
82
- - **(7a) Answerability** β€” is the question answerable from the evidence?
83
- - **(7b) Atomicity** β€” is it a single-focus, verifiable question grounded in the claim?
84
- - **(7c) Answer correctness** β€” is the answer faithful to the document (no contradictions, no extrinsic info)?
85
 
86
  ## Quickstart
87
 
@@ -213,7 +213,7 @@ the specific companies. Next, verify the companies and the layoffs.
213
  πŸ’¬ A2: The evidence document states that Nicholson worked as a consultant for
214
  companies that laid off nearly 1,900 people since 2015, shutting down
215
  plants in Wisconsin and other states. But it also says Baldwin cites no
216
- evidence that Nicholson's work caused the layoffs and shutdowns β€” only
217
  some element of truth, our definition of Mostly False.
218
 
219
  ──────────────────────────────────────────────────────────────────────────────
@@ -238,7 +238,7 @@ The `--max-model-len` matches the training-time `max_seq_length=16016` (with `ma
238
 
239
  ## Performance
240
 
241
- ### In-domain β€” balanced accuracy (%) on 9 claim-verification benchmarks
242
 
243
  Compared against every same-size (Qwen-7B) baseline plus MiniCheck. *Micro* is pooled balanced accuracy across all in-domain samples; *Macro* is the uniform mean across the 9 datasets. **Bold** marks the column winner; *italic* marks the second-best.
244
 
@@ -269,7 +269,7 @@ Compared against every same-size (Qwen-7B) baseline plus MiniCheck. *Micro* is p
269
  - **In-scope**: verifying factual claims against a *provided* evidence document (open-book fact verification, retrieval-augmented fact-checking pipelines).
270
  - **Out-of-scope**: closed-book fact-checking, claim verification against the model's parametric knowledge, real-time news verification without supplied evidence.
271
 
272
- The model is trained to say *"I don't know"* when the evidence document is silent β€” please respect that signal in downstream systems instead of forcing a label.
273
 
274
  ## Citation
275
 
 
37
  - **84.6% macro-average balanced accuracy** across the same 9 benchmarks
38
  - Out-of-domain: **60.2% balanced accuracy on Coverbench**, **77.0% on LLM-AggreFact**
39
  - Strong on long-form evidence: 87% on Ex-FEVER, 92% on FEVEROUS, 76% on HoVer
40
+ - Reasoning is **fully transparent**: the model emits its sub-claim checklist, every question it asked, every quote from evidence, and a final label
41
 
42
  ## Model Overview
43
 
 
60
  3. **Sufficiency check** (`<think>`): track which sub-claims are resolved; loop until every sub-claim is addressed.
61
  4. **Final verdict** (`<verification>`): `Supported` or `Refuted`.
62
 
63
+ ### Reward Stack: seven complementary signals
64
 
65
  GRPO is supervised with a sum of seven rewards, grouped into three families:
66
 
67
  **Programmatic anchors** (no judge call)
68
 
69
+ 1. **Format**: ensures the trace is parseable; a gating prerequisite without which no other reward can be computed.
70
+ 2. **Question count**: discourages collapsing the decomposition into one mega-question or padding it with filler.
71
+ 3. **Diversity**: penalizes redundant questions so the policy covers distinct sub-claims instead of rewording the same one.
72
 
73
  **Set-level signals**
74
 
75
+ 4. **Coverage**: checks whether the verdict can be recovered from the answers alone; tests if the decomposition is *collectively sufficient*.
76
+ 5. **Verification**: direct outcome anchor; did the final label match the gold label?
77
 
78
  **Leave-one-out and per-question composites**
79
 
80
+ 6. **Necessity (leave-one-out)**: the only signal that can push the policy to *remove* misleading questions; a question is necessary iff its removal would change the verdict.
81
+ 7. **Joint multiplicative quality**: composes three per-question sub-signals so a question must clear *all* of them simultaneously rather than scoring partial credit:
82
+ - **(7a) Answerability**: is the question answerable from the evidence?
83
+ - **(7b) Atomicity**: is it a single-focus, verifiable question grounded in the claim?
84
+ - **(7c) Answer correctness**: is the answer faithful to the document (no contradictions, no extrinsic info)?
85
 
86
  ## Quickstart
87
 
 
213
  πŸ’¬ A2: The evidence document states that Nicholson worked as a consultant for
214
  companies that laid off nearly 1,900 people since 2015, shutting down
215
  plants in Wisconsin and other states. But it also says Baldwin cites no
216
+ evidence that Nicholson's work caused the layoffs and shutdowns, only
217
  some element of truth, our definition of Mostly False.
218
 
219
  ──────────────────────────────────────────────────────────────────────────────
 
238
 
239
  ## Performance
240
 
241
+ ### In-domain: balanced accuracy (%) on 9 claim-verification benchmarks
242
 
243
  Compared against every same-size (Qwen-7B) baseline plus MiniCheck. *Micro* is pooled balanced accuracy across all in-domain samples; *Macro* is the uniform mean across the 9 datasets. **Bold** marks the column winner; *italic* marks the second-best.
244
 
 
269
  - **In-scope**: verifying factual claims against a *provided* evidence document (open-book fact verification, retrieval-augmented fact-checking pipelines).
270
  - **Out-of-scope**: closed-book fact-checking, claim verification against the model's parametric knowledge, real-time news verification without supplied evidence.
271
 
272
+ The model is trained to say *"I don't know"* when the evidence document is silent; please respect that signal in downstream systems instead of forcing a label.
273
 
274
  ## Citation
275