CFM-Methods-3B / README.md
Joe-Davis's picture
Refresh card: clean two-panel benchmark figure, worked example, multi-domain framing
957cff8 verified
|
Raw
History Blame Contribute Delete
6.89 kB
---
license: other
license_name: morphmind-cfm-research-license
license_link: LICENSE
base_model: Qwen/Qwen2.5-3B-Instruct
pipeline_tag: text-generation
library_name: transformers
inference: false
tags:
- control-foundation-model
- scientific-ai
- methodology-review
- peer-review
- rlvr
- morphmind
---
# CFM-Methods-3B · MorphMind
**A tiny control model that reads a methods section and tells you exactly where the methodology is
unsound.** Give it a methods or experimental-design block from any empirical-science paper ---
**statistics, machine learning, quantitative biology, econometrics, materials science, or chemical
physics** --- and it returns a structured verdict, **support** or **refute**, pinpoints the offending
statement, and explains why. It is a **high-recall screen**: it surfaces methodological red flags ---
data leakage, p-hacking, uncorrected multiple comparisons, train/test contamination, optional stopping,
correlation-as-causation, post-hoc outlier removal, unblinded scoring, and more --- so a human misses
almost nothing.
At just **3B parameters**, CFM-Methods-3B delivers **frontier-level methodology screening** that runs
on a single GPU, on-premise, at a tiny fraction of the cost of a frontier API. It is the compact member
of MorphMind's **Control Foundation Model (CFM)** line --- models whose job is not to *generate*
science but to **check** it.
*By [MorphMind](https://morphmind.ai). Research preview.*
## Benchmark --- methodology-flaw detection vs. frontier models
![methodology benchmark](benchmark.png)
Evaluated on **flaw types the model never trained on** (24 flaw families used for training, **12 held
out for evaluation**), benchmarked head-to-head against frontier commercial models on the *same*
held-out set:
| Model | Recall | Precision | Localization | False-positive rate (clean) |
|---|---|---|---|---|
| base Qwen2.5-3B | 0.30 | --- | 0.42 | 0.07 |
| GPT-4o | 0.86 | 0.64 | 0.94 | 0.47 |
| Claude Opus 4 | 0.96 | 0.78 | 0.97 | 0.28 |
| **CFM-Methods-3B (ours)** | **0.98** | **1.00** | **0.97** | **0.005** |
**CFM-Methods-3B matches frontier recall and localization, with the cleanest false-alarm rate ---
effectively zero.** It catches **98% of methodological flaws it has never seen** and pinpoints the exact
flawed statement **97% of the time**, on par with Claude Opus 4 and GPT-4o, while the frontier models
over-flag clean methods heavily (Opus 28%, GPT-4o 47% false-positive rate). So it delivers
**frontier-grade methodology screening with the precision of a careful expert --- on-prem, in a 3B
model, at a tiny fraction of the cost.**
## Worked example
**1 — it catches an uncorrected multiple-comparisons flaw.** Given this methods block:
> *"We screened 60 candidate protein markers for association with disease status. Each marker was
> tested individually with a univariate logistic regression at the 0.05 level. The 14 markers reaching
> p < 0.05 in univariate tests are reported as significant and carried forward as the disease signature."*
CFM-Methods-3B returns (verbatim model output):
```json
{
"analysis": "There is a methodological flaw: Many tests without correction inflate the false-positive rate.",
"verdict": "refute",
"error_spans": [
{
"text": "The 14 markers reaching p < 0.05 in univariate tests are reported as significant",
"why": "Many tests without correction inflate the false-positive rate."
}
],
"action": "suggest_edit"
}
```
It pinpoints the exact offending sentence and names the failure mode — 60 simultaneous tests at α = 0.05
with no correction.
**2 — and it passes clean methods without crying wolf.** Given a sound instrumental-variables design:
> *"We estimate the causal effect of schooling on wages with two-stage least squares, instrumenting
> years of education with quarter-of-birth. We report first-stage F-statistics to confirm instrument
> strength and cluster standard errors at the state level."*
```json
{
"analysis": "The methodology is sound: instrument strength is verified and standard errors are clustered appropriately.",
"verdict": "support",
"error_spans": [],
"action": "accept"
}
```
No false flag — the near-zero false-positive rate in the benchmark above is what this looks like in practice.
## When & how to use it
Use it as a **fast, private, first-pass methodology screen** --- a pre-submission self-check for
researchers, triage for journals / reviewers / grant panels, QA over a stack of submissions, or a check
on AI-generated experimental designs. **Review one methods block at a time** (split a paper into its
method / experiment / analysis sections and run each). Because it is tuned for recall, treat its flags
as *"worth a human's 30 seconds."*
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained("MorphMind-AI/CFM-Methods-3B")
model = AutoModelForCausalLM.from_pretrained("MorphMind-AI/CFM-Methods-3B",
torch_dtype=torch.bfloat16, device_map="auto")
SYS = ("You are a scientific methodology reviewer. Review the methods and respond ONLY with JSON: "
"{\"analysis\":...,\"verdict\":\"support|refute\","
"\"error_spans\":[{\"text\":...,\"why\":...}],\"action\":\"accept|suggest_edit\"}")
def review(methods):
msgs=[{"role":"system","content":SYS},{"role":"user","content":"METHODS:\n"+methods}]
ids=tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out=model.generate(ids, max_new_tokens=320, do_sample=False)
return tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True)
```
## How it was built
A full-parameter fine-tune of Qwen2.5-3B-Instruct, trained with **RLVR** (Reinforcement Learning from
Verifiable Rewards) under a **localization-gated reward** --- a verdict is reinforced only if the model
also points to the actual flawed statement, which teaches genuine reasoning rather than blanket
flagging. Trained on public **arXiv** methods sections across statistics, machine learning, quantitative
biology, econometrics, materials science, and chemical physics, with injected, paraphrased
methodological flaws; evaluated on held-out flaw families.
## Notes
- A **high-recall screen** for first-pass review: ~98% of flaws surfaced with a near-zero false-alarm
rate, designed to keep an expert in the loop for the final call.
- **Generalizes** to methodological flaws it has never seen, across six empirical-science families.
- Part of MorphMind's growing **Control Foundation Model** family.
## License
Released under the **MorphMind CFM Research License** (see `LICENSE`), incorporating the **Qwen Research
License** of the Qwen2.5-3B base. Research / non-commercial use, with attribution to MorphMind and Qwen.
**For commercial licensing, contact MorphMind (morphmind.ai).**