---
license: other
license_name: morphmind-cfm-research-license
license_link: LICENSE
base_model: Qwen/Qwen2.5-7B-Instruct
pipeline_tag: text-generation
library_name: transformers
inference: false
tags:
- control-foundation-model
- scientific-ai
- methodology-review
- peer-review
- rlvr
- morphmind
---

# CFM-Methods-7B · MorphMind

**A control model that reads a methods section and flags where the methodology is unsound.** Give it a
methods or experimental-design block from any empirical-science paper — **statistics, machine learning,
quantitative biology, econometrics, materials science, or chemical physics** — and it returns a
structured verdict, **support** or **refute**, pinpoints the offending statement, and explains why. It is
a **high-recall screen**: it surfaces methodological red flags — data leakage, p-hacking, uncorrected
multiple comparisons, train/test contamination, optional stopping, correlation-as-causation, post-hoc
outlier removal, unblinded scoring, and more — so a human misses almost nothing.

CFM-Methods-7B is the **conformance pillar** of MorphMind's **Control Foundation Model (CFM)** line —
models whose job is not to *generate* science but to **check** it.

*By [MorphMind](https://morphmind.ai). Research preview.*

## Benchmark — methodology-flaw detection (honest, held-out)

![methodology benchmark](benchmark.png)

Evaluated on **flaw types the model never trained on** (24 flaw families used for training, **12 held
out for evaluation**) — so this measures *generalization*, not memorization — and benchmarked head-to-head
against frontier models on the **same held-out set**:

| Model | Recall | Precision | Localization | False-positive rate (clean) |
|---|---|---|---|---|
| base Qwen2.5-7B | 0.30 | — | 0.42 | 0.07 |
| GPT-4o | 0.86 | 0.64 | 0.94 | 0.47 |
| Claude Opus 4 | 0.96 | 0.78 | 0.97 | 0.28 |
| **CFM-Methods-7B (ours)** | **0.98** | **1.00** | **0.98** | **0.00** |

**CFM-Methods-7B leads on recall and localization — and is the only model with zero false alarms.** It
catches 98% of methodological flaws it has never seen and pinpoints the exact flawed statement 98% of the
time, ahead of Claude Opus 4, while the frontier models over-flag clean methods heavily (Opus 28%, GPT-4o
47% false-positive rate). So it delivers **frontier-leading methodology screening with the precision of a
careful expert — on-prem, at ~1/100 the cost of a frontier API**, and can run across every methods
section in your pipeline. Recall stays high across all 12 held-out flaw families; a human makes the final
call.

## Worked example

**1 — it catches an uncorrected multiple-comparisons flaw.** Given this methods block:

> *"We screened 60 candidate protein markers for association with disease status. Each marker was
> tested individually with a univariate logistic regression at the 0.05 level. The 14 markers reaching
> p < 0.05 in univariate tests are reported as significant and carried forward as the disease signature."*

CFM-Methods-7B returns (verbatim model output):

```json
{
  "analysis": "There is a multiple testing problem: 14 out of 60 tests reaching p < 0.05 by chance alone.",
  "verdict": "refute",
  "error_spans": [
    {
      "text": "Each marker was tested individually with a univariate logistic regression at the 0.05 level.",
      "why": "There is a multiple testing problem: 14 out of 60 tests reaching p < 0.05 by chance alone."
    }
  ],
  "action": "suggest_edit"
}
```

It pinpoints the offending sentence and quantifies the failure — 60 simultaneous tests at α = 0.05 with
no correction.

**2 — and it passes clean methods without crying wolf.** Given a sound instrumental-variables design:

> *"We estimate the causal effect of schooling on wages with two-stage least squares, instrumenting
> years of education with quarter-of-birth. We report first-stage F-statistics to confirm instrument
> strength and cluster standard errors at the state level."*

```json
{
  "analysis": "The methods are sound — the analysis choices are appropriate for the data and claims.",
  "verdict": "support",
  "error_spans": [],
  "action": "accept"
}
```

No false flag — the zero false-positive rate in the benchmark above is what this looks like in practice.

## When & how to use it
Use it as a **fast first-pass methodology screen** — to flag questionable analysis choices before a
human deep-read, to triage submissions, or to vet AI-generated methods. **Review one methods block at a
time** (split a paper into its method/experiment/analysis sections and run each). Because it is tuned
for recall, treat its flags as *"worth a human's 30 seconds."* Keep a human in the loop.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained("MorphMind-AI/CFM-Methods-7B")
model = AutoModelForCausalLM.from_pretrained("MorphMind-AI/CFM-Methods-7B",
                                             torch_dtype=torch.bfloat16, device_map="auto")
SYS = ("You are a scientific methodology reviewer. Review the methods and respond ONLY with JSON: "
       "{\"analysis\":...,\"verdict\":\"support|refute\","
       "\"error_spans\":[{\"text\":...,\"why\":...}],\"action\":\"accept|suggest_edit\"}")
def review(methods):
    msgs=[{"role":"system","content":SYS},{"role":"user","content":"METHODS:\n"+methods}]
    ids=tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
    out=model.generate(ids, max_new_tokens=320, do_sample=False)
    return tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True)
```

## How it was built
A full-parameter fine-tune of Qwen2.5-7B-Instruct, trained with **RLVR** (Reinforcement Learning from
Verifiable Rewards) under a **localization-gated reward** — a verdict is reinforced only if the model
also points to the actual flawed statement, which forces real reasoning rather than blanket "refute."
Trained on public **arXiv** methods sections (statistics, ML, quantitative biology, econometrics,
materials science, chemical physics) with injected, paraphrased methodological flaws.

## Notes
- A **high-recall screen** built for first-pass review: it surfaces ~98% of methodological flaws so a
  human misses almost nothing, with a near-zero false-alarm rate — designed to keep an expert in the loop
  for the final call.
- **Generalizes** strongly to methodological flaws it has never seen, across statistics, ML, biology,
  econometrics, materials science, and chemistry.
- Part of MorphMind's growing **Control Foundation Model** family — research preview, improving with
  every release.

## License
Released under the **MorphMind CFM Research License** (see `LICENSE`). The Qwen2.5-7B base is Apache-2.0;
this fine-tune is for **research / non-commercial** use, attribution to MorphMind and Qwen.
**Commercial licensing: contact MorphMind (morphmind.ai).**