NPC Fin-PRM 7B — Financial Process Reward Model

A 7B process reward model trained to verify the correctness of step-level reasoning in financial analysis. Scores individual reasoning steps for factual accuracy, logical validity, completeness, and risk awareness.

Built for use in Best-of-N inference pipelines and reasoning alignment workflows.

📄 Paper: Cheap PRMs: Multi-Dimensional Process Reward Modeling for Domain-Specialized Reasoning (Zenodo, 2026)

Model Details

Parameter	Value
Base Model	Qwen/Qwen2.5-7B-Instruct
Method	QLoRA (4-bit NF4 quantization)
LoRA Rank	32
LoRA Alpha	64
LoRA Dropout	0.05
Target Modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training Framework	Unsloth + trl SFTTrainer
Max Sequence Length	2,048 tokens

How It Works

The PRM evaluates individual reasoning steps within a multi-step financial analysis chain. For each step, it produces:

{
  "overall_score": 0.85,
  "dimensions": {
    "factual_accuracy": 0.90,
    "logical_validity": 0.85,
    "completeness": 0.80,
    "risk_awareness": 0.85
  },
  "explanation": "Step correctly identifies the correlation but underweights liquidation cascade risk.",
  "error_identified": null
}

Training Data

~80,000 step-level evaluations across ~20,000 reasoning paths
Source: Logic tree data with correct and incorrect reasoning branches
Labels generated via model-as-judge pipeline (Qwen2.5-72B-Instruct)
3 epochs of QLoRA SFT on a fixed label set; iterative bootstrap-with-relabeling is future work

Training Configuration

Parameter	Value
Optimizer	AdamW 8-bit
Learning Rate	2e-4
LR Schedule	Cosine decay
Warmup Ratio	0.05
Per-device Batch Size	2
Gradient Accumulation	8 (effective batch size 16)
Epochs	3
Hardware	NVIDIA H100 80GB

Evaluation Results

Numbers below are from an independent re-evaluation on a stratified 200-example held-out split (MLX 4-bit, 2026-04-26).

Metric	Score
Spearman Correlation (vs judge labels)	0.9234
Error Detection F1 (FLAWED)	0.8421
Step-Level Rating Accuracy	88.5%
MAE on overall_score	0.0404

Per-Dimension Spearman

Dimension	Spearman ρ
factual_accuracy	0.842
logical_validity	0.931
completeness	0.865
risk_awareness	0.908

Calibration

Despite Spearman 0.92, the model is poorly calibrated as a probability of error: ECE = 0.21 on the held-out 200-step set. The score over-flags in the 0.1–0.4 score band and under-flags sharply around 0.5–0.6. For pipelines that only need ranking (best-of-N reranking, search-guided decoding), the raw scores are fine. For pipelines that need a calibrated P(flawed), apply Platt-scaling or isotonic regression on a held-out calibration split. A calibrated checkpoint is not yet shipped.

Out-of-Distribution Behavior

Evaluated on 307 gold-correct reasoning steps from GSM8K (134) and MATH-500 (173), the PRM mis-flags 5.2% as FLAWED with mean overall_score 0.856 — better cross-domain transfer than expected for a DeFi-only training corpus. The model also extrapolates beyond its trained label set, emitting EXCELLENT (3.6%) and PERFECT (0.3%) ratings on OOD math reasoning despite never being trained on those labels. Downstream consumers should fold these into STRONG or impose a closed-set rating parser.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import json

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# Load PRM adapter
model = PeftModel.from_pretrained(base_model, "ramankrishna10/npc-fin-prm-7b")

# Score a reasoning step
messages = [
    {"role": "system", "content": "You are a financial reasoning verifier. Evaluate the correctness of each reasoning step. Respond with JSON containing overall_score, dimensions, explanation, and error_identified."},
    {"role": "user", "content": """Scenario: ETH price dropped 15% in 24 hours while BTC remained flat.

Previous steps:
1. Identified ETH-specific selling pressure (correct -- on-chain data shows large wallet transfers to exchanges)

Current step to evaluate:
2. Concluded that ETH will continue dropping because BTC is flat, suggesting a broader market downturn."""}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.1)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Best-of-N Inference

def best_of_n(generator_model, prm_model, prompt, n=8):
    candidates = []
    for _ in range(n):
        response = generate(generator_model, prompt)
        steps = extract_reasoning_steps(response)
        scores = [score_step(prm_model, prompt, steps[:i+1]) for i in range(len(steps))]
        avg_score = sum(s['overall_score'] for s in scores) / len(scores)
        candidates.append((response, avg_score))
    return max(candidates, key=lambda x: x[1])

Intended Use

Step-level reasoning verification in financial analysis pipelines
Best-of-N inference for improved reasoning quality
Alignment scoring for financial reasoning models
Research into process reward models for domain-specific applications

Limitations

Trained on financial/crypto domain — may not generalize to other reasoning domains
Judge labels sourced from Qwen2.5-72B, not human annotators — ceiling limited by judge quality
Scores are relative, not absolute — calibrate thresholds for your use case
Score scale is uncalibrated as a probability (ECE 0.21); use Platt-scaling for downstream pipelines that need P(flawed)
Four labeled dimensions are not independent — three cluster at Spearman 0.85–0.92. logical_validity alone explains ~95% of overall_score. Future PRM designs should pick more orthogonal axes.
Not a replacement for human review in production financial decisions

Alignment Relevance

Process reward models are a key component in AI alignment research. This model demonstrates:

Step-level verification — catching errors in intermediate reasoning, not just final outputs
Decomposed scoring — multi-dimensional evaluation (factual, logical, completeness, risk) rather than single scalar reward
Domain-specific alignment — aligning reasoning quality within a specialized vertical

Paper

📄 Cheap PRMs: Multi-Dimensional Process Reward Modeling for Domain-Specialized Reasoning (Zenodo, 2026) DOI: 10.5281/zenodo.19800784

The paper documents the full pipeline, presents independent re-evaluation numbers (Spearman 0.9234, F1 0.84 on n=200 stratified val), and reports two findings the model card adds here: the four labeled dimensions are not independent (three cluster at $\rho \in [0.85, 0.92]$), and the score scale is miscalibrated as a probability (ECE 0.21) despite the strong rank correlation. See the linked PDF for the full evaluation including OOD probe on GSM8K + MATH-500 (5.2% mis-flag rate on gold-correct math reasoning).

Related Models

npc-fin-32b-sft — The financial reasoning model this PRM was built to verify

Citation

Bachu, R. K. (2026). Cheap PRMs: Multi-Dimensional Process Reward Modeling for Domain-Specialized Reasoning. Zenodo. https://doi.org/10.5281/zenodo.19800784

@misc{bachu2026cheapprms,
  title     = {Cheap PRMs: Multi-Dimensional Process Reward Modeling for
               Domain-Specialized Reasoning},
  author    = {Bachu, Rama Krishna},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19800784},
  url       = {https://doi.org/10.5281/zenodo.19800784},
  note      = {Preprint},
}

Author

Ramakrishna Bachu — GitHub | LinkedIn

Part of the NPC Model Family by Bottensor.

Downloads last month: 1

Model tree for ramankrishna10/npc-fin-prm-7b

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Adapter

(2134)

this model