Instructions to use ramankrishna10/npc-fin-prm-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use ramankrishna10/npc-fin-prm-7b with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit") model = PeftModel.from_pretrained(base_model, "ramankrishna10/npc-fin-prm-7b") - Notebooks
- Google Colab
- Kaggle
NPC Fin-PRM 7B β Financial Process Reward Model
A 7B process reward model trained to verify the correctness of step-level reasoning in financial analysis. Scores individual reasoning steps for factual accuracy, logical validity, completeness, and risk awareness.
Built for use in Best-of-N inference pipelines and reasoning alignment workflows.
π Paper: Cheap PRMs: Multi-Dimensional Process Reward Modeling for Domain-Specialized Reasoning (Zenodo, 2026)
Model Details
| Parameter | Value |
|---|---|
| Base Model | Qwen/Qwen2.5-7B-Instruct |
| Method | QLoRA (4-bit NF4 quantization) |
| LoRA Rank | 32 |
| LoRA Alpha | 64 |
| LoRA Dropout | 0.05 |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Training Framework | Unsloth + trl SFTTrainer |
| Max Sequence Length | 2,048 tokens |
How It Works
The PRM evaluates individual reasoning steps within a multi-step financial analysis chain. For each step, it produces:
{
"overall_score": 0.85,
"dimensions": {
"factual_accuracy": 0.90,
"logical_validity": 0.85,
"completeness": 0.80,
"risk_awareness": 0.85
},
"explanation": "Step correctly identifies the correlation but underweights liquidation cascade risk.",
"error_identified": null
}
Training Data
- ~80,000 step-level evaluations across ~20,000 reasoning paths
- Source: Logic tree data with correct and incorrect reasoning branches
- Labels generated via model-as-judge pipeline (Qwen2.5-72B-Instruct)
- 3 epochs of QLoRA SFT on a fixed label set; iterative bootstrap-with-relabeling is future work
Training Configuration
| Parameter | Value |
|---|---|
| Optimizer | AdamW 8-bit |
| Learning Rate | 2e-4 |
| LR Schedule | Cosine decay |
| Warmup Ratio | 0.05 |
| Per-device Batch Size | 2 |
| Gradient Accumulation | 8 (effective batch size 16) |
| Epochs | 3 |
| Hardware | NVIDIA H100 80GB |
Evaluation Results
Numbers below are from an independent re-evaluation on a stratified 200-example held-out split (MLX 4-bit, 2026-04-26).
| Metric | Score |
|---|---|
| Spearman Correlation (vs judge labels) | 0.9234 |
| Error Detection F1 (FLAWED) | 0.8421 |
| Step-Level Rating Accuracy | 88.5% |
| MAE on overall_score | 0.0404 |
Per-Dimension Spearman
| Dimension | Spearman Ο |
|---|---|
| factual_accuracy | 0.842 |
| logical_validity | 0.931 |
| completeness | 0.865 |
| risk_awareness | 0.908 |
Calibration
Despite Spearman 0.92, the model is poorly calibrated as a probability of error: ECE = 0.21 on the held-out 200-step set. The score over-flags in the 0.1β0.4 score band and under-flags sharply around 0.5β0.6. For pipelines that only need ranking (best-of-N reranking, search-guided decoding), the raw scores are fine. For pipelines that need a calibrated P(flawed), apply Platt-scaling or isotonic regression on a held-out calibration split. A calibrated checkpoint is not yet shipped.
Out-of-Distribution Behavior
Evaluated on 307 gold-correct reasoning steps from GSM8K (134) and MATH-500 (173), the PRM mis-flags 5.2% as FLAWED with mean overall_score 0.856 β better cross-domain transfer than expected for a DeFi-only training corpus. The model also extrapolates beyond its trained label set, emitting EXCELLENT (3.6%) and PERFECT (0.3%) ratings on OOD math reasoning despite never being trained on those labels. Downstream consumers should fold these into STRONG or impose a closed-set rating parser.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import json
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
# Load PRM adapter
model = PeftModel.from_pretrained(base_model, "ramankrishna10/npc-fin-prm-7b")
# Score a reasoning step
messages = [
{"role": "system", "content": "You are a financial reasoning verifier. Evaluate the correctness of each reasoning step. Respond with JSON containing overall_score, dimensions, explanation, and error_identified."},
{"role": "user", "content": """Scenario: ETH price dropped 15% in 24 hours while BTC remained flat.
Previous steps:
1. Identified ETH-specific selling pressure (correct -- on-chain data shows large wallet transfers to exchanges)
Current step to evaluate:
2. Concluded that ETH will continue dropping because BTC is flat, suggesting a broader market downturn."""}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.1)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
Best-of-N Inference
def best_of_n(generator_model, prm_model, prompt, n=8):
candidates = []
for _ in range(n):
response = generate(generator_model, prompt)
steps = extract_reasoning_steps(response)
scores = [score_step(prm_model, prompt, steps[:i+1]) for i in range(len(steps))]
avg_score = sum(s['overall_score'] for s in scores) / len(scores)
candidates.append((response, avg_score))
return max(candidates, key=lambda x: x[1])
Intended Use
- Step-level reasoning verification in financial analysis pipelines
- Best-of-N inference for improved reasoning quality
- Alignment scoring for financial reasoning models
- Research into process reward models for domain-specific applications
Limitations
- Trained on financial/crypto domain β may not generalize to other reasoning domains
- Judge labels sourced from Qwen2.5-72B, not human annotators β ceiling limited by judge quality
- Scores are relative, not absolute β calibrate thresholds for your use case
- Score scale is uncalibrated as a probability (ECE 0.21); use Platt-scaling for downstream pipelines that need P(flawed)
- Four labeled dimensions are not independent β three cluster at Spearman 0.85β0.92. logical_validity alone explains ~95% of overall_score. Future PRM designs should pick more orthogonal axes.
- Not a replacement for human review in production financial decisions
Alignment Relevance
Process reward models are a key component in AI alignment research. This model demonstrates:
- Step-level verification β catching errors in intermediate reasoning, not just final outputs
- Decomposed scoring β multi-dimensional evaluation (factual, logical, completeness, risk) rather than single scalar reward
- Domain-specific alignment β aligning reasoning quality within a specialized vertical
Paper
π Cheap PRMs: Multi-Dimensional Process Reward Modeling for Domain-Specialized Reasoning (Zenodo, 2026) DOI: 10.5281/zenodo.19800784
The paper documents the full pipeline, presents independent re-evaluation numbers (Spearman 0.9234, F1 0.84 on n=200 stratified val), and reports two findings the model card adds here: the four labeled dimensions are not independent (three cluster at $\rho \in [0.85, 0.92]$), and the score scale is miscalibrated as a probability (ECE 0.21) despite the strong rank correlation. See the linked PDF for the full evaluation including OOD probe on GSM8K + MATH-500 (5.2% mis-flag rate on gold-correct math reasoning).
Related Models
- npc-fin-32b-sft β The financial reasoning model this PRM was built to verify
Citation
Bachu, R. K. (2026). Cheap PRMs: Multi-Dimensional Process Reward Modeling for Domain-Specialized Reasoning. Zenodo. https://doi.org/10.5281/zenodo.19800784
@misc{bachu2026cheapprms,
title = {Cheap PRMs: Multi-Dimensional Process Reward Modeling for
Domain-Specialized Reasoning},
author = {Bachu, Rama Krishna},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.19800784},
url = {https://doi.org/10.5281/zenodo.19800784},
note = {Preprint},
}
Author
Ramakrishna Bachu β GitHub | LinkedIn
Part of the NPC Model Family by Bottensor.
- Downloads last month
- 1