NPC Fin-PRM 7B β Financial Process Reward Model
A 7B process reward model trained to verify the correctness of step-level reasoning in financial analysis. Scores individual reasoning steps for factual accuracy, logical validity, completeness, and risk awareness.
Built for use in Best-of-N inference pipelines and reasoning alignment workflows.
Model Details
| Parameter | Value |
|---|---|
| Base Model | Qwen/Qwen2.5-7B-Instruct |
| Method | QLoRA (4-bit NF4 quantization) |
| LoRA Rank | 32 |
| LoRA Alpha | 64 |
| LoRA Dropout | 0.05 |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Training Framework | Unsloth + trl SFTTrainer |
| Max Sequence Length | 2,048 tokens |
| Training Iterations | 3 (bootstrapped improvement) |
How It Works
The PRM evaluates individual reasoning steps within a multi-step financial analysis chain. For each step, it produces:
{
"overall_score": 0.85,
"dimensions": {
"factual_accuracy": 0.90,
"logical_validity": 0.85,
"completeness": 0.80,
"risk_awareness": 0.85
},
"explanation": "Step correctly identifies the correlation but underweights liquidation cascade risk.",
"error_identified": null
}
Training Data
- ~80,000 step-level evaluations across ~20,000 reasoning paths
- Source: Logic tree data with correct and incorrect reasoning branches
- Labels generated via model-as-judge pipeline (Qwen2.5-72B-Instruct)
- 3 bootstrapped iterations: train, evaluate, re-label worst 500 examples, retrain
- Each iteration improved precision on error detection
Training Configuration
| Parameter | Value |
|---|---|
| Optimizer | AdamW 8-bit |
| Learning Rate | 2e-4 |
| LR Schedule | Cosine decay |
| Warmup Ratio | 0.05 |
| Batch Size | 4 |
| Gradient Accumulation | 4 (effective batch size 16) |
| Epochs | 3 per iteration |
| Hardware | NVIDIA H100 80GB |
Evaluation Results
| Metric | Score |
|---|---|
| Spearman Correlation (vs judge labels) | 0.94 |
| Error Detection F1 | 0.87 |
| Step-Level Accuracy | 89.2% |
Per-Dimension Accuracy
| Dimension | Accuracy |
|---|---|
| Factual Accuracy | 91.3% |
| Logical Validity | 88.7% |
| Completeness | 86.4% |
| Risk Awareness | 90.1% |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import json
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
# Load PRM adapter
model = PeftModel.from_pretrained(base_model, "ramankrishna10/npc-fin-prm-7b")
# Score a reasoning step
messages = [
{"role": "system", "content": "You are a financial reasoning verifier. Evaluate the correctness of each reasoning step. Respond with JSON containing overall_score, dimensions, explanation, and error_identified."},
{"role": "user", "content": """Scenario: ETH price dropped 15% in 24 hours while BTC remained flat.
Previous steps:
1. Identified ETH-specific selling pressure (correct -- on-chain data shows large wallet transfers to exchanges)
Current step to evaluate:
2. Concluded that ETH will continue dropping because BTC is flat, suggesting a broader market downturn."""}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.1)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
Best-of-N Inference
def best_of_n(generator_model, prm_model, prompt, n=8):
candidates = []
for _ in range(n):
response = generate(generator_model, prompt)
steps = extract_reasoning_steps(response)
scores = [score_step(prm_model, prompt, steps[:i+1]) for i in range(len(steps))]
avg_score = sum(s['overall_score'] for s in scores) / len(scores)
candidates.append((response, avg_score))
return max(candidates, key=lambda x: x[1])
Intended Use
- Step-level reasoning verification in financial analysis pipelines
- Best-of-N inference for improved reasoning quality
- Alignment scoring for financial reasoning models
- Research into process reward models for domain-specific applications
Limitations
- Trained on financial/crypto domain β may not generalize to other reasoning domains
- Judge labels sourced from Qwen2.5-72B, not human annotators β ceiling limited by judge quality
- Scores are relative, not absolute β calibrate thresholds for your use case
- Not a replacement for human review in production financial decisions
Alignment Relevance
Process reward models are a key component in AI alignment research. This model demonstrates:
- Step-level verification β catching errors in intermediate reasoning, not just final outputs
- Bootstrapped improvement β iterative training where the model's own failures inform the next round
- Decomposed scoring β multi-dimensional evaluation (factual, logical, completeness, risk) rather than single scalar reward
- Domain-specific alignment β aligning reasoning quality within a specialized vertical
Related Models
- npc-fin-32b-sft β The financial reasoning model this PRM was built to verify
Citation
@misc{bachu2026npcfinprm,
title={NPC Fin-PRM 7B: Process Reward Model for Financial Reasoning Verification},
author={Ramakrishna Bachu},
year={2026},
url={https://huggingface.co/ramankrishna10/npc-fin-prm-7b}
}
Author
Ramakrishna Bachu β GitHub | LinkedIn
Part of the NPC Model Family by Bottensor.
- Downloads last month
- 28