NPC Fin-PRM 7B β€” Financial Process Reward Model

A 7B process reward model trained to verify the correctness of step-level reasoning in financial analysis. Scores individual reasoning steps for factual accuracy, logical validity, completeness, and risk awareness.

Built for use in Best-of-N inference pipelines and reasoning alignment workflows.

Model Details

Parameter Value
Base Model Qwen/Qwen2.5-7B-Instruct
Method QLoRA (4-bit NF4 quantization)
LoRA Rank 32
LoRA Alpha 64
LoRA Dropout 0.05
Target Modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training Framework Unsloth + trl SFTTrainer
Max Sequence Length 2,048 tokens
Training Iterations 3 (bootstrapped improvement)

How It Works

The PRM evaluates individual reasoning steps within a multi-step financial analysis chain. For each step, it produces:

{
  "overall_score": 0.85,
  "dimensions": {
    "factual_accuracy": 0.90,
    "logical_validity": 0.85,
    "completeness": 0.80,
    "risk_awareness": 0.85
  },
  "explanation": "Step correctly identifies the correlation but underweights liquidation cascade risk.",
  "error_identified": null
}

Training Data

  • ~80,000 step-level evaluations across ~20,000 reasoning paths
  • Source: Logic tree data with correct and incorrect reasoning branches
  • Labels generated via model-as-judge pipeline (Qwen2.5-72B-Instruct)
  • 3 bootstrapped iterations: train, evaluate, re-label worst 500 examples, retrain
  • Each iteration improved precision on error detection

Training Configuration

Parameter Value
Optimizer AdamW 8-bit
Learning Rate 2e-4
LR Schedule Cosine decay
Warmup Ratio 0.05
Batch Size 4
Gradient Accumulation 4 (effective batch size 16)
Epochs 3 per iteration
Hardware NVIDIA H100 80GB

Evaluation Results

Metric Score
Spearman Correlation (vs judge labels) 0.94
Error Detection F1 0.87
Step-Level Accuracy 89.2%

Per-Dimension Accuracy

Dimension Accuracy
Factual Accuracy 91.3%
Logical Validity 88.7%
Completeness 86.4%
Risk Awareness 90.1%

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import json

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# Load PRM adapter
model = PeftModel.from_pretrained(base_model, "ramankrishna10/npc-fin-prm-7b")

# Score a reasoning step
messages = [
    {"role": "system", "content": "You are a financial reasoning verifier. Evaluate the correctness of each reasoning step. Respond with JSON containing overall_score, dimensions, explanation, and error_identified."},
    {"role": "user", "content": """Scenario: ETH price dropped 15% in 24 hours while BTC remained flat.

Previous steps:
1. Identified ETH-specific selling pressure (correct -- on-chain data shows large wallet transfers to exchanges)

Current step to evaluate:
2. Concluded that ETH will continue dropping because BTC is flat, suggesting a broader market downturn."""}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.1)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Best-of-N Inference

def best_of_n(generator_model, prm_model, prompt, n=8):
    candidates = []
    for _ in range(n):
        response = generate(generator_model, prompt)
        steps = extract_reasoning_steps(response)
        scores = [score_step(prm_model, prompt, steps[:i+1]) for i in range(len(steps))]
        avg_score = sum(s['overall_score'] for s in scores) / len(scores)
        candidates.append((response, avg_score))
    return max(candidates, key=lambda x: x[1])

Intended Use

  • Step-level reasoning verification in financial analysis pipelines
  • Best-of-N inference for improved reasoning quality
  • Alignment scoring for financial reasoning models
  • Research into process reward models for domain-specific applications

Limitations

  • Trained on financial/crypto domain β€” may not generalize to other reasoning domains
  • Judge labels sourced from Qwen2.5-72B, not human annotators β€” ceiling limited by judge quality
  • Scores are relative, not absolute β€” calibrate thresholds for your use case
  • Not a replacement for human review in production financial decisions

Alignment Relevance

Process reward models are a key component in AI alignment research. This model demonstrates:

  • Step-level verification β€” catching errors in intermediate reasoning, not just final outputs
  • Bootstrapped improvement β€” iterative training where the model's own failures inform the next round
  • Decomposed scoring β€” multi-dimensional evaluation (factual, logical, completeness, risk) rather than single scalar reward
  • Domain-specific alignment β€” aligning reasoning quality within a specialized vertical

Related Models

  • npc-fin-32b-sft β€” The financial reasoning model this PRM was built to verify

Citation

@misc{bachu2026npcfinprm,
  title={NPC Fin-PRM 7B: Process Reward Model for Financial Reasoning Verification},
  author={Ramakrishna Bachu},
  year={2026},
  url={https://huggingface.co/ramankrishna10/npc-fin-prm-7b}
}

Author

Ramakrishna Bachu β€” GitHub | LinkedIn

Part of the NPC Model Family by Bottensor.

Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ramankrishna10/npc-fin-prm-7b

Base model

Qwen/Qwen2.5-7B
Adapter
(1755)
this model