finryver-dev / RLHF_GUIDE.md
Sahil Garg
initial RLHF applied
c172f37

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

RLHF (Reinforcement Learning from Human Feedback) Features

Overview

FinRyver now includes RLHF capabilities that allow the system to learn from human feedback and improve the quality of generated financial statements over time.

Key Components

1. Enhanced Workflows

  • RLHF-enhanced versions of all financial statement generation workflows
  • Multiple candidate generation and selection using reward models
  • Quality prediction and confidence scoring

2. Feedback Collection System

  • Web-based review interface for human feedback
  • Structured feedback forms with technical and quality metrics
  • Storage and management of feedback data

3. Reward Model

  • Machine learning model that predicts statement quality
  • Trained on human feedback data
  • Automatic retraining when sufficient new feedback is available

Usage

Basic Financial Statement Generation

Standard workflow (existing functionality):

curl -X POST "http://localhost:8000/notes" \
  -F "file=@trial_balance.xlsx"

RLHF-enhanced workflow:

curl -X POST "http://localhost:8000/notes?use_rlhf=true" \
  -F "file=@trial_balance.xlsx"

The RLHF-enhanced workflow will:

  1. Generate multiple candidates (if reward model is trained)
  2. Use the reward model to select the best candidate
  3. Provide quality predictions and confidence scores
  4. Store the result for potential human feedback

Response Headers

When using RLHF workflows, additional metadata is included in response headers:

  • X-RLHF-Statement-ID: Unique ID for the generated statement
  • X-RLHF-Quality-Score: Predicted quality score (1-5)
  • X-RLHF-Confidence: Model confidence in the prediction

Feedback Collection

1. Get Statements Needing Review

curl "http://localhost:8000/rlhf/pending-reviews"

2. Review Interface

Visit: http://localhost:8000/rlhf/review/{statement_id}

This provides an HTML form for structured feedback collection.

3. Submit Feedback Programmatically

curl -X POST "http://localhost:8000/rlhf/feedback" \
  -F "statement_id=123e4567-e89b-12d3-a456-426614174000" \
  -F "calculation_accuracy=4" \
  -F "account_classification=5" \
  -F "statement_balance=4" \
  -F "accounting_standards=4" \
  -F "regulatory_compliance=5" \
  -F "completeness=3" \
  -F "professional_presentation=4" \
  -F "would_accept_for_audit=true" \
  -F "specific_errors=Minor formatting issues" \
  -F "improvement_suggestions=Add more detailed notes"

Monitoring and Statistics

Get Feedback Statistics

curl "http://localhost:8000/rlhf/stats"

Returns:

  • Total feedback collected
  • Average quality scores
  • Audit approval rates
  • Model training status
  • Feature importance

Get Model Information

curl "http://localhost:8000/rlhf/model-info"

Manual Model Retraining

curl -X POST "http://localhost:8000/rlhf/retrain"

Feedback Metrics

Technical Accuracy (1-5 scale)

  • Calculation Accuracy: Mathematical correctness
  • Account Classification: Proper categorization of accounts
  • Statement Balance: Internal consistency and reconciliation

Compliance (1-5 scale)

  • Accounting Standards: GAAP/IFRS compliance
  • Regulatory Compliance: Meeting regulatory requirements

Quality (1-5 scale)

  • Completeness: All necessary items included
  • Professional Presentation: Formatting and language quality

Qualitative Feedback

  • Specific Errors: Detailed error descriptions
  • Missing Items: Items that should be included
  • Improvement Suggestions: Recommendations for enhancement
  • Audit Acceptance: Binary approval for professional use

Training Process

  1. Initial Phase: System operates with default models
  2. Feedback Collection: Human experts review generated statements
  3. Model Training: When 20+ feedback samples are available, reward model is trained
  4. Enhanced Generation: RLHF workflows use trained model for better results
  5. Continuous Learning: Model retrains automatically with new feedback

Benefits

  • Quality Improvement: Statements become more accurate over time
  • Domain Adaptation: System learns specific requirements and preferences
  • Consistency: Reduces variability in output quality
  • Professional Standards: Aligns with human expert expectations

Implementation Notes

  • RLHF features are optional and backward-compatible
  • Existing workflows continue to work unchanged
  • Feedback data is stored locally and can be exported for analysis
  • Models can be backed up and restored
  • Multiple reward models can be maintained for different statement types

File Structure

data/
β”œβ”€β”€ feedback/
β”‚   β”œβ”€β”€ human_feedback.json     # Collected feedback data
β”‚   └── generated_statements.json  # Statement metadata
└── models/
    β”œβ”€β”€ reward_model.pkl        # Trained reward model
    β”œβ”€β”€ feature_names.json      # Model feature definitions
    └── model_stats.json        # Training statistics

Security and Privacy

  • Feedback data is stored locally
  • No external transmission of financial data
  • Anonymous feedback collection supported
  • Data can be cleaned/anonymized before training