Spaces:

Varshithdharmajv
/

mvm2-math-verification

Running

App Files Files Community

mvm2-math-verification / docs /README.md

Varshith dharmaj

Upload docs/README.md with huggingface_hub

64fc2b8 verified 18 days ago

preview code

raw

history blame contribute delete

7.46 kB

A newer version of the Gradio SDK is available: 6.10.0

Upgrade

MVM²: MVM² - Multi-Modal Multi-Model Mathematical Reasoning Verification System

VNR VJIET Major Project 2025
Team: Brahma Teja, Vinith Kulkarni, Varshith Dharmaj V, Bhavitha Yaragorla

📄 Problem Statement

Validating mathematical reasoning generated by Large Language Models (LLMs) is critical but challenging, especially when inputs are multimodal (images of handwritten or printed text).

Key Challenges:

Hallucinations: LLMs often generate plausible-sounding but logically flawed steps.
OCR Noise: Extracting math from images introduces errors (e.g., confusing '5' with 'S' or missing integrals) that downstream verifiers blindly accept.
Lack of Formal Uncertainty: Existing systems do not account for OCR confidence when making final validity judgments.

MVM² Solution: A unified pipeline that combines OCR with formal uncertainty propagation, symbolic verification (SymPy), and multi-agent LLM consensus to robustly verify mathematical solutions.

🏗️ System Architecture

The system follows a modular service-oriented architecture located in the backend/ directory:

Service	Responsibility
1. Input Receiver	(`backend/input_receiver.py`) Validates text/image inputs via Pydantic models.
2. Preprocessing	(`backend/preprocessing_service.py`) cleans images using OpenCV (denoising, binarization).
3. OCR Service	(`backend/ocr_service.py`) Hybrid engine combining Tesseract and specialized Handwritten models. Calculates OCR Confidence ($C_{ocr}$).
4. Representation	(`backend/representation_service.py`) Normalizes inputs into a canonical LaTeX-like Intermediate Representation (IR).
5. Verification	(`backend/verification_service.py`) Orchestrates SymPy for arithmetic checks and Multi-Agent LLMs (Solver, Critic, Verifier) for logic.
6. Classification	(`backend/classifier_service.py`) Aggregates scores using the MVM² Hybrid Formula.
7. Reporting	(`backend/reporting_service.py`) Generates detailed JSON/HTML reports for the user.

⭐ Key Innovations

1. OCR-Aware Confidence Propagation

Unlike standard pipelines that treat OCR text as ground truth, MVM² formally propagates visual uncertainty into the final confidence score ($C_{final}$).

$C_{final} = S_{weighted} \times (0.9 + 0.1 \times C_{ocr})$

This ensures that a verification result is heavily penalized if the input image was ambiguous, preventing false positives on noisy data.

2. Step-Level Multi-Agent Consensus

We deploy a Multi-Agent System (Solver, Critic, Verifier) to analyze solution steps. We compute a Hallucination Rate by checking consensus across agents for each step.

Agreement: +Confidence
Disagreement: Flags potential hallucination

3. Hybrid Scoring Mechanism

The final validity score ($S_{weighted}$) is a weighted ensemble of three distinct signals:

Symbolic Score ($\alpha=0.40$): SymPy's formal verification of arithmetic.
Logical Score ($\beta=0.35$): LLM consensus on reasoning flow.
Classifier Score ($\gamma=0.25$): Rule-based patterns (e.g., detecting uncertainty keywords).

$S_{weighted} = 0.40 \cdot S_{sym} + 0.35 \cdot S_{log} + 0.25 \cdot S_{clf}$

🚀 Getting Started

Prerequisites

Python 3.10+
Tesseract OCR installed (Instructions)
Google Gemini API Key

Installation

Clone the repository:

git clone https://github.com/yourusername/mvm2.git
cd mvm2

Install dependencies:
```
pip install -r requirements.txt
```

Set API Key:

# Windows PowerShell
$env:GEMINI_API_KEY="your_api_key_here"

Running the System

1. Backend API (FastAPI)

python backend/main.py
# Server runs at http://localhost:8000

2. Frontend Interface Open frontend/index.html in your web browser. (No build step required for this lightweight UI)

3. Docker Deployment MVM² is container-ready. We provide a full docker-compose setup.

docker-compose up --build -d

Backend API will be available at http://localhost:8000
Frontend UI will be available at http://localhost:8080

🧪 Experiments & Evaluation

We provide a custom evaluation suite to reproduce our ablation studies.

1. Dataset

The evaluation uses datasets/sample_data.json. You can add your own samples here.

2. Running Ablation Modes

The run_evaluation.py script automatically compares 4 system configurations:

Mode	Description	Hypothesis
`single_llm_only`	Baseline (1 Agent)	High hallucination rate, low accuracy.
`llm_plus_sympy`	Hybrid (1 Agent + SymPy)	Better arithmetic, still hallucinates logic.
`multi_agent_no_ocr_conf`	Multi-Agent Consensus	Low hallucination, but overconfident on noisy images.
`full_mvm2`	Complete System	Highest reliability and calibrated confidence.

Command:

python run_evaluation.py

3. Results

Outputs are saved to evaluation_results.csv containing:

Accuracy (Exact Match)
Hallucination Rate
Latency (ms)
Verdicts

📁 Project Structure

math_verification_mvp/
├── backend/
│   ├── config.py             # Central Configuration
│   ├── core/                 # Core Logic Services (MVM² Modules)
│   │   ├── input_receiver.py
│   │   ├── ocr_service.py
│   │   ├── verification_service.py
│   │   ├── classifier_service.py
│   │   └── ...
│   ├── tests/                # Unit Tests
│   └── main.py               # FastAPI Entry Point
├── frontend/                 # Lightweight UI
├── datasets/                 # Evaluation Data & Results
├── scripts/                  # Evaluation & Benchmark Scripts
│   ├── run_evaluation.py
│   ├── run_benchmarks.py
│   └── quick_test.py
├── docs/                     # Documentation
└── requirements.txt          # Dependencies