A newer version of the Gradio SDK is available: 6.10.0
MVMยฒ: MVMยฒ - Multi-Modal Multi-Model Mathematical Reasoning Verification System
VNR VJIET Major Project 2025
Team: Brahma Teja, Vinith Kulkarni, Varshith Dharmaj V, Bhavitha Yaragorla
๐ Problem Statement
Validating mathematical reasoning generated by Large Language Models (LLMs) is critical but challenging, especially when inputs are multimodal (images of handwritten or printed text).
Key Challenges:
- Hallucinations: LLMs often generate plausible-sounding but logically flawed steps.
- OCR Noise: Extracting math from images introduces errors (e.g., confusing '5' with 'S' or missing integrals) that downstream verifiers blindly accept.
- Lack of Formal Uncertainty: Existing systems do not account for OCR confidence when making final validity judgments.
MVMยฒ Solution: A unified pipeline that combines OCR with formal uncertainty propagation, symbolic verification (SymPy), and multi-agent LLM consensus to robustly verify mathematical solutions.
๐๏ธ System Architecture
The system follows a modular service-oriented architecture located in the backend/ directory:
| Service | Responsibility |
|---|---|
| 1. Input Receiver | (backend/input_receiver.py) Validates text/image inputs via Pydantic models. |
| 2. Preprocessing | (backend/preprocessing_service.py) cleans images using OpenCV (denoising, binarization). |
| 3. OCR Service | (backend/ocr_service.py) Hybrid engine combining Tesseract and specialized Handwritten models. Calculates OCR Confidence ($C_{ocr}$). |
| 4. Representation | (backend/representation_service.py) Normalizes inputs into a canonical LaTeX-like Intermediate Representation (IR). |
| 5. Verification | (backend/verification_service.py) Orchestrates SymPy for arithmetic checks and Multi-Agent LLMs (Solver, Critic, Verifier) for logic. |
| 6. Classification | (backend/classifier_service.py) Aggregates scores using the MVMยฒ Hybrid Formula. |
| 7. Reporting | (backend/reporting_service.py) Generates detailed JSON/HTML reports for the user. |
โญ Key Innovations
1. OCR-Aware Confidence Propagation
Unlike standard pipelines that treat OCR text as ground truth, MVMยฒ formally propagates visual uncertainty into the final confidence score ($C_{final}$).
This ensures that a verification result is heavily penalized if the input image was ambiguous, preventing false positives on noisy data.
2. Step-Level Multi-Agent Consensus
We deploy a Multi-Agent System (Solver, Critic, Verifier) to analyze solution steps. We compute a Hallucination Rate by checking consensus across agents for each step.
- Agreement: +Confidence
- Disagreement: Flags potential hallucination
3. Hybrid Scoring Mechanism
The final validity score ($S_{weighted}$) is a weighted ensemble of three distinct signals:
- Symbolic Score ($\alpha=0.40$): SymPy's formal verification of arithmetic.
- Logical Score ($\beta=0.35$): LLM consensus on reasoning flow.
- Classifier Score ($\gamma=0.25$): Rule-based patterns (e.g., detecting uncertainty keywords).
๐ Getting Started
Prerequisites
- Python 3.10+
- Tesseract OCR installed (Instructions)
- Google Gemini API Key
Installation
- Clone the repository:
git clone https://github.com/yourusername/mvm2.git cd mvm2 - Install dependencies:
pip install -r requirements.txt - Set API Key:
# Windows PowerShell $env:GEMINI_API_KEY="your_api_key_here"
Running the System
1. Backend API (FastAPI)
python backend/main.py
# Server runs at http://localhost:8000
2. Frontend Interface
Open frontend/index.html in your web browser.
(No build step required for this lightweight UI)
3. Docker Deployment MVMยฒ is container-ready. We provide a full docker-compose setup.
docker-compose up --build -d
- Backend API will be available at
http://localhost:8000 - Frontend UI will be available at
http://localhost:8080
๐งช Experiments & Evaluation
We provide a custom evaluation suite to reproduce our ablation studies.
1. Dataset
The evaluation uses datasets/sample_data.json. You can add your own samples here.
2. Running Ablation Modes
The run_evaluation.py script automatically compares 4 system configurations:
| Mode | Description | Hypothesis |
|---|---|---|
single_llm_only |
Baseline (1 Agent) | High hallucination rate, low accuracy. |
llm_plus_sympy |
Hybrid (1 Agent + SymPy) | Better arithmetic, still hallucinates logic. |
multi_agent_no_ocr_conf |
Multi-Agent Consensus | Low hallucination, but overconfident on noisy images. |
full_mvm2 |
Complete System | Highest reliability and calibrated confidence. |
Command:
python run_evaluation.py
3. Results
Outputs are saved to evaluation_results.csv containing:
- Accuracy (Exact Match)
- Hallucination Rate
- Latency (ms)
- Verdicts
๐ Project Structure
math_verification_mvp/
โโโ backend/
โ โโโ config.py # Central Configuration
โ โโโ core/ # Core Logic Services (MVMยฒ Modules)
โ โ โโโ input_receiver.py
โ โ โโโ ocr_service.py
โ โ โโโ verification_service.py
โ โ โโโ classifier_service.py
โ โ โโโ ...
โ โโโ tests/ # Unit Tests
โ โโโ main.py # FastAPI Entry Point
โโโ frontend/ # Lightweight UI
โโโ datasets/ # Evaluation Data & Results
โโโ scripts/ # Evaluation & Benchmark Scripts
โ โโโ run_evaluation.py
โ โโโ run_benchmarks.py
โ โโโ quick_test.py
โโโ docs/ # Documentation
โโโ requirements.txt # Dependencies
๐ Getting Started
Prerequisites
- Python 3.10+
- Tesseract OCR installed (Instructions)
- Google Gemini API Key
Installation
- Clone the repository:
git clone https://github.com/yourusername/mvm2.git cd math_verification_mvp - Install dependencies:
pip install -r requirements.txt - Set API Key:
# Windows PowerShell $env:GEMINI_API_KEY="your_api_key_here"
Running the System
1. Backend API (FastAPI)
python backend/main.py
# Server runs at http://localhost:8000
2. Frontend Interface
Open frontend/index.html in your web browser.
3. Running Experiments
# Run full evaluation suite
python scripts/run_evaluation.py