Spaces:

Varshithdharmajv
/

mvm2-math-verification

Running

App Files Files Community

mvm2-math-verification / docs /README.md

Varshith dharmaj

Upload docs/README.md with huggingface_hub

64fc2b8 verified 18 days ago

preview code

raw

history blame contribute delete

7.46 kB

	# MVM²: MVM² - Multi-Modal Multi-Model Mathematical Reasoning Verification System

	VNR VJIET Major Project 2025
	Team: Brahma Teja, Vinith Kulkarni, Varshith Dharmaj V, Bhavitha Yaragorla

	![Status](https://img.shields.io/badge/status-production--ready-green)
	![Version](https://img.shields.io/badge/version-2.0.0-blue)
	![Python](https://img.shields.io/badge/python-3.10+-blue)
	![Docker](https://img.shields.io/badge/docker-enabled-blue)

	---

	## 📄 Problem Statement

	Validating mathematical reasoning generated by Large Language Models (LLMs) is critical but challenging, especially when inputs are multimodal (images of handwritten or printed text).

	Key Challenges:
	1. Hallucinations: LLMs often generate plausible-sounding but logically flawed steps.
	2. OCR Noise: Extracting math from images introduces errors (e.g., confusing '5' with 'S' or missing integrals) that downstream verifiers blindly accept.
	3. Lack of Formal Uncertainty: Existing systems do not account for OCR confidence when making final validity judgments.

	MVM² Solution: A unified pipeline that combines OCR with formal uncertainty propagation, symbolic verification (SymPy), and multi-agent LLM consensus to robustly verify mathematical solutions.

	---

	## 🏗️ System Architecture

	The system follows a modular service-oriented architecture located in the `backend/` directory:

	\| Service \| Responsibility \|
	\|---\|---\|
	\| 1. Input Receiver \| (`backend/input_receiver.py`) Validates text/image inputs via Pydantic models. \|
	\| 2. Preprocessing \| (`backend/preprocessing_service.py`) cleans images using OpenCV (denoising, binarization). \|
	\| 3. OCR Service \| (`backend/ocr_service.py`) Hybrid engine combining Tesseract and specialized Handwritten models. Calculates OCR Confidence ($C_{ocr}$). \|
	\| 4. Representation \| (`backend/representation_service.py`) Normalizes inputs into a canonical LaTeX-like Intermediate Representation (IR). \|
	\| 5. Verification \| (`backend/verification_service.py`) Orchestrates SymPy for arithmetic checks and Multi-Agent LLMs (Solver, Critic, Verifier) for logic. \|
	\| 6. Classification \| (`backend/classifier_service.py`) Aggregates scores using the MVM² Hybrid Formula. \|
	\| 7. Reporting \| (`backend/reporting_service.py`) Generates detailed JSON/HTML reports for the user. \|

	---

	## ⭐ Key Innovations

	### 1. OCR-Aware Confidence Propagation
	Unlike standard pipelines that treat OCR text as ground truth, MVM² formally propagates visual uncertainty into the final confidence score ($C_{final}$).

	$$
	C_{final} = S_{weighted} \times (0.9 + 0.1 \times C_{ocr})
	$$

	This ensures that a verification result is heavily penalized if the input image was ambiguous, preventing false positives on noisy data.

	### 2. Step-Level Multi-Agent Consensus
	We deploy a Multi-Agent System (Solver, Critic, Verifier) to analyze solution steps. We compute a Hallucination Rate by checking consensus across agents for each step.
	- Agreement: +Confidence
	- Disagreement: Flags potential hallucination

	### 3. Hybrid Scoring Mechanism
	The final validity score ($S_{weighted}$) is a weighted ensemble of three distinct signals:
	- Symbolic Score ($\alpha=0.40$): SymPy's formal verification of arithmetic.
	- Logical Score ($\beta=0.35$): LLM consensus on reasoning flow.
	- Classifier Score ($\gamma=0.25$): Rule-based patterns (e.g., detecting uncertainty keywords).

	$$
	S_{weighted} = 0.40 \cdot S_{sym} + 0.35 \cdot S_{log} + 0.25 \cdot S_{clf}
	$$

	---

	## 🚀 Getting Started

	### Prerequisites
	- Python 3.10+
	- Tesseract OCR installed ([Instructions](https://github.com/tesseract-ocr/tesseract))
	- Google Gemini API Key

	### Installation
	1. Clone the repository:
	```bash
	git clone https://github.com/yourusername/mvm2.git
	cd mvm2
	```
	2. Install dependencies:
	```bash
	pip install -r requirements.txt
	```
	3. Set API Key:
	```powershell
	# Windows PowerShell
	$env:GEMINI_API_KEY="your_api_key_here"
	```

	### Running the System

	1. Backend API (FastAPI)
	```bash
	python backend/main.py
	# Server runs at http://localhost:8000
	```

	2. Frontend Interface
	Open `frontend/index.html` in your web browser.
	(No build step required for this lightweight UI)

	3. Docker Deployment
	MVM² is container-ready. We provide a full docker-compose setup.
	```bash
	docker-compose up --build -d
	```
	- Backend API will be available at `http://localhost:8000`
	- Frontend UI will be available at `http://localhost:8080`

	---

	## 🧪 Experiments & Evaluation

	We provide a custom evaluation suite to reproduce our ablation studies.

	### 1. Dataset
	The evaluation uses `datasets/sample_data.json`. You can add your own samples here.

	### 2. Running Ablation Modes
	The `run_evaluation.py` script automatically compares 4 system configurations:

	\| Mode \| Description \| Hypothesis \|
	\|---\|---\|---\|
	\| `single_llm_only` \| Baseline (1 Agent) \| High hallucination rate, low accuracy. \|
	\| `llm_plus_sympy` \| Hybrid (1 Agent + SymPy) \| Better arithmetic, still hallucinates logic. \|
	\| `multi_agent_no_ocr_conf` \| Multi-Agent Consensus \| Low hallucination, but overconfident on noisy images. \|
	\| `full_mvm2` \| Complete System \| Highest reliability and calibrated confidence. \|

	Command:
	```bash
	python run_evaluation.py
	```

	### 3. Results
	Outputs are saved to `evaluation_results.csv` containing:
	- Accuracy (Exact Match)
	- Hallucination Rate
	- Latency (ms)
	- Verdicts

	---

	## 📁 Project Structure

	```
	math_verification_mvp/
	├── backend/
	│ ├── config.py # Central Configuration
	│ ├── core/ # Core Logic Services (MVM² Modules)
	│ │ ├── input_receiver.py
	│ │ ├── ocr_service.py
	│ │ ├── verification_service.py
	│ │ ├── classifier_service.py
	│ │ └── ...
	│ ├── tests/ # Unit Tests
	│ └── main.py # FastAPI Entry Point
	├── frontend/ # Lightweight UI
	├── datasets/ # Evaluation Data & Results
	├── scripts/ # Evaluation & Benchmark Scripts
	│ ├── run_evaluation.py
	│ ├── run_benchmarks.py
	│ └── quick_test.py
	├── docs/ # Documentation
	└── requirements.txt # Dependencies
	```

	## 🚀 Getting Started

	### Prerequisites
	- Python 3.10+
	- Tesseract OCR installed ([Instructions](https://github.com/tesseract-ocr/tesseract))
	- Google Gemini API Key

	### Installation
	1. Clone the repository:
	```bash
	git clone https://github.com/yourusername/mvm2.git
	cd math_verification_mvp
	```
	2. Install dependencies:
	```bash
	pip install -r requirements.txt
	```
	3. Set API Key:
	```powershell
	# Windows PowerShell
	$env:GEMINI_API_KEY="your_api_key_here"
	```

	### Running the System

	1. Backend API (FastAPI)
	```bash
	python backend/main.py
	# Server runs at http://localhost:8000
	```

	2. Frontend Interface
	Open `frontend/index.html` in your web browser.

	3. Running Experiments
	```bash
	# Run full evaluation suite
	python scripts/run_evaluation.py
	```