File size: 7,456 Bytes
64fc2b8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 | # MVMยฒ: MVMยฒ - Multi-Modal Multi-Model Mathematical Reasoning Verification System
**VNR VJIET Major Project 2025**
**Team:** Brahma Teja, Vinith Kulkarni, Varshith Dharmaj V, Bhavitha Yaragorla




---
## ๐ Problem Statement
Validating mathematical reasoning generated by Large Language Models (LLMs) is critical but challenging, especially when inputs are multimodal (images of handwritten or printed text).
**Key Challenges:**
1. **Hallucinations:** LLMs often generate plausible-sounding but logically flawed steps.
2. **OCR Noise:** Extracting math from images introduces errors (e.g., confusing '5' with 'S' or missing integrals) that downstream verifiers blindly accept.
3. **Lack of Formal Uncertainty:** Existing systems do not account for OCR confidence when making final validity judgments.
**MVMยฒ Solution:** A unified pipeline that combines **OCR with formal uncertainty propagation**, **symbolic verification (SymPy)**, and **multi-agent LLM consensus** to robustly verify mathematical solutions.
---
## ๐๏ธ System Architecture
The system follows a modular service-oriented architecture located in the `backend/` directory:
| Service | Responsibility |
|---|---|
| **1. Input Receiver** | (`backend/input_receiver.py`) Validates text/image inputs via Pydantic models. |
| **2. Preprocessing** | (`backend/preprocessing_service.py`) cleans images using OpenCV (denoising, binarization). |
| **3. OCR Service** | (`backend/ocr_service.py`) Hybrid engine combining Tesseract and specialized Handwritten models. **Calculates OCR Confidence ($C_{ocr}$).** |
| **4. Representation** | (`backend/representation_service.py`) Normalizes inputs into a canonical LaTeX-like Intermediate Representation (IR). |
| **5. Verification** | (`backend/verification_service.py`) Orchestrates **SymPy** for arithmetic checks and **Multi-Agent LLMs** (Solver, Critic, Verifier) for logic. |
| **6. Classification** | (`backend/classifier_service.py`) Aggregates scores using the **MVMยฒ Hybrid Formula**. |
| **7. Reporting** | (`backend/reporting_service.py`) Generates detailed JSON/HTML reports for the user. |
---
## โญ Key Innovations
### 1. OCR-Aware Confidence Propagation
Unlike standard pipelines that treat OCR text as ground truth, MVMยฒ formally propagates visual uncertainty into the final confidence score ($C_{final}$).
$$
C_{final} = S_{weighted} \times (0.9 + 0.1 \times C_{ocr})
$$
This ensures that a verification result is heavily penalized if the input image was ambiguous, preventing false positives on noisy data.
### 2. Step-Level Multi-Agent Consensus
We deploy a **Multi-Agent System** (Solver, Critic, Verifier) to analyze solution steps. We compute a **Hallucination Rate** by checking consensus across agents for each step.
- **Agreement:** +Confidence
- **Disagreement:** Flags potential hallucination
### 3. Hybrid Scoring Mechanism
The final validity score ($S_{weighted}$) is a weighted ensemble of three distinct signals:
- **Symbolic Score ($\alpha=0.40$):** SymPy's formal verification of arithmetic.
- **Logical Score ($\beta=0.35$):** LLM consensus on reasoning flow.
- **Classifier Score ($\gamma=0.25$):** Rule-based patterns (e.g., detecting uncertainty keywords).
$$
S_{weighted} = 0.40 \cdot S_{sym} + 0.35 \cdot S_{log} + 0.25 \cdot S_{clf}
$$
---
## ๐ Getting Started
### Prerequisites
- Python 3.10+
- Tesseract OCR installed ([Instructions](https://github.com/tesseract-ocr/tesseract))
- Google Gemini API Key
### Installation
1. Clone the repository:
```bash
git clone https://github.com/yourusername/mvm2.git
cd mvm2
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Set API Key:
```powershell
# Windows PowerShell
$env:GEMINI_API_KEY="your_api_key_here"
```
### Running the System
**1. Backend API (FastAPI)**
```bash
python backend/main.py
# Server runs at http://localhost:8000
```
**2. Frontend Interface**
Open `frontend/index.html` in your web browser.
*(No build step required for this lightweight UI)*
**3. Docker Deployment**
MVMยฒ is container-ready. We provide a full docker-compose setup.
```bash
docker-compose up --build -d
```
- Backend API will be available at `http://localhost:8000`
- Frontend UI will be available at `http://localhost:8080`
---
## ๐งช Experiments & Evaluation
We provide a custom evaluation suite to reproduce our ablation studies.
### 1. Dataset
The evaluation uses `datasets/sample_data.json`. You can add your own samples here.
### 2. Running Ablation Modes
The `run_evaluation.py` script automatically compares 4 system configurations:
| Mode | Description | Hypothesis |
|---|---|---|
| `single_llm_only` | Baseline (1 Agent) | High hallucination rate, low accuracy. |
| `llm_plus_sympy` | Hybrid (1 Agent + SymPy) | Better arithmetic, still hallucinates logic. |
| `multi_agent_no_ocr_conf` | Multi-Agent Consensus | Low hallucination, but overconfident on noisy images. |
| **`full_mvm2`** | **Complete System** | **Highest reliability and calibrated confidence.** |
**Command:**
```bash
python run_evaluation.py
```
### 3. Results
Outputs are saved to `evaluation_results.csv` containing:
- Accuracy (Exact Match)
- Hallucination Rate
- Latency (ms)
- Verdicts
---
## ๐ Project Structure
```
math_verification_mvp/
โโโ backend/
โ โโโ config.py # Central Configuration
โ โโโ core/ # Core Logic Services (MVMยฒ Modules)
โ โ โโโ input_receiver.py
โ โ โโโ ocr_service.py
โ โ โโโ verification_service.py
โ โ โโโ classifier_service.py
โ โ โโโ ...
โ โโโ tests/ # Unit Tests
โ โโโ main.py # FastAPI Entry Point
โโโ frontend/ # Lightweight UI
โโโ datasets/ # Evaluation Data & Results
โโโ scripts/ # Evaluation & Benchmark Scripts
โ โโโ run_evaluation.py
โ โโโ run_benchmarks.py
โ โโโ quick_test.py
โโโ docs/ # Documentation
โโโ requirements.txt # Dependencies
```
## ๐ Getting Started
### Prerequisites
- Python 3.10+
- Tesseract OCR installed ([Instructions](https://github.com/tesseract-ocr/tesseract))
- Google Gemini API Key
### Installation
1. Clone the repository:
```bash
git clone https://github.com/yourusername/mvm2.git
cd math_verification_mvp
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Set API Key:
```powershell
# Windows PowerShell
$env:GEMINI_API_KEY="your_api_key_here"
```
### Running the System
**1. Backend API (FastAPI)**
```bash
python backend/main.py
# Server runs at http://localhost:8000
```
**2. Frontend Interface**
Open `frontend/index.html` in your web browser.
**3. Running Experiments**
```bash
# Run full evaluation suite
python scripts/run_evaluation.py
```
|