File size: 7,456 Bytes
64fc2b8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
# MVMยฒ: MVMยฒ - Multi-Modal Multi-Model Mathematical Reasoning Verification System

**VNR VJIET Major Project 2025**  
**Team:** Brahma Teja, Vinith Kulkarni, Varshith Dharmaj V, Bhavitha Yaragorla

![Status](https://img.shields.io/badge/status-production--ready-green)
![Version](https://img.shields.io/badge/version-2.0.0-blue)
![Python](https://img.shields.io/badge/python-3.10+-blue)
![Docker](https://img.shields.io/badge/docker-enabled-blue)

---

## ๐Ÿ“„ Problem Statement

Validating mathematical reasoning generated by Large Language Models (LLMs) is critical but challenging, especially when inputs are multimodal (images of handwritten or printed text). 

**Key Challenges:**
1.  **Hallucinations:** LLMs often generate plausible-sounding but logically flawed steps.
2.  **OCR Noise:** Extracting math from images introduces errors (e.g., confusing '5' with 'S' or missing integrals) that downstream verifiers blindly accept.
3.  **Lack of Formal Uncertainty:** Existing systems do not account for OCR confidence when making final validity judgments.

**MVMยฒ Solution:** A unified pipeline that combines **OCR with formal uncertainty propagation**, **symbolic verification (SymPy)**, and **multi-agent LLM consensus** to robustly verify mathematical solutions.

---

## ๐Ÿ—๏ธ System Architecture

The system follows a modular service-oriented architecture located in the `backend/` directory:

| Service | Responsibility |
|---|---|
| **1. Input Receiver** | (`backend/input_receiver.py`) Validates text/image inputs via Pydantic models. |
| **2. Preprocessing** | (`backend/preprocessing_service.py`) cleans images using OpenCV (denoising, binarization). |
| **3. OCR Service** | (`backend/ocr_service.py`) Hybrid engine combining Tesseract and specialized Handwritten models. **Calculates OCR Confidence ($C_{ocr}$).** |

| **4. Representation** | (`backend/representation_service.py`) Normalizes inputs into a canonical LaTeX-like Intermediate Representation (IR). |

| **5. Verification** | (`backend/verification_service.py`) Orchestrates **SymPy** for arithmetic checks and **Multi-Agent LLMs** (Solver, Critic, Verifier) for logic. |
| **6. Classification** | (`backend/classifier_service.py`) Aggregates scores using the **MVMยฒ Hybrid Formula**. |
| **7. Reporting** | (`backend/reporting_service.py`) Generates detailed JSON/HTML reports for the user. |

---

## โญ Key Innovations

### 1. OCR-Aware Confidence Propagation
Unlike standard pipelines that treat OCR text as ground truth, MVMยฒ formally propagates visual uncertainty into the final confidence score ($C_{final}$).



$$

C_{final} = S_{weighted} \times (0.9 + 0.1 \times C_{ocr})
$$

This ensures that a verification result is heavily penalized if the input image was ambiguous, preventing false positives on noisy data.

### 2. Step-Level Multi-Agent Consensus
We deploy a **Multi-Agent System** (Solver, Critic, Verifier) to analyze solution steps. We compute a **Hallucination Rate** by checking consensus across agents for each step.
- **Agreement:** +Confidence
- **Disagreement:** Flags potential hallucination

### 3. Hybrid Scoring Mechanism
The final validity score ($S_{weighted}$) is a weighted ensemble of three distinct signals:

- **Symbolic Score ($\alpha=0.40$):** SymPy's formal verification of arithmetic.

- **Logical Score ($\beta=0.35$):** LLM consensus on reasoning flow.

- **Classifier Score ($\gamma=0.25$):** Rule-based patterns (e.g., detecting uncertainty keywords).



$$

S_{weighted} = 0.40 \cdot S_{sym} + 0.35 \cdot S_{log} + 0.25 \cdot S_{clf}

$$



---



## ๐Ÿš€ Getting Started



### Prerequisites

- Python 3.10+

- Tesseract OCR installed ([Instructions](https://github.com/tesseract-ocr/tesseract))

- Google Gemini API Key



### Installation

1.  Clone the repository:

    ```bash

    git clone https://github.com/yourusername/mvm2.git

    cd mvm2

    ```

2.  Install dependencies:

    ```bash

    pip install -r requirements.txt

    ```

3.  Set API Key:

    ```powershell

    # Windows PowerShell

    $env:GEMINI_API_KEY="your_api_key_here"
    ```


### Running the System

**1. Backend API (FastAPI)**
```bash

python backend/main.py

# Server runs at http://localhost:8000

```

**2. Frontend Interface**
Open `frontend/index.html` in your web browser. 
*(No build step required for this lightweight UI)*

**3. Docker Deployment**
MVMยฒ is container-ready. We provide a full docker-compose setup.
```bash

docker-compose up --build -d

```
- Backend API will be available at `http://localhost:8000`
- Frontend UI will be available at `http://localhost:8080`

---

## ๐Ÿงช Experiments & Evaluation

We provide a custom evaluation suite to reproduce our ablation studies.

### 1. Dataset
The evaluation uses `datasets/sample_data.json`. You can add your own samples here.

### 2. Running Ablation Modes
The `run_evaluation.py` script automatically compares 4 system configurations:

| Mode | Description | Hypothesis |
|---|---|---|
| `single_llm_only` | Baseline (1 Agent) | High hallucination rate, low accuracy. |
| `llm_plus_sympy` | Hybrid (1 Agent + SymPy) | Better arithmetic, still hallucinates logic. |
| `multi_agent_no_ocr_conf` | Multi-Agent Consensus | Low hallucination, but overconfident on noisy images. |
| **`full_mvm2`** | **Complete System** | **Highest reliability and calibrated confidence.** |



**Command:**

```bash

python run_evaluation.py

```



### 3. Results

Outputs are saved to `evaluation_results.csv` containing:

- Accuracy (Exact Match)

- Hallucination Rate

- Latency (ms)

- Verdicts



---



## ๐Ÿ“ Project Structure



```

math_verification_mvp/

โ”œโ”€โ”€ backend/

โ”‚   โ”œโ”€โ”€ config.py             # Central Configuration

โ”‚   โ”œโ”€โ”€ core/                 # Core Logic Services (MVMยฒ Modules)

โ”‚   โ”‚   โ”œโ”€โ”€ input_receiver.py

โ”‚   โ”‚   โ”œโ”€โ”€ ocr_service.py

โ”‚   โ”‚   โ”œโ”€โ”€ verification_service.py

โ”‚   โ”‚   โ”œโ”€โ”€ classifier_service.py

โ”‚   โ”‚   โ””โ”€โ”€ ...

โ”‚   โ”œโ”€โ”€ tests/                # Unit Tests

โ”‚   โ””โ”€โ”€ main.py               # FastAPI Entry Point

โ”œโ”€โ”€ frontend/                 # Lightweight UI

โ”œโ”€โ”€ datasets/                 # Evaluation Data & Results

โ”œโ”€โ”€ scripts/                  # Evaluation & Benchmark Scripts

โ”‚   โ”œโ”€โ”€ run_evaluation.py

โ”‚   โ”œโ”€โ”€ run_benchmarks.py

โ”‚   โ””โ”€โ”€ quick_test.py

โ”œโ”€โ”€ docs/                     # Documentation

โ””โ”€โ”€ requirements.txt          # Dependencies

```



## ๐Ÿš€ Getting Started



### Prerequisites

- Python 3.10+

- Tesseract OCR installed ([Instructions](https://github.com/tesseract-ocr/tesseract))

- Google Gemini API Key



### Installation

1.  Clone the repository:

    ```bash

    git clone https://github.com/yourusername/mvm2.git

    cd math_verification_mvp

    ```

2.  Install dependencies:

    ```bash

    pip install -r requirements.txt

    ```

3.  Set API Key:

    ```powershell

    # Windows PowerShell

    $env:GEMINI_API_KEY="your_api_key_here"

    ```



### Running the System



**1. Backend API (FastAPI)**

```bash

python backend/main.py

# Server runs at http://localhost:8000

```



**2. Frontend Interface**

Open `frontend/index.html` in your web browser.



**3. Running Experiments**

```bash

# Run full evaluation suite

python scripts/run_evaluation.py

```