Spaces:

Varshithdharmajv
/

mvm2-math-verification

Running

File size: 7,456 Bytes

64fc2b8

# MVM²: MVM² - Multi-Modal Multi-Model Mathematical Reasoning Verification System

**VNR VJIET Major Project 2025**  
**Team:** Brahma Teja, Vinith Kulkarni, Varshith Dharmaj V, Bhavitha Yaragorla

![Status](https://img.shields.io/badge/status-production--ready-green)
![Version](https://img.shields.io/badge/version-2.0.0-blue)
![Python](https://img.shields.io/badge/python-3.10+-blue)
![Docker](https://img.shields.io/badge/docker-enabled-blue)

---

## 📄 Problem Statement

Validating mathematical reasoning generated by Large Language Models (LLMs) is critical but challenging, especially when inputs are multimodal (images of handwritten or printed text). 

**Key Challenges:**
1.  **Hallucinations:** LLMs often generate plausible-sounding but logically flawed steps.
2.  **OCR Noise:** Extracting math from images introduces errors (e.g., confusing '5' with 'S' or missing integrals) that downstream verifiers blindly accept.
3.  **Lack of Formal Uncertainty:** Existing systems do not account for OCR confidence when making final validity judgments.

**MVM² Solution:** A unified pipeline that combines **OCR with formal uncertainty propagation**, **symbolic verification (SymPy)**, and **multi-agent LLM consensus** to robustly verify mathematical solutions.

---

## 🏗️ System Architecture

The system follows a modular service-oriented architecture located in the `backend/` directory:

| Service | Responsibility |
|---|---|
| **1. Input Receiver** | (`backend/input_receiver.py`) Validates text/image inputs via Pydantic models. |
| **2. Preprocessing** | (`backend/preprocessing_service.py`) cleans images using OpenCV (denoising, binarization). |
| **3. OCR Service** | (`backend/ocr_service.py`) Hybrid engine combining Tesseract and specialized Handwritten models. **Calculates OCR Confidence ($C_{ocr}$).** |

| **4. Representation** | (`backend/representation_service.py`) Normalizes inputs into a canonical LaTeX-like Intermediate Representation (IR). |

| **5. Verification** | (`backend/verification_service.py`) Orchestrates **SymPy** for arithmetic checks and **Multi-Agent LLMs** (Solver, Critic, Verifier) for logic. |
| **6. Classification** | (`backend/classifier_service.py`) Aggregates scores using the **MVM² Hybrid Formula**. |
| **7. Reporting** | (`backend/reporting_service.py`) Generates detailed JSON/HTML reports for the user. |

---

## ⭐ Key Innovations

### 1. OCR-Aware Confidence Propagation
Unlike standard pipelines that treat OCR text as ground truth, MVM² formally propagates visual uncertainty into the final confidence score ($C_{final}$).



$$

C_{final} = S_{weighted} \times (0.9 + 0.1 \times C_{ocr})
$$

This ensures that a verification result is heavily penalized if the input image was ambiguous, preventing false positives on noisy data.

### 2. Step-Level Multi-Agent Consensus
We deploy a **Multi-Agent System** (Solver, Critic, Verifier) to analyze solution steps. We compute a **Hallucination Rate** by checking consensus across agents for each step.
- **Agreement:** +Confidence
- **Disagreement:** Flags potential hallucination

### 3. Hybrid Scoring Mechanism
The final validity score ($S_{weighted}$) is a weighted ensemble of three distinct signals:

- **Symbolic Score ($\alpha=0.40$):** SymPy's formal verification of arithmetic.

- **Logical Score ($\beta=0.35$):** LLM consensus on reasoning flow.

- **Classifier Score ($\gamma=0.25$):** Rule-based patterns (e.g., detecting uncertainty keywords).



$$

S_{weighted} = 0.40 \cdot S_{sym} + 0.35 \cdot S_{log} + 0.25 \cdot S_{clf}

$$



---



## 🚀 Getting Started



### Prerequisites

- Python 3.10+

- Tesseract OCR installed ([Instructions](https://github.com/tesseract-ocr/tesseract))

- Google Gemini API Key



### Installation

1.  Clone the repository:

    ```bash

    git clone https://github.com/yourusername/mvm2.git

    cd mvm2

    ```

2.  Install dependencies:

    ```bash

    pip install -r requirements.txt

    ```

3.  Set API Key:

    ```powershell

    # Windows PowerShell

    $env:GEMINI_API_KEY="your_api_key_here"
    ```


### Running the System

**1. Backend API (FastAPI)**
```bash

python backend/main.py

# Server runs at http://localhost:8000

```

**2. Frontend Interface**
Open `frontend/index.html` in your web browser. 
*(No build step required for this lightweight UI)*

**3. Docker Deployment**
MVM² is container-ready. We provide a full docker-compose setup.
```bash

docker-compose up --build -d

```
- Backend API will be available at `http://localhost:8000`
- Frontend UI will be available at `http://localhost:8080`

---

## 🧪 Experiments & Evaluation

We provide a custom evaluation suite to reproduce our ablation studies.

### 1. Dataset
The evaluation uses `datasets/sample_data.json`. You can add your own samples here.

### 2. Running Ablation Modes
The `run_evaluation.py` script automatically compares 4 system configurations:

| Mode | Description | Hypothesis |
|---|---|---|
| `single_llm_only` | Baseline (1 Agent) | High hallucination rate, low accuracy. |
| `llm_plus_sympy` | Hybrid (1 Agent + SymPy) | Better arithmetic, still hallucinates logic. |
| `multi_agent_no_ocr_conf` | Multi-Agent Consensus | Low hallucination, but overconfident on noisy images. |
| **`full_mvm2`** | **Complete System** | **Highest reliability and calibrated confidence.** |



**Command:**

```bash

python run_evaluation.py

```



### 3. Results

Outputs are saved to `evaluation_results.csv` containing:

- Accuracy (Exact Match)

- Hallucination Rate

- Latency (ms)

- Verdicts



---



## 📁 Project Structure



```

math_verification_mvp/

├── backend/

│   ├── config.py             # Central Configuration

│   ├── core/                 # Core Logic Services (MVM² Modules)

│   │   ├── input_receiver.py

│   │   ├── ocr_service.py

│   │   ├── verification_service.py

│   │   ├── classifier_service.py

│   │   └── ...

│   ├── tests/                # Unit Tests

│   └── main.py               # FastAPI Entry Point

├── frontend/                 # Lightweight UI

├── datasets/                 # Evaluation Data & Results

├── scripts/                  # Evaluation & Benchmark Scripts

│   ├── run_evaluation.py

│   ├── run_benchmarks.py

│   └── quick_test.py

├── docs/                     # Documentation

└── requirements.txt          # Dependencies

```



## 🚀 Getting Started



### Prerequisites

- Python 3.10+

- Tesseract OCR installed ([Instructions](https://github.com/tesseract-ocr/tesseract))

- Google Gemini API Key



### Installation

1.  Clone the repository:

    ```bash

    git clone https://github.com/yourusername/mvm2.git

    cd math_verification_mvp

    ```

2.  Install dependencies:

    ```bash

    pip install -r requirements.txt

    ```

3.  Set API Key:

    ```powershell

    # Windows PowerShell

    $env:GEMINI_API_KEY="your_api_key_here"

    ```



### Running the System



**1. Backend API (FastAPI)**

```bash

python backend/main.py

# Server runs at http://localhost:8000

```



**2. Frontend Interface**

Open `frontend/index.html` in your web browser.



**3. Running Experiments**

```bash

# Run full evaluation suite

python scripts/run_evaluation.py

```