| # MVMยฒ: MVMยฒ - Multi-Modal Multi-Model Mathematical Reasoning Verification System | |
| **VNR VJIET Major Project 2025** | |
| **Team:** Brahma Teja, Vinith Kulkarni, Varshith Dharmaj V, Bhavitha Yaragorla | |
|  | |
|  | |
|  | |
|  | |
| --- | |
| ## ๐ Problem Statement | |
| Validating mathematical reasoning generated by Large Language Models (LLMs) is critical but challenging, especially when inputs are multimodal (images of handwritten or printed text). | |
| **Key Challenges:** | |
| 1. **Hallucinations:** LLMs often generate plausible-sounding but logically flawed steps. | |
| 2. **OCR Noise:** Extracting math from images introduces errors (e.g., confusing '5' with 'S' or missing integrals) that downstream verifiers blindly accept. | |
| 3. **Lack of Formal Uncertainty:** Existing systems do not account for OCR confidence when making final validity judgments. | |
| **MVMยฒ Solution:** A unified pipeline that combines **OCR with formal uncertainty propagation**, **symbolic verification (SymPy)**, and **multi-agent LLM consensus** to robustly verify mathematical solutions. | |
| --- | |
| ## ๐๏ธ System Architecture | |
| The system follows a modular service-oriented architecture located in the `backend/` directory: | |
| | Service | Responsibility | | |
| |---|---| | |
| | **1. Input Receiver** | (`backend/input_receiver.py`) Validates text/image inputs via Pydantic models. | | |
| | **2. Preprocessing** | (`backend/preprocessing_service.py`) cleans images using OpenCV (denoising, binarization). | | |
| | **3. OCR Service** | (`backend/ocr_service.py`) Hybrid engine combining Tesseract and specialized Handwritten models. **Calculates OCR Confidence ($C_{ocr}$).** | | |
| | **4. Representation** | (`backend/representation_service.py`) Normalizes inputs into a canonical LaTeX-like Intermediate Representation (IR). | | |
| | **5. Verification** | (`backend/verification_service.py`) Orchestrates **SymPy** for arithmetic checks and **Multi-Agent LLMs** (Solver, Critic, Verifier) for logic. | | |
| | **6. Classification** | (`backend/classifier_service.py`) Aggregates scores using the **MVMยฒ Hybrid Formula**. | | |
| | **7. Reporting** | (`backend/reporting_service.py`) Generates detailed JSON/HTML reports for the user. | | |
| --- | |
| ## โญ Key Innovations | |
| ### 1. OCR-Aware Confidence Propagation | |
| Unlike standard pipelines that treat OCR text as ground truth, MVMยฒ formally propagates visual uncertainty into the final confidence score ($C_{final}$). | |
| $$ | |
| C_{final} = S_{weighted} \times (0.9 + 0.1 \times C_{ocr}) | |
| $$ | |
| This ensures that a verification result is heavily penalized if the input image was ambiguous, preventing false positives on noisy data. | |
| ### 2. Step-Level Multi-Agent Consensus | |
| We deploy a **Multi-Agent System** (Solver, Critic, Verifier) to analyze solution steps. We compute a **Hallucination Rate** by checking consensus across agents for each step. | |
| - **Agreement:** +Confidence | |
| - **Disagreement:** Flags potential hallucination | |
| ### 3. Hybrid Scoring Mechanism | |
| The final validity score ($S_{weighted}$) is a weighted ensemble of three distinct signals: | |
| - **Symbolic Score ($\alpha=0.40$):** SymPy's formal verification of arithmetic. | |
| - **Logical Score ($\beta=0.35$):** LLM consensus on reasoning flow. | |
| - **Classifier Score ($\gamma=0.25$):** Rule-based patterns (e.g., detecting uncertainty keywords). | |
| $$ | |
| S_{weighted} = 0.40 \cdot S_{sym} + 0.35 \cdot S_{log} + 0.25 \cdot S_{clf} | |
| $$ | |
| --- | |
| ## ๐ Getting Started | |
| ### Prerequisites | |
| - Python 3.10+ | |
| - Tesseract OCR installed ([Instructions](https://github.com/tesseract-ocr/tesseract)) | |
| - Google Gemini API Key | |
| ### Installation | |
| 1. Clone the repository: | |
| ```bash | |
| git clone https://github.com/yourusername/mvm2.git | |
| cd mvm2 | |
| ``` | |
| 2. Install dependencies: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 3. Set API Key: | |
| ```powershell | |
| # Windows PowerShell | |
| $env:GEMINI_API_KEY="your_api_key_here" | |
| ``` | |
| ### Running the System | |
| **1. Backend API (FastAPI)** | |
| ```bash | |
| python backend/main.py | |
| # Server runs at http://localhost:8000 | |
| ``` | |
| **2. Frontend Interface** | |
| Open `frontend/index.html` in your web browser. | |
| *(No build step required for this lightweight UI)* | |
| **3. Docker Deployment** | |
| MVMยฒ is container-ready. We provide a full docker-compose setup. | |
| ```bash | |
| docker-compose up --build -d | |
| ``` | |
| - Backend API will be available at `http://localhost:8000` | |
| - Frontend UI will be available at `http://localhost:8080` | |
| --- | |
| ## ๐งช Experiments & Evaluation | |
| We provide a custom evaluation suite to reproduce our ablation studies. | |
| ### 1. Dataset | |
| The evaluation uses `datasets/sample_data.json`. You can add your own samples here. | |
| ### 2. Running Ablation Modes | |
| The `run_evaluation.py` script automatically compares 4 system configurations: | |
| | Mode | Description | Hypothesis | | |
| |---|---|---| | |
| | `single_llm_only` | Baseline (1 Agent) | High hallucination rate, low accuracy. | | |
| | `llm_plus_sympy` | Hybrid (1 Agent + SymPy) | Better arithmetic, still hallucinates logic. | | |
| | `multi_agent_no_ocr_conf` | Multi-Agent Consensus | Low hallucination, but overconfident on noisy images. | | |
| | **`full_mvm2`** | **Complete System** | **Highest reliability and calibrated confidence.** | | |
| **Command:** | |
| ```bash | |
| python run_evaluation.py | |
| ``` | |
| ### 3. Results | |
| Outputs are saved to `evaluation_results.csv` containing: | |
| - Accuracy (Exact Match) | |
| - Hallucination Rate | |
| - Latency (ms) | |
| - Verdicts | |
| --- | |
| ## ๐ Project Structure | |
| ``` | |
| math_verification_mvp/ | |
| โโโ backend/ | |
| โ โโโ config.py # Central Configuration | |
| โ โโโ core/ # Core Logic Services (MVMยฒ Modules) | |
| โ โ โโโ input_receiver.py | |
| โ โ โโโ ocr_service.py | |
| โ โ โโโ verification_service.py | |
| โ โ โโโ classifier_service.py | |
| โ โ โโโ ... | |
| โ โโโ tests/ # Unit Tests | |
| โ โโโ main.py # FastAPI Entry Point | |
| โโโ frontend/ # Lightweight UI | |
| โโโ datasets/ # Evaluation Data & Results | |
| โโโ scripts/ # Evaluation & Benchmark Scripts | |
| โ โโโ run_evaluation.py | |
| โ โโโ run_benchmarks.py | |
| โ โโโ quick_test.py | |
| โโโ docs/ # Documentation | |
| โโโ requirements.txt # Dependencies | |
| ``` | |
| ## ๐ Getting Started | |
| ### Prerequisites | |
| - Python 3.10+ | |
| - Tesseract OCR installed ([Instructions](https://github.com/tesseract-ocr/tesseract)) | |
| - Google Gemini API Key | |
| ### Installation | |
| 1. Clone the repository: | |
| ```bash | |
| git clone https://github.com/yourusername/mvm2.git | |
| cd math_verification_mvp | |
| ``` | |
| 2. Install dependencies: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 3. Set API Key: | |
| ```powershell | |
| # Windows PowerShell | |
| $env:GEMINI_API_KEY="your_api_key_here" | |
| ``` | |
| ### Running the System | |
| **1. Backend API (FastAPI)** | |
| ```bash | |
| python backend/main.py | |
| # Server runs at http://localhost:8000 | |
| ``` | |
| **2. Frontend Interface** | |
| Open `frontend/index.html` in your web browser. | |
| **3. Running Experiments** | |
| ```bash | |
| # Run full evaluation suite | |
| python scripts/run_evaluation.py | |
| ``` | |