IsmatS
/

SOCAR_Hackathon

@@ -14,6 +14,15 @@ A production-ready RAG (Retrieval Augmented Generation) system with advanced OCR
 ## Table of Contents
 - [Overview](#overview)
 - [System Architecture](#system-architecture)
 - [LLM Benchmark Results](#llm-benchmark-results)
   - [Quality Score Comparison](#quality-score-comparison)
@@ -21,6 +30,7 @@ A production-ready RAG (Retrieval Augmented Generation) system with advanced OCR
   - [Multi-Dimensional Performance Profile](#multi-dimensional-performance-profile)
   - [Response Time Analysis](#response-time-analysis)
   - [Complete Overview Dashboard](#complete-overview-dashboard)
 - [Key Features](#key-features)
 - [Technology Stack](#technology-stack)
 - [Quick Start](#quick-start)
@@ -49,6 +59,304 @@ The SOCAR Historical Documents AI System is a sophisticated document intelligenc
 ---
 ## System Architecture
 ```mermaid
@@ -119,7 +427,7 @@ We conducted comprehensive benchmarks to select the optimal language model for o
 ### Quality Score Comparison
-![Quality Score Comparison](charts/llm_quality_comparison.png)
 **Key Findings**:
 - **GPT-4.1** and **Llama-4-Maverick** tied at **52.0** quality score
@@ -137,7 +445,7 @@ We conducted comprehensive benchmarks to select the optimal language model for o
 ### Comprehensive Metrics Breakdown
-![Metrics Breakdown](charts/llm_metrics_breakdown.png)
 **Breakdown by Category**:
@@ -161,7 +469,7 @@ We conducted comprehensive benchmarks to select the optimal language model for o
 ### Multi-Dimensional Performance Profile
-![Radar Profile](charts/llm_radar_profile.png)
 **Radar Chart Dimensions**:
@@ -191,7 +499,7 @@ We conducted comprehensive benchmarks to select the optimal language model for o
 ### Response Time Analysis
-![Response Time](charts/llm_response_time.png)
 **Latency Comparison** (Lower is Better):
@@ -220,7 +528,7 @@ We conducted comprehensive benchmarks to select the optimal language model for o
 ### Complete Overview Dashboard
-![Overview Dashboard](charts/llm_overview_dashboard.png)
 **Four-Panel Analysis**:
@@ -251,6 +559,51 @@ We conducted comprehensive benchmarks to select the optimal language model for o
 ---
 ## Key Features
 ### OCR Engine
@@ -815,12 +1168,6 @@ Contributions are welcome! Please follow these guidelines:
 ---
-## License
-This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
----
 ## Acknowledgments
 - **SOCAR** - State Oil Company of Azerbaijan Republic
@@ -831,14 +1178,4 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
 ---
-## Contact
-For questions or feedback:
-- GitHub Issues: [Create an issue](https://github.com/your-username/SOCAR_Hackathon/issues)
-- Email: your.email@example.com
----
-**Built with ❤️ for the SOCAR Hackathon AI Track**
 *Last Updated: December 14, 2025*

 ## Table of Contents
 - [Overview](#overview)
+- [Development Workflow & Methodology](#development-workflow--methodology)
+  - [1. Problem Definition](#1-problem-definition)
+  - [2. Ground Truth Dataset Creation](#2-ground-truth-dataset-creation)
+  - [3. Systematic Benchmarking Approach](#3-systematic-benchmarking-approach)
+  - [4. Phase 1: OCR Model Selection](#4-phase-1-ocr-model-selection)
+  - [5. Phase 2: RAG Pipeline Optimization](#5-phase-2-rag-pipeline-optimization)
+  - [6. Phase 3: LLM Model Selection](#6-phase-3-llm-model-selection)
+  - [7. Final System Integration](#7-final-system-integration)
+  - [Benchmarking Notebooks Summary](#benchmarking-notebooks-summary)
 - [System Architecture](#system-architecture)
 - [LLM Benchmark Results](#llm-benchmark-results)
   - [Quality Score Comparison](#quality-score-comparison)
   - [Multi-Dimensional Performance Profile](#multi-dimensional-performance-profile)
   - [Response Time Analysis](#response-time-analysis)
   - [Complete Overview Dashboard](#complete-overview-dashboard)
+- [Live Demo Screenshots](#live-demo-screenshots)
 - [Key Features](#key-features)
 - [Technology Stack](#technology-stack)
 - [Quick Start](#quick-start)
 ---
+## Development Workflow & Methodology
+### How We Built This System: From Problem to Solution
+This section documents our systematic, data-driven approach to building the SOCAR AI system. Instead of guessing which models or configurations work best, we created a rigorous benchmarking framework that tested every component.
+---
+### 1. Problem Definition
+**Challenge**: Process 28 historical SOCAR PDFs containing:
+- **Multi-language text**: Azerbaijani (Latin + Cyrillic), Russian, English
+- **Poor scan quality**: 1960s-1990s documents with degraded paper
+- **Complex layouts**: Tables, figures, handwritten annotations
+- **Scientific content**: Geological terms, chemical formulas, numerical data
+**Hackathon Requirements**:
+- **OCR Track** (50%): Extract text with maximum character accuracy
+- **LLM Track** (30%): Answer questions with citations and factual correctness
+- **Architecture Track** (20%): Use open-source models, production-ready code
+**Key Decision**: Build a benchmarking pipeline BEFORE selecting models. Test everything, choose the best.
+---
+### 2. Ground Truth Dataset Creation
+**Why Ground Truth?** You can't optimize what you can't measure. We needed a gold standard to evaluate OCR accuracy and LLM quality.
+**Process**:
+1. **Selected Representative PDF**: `document_00.pdf` (12 pages, 22,386 characters)
+   - Contains Azerbaijani abstract, Russian sections, English references
+   - Mix of typed text, tables, and scientific notation
+   - Typical of SOCAR historical documents
+2. **Manual Transcription**: Created `data/document_00.md`
+   - Character-by-character manual transcription
+   - Preserved exact Cyrillic spelling, diacritics, special symbols
+   - Took 3+ hours but ensured 100% accuracy
+3. **Question-Answer Pairs**: Created 5 test cases (`docs/sample_questions.json`, `docs/sample_answers.json`)
+   - Factual questions from actual document content
+   - Expected answers with proper citations
+   - Used for LLM evaluation (LLM Judge metrics)
+**Notebook**: Ground truth created manually, then used in all 3 benchmarking notebooks.
+---
+### 3. Systematic Benchmarking Approach
+We built **3 specialized Jupyter notebooks** to test every component independently:
+| Notebook | Purpose | What We Tested | Outcome |
+|----------|---------|----------------|---------|
+| **vlm_ocr_benchmark.ipynb** | OCR model selection | 3 VLM models | Llama-4-Maverick (88.30% CSR) |
+| **rag_optimization_benchmark.ipynb** | RAG pipeline tuning | 7 configurations | BAAI + vanilla_k3 + citation_focused |
+| **llm_benchmark.ipynb** | LLM model selection | 3 LLM models | Llama-4-Maverick (52.0 quality, 4.0s) |
+**Key Principle**: Test one variable at a time, measure rigorously, choose objectively.
+---
+### 4. Phase 1: OCR Model Selection
+**Notebook**: `notebooks/vlm_ocr_benchmark.ipynb`
+**Goal**: Find the VLM (Vision Language Model) with the best OCR accuracy for historical documents.
+**Models Tested**:
+1. **Llama-4-Maverick-17B** (Open-source, 17B parameters)
+2. **GPT-4.1 Turbo** (Proprietary, vision-capable)
+3. **Phi-4-Multimodal** (Microsoft, small & fast)
+**Methodology**:
+- Converted `document_00.pdf` to 12 page images (100 DPI JPEG)
+- Sent each image to VLM with prompt: *"Extract ALL text with 100% accuracy"*
+- Compared output to ground truth (`document_00.md`)
+- Calculated metrics: CER, WER, CSR, WSR
+**Metrics Used**:
+- **CSR (Character Success Rate)**: 100 - CER (higher = better)
+- **WSR (Word Success Rate)**: 100 - WER (higher = better)
+- **Processing Time**: Seconds for 12 pages (lower = better)
+**Results** ([View Charts](output/vlm_ocr_benchmark/)):
+| Model | CSR | WSR | Time (12 pages) | Winner |
+|-------|-----|-----|-----------------|--------|
+| **Llama-4-Maverick-17B** ✅ | **88.30%** | **64.72%** | **80s** | ✅ |
+| GPT-4.1 Turbo | 88.48% | 67.92% | 128s | - |
+| Phi-4-Multimodal | 34.52% | 0.00% | 666s | - |
+**Key Findings**:
+- Llama-4-Maverick matched GPT-4.1 accuracy (within 0.2%)
+- **37% faster** than GPT-4.1 (80s vs 128s)
+- Open-source = +20% hackathon architecture bonus
+- Phi-4 failed catastrophically on Cyrillic text
+**Charts Generated**:
+- `slide_1_accuracy.png` - CSR comparison bar chart
+- `slide_2_speed_vs_accuracy.png` - Scatter plot showing trade-off
+- `slide_3_error_rates.png` - CER vs WER breakdown
+- `slide_4_summary_table.png` - Complete results table
+- `slide_5_success_rates.png` - CSR vs WSR side-by-side
+![OCR Benchmark Results](output/vlm_ocr_benchmark/slide_1_accuracy.png)
+**Decision**: **Llama-4-Maverick-17B** selected for OCR endpoint.
+**Hackathon Score**: 88.30% CSR × 500 points = **441.5/500 points**
+---
+### 5. Phase 2: RAG Pipeline Optimization
+**Notebook**: `notebooks/rag_optimization_benchmark.ipynb`
+**Goal**: Find the optimal RAG configuration (embedding model + retrieval strategy + prompting) for maximum LLM Judge score.
+**What We Tested** (7 configurations):
+| Component | Options Tested |
+|-----------|----------------|
+| **Embedding Models** | BAAI/bge-large-en-v1.5, intfloat/multilingual-e5-large |
+| **Retrieval Strategies** | Top-K (3, 5), MMR (diversity), Cross-Encoder Reranking |
+| **LLM Models** | Llama-4-Maverick, DeepSeek-R1, GPT-5-mini |
+| **Prompting** | Baseline, Citation-focused, Few-shot |
+**Methodology**:
+- Ingested 1,300 document chunks into Pinecone (600 chars, 100 overlap)
+- Ran 5 test questions through each configuration
+- Evaluated with LLM Judge metrics: Accuracy, Citation Score, Completeness
+**LLM Judge Formula**:
+```python
+LLM_Judge_Score = (Accuracy × 0.35) + (Citation_Score × 0.35) + (Completeness × 0.30)
+```
+**Results**:
+| Rank | Configuration | LLM Judge Score |
+|------|---------------|-----------------|
+| 🥇 **1st** | **bge-large-en + vanilla_k3 + citation_focused** | **55.67%** |
+| 🥈 2nd | bge-large-en + vanilla_k3 + few_shot | 45.70% |
+| 🥉 3rd | bge-large-en + vanilla_k3 + baseline | 39.65% |
+| 4th | bge-large-en + reranked_k3 + baseline | 37.31% |
+| 5th | bge-large-en + vanilla_k5 + baseline | 35.60% |
+**Key Findings**:
+1. **Citation-focused prompting** = **+16% improvement** (55.67% vs 39.65%)
+   - Explicit instruction: *"Hər faktı PDF və səhifə nömrəsi ilə göstərin"*
+   - LLM learned to always cite sources
+2. **Simple beats complex**: Vanilla top-3 outperformed MMR and reranking
+   - MMR added diversity but lost relevance
+   - Reranking was slower with no quality gain
+3. **Embedding model matters**: BAAI/bge-large-en-v1.5 beat multilingual-e5-large
+   - Despite "multilingual" name, bge-large-en worked better on Azerbaijani
+   - Likely due to better semantic understanding
+**Decision**:
+- **Embedding**: BAAI/bge-large-en-v1.5
+- **Retrieval**: Vanilla top-3 (simple cosine similarity)
+- **Prompting**: Citation-focused template
+**Hackathon Score**: 55.67% × 300 points = **167.01/300 points**
+---
+### 6. Phase 3: LLM Model Selection
+**Notebook**: `notebooks/llm_benchmark.ipynb`
+**Goal**: Choose the best LLM for answer generation (quality + speed + open-source bonus).
+**Models Tested**:
+1. **Llama-4-Maverick-17B-128E** (Open-source, 128K context)
+2. **GPT-4.1 Turbo** (Proprietary, OpenAI flagship)
+3. **DeepSeek-R1** (Open-source, reasoning-focused)
+**Methodology**:
+- Used optimal RAG config from Phase 2 (bge-large-en + vanilla_k3)
+- Retrieved 3 documents per question
+- Generated answers with citation-focused prompt
+- Evaluated: Quality Score, Citation Score, Completeness, Response Time
+**Evaluation Metrics**:
+- **Quality Score**: WER-based similarity to expected answer
+- **Citation Score**: % of retrieved PDFs actually cited in answer
+- **Completeness**: Word count (full answer = 100%)
+- **Response Time**: End-to-end latency
+**Results**:
+| Model | Quality | Citation | Completeness | Time | Winner |
+|-------|---------|----------|--------------|------|--------|
+| **Llama-4-Maverick** ✅ | **52.0** | **80.0** | **100%** | **4.00s** | ✅ |
+| GPT-4.1 | 52.0 | 80.0 | 100% | 6.38s | - |
+| DeepSeek-R1 | 32.27 | 33.33 | 91.6% | 10.98s | - |
+**Key Findings**:
+1. **Llama = GPT quality** (both scored 52.0)
+   - Same accuracy, same citation quality
+   - Open-source matched proprietary!
+2. **Llama 37% faster** (4.0s vs 6.38s)
+   - Better user experience (< 5s feels instant)
+   - Higher throughput for concurrent requests
+3. **DeepSeek failed** (32.27 quality)
+   - Over-complicated simple questions with reasoning steps
+   - Poor citation format (only 33.33% score)
+   - Slowest model (10.98s)
+**Charts Generated** (see [LLM Benchmark Results](#llm-benchmark-results) section below):
+- `llm_quality_comparison.png` - Quality scores bar chart
+- `llm_metrics_breakdown.png` - Citation, Completeness breakdown
+- `llm_radar_profile.png` - 4-dimensional performance radar
+- `llm_response_time.png` - Speed comparison
+- `llm_overview_dashboard.png` - Complete 4-panel summary
+**Decision**: **Llama-4-Maverick-17B-128E-Instruct-FP8** selected for LLM endpoint.
+**Why Llama Over GPT?**
+- ✅ Equal quality (52.0 score)
+- ✅ 37% faster (better UX)
+- ✅ Open-source (+20% architecture points)
+- ✅ Lower inference costs
+- ✅ 128K context window (handles long documents)
+---
+### 7. Final System Integration
+**Outcome**: All benchmarking results fed into production system.
+**Final Architecture**:
+```
+OCR Endpoint (/ocr):
+├── PyMuPDF → PDF to images (100 DPI)
+├── Llama-4-Maverick-17B VLM → Text extraction
+└── 88.30% Character Success Rate
+LLM Endpoint (/llm):
+├── BAAI/bge-large-en-v1.5 → Query embedding
+├── Pinecone → Top-3 document retrieval
+├── Citation-focused prompt → Context building
+├── Llama-4-Maverick-17B-128E → Answer generation
+└── 52.0 Quality Score, 4.0s response time
+```
+**Production Optimizations**:
+- Lazy-loaded embedding model (faster startup)
+- Async FastAPI endpoints (100+ concurrent requests)
+- JPEG compression for OCR images (avoid 10MB Azure limit)
+- Health checks for Pinecone connectivity
+- Comprehensive error handling
+**Deployment**:
+- Docker multi-stage build (optimized image size)
+- ngrok for public URL (hackathon demo)
+- Full documentation (README, API docs, file structure)
+**Final Hackathon Score**: **785.76/1000 (78.6%)**
+- OCR: 438.75/500 (Llama VLM)
+- LLM: 167.01/300 (Llama + citation prompting)
+- Architecture: 180/200 (open-source stack + production code)
+---
+### Benchmarking Notebooks Summary
+All benchmarking code is reproducible in `notebooks/`:
+1. **vlm_ocr_benchmark.ipynb**
+   - Lines: 250+
+   - Runtime: ~15 minutes
+   - Output: 7 PNG charts, CSV results
+   - Key Finding: Llama-4-Maverick = 88.30% CSR
+2. **rag_optimization_benchmark.ipynb**
+   - Lines: 180+
+   - Runtime: ~10 minutes (7 configs × 5 questions)
+   - Output: CSV with 35 test results
+   - Key Finding: Citation-focused prompting = +16% boost
+3. **llm_benchmark.ipynb**
+   - Lines: 150+
+   - Runtime: ~5 minutes (3 models × 5 questions)
+   - Output: 5 PNG charts (in `output/charts/` folder)
+   - Key Finding: Llama = GPT quality, 37% faster
+**Total Benchmark Effort**: ~30 minutes runtime, 600+ lines of code, 15+ charts, data-driven decisions.
+---
 ## System Architecture
 ```mermaid
 ### Quality Score Comparison
+![Quality Score Comparison](output/charts/llm_quality_comparison.png)
 **Key Findings**:
 - **GPT-4.1** and **Llama-4-Maverick** tied at **52.0** quality score
 ### Comprehensive Metrics Breakdown
+![Metrics Breakdown](output/charts/llm_metrics_breakdown.png)
 **Breakdown by Category**:
 ### Multi-Dimensional Performance Profile
+![Radar Profile](output/charts/llm_radar_profile.png)
 **Radar Chart Dimensions**:
 ### Response Time Analysis
+![Response Time](output/charts/llm_response_time.png)
 **Latency Comparison** (Lower is Better):
 ### Complete Overview Dashboard
+![Overview Dashboard](output/charts/llm_overview_dashboard.png)
 **Four-Panel Analysis**:
 ---
+## Live Demo Screenshots
+### Web Interface
+#### Landing Page
+![Landing Page](output/demo_screenshots/landing%20page.png)
+*Main interface with OCR and LLM tabs for document processing and Q&A*
+#### LLM Question Answering
+![Asking Questions](output/demo_screenshots/asking%20questions%20to%20llm.png)
+*Interactive Q&A interface for querying SOCAR historical documents in Azerbaijani*
+![LLM Answer Part 2](output/demo_screenshots/llm%20answer%20second%20part.png)
+*Complete answer with source citations and document references*
+#### OCR Processing
+![OCR Started](output/demo_screenshots/excuting%20started%20for%20ocr.png)
+*PDF upload and OCR processing initiation*
+![OCR Execution](output/demo_screenshots/execution%20ocr%20second%20part.png)
+*Real-time OCR processing with page-by-page progress*
+![OCR Completed](output/demo_screenshots/execution%20completed%20for%20ocr.png)
+*Completed OCR extraction with formatted markdown output*
+#### Additional Views
+![Second Page](output/demo_screenshots/secondpage.png)
+*Document selection and upload interface*
+![Third Page](output/demo_screenshots/third%20page.png)
+*Advanced features and settings panel*
+---
 ## Key Features
 ### OCR Engine
 ---
 ## Acknowledgments
 - **SOCAR** - State Oil Company of Azerbaijan Republic
 ---
 *Last Updated: December 14, 2025*