IsmatS commited on
Commit
fd93806
·
1 Parent(s): ba54c7e
.env.example CHANGED
@@ -1,11 +1,13 @@
1
  # Azure OpenAI Configuration
2
  AZURE_OPENAI_API_KEY=your_azure_openai_api_key_here
3
- AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
4
  AZURE_OPENAI_API_VERSION=2024-08-01-preview
5
 
6
- # Azure OpenAI Embedding Configuration (for /llm endpoint)
7
- # IMPORTANT: Deploy text-embedding-3-small in Azure OpenAI Studio first!
8
- # See DEPLOYMENT_TROUBLESHOOTING.md for step-by-step guide
 
 
9
  AZURE_EMBEDDING_MODEL=text-embedding-3-small
10
  AZURE_EMBEDDING_DIMS=1024
11
 
@@ -13,16 +15,20 @@ AZURE_EMBEDDING_DIMS=1024
13
  AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT=https://your-resource.services.ai.azure.com/
14
  AZURE_DOCUMENT_INTELLIGENCE_KEY=your_document_intelligence_key_here
15
 
 
 
 
 
16
  # VM Configuration (Optional)
17
  VM_HOST=your-vm-host.cloudapp.azure.com
18
  VM_USER=hackathon
19
  VM_SSH_KEY=your_ssh_key
20
 
21
- # HuggingFace Resources (Optional)
22
  HUGGINGFACE_ORG=https://huggingface.co/SOCARAI
23
  DATASET_NAME=SOCARAI/ai_track_data
24
 
25
- # GitHub Code Samples (Optional)
26
  CODE_SAMPLES_REPO=https://github.com/neaorin/foundry-models-samples
27
 
28
  # Application Configuration
@@ -37,7 +43,7 @@ PROCESSED_DIR=./data/processed
37
  LLM_MODEL=Llama-4-Maverick-17B-128E-Instruct-FP8
38
 
39
  # Pinecone Configuration (Cloud Vector Database)
40
- PINECONE_API_KEY=pcsk_2aNboE_GqcDREwMDyGKQkg6paRUG6tFJwK1CtyQwZ5dgmFCGVUmyVK1bA167LNNMkdYLY3
41
  PINECONE_INDEX_NAME=hackathon
42
  PINECONE_CLOUD=aws
43
  PINECONE_REGION=us-east-1
@@ -48,20 +54,7 @@ API_HOST=0.0.0.0
48
  API_PORT=8000
49
 
50
  # OCR Configuration
51
- OCR_MAX_PAGES=0 # 0 = unlimited pages. Set to limit if on constrained hosting (e.g., 5 for 512MB)
52
-
53
- # Production SSL/Security Configuration
54
- # Set these for production deployment (see docs/markdowns/SSL_CAA_SETUP.md)
55
- PRODUCTION=false
56
- HTTPS_ONLY=false
57
-
58
- # Domain configuration - comma-separated list of allowed hosts
59
- # Example: TRUSTED_HOSTS=yourdomain.com,www.yourdomain.com
60
- TRUSTED_HOSTS=*
61
-
62
- # CORS Origins - comma-separated list of allowed origins
63
- # Example: ALLOWED_ORIGINS=https://yourdomain.com,https://www.yourdomain.com
64
- ALLOWED_ORIGINS=*
65
 
66
  # Disable telemetry and warnings
67
  TOKENIZERS_PARALLELISM=false
 
1
  # Azure OpenAI Configuration
2
  AZURE_OPENAI_API_KEY=your_azure_openai_api_key_here
3
+ AZURE_OPENAI_ENDPOINT=https://your-resource.services.ai.azure.com/
4
  AZURE_OPENAI_API_VERSION=2024-08-01-preview
5
 
6
+ # Azure OpenAI Embedding Configuration (separate resource for embeddings)
7
+ # IMPORTANT: If using a different Azure resource for embeddings, set these variables
8
+ # Otherwise, the main AZURE_OPENAI credentials will be used
9
+ AZURE_EMBEDDING_API_KEY=your_embedding_api_key_here
10
+ AZURE_EMBEDDING_ENDPOINT=https://your-embedding-resource.cognitiveservices.azure.com/
11
  AZURE_EMBEDDING_MODEL=text-embedding-3-small
12
  AZURE_EMBEDDING_DIMS=1024
13
 
 
15
  AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT=https://your-resource.services.ai.azure.com/
16
  AZURE_DOCUMENT_INTELLIGENCE_KEY=your_document_intelligence_key_here
17
 
18
+ # Azure AI Foundry Models
19
+ # Access to LLaMA and other models via Azure AI Foundry
20
+ # https://azure.microsoft.com/en-us/products/ai-foundry/models
21
+
22
  # VM Configuration (Optional)
23
  VM_HOST=your-vm-host.cloudapp.azure.com
24
  VM_USER=hackathon
25
  VM_SSH_KEY=your_ssh_key
26
 
27
+ # HuggingFace Resources
28
  HUGGINGFACE_ORG=https://huggingface.co/SOCARAI
29
  DATASET_NAME=SOCARAI/ai_track_data
30
 
31
+ # GitHub Code Samples
32
  CODE_SAMPLES_REPO=https://github.com/neaorin/foundry-models-samples
33
 
34
  # Application Configuration
 
43
  LLM_MODEL=Llama-4-Maverick-17B-128E-Instruct-FP8
44
 
45
  # Pinecone Configuration (Cloud Vector Database)
46
+ PINECONE_API_KEY=your_pinecone_api_key_here
47
  PINECONE_INDEX_NAME=hackathon
48
  PINECONE_CLOUD=aws
49
  PINECONE_REGION=us-east-1
 
54
  API_PORT=8000
55
 
56
  # OCR Configuration
57
+ OCR_MAX_PAGES=0 # 0 = unlimited pages (set to limit if needed)
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
  # Disable telemetry and warnings
60
  TOKENIZERS_PARALLELISM=false
README.md CHANGED
@@ -317,9 +317,12 @@ We conducted comprehensive benchmarks to select the optimal language model for o
317
  - Python 3.11+
318
  - Azure OpenAI API key
319
  - Pinecone API key
320
- - Docker (optional)
 
321
 
322
- ### Installation
 
 
323
 
324
  1. **Clone the repository**:
325
  ```bash
@@ -342,37 +345,234 @@ Required variables:
342
  ```env
343
  AZURE_OPENAI_API_KEY=your_azure_key
344
  AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
 
 
345
  PINECONE_API_KEY=your_pinecone_key
346
  PINECONE_INDEX_NAME=hackathon
347
  ```
348
 
349
  4. **Ingest PDFs** (one-time setup):
350
  ```bash
351
- # Test with single PDF
352
- python scripts/ingest_pdfs.py test
353
 
354
- # Ingest all PDFs
355
- python scripts/ingest_pdfs.py
356
  ```
357
 
358
- 5. **Start the API**:
359
  ```bash
360
- cd app && uvicorn main:app --host 0.0.0.0 --port 8000
361
  ```
362
 
 
 
363
  6. **Access the system**:
364
- - API Docs: http://localhost:8000/docs
365
- - Web UI: http://localhost:8000
366
- - Health Check: http://localhost:8000/health
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
367
 
368
- ### Docker Deployment
369
 
 
370
  ```bash
371
- # Build image
372
- docker build -t socar-ai .
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
373
 
374
- # Run container
375
- docker run -p 8000:8000 --env-file .env socar-ai
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
376
  ```
377
 
378
  ---
 
317
  - Python 3.11+
318
  - Azure OpenAI API key
319
  - Pinecone API key
320
+ - Docker (optional, for containerized deployment)
321
+ - ngrok (optional, for public URL)
322
 
323
+ ---
324
+
325
+ ### Option 1: Local Development (Recommended for Development)
326
 
327
  1. **Clone the repository**:
328
  ```bash
 
345
  ```env
346
  AZURE_OPENAI_API_KEY=your_azure_key
347
  AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
348
+ AZURE_EMBEDDING_API_KEY=your_embedding_key # If using separate resource
349
+ AZURE_EMBEDDING_ENDPOINT=https://your-embedding.cognitiveservices.azure.com/
350
  PINECONE_API_KEY=your_pinecone_key
351
  PINECONE_INDEX_NAME=hackathon
352
  ```
353
 
354
  4. **Ingest PDFs** (one-time setup):
355
  ```bash
356
+ # Ingest all PDFs from hackathon_data folder (parallel processing)
357
+ python scripts/ingest_hackathon_data.py
358
 
359
+ # Check ingestion status
360
+ python scripts/check_pinecone.py
361
  ```
362
 
363
+ 5. **Start the API server**:
364
  ```bash
365
+ cd app && uvicorn main:app --reload --host 0.0.0.0 --port 8000
366
  ```
367
 
368
+ The `--reload` flag enables auto-reload on code changes (development mode).
369
+
370
  6. **Access the system**:
371
+ - **Web UI**: http://localhost:8000
372
+ - **API Docs**: http://localhost:8000/docs
373
+ - **Health Check**: http://localhost:8000/health
374
+ - **ngrok URL** (if using ngrok): See ngrok setup below
375
+
376
+ ---
377
+
378
+ ### Option 2: Docker (Recommended for Production)
379
+
380
+ #### Using Docker Compose (Easiest)
381
+
382
+ ```bash
383
+ # Build and start the container
384
+ docker-compose up --build
385
+
386
+ # Run in detached mode (background)
387
+ docker-compose up -d
388
+
389
+ # View logs
390
+ docker-compose logs -f
391
+
392
+ # Stop the container
393
+ docker-compose down
394
+ ```
395
+
396
+ Access at: http://localhost:8000
397
+
398
+ #### Using Docker CLI
399
+
400
+ ```bash
401
+ # Build the image
402
+ docker build -t socar-ai-system .
403
+
404
+ # Run the container
405
+ docker run -d \
406
+ --name socar-ai \
407
+ -p 8000:8000 \
408
+ --env-file .env \
409
+ --restart unless-stopped \
410
+ socar-ai-system
411
+
412
+ # View logs
413
+ docker logs -f socar-ai
414
+
415
+ # Stop the container
416
+ docker stop socar-ai
417
+
418
+ # Remove the container
419
+ docker rm socar-ai
420
+ ```
421
 
422
+ #### Docker Health Check
423
 
424
+ The container includes automatic health checks:
425
  ```bash
426
+ # Check container health
427
+ docker inspect --format='{{.State.Health.Status}}' socar-ai
428
+
429
+ # Manual health check
430
+ curl http://localhost:8000/health
431
+ ```
432
+
433
+ ---
434
+
435
+ ### Option 3: Public URL with ngrok (Optional)
436
+
437
+ Make your local server publicly accessible for demos, testing, or hackathon submissions.
438
+
439
+ #### Install ngrok
440
+
441
+ **macOS** (using Homebrew):
442
+ ```bash
443
+ brew install ngrok
444
+ ```
445
+
446
+ **Linux/Windows**: Download from https://ngrok.com/download
447
+
448
+ #### Setup ngrok Authentication (One-Time)
449
+
450
+ 1. Sign up at https://dashboard.ngrok.com/signup
451
+ 2. Get your auth token from https://dashboard.ngrok.com/get-started/your-authtoken
452
+ 3. Configure ngrok:
453
+ ```bash
454
+ ngrok config add-authtoken YOUR_AUTHTOKEN
455
+ ```
456
+
457
+ #### Start ngrok Tunnel
458
+
459
+ **Basic tunnel** (random URL):
460
+ ```bash
461
+ # Start ngrok tunnel to local port 8000
462
+ ngrok http 8000
463
+ ```
464
+
465
+ **Custom subdomain** (requires ngrok paid plan):
466
+ ```bash
467
+ ngrok http 8000 --subdomain=socar-hackathon
468
+ ```
469
 
470
+ **With specific region**:
471
+ ```bash
472
+ ngrok http 8000 --region=eu # Europe
473
+ ngrok http 8000 --region=us # United States
474
+ ```
475
+
476
+ #### ngrok Output Example
477
+
478
+ ```
479
+ ngrok
480
+
481
+ Session Status online
482
+ Account your-email@example.com
483
+ Version 3.0.0
484
+ Region United States (us)
485
+ Latency 45ms
486
+ Web Interface http://127.0.0.1:4040
487
+ Forwarding https://abc123.ngrok.io -> http://localhost:8000
488
+
489
+ Connections ttl opn rt1 rt5 p50 p90
490
+ 0 0 0.00 0.00 0.00 0.00
491
+ ```
492
+
493
+ Your **public URL** is: `https://abc123.ngrok.io`
494
+
495
+ #### ngrok Web Interface
496
+
497
+ Access http://127.0.0.1:4040 for:
498
+ - Request inspection
499
+ - Replay requests
500
+ - Traffic analysis
501
+ - Response details
502
+
503
+ #### Keep ngrok Running
504
+
505
+ **Using tmux** (recommended for servers):
506
+ ```bash
507
+ # Start tmux session
508
+ tmux new -s ngrok
509
+
510
+ # Inside tmux: start ngrok
511
+ ngrok http 8000
512
+
513
+ # Detach: Press Ctrl+B, then D
514
+ # Reattach: tmux attach -t ngrok
515
+ ```
516
+
517
+ **Using nohup**:
518
+ ```bash
519
+ nohup ngrok http 8000 > ngrok.log 2>&1 &
520
+
521
+ # View logs
522
+ tail -f ngrok.log
523
+
524
+ # Get ngrok URL
525
+ curl http://localhost:4040/api/tunnels | grep -o 'https://[^"]*ngrok.io'
526
+ ```
527
+
528
+ ---
529
+
530
+ ### Complete Setup Example (Local + ngrok)
531
+
532
+ ```bash
533
+ # Terminal 1: Start the API server
534
+ cd SOCAR_Hackathon/app
535
+ uvicorn main:app --reload --host 0.0.0.0 --port 8000
536
+
537
+ # Terminal 2: Start ngrok tunnel
538
+ ngrok http 8000
539
+
540
+ # Your app is now accessible at:
541
+ # - Local: http://localhost:8000
542
+ # - Public: https://abc123.ngrok.io (from ngrok output)
543
+ ```
544
+
545
+ ---
546
+
547
+ ### Verify Installation
548
+
549
+ Test all endpoints:
550
+
551
+ ```bash
552
+ # Health check
553
+ curl http://localhost:8000/health
554
+
555
+ # LLM endpoint test
556
+ curl -X POST http://localhost:8000/llm \
557
+ -H "Content-Type: application/json" \
558
+ -d '{"question": "SOCAR haqqında məlumat verin"}'
559
+
560
+ # OCR endpoint test (requires PDF file)
561
+ curl -X POST http://localhost:8000/ocr \
562
+ -F "file=@/path/to/document.pdf"
563
+ ```
564
+
565
+ Expected response for health check:
566
+ ```json
567
+ {
568
+ "status": "healthy",
569
+ "pinecone": {
570
+ "connected": true,
571
+ "total_vectors": 606
572
+ },
573
+ "azure_openai": "connected",
574
+ "embedding_model": "loaded"
575
+ }
576
  ```
577
 
578
  ---
presentation/PITCH_DECK.md DELETED
@@ -1,317 +0,0 @@
1
- # SOCAR Historical Documents AI
2
- ## Intelligent OCR & RAG System for Oil & Gas Archives
3
-
4
- **Hackathon Pitch Deck**
5
-
6
- ---
7
-
8
- # Slide 1: Title
9
-
10
- # SOCAR Historical Documents AI
11
-
12
- ### Intelligent OCR & RAG System for Oil & Gas Archives
13
-
14
- > Transforming 28 Historical Documents into Searchable Knowledge
15
-
16
- ---
17
-
18
- # Slide 2: The Problem
19
-
20
- ## The Challenge We're Solving
21
-
22
- ### 1. Inaccessible Archives
23
- - Decades of valuable historical documents locked in PDF format
24
- - Impossible to search or query efficiently
25
-
26
- ### 2. Multi-Language Barrier
27
- - Documents in **Azerbaijani**, **Russian**, and **English**
28
- - Complex Cyrillic text preservation required
29
-
30
- ### 3. Time-Consuming Research
31
- - Manual document review takes hours
32
- - Finding specific information is a needle-in-haystack problem
33
-
34
- > **How can we unlock institutional knowledge trapped in historical documents?**
35
-
36
- ---
37
-
38
- # Slide 3: Our Solution
39
-
40
- ## A Complete Document Intelligence System
41
-
42
- | Feature | Description |
43
- |---------|-------------|
44
- | **Vision-Language OCR** | Llama-4-Maverick extracts text with **87.75% accuracy**, preserving Cyrillic characters |
45
- | **Semantic Search** | BAAI/bge-large embeddings + Pinecone enable instant retrieval across **1,128 chunks** |
46
- | **RAG-Powered Q&A** | Natural language questions answered with **source citations** |
47
- | **Production-Ready API** | FastAPI + Docker with health monitoring and web UI |
48
-
49
- ---
50
-
51
- # Slide 4: System Architecture
52
-
53
- ```
54
- ┌─────────────────────────────────────────────────────────────────────┐
55
- │ SYSTEM ARCHITECTURE │
56
- └─────────────────────────────────────────────────────────────────────┘
57
-
58
- ┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐
59
- │ PDF │ -> │ Llama-4 │ -> │ BAAI/bge │ -> │ Pinecone │
60
- │ Documents│ │ Vision OCR │ │ Embeddings │ │ Vector DB │
61
- └──────────┘ └──────────────┘ └──────────────┘ └───────────┘
62
-
63
- v
64
- ┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐
65
- │ Answer │ <- │ Llama-4 LLM │ <- │ Context │ <- │ Top-3 │
66
- │ + Sources│ │ Generation │ │ Building │ │ Retrieval │
67
- └──────────┘ └──────────────┘ └──────────────┘ └───────────┘
68
- ```
69
-
70
- ### OCR Pipeline
71
- ```
72
- PDF Upload → PyMuPDF (100 DPI) → Vision LLM → Image Detection → Markdown Output
73
- ```
74
-
75
- ### RAG Pipeline
76
- ```
77
- User Question → Embed Query → Top-3 Retrieval → Context Building → LLM + Citations
78
- ```
79
-
80
- ---
81
-
82
- # Slide 5: Technology Stack
83
-
84
- ## Open-Source & Production-Ready
85
-
86
- | Component | Technology | Purpose |
87
- |-----------|------------|---------|
88
- | **OCR/LLM** | Llama-4-Maverick-17B | Vision & Language Model |
89
- | **Embeddings** | BAAI/bge-large-en-v1.5 | 1024-dimensional vectors |
90
- | **Vector DB** | Pinecone Cloud | Scalable similarity search |
91
- | **API** | FastAPI | Async REST endpoints |
92
- | **PDF Processing** | PyMuPDF | PDF to image conversion |
93
- | **Deployment** | Docker | Containerization |
94
-
95
- ### API Endpoints
96
-
97
- | Method | Endpoint | Description |
98
- |--------|----------|-------------|
99
- | `POST` | `/ocr` | Extract text from uploaded PDF with image detection |
100
- | `POST` | `/llm` | RAG-based Q&A with source citations |
101
- | `GET` | `/health` | Service health check and vector count |
102
- | `GET` | `/` | Interactive web interface |
103
-
104
- ---
105
-
106
- # Slide 6: OCR Benchmark Results
107
-
108
- ## We Tested 3 OCR Models
109
-
110
- | Model | Character Success Rate | Word Success Rate | Speed (12 pages) | Type |
111
- |-------|----------------------|-------------------|------------------|------|
112
- | GPT-4.1 | 88.12% | 67.44% | 199s | Closed |
113
- | **Llama-4-Maverick 17B** | **87.75%** | **61.91%** | **75s** | **Open** |
114
- | Phi-4-multimodal | Failed | Failed | N/A | Open |
115
-
116
- ### Why We Chose Llama-4:
117
- - Only **0.37% accuracy loss** vs GPT-4.1
118
- - **2.7x faster** processing
119
- - **100% open-source**
120
- - No vendor lock-in
121
-
122
- ---
123
-
124
- # Slide 7: RAG Optimization Results
125
-
126
- ## We Tested 7 Configurations
127
-
128
- | Configuration | Answer Quality | Citation Rate | Response Time |
129
- |--------------|----------------|---------------|---------------|
130
- | **Citation-focused + Vanilla k3** | **55.67%** | **73.33%** | **3.61s** |
131
- | Few-shot + Vanilla k3 | 45.70% | 40.00% | 2.17s |
132
- | Baseline + Vanilla k3 | 39.65% | 20.00% | 2.28s |
133
- | MMR Retrieval | 34.60% | 6.67% | 2.53s |
134
-
135
- ### Key Insights
136
-
137
- 1. **Simple Beats Complex**: Vanilla retrieval outperforms MMR reranking by **+21%**
138
- 2. **Less is More**: Top-3 beats Top-5 by **+20%** (more context confused the LLM)
139
- 3. **Prompt Engineering Matters**: Citation-focused prompt improves quality by **+16%**
140
-
141
- ---
142
-
143
- # Slide 8: Performance Metrics
144
-
145
- ## Final System Performance
146
-
147
- | Metric | Score |
148
- |--------|-------|
149
- | **OCR Accuracy** | 87.75% |
150
- | **Answer Quality** | 55.67% |
151
- | **Citation Rate** | 73.33% |
152
- | **Response Time** | 3.6 seconds |
153
-
154
- ---
155
-
156
- ## Estimated Hackathon Score
157
-
158
- | Category | Weight | Score | Points |
159
- |----------|--------|-------|--------|
160
- | OCR Quality | 50% | 87.75% | **43.9 / 50** |
161
- | LLM Quality | 30% | 55.67% | **16.7 / 30** |
162
- | Architecture | 20% | 100% | **20 / 20** |
163
- | **TOTAL** | 100% | **88.1%** | **440.6 / 500** |
164
-
165
- ---
166
-
167
- # Slide 9: Key Technical Decisions
168
-
169
- ## What We Did (and Why)
170
-
171
- | Decision | Reason | Impact |
172
- |----------|--------|--------|
173
- | **Open-source Llama** over GPT-4 | Transparency + speed | 2.7x faster |
174
- | **Top-3 retrieval** | More context confused LLM | +20% quality |
175
- | **Vanilla retrieval** | Simple beats complex | +21% vs MMR |
176
- | **Citation-focused prompt** | In Azerbaijani | +16% quality, +53% citations |
177
- | **BAAI embeddings** | Best for non-English | +25% vs multilingual |
178
- | **600-char chunks** | Optimal context size | Balanced retrieval |
179
-
180
- ## What We Avoided
181
-
182
- - MMR/Reranking (21% worse performance)
183
- - Top-5+ retrieval (information overload)
184
- - Few-shot prompting (inconsistent results)
185
- - Multilingual embeddings (underperformed)
186
- - Complex architectures (unnecessary overhead)
187
-
188
- > *"Every decision was validated through rigorous benchmarking across 3 Jupyter notebooks"*
189
-
190
- ---
191
-
192
- # Slide 10: Live Demo Features
193
-
194
- ## Interactive Capabilities
195
-
196
- ### 1. PDF Upload & OCR
197
- - Drag & drop any PDF
198
- - Text extraction with image detection
199
- - Results in markdown format
200
-
201
- ### 2. Interactive Q&A Chat
202
- - Ask questions in Azerbaijani, Russian, or English
203
- - Real-time responses with context
204
-
205
- ### 3. Source Citations
206
- - Document name, page number, and excerpt
207
- - Full traceability for verification
208
-
209
- ### 4. API Documentation
210
- - Swagger UI at `/docs`
211
- - Interactive testing capabilities
212
-
213
- **Demo URL**: `http://localhost:8000`
214
-
215
- ---
216
-
217
- # Slide 11: Deliverables
218
-
219
- ## What We Built
220
-
221
- | Category | Count |
222
- |----------|-------|
223
- | PDFs Processed | 28 |
224
- | Vector Chunks | 1,128 |
225
- | Benchmark Notebooks | 3 |
226
- | Documentation Files | 8 |
227
-
228
- ### Code & Infrastructure
229
- - FastAPI application (505 lines)
230
- - Data ingestion pipeline with parallel processing (4x speedup)
231
- - Docker + Docker Compose deployment
232
- - Health monitoring and web UI
233
-
234
- ### Documentation & Analysis
235
- - VLM OCR benchmark notebook
236
- - RAG optimization notebook
237
- - LLM comparison notebook
238
- - Comprehensive markdown documentation
239
- - Sample questions & answers
240
-
241
- ---
242
-
243
- # Slide 12: Thank You!
244
-
245
- ## SOCAR Historical Documents AI System
246
-
247
- > Transforming archives into accessible, searchable knowledge
248
-
249
- ### Final Metrics
250
-
251
- | Metric | Value |
252
- |--------|-------|
253
- | OCR Accuracy | **87.75%** |
254
- | Estimated Score | **440.6 / 500** |
255
- | Open Source | **100%** |
256
- | Response Time | **3.6s** |
257
-
258
- ---
259
-
260
- # Questions? Let's Demo!
261
-
262
- **GitHub**: [Repository Link]
263
- **API Docs**: `http://localhost:8000/docs`
264
- **Web UI**: `http://localhost:8000`
265
-
266
- ---
267
-
268
- # Appendix: Sample Questions
269
-
270
- ## Test Questions (Azerbaijani)
271
-
272
- 1. "Palciq vulkanlarinin tasir radiusu na qadardir?"
273
- 2. "SOCAR-in tarixi haqqinda malumat verin"
274
- 3. "Neft hasilatinin illik hacmi necedr?"
275
-
276
- ## Expected Response Format
277
-
278
- ```json
279
- {
280
- "answer": "Answer with citations...",
281
- "sources": [
282
- {
283
- "pdf_name": "document_06.pdf",
284
- "page_number": 3,
285
- "content": "Relevant excerpt..."
286
- }
287
- ],
288
- "response_time": 3.61
289
- }
290
- ```
291
-
292
- ---
293
-
294
- # Appendix: Quick Start
295
-
296
- ## Running the System
297
-
298
- ```bash
299
- # Option 1: Docker Compose (Recommended)
300
- docker-compose up -d
301
-
302
- # Option 2: Manual Installation
303
- pip install -r app/requirements.txt
304
- python app/main.py
305
-
306
- # Access the application
307
- open http://localhost:8000
308
- ```
309
-
310
- ## Environment Variables
311
-
312
- ```bash
313
- AZURE_OPENAI_API_KEY=your_key
314
- AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
315
- PINECONE_API_KEY=your_pinecone_key
316
- PINECONE_INDEX_NAME=hackathon
317
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
presentation/pitch_deck.html DELETED
@@ -1,1185 +0,0 @@
1
- <!DOCTYPE html>
2
- <html lang="en">
3
- <head>
4
- <meta charset="UTF-8">
5
- <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
- <title>SOCAR Historical Documents AI - Hackathon Pitch</title>
7
- <style>
8
- * {
9
- margin: 0;
10
- padding: 0;
11
- box-sizing: border-box;
12
- }
13
-
14
- body {
15
- font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
16
- background: #0a0a0a;
17
- color: #ffffff;
18
- overflow: hidden;
19
- }
20
-
21
- .slide {
22
- display: none;
23
- width: 100vw;
24
- height: 100vh;
25
- padding: 60px 80px;
26
- background: linear-gradient(135deg, #0d1117 0%, #161b22 100%);
27
- position: relative;
28
- overflow: hidden;
29
- }
30
-
31
- .slide.active {
32
- display: flex;
33
- flex-direction: column;
34
- justify-content: center;
35
- }
36
-
37
- .slide::before {
38
- content: '';
39
- position: absolute;
40
- top: 0;
41
- left: 0;
42
- right: 0;
43
- height: 4px;
44
- background: linear-gradient(90deg, #00d4aa, #0099ff, #00d4aa);
45
- }
46
-
47
- /* Title Slide */
48
- .title-slide {
49
- text-align: center;
50
- justify-content: center;
51
- align-items: center;
52
- }
53
-
54
- .title-slide h1 {
55
- font-size: 4rem;
56
- font-weight: 700;
57
- background: linear-gradient(135deg, #00d4aa, #0099ff);
58
- -webkit-background-clip: text;
59
- -webkit-text-fill-color: transparent;
60
- background-clip: text;
61
- margin-bottom: 20px;
62
- }
63
-
64
- .title-slide .subtitle {
65
- font-size: 1.8rem;
66
- color: #8b949e;
67
- margin-bottom: 40px;
68
- }
69
-
70
- .title-slide .tagline {
71
- font-size: 1.3rem;
72
- color: #58a6ff;
73
- padding: 15px 30px;
74
- border: 2px solid #30363d;
75
- border-radius: 10px;
76
- background: rgba(88, 166, 255, 0.1);
77
- }
78
-
79
- /* Regular Slides */
80
- h2 {
81
- font-size: 2.8rem;
82
- color: #00d4aa;
83
- margin-bottom: 40px;
84
- display: flex;
85
- align-items: center;
86
- gap: 15px;
87
- }
88
-
89
- h2 .icon {
90
- font-size: 2.5rem;
91
- }
92
-
93
- h3 {
94
- font-size: 1.6rem;
95
- color: #58a6ff;
96
- margin: 25px 0 15px 0;
97
- }
98
-
99
- p {
100
- font-size: 1.3rem;
101
- line-height: 1.8;
102
- color: #c9d1d9;
103
- }
104
-
105
- ul {
106
- list-style: none;
107
- padding-left: 0;
108
- }
109
-
110
- li {
111
- font-size: 1.4rem;
112
- line-height: 2;
113
- color: #c9d1d9;
114
- padding-left: 35px;
115
- position: relative;
116
- }
117
-
118
- li::before {
119
- content: '▹';
120
- position: absolute;
121
- left: 0;
122
- color: #00d4aa;
123
- font-size: 1.2rem;
124
- }
125
-
126
- /* Stats Grid */
127
- .stats-grid {
128
- display: grid;
129
- grid-template-columns: repeat(4, 1fr);
130
- gap: 30px;
131
- margin-top: 40px;
132
- }
133
-
134
- .stat-card {
135
- background: linear-gradient(135deg, #21262d 0%, #161b22 100%);
136
- border: 1px solid #30363d;
137
- border-radius: 16px;
138
- padding: 30px;
139
- text-align: center;
140
- transition: transform 0.3s, border-color 0.3s;
141
- }
142
-
143
- .stat-card:hover {
144
- transform: translateY(-5px);
145
- border-color: #00d4aa;
146
- }
147
-
148
- .stat-card .number {
149
- font-size: 3rem;
150
- font-weight: 700;
151
- background: linear-gradient(135deg, #00d4aa, #0099ff);
152
- -webkit-background-clip: text;
153
- -webkit-text-fill-color: transparent;
154
- background-clip: text;
155
- }
156
-
157
- .stat-card .label {
158
- font-size: 1rem;
159
- color: #8b949e;
160
- margin-top: 10px;
161
- text-transform: uppercase;
162
- letter-spacing: 1px;
163
- }
164
-
165
- /* Architecture Diagram */
166
- .architecture {
167
- display: flex;
168
- justify-content: space-between;
169
- align-items: center;
170
- margin-top: 30px;
171
- padding: 20px;
172
- }
173
-
174
- .arch-box {
175
- background: linear-gradient(135deg, #21262d 0%, #161b22 100%);
176
- border: 2px solid #30363d;
177
- border-radius: 12px;
178
- padding: 25px 35px;
179
- text-align: center;
180
- min-width: 180px;
181
- }
182
-
183
- .arch-box.highlight {
184
- border-color: #00d4aa;
185
- box-shadow: 0 0 30px rgba(0, 212, 170, 0.2);
186
- }
187
-
188
- .arch-box .title {
189
- font-size: 1rem;
190
- color: #8b949e;
191
- text-transform: uppercase;
192
- letter-spacing: 1px;
193
- margin-bottom: 8px;
194
- }
195
-
196
- .arch-box .tech {
197
- font-size: 1.2rem;
198
- color: #58a6ff;
199
- font-weight: 600;
200
- }
201
-
202
- .arrow {
203
- font-size: 2rem;
204
- color: #00d4aa;
205
- }
206
-
207
- /* Two Column Layout */
208
- .two-col {
209
- display: grid;
210
- grid-template-columns: 1fr 1fr;
211
- gap: 60px;
212
- margin-top: 20px;
213
- }
214
-
215
- .col {
216
- background: rgba(33, 38, 45, 0.5);
217
- border-radius: 16px;
218
- padding: 30px;
219
- border: 1px solid #30363d;
220
- }
221
-
222
- /* Tech Stack */
223
- .tech-stack {
224
- display: grid;
225
- grid-template-columns: repeat(3, 1fr);
226
- gap: 20px;
227
- margin-top: 30px;
228
- }
229
-
230
- .tech-item {
231
- background: linear-gradient(135deg, #21262d 0%, #161b22 100%);
232
- border: 1px solid #30363d;
233
- border-radius: 12px;
234
- padding: 20px;
235
- display: flex;
236
- align-items: center;
237
- gap: 15px;
238
- }
239
-
240
- .tech-item .icon {
241
- font-size: 2rem;
242
- }
243
-
244
- .tech-item .name {
245
- font-size: 1.1rem;
246
- color: #c9d1d9;
247
- }
248
-
249
- .tech-item .desc {
250
- font-size: 0.9rem;
251
- color: #8b949e;
252
- }
253
-
254
- /* Comparison Table */
255
- .comparison-table {
256
- width: 100%;
257
- margin-top: 30px;
258
- border-collapse: collapse;
259
- }
260
-
261
- .comparison-table th,
262
- .comparison-table td {
263
- padding: 18px 25px;
264
- text-align: left;
265
- border-bottom: 1px solid #30363d;
266
- }
267
-
268
- .comparison-table th {
269
- background: #21262d;
270
- color: #58a6ff;
271
- font-size: 1.1rem;
272
- font-weight: 600;
273
- }
274
-
275
- .comparison-table td {
276
- font-size: 1.1rem;
277
- color: #c9d1d9;
278
- }
279
-
280
- .comparison-table tr:hover td {
281
- background: rgba(0, 212, 170, 0.05);
282
- }
283
-
284
- .comparison-table .winner {
285
- color: #00d4aa;
286
- font-weight: 600;
287
- }
288
-
289
- .comparison-table .badge {
290
- display: inline-block;
291
- padding: 4px 12px;
292
- border-radius: 20px;
293
- font-size: 0.85rem;
294
- font-weight: 600;
295
- }
296
-
297
- .badge.open {
298
- background: rgba(0, 212, 170, 0.2);
299
- color: #00d4aa;
300
- }
301
-
302
- .badge.closed {
303
- background: rgba(255, 107, 107, 0.2);
304
- color: #ff6b6b;
305
- }
306
-
307
- /* Flow Diagram */
308
- .flow {
309
- display: flex;
310
- flex-direction: column;
311
- gap: 15px;
312
- margin-top: 20px;
313
- }
314
-
315
- .flow-row {
316
- display: flex;
317
- align-items: center;
318
- gap: 15px;
319
- }
320
-
321
- .flow-box {
322
- background: #21262d;
323
- border: 1px solid #30363d;
324
- border-radius: 8px;
325
- padding: 12px 20px;
326
- font-size: 1rem;
327
- color: #c9d1d9;
328
- }
329
-
330
- .flow-box.primary {
331
- border-color: #00d4aa;
332
- color: #00d4aa;
333
- }
334
-
335
- .flow-arrow {
336
- color: #58a6ff;
337
- font-size: 1.2rem;
338
- }
339
-
340
- /* Navigation */
341
- .nav {
342
- position: fixed;
343
- bottom: 30px;
344
- left: 50%;
345
- transform: translateX(-50%);
346
- display: flex;
347
- gap: 15px;
348
- z-index: 1000;
349
- }
350
-
351
- .nav button {
352
- background: #21262d;
353
- border: 1px solid #30363d;
354
- color: #c9d1d9;
355
- padding: 12px 25px;
356
- border-radius: 8px;
357
- cursor: pointer;
358
- font-size: 1rem;
359
- transition: all 0.3s;
360
- }
361
-
362
- .nav button:hover {
363
- background: #30363d;
364
- border-color: #00d4aa;
365
- color: #00d4aa;
366
- }
367
-
368
- .slide-counter {
369
- position: fixed;
370
- bottom: 30px;
371
- right: 40px;
372
- color: #8b949e;
373
- font-size: 1rem;
374
- }
375
-
376
- /* Problem icons */
377
- .problem-grid {
378
- display: grid;
379
- grid-template-columns: repeat(3, 1fr);
380
- gap: 30px;
381
- margin-top: 40px;
382
- }
383
-
384
- .problem-card {
385
- background: linear-gradient(135deg, #21262d 0%, #161b22 100%);
386
- border: 1px solid #30363d;
387
- border-radius: 16px;
388
- padding: 35px;
389
- text-align: center;
390
- }
391
-
392
- .problem-card .icon {
393
- font-size: 3rem;
394
- margin-bottom: 20px;
395
- }
396
-
397
- .problem-card h4 {
398
- font-size: 1.3rem;
399
- color: #ff6b6b;
400
- margin-bottom: 12px;
401
- }
402
-
403
- .problem-card p {
404
- font-size: 1rem;
405
- color: #8b949e;
406
- }
407
-
408
- /* Solution cards */
409
- .solution-grid {
410
- display: grid;
411
- grid-template-columns: repeat(2, 1fr);
412
- gap: 30px;
413
- margin-top: 30px;
414
- }
415
-
416
- .solution-card {
417
- background: linear-gradient(135deg, rgba(0, 212, 170, 0.1) 0%, rgba(0, 153, 255, 0.1) 100%);
418
- border: 1px solid #00d4aa;
419
- border-radius: 16px;
420
- padding: 30px;
421
- }
422
-
423
- .solution-card h4 {
424
- font-size: 1.4rem;
425
- color: #00d4aa;
426
- margin-bottom: 15px;
427
- display: flex;
428
- align-items: center;
429
- gap: 10px;
430
- }
431
-
432
- .solution-card p {
433
- font-size: 1.1rem;
434
- color: #c9d1d9;
435
- }
436
-
437
- /* Score breakdown */
438
- .score-breakdown {
439
- margin-top: 30px;
440
- }
441
-
442
- .score-item {
443
- display: flex;
444
- align-items: center;
445
- margin-bottom: 25px;
446
- }
447
-
448
- .score-label {
449
- width: 200px;
450
- font-size: 1.2rem;
451
- color: #c9d1d9;
452
- }
453
-
454
- .score-bar-container {
455
- flex: 1;
456
- height: 30px;
457
- background: #21262d;
458
- border-radius: 15px;
459
- overflow: hidden;
460
- margin: 0 20px;
461
- }
462
-
463
- .score-bar {
464
- height: 100%;
465
- background: linear-gradient(90deg, #00d4aa, #0099ff);
466
- border-radius: 15px;
467
- transition: width 1s ease-out;
468
- }
469
-
470
- .score-value {
471
- width: 100px;
472
- font-size: 1.3rem;
473
- font-weight: 700;
474
- color: #00d4aa;
475
- text-align: right;
476
- }
477
-
478
- /* Final slide */
479
- .final-slide {
480
- text-align: center;
481
- }
482
-
483
- .final-slide h2 {
484
- justify-content: center;
485
- font-size: 3rem;
486
- margin-bottom: 30px;
487
- }
488
-
489
- .contact-info {
490
- display: flex;
491
- justify-content: center;
492
- gap: 40px;
493
- margin-top: 50px;
494
- }
495
-
496
- .contact-item {
497
- background: #21262d;
498
- border: 1px solid #30363d;
499
- border-radius: 12px;
500
- padding: 20px 40px;
501
- }
502
-
503
- .contact-item .label {
504
- font-size: 0.9rem;
505
- color: #8b949e;
506
- text-transform: uppercase;
507
- letter-spacing: 1px;
508
- }
509
-
510
- .contact-item .value {
511
- font-size: 1.2rem;
512
- color: #58a6ff;
513
- margin-top: 8px;
514
- }
515
-
516
- /* Highlight text */
517
- .highlight-text {
518
- color: #00d4aa;
519
- font-weight: 600;
520
- }
521
-
522
- /* Demo section */
523
- .demo-features {
524
- display: grid;
525
- grid-template-columns: repeat(2, 1fr);
526
- gap: 25px;
527
- margin-top: 30px;
528
- }
529
-
530
- .demo-feature {
531
- background: #21262d;
532
- border: 1px solid #30363d;
533
- border-radius: 12px;
534
- padding: 25px;
535
- display: flex;
536
- gap: 20px;
537
- align-items: flex-start;
538
- }
539
-
540
- .demo-feature .icon {
541
- font-size: 2.5rem;
542
- color: #00d4aa;
543
- }
544
-
545
- .demo-feature h4 {
546
- font-size: 1.2rem;
547
- color: #c9d1d9;
548
- margin-bottom: 8px;
549
- }
550
-
551
- .demo-feature p {
552
- font-size: 1rem;
553
- color: #8b949e;
554
- }
555
-
556
- /* API endpoints */
557
- .endpoint {
558
- background: #161b22;
559
- border: 1px solid #30363d;
560
- border-radius: 8px;
561
- padding: 15px 20px;
562
- margin: 10px 0;
563
- font-family: 'Courier New', monospace;
564
- }
565
-
566
- .endpoint .method {
567
- display: inline-block;
568
- padding: 4px 10px;
569
- border-radius: 4px;
570
- font-size: 0.85rem;
571
- font-weight: 700;
572
- margin-right: 15px;
573
- }
574
-
575
- .endpoint .method.post {
576
- background: rgba(0, 212, 170, 0.2);
577
- color: #00d4aa;
578
- }
579
-
580
- .endpoint .method.get {
581
- background: rgba(88, 166, 255, 0.2);
582
- color: #58a6ff;
583
- }
584
-
585
- .endpoint .path {
586
- color: #c9d1d9;
587
- font-size: 1.1rem;
588
- }
589
-
590
- .endpoint .desc {
591
- color: #8b949e;
592
- font-size: 0.95rem;
593
- margin-left: 70px;
594
- margin-top: 5px;
595
- }
596
-
597
- /* Key decisions */
598
- .decision-list {
599
- margin-top: 20px;
600
- }
601
-
602
- .decision-item {
603
- background: #21262d;
604
- border-left: 4px solid #00d4aa;
605
- padding: 20px 25px;
606
- margin: 15px 0;
607
- border-radius: 0 8px 8px 0;
608
- }
609
-
610
- .decision-item h4 {
611
- color: #58a6ff;
612
- font-size: 1.2rem;
613
- margin-bottom: 8px;
614
- }
615
-
616
- .decision-item p {
617
- font-size: 1rem;
618
- color: #8b949e;
619
- }
620
-
621
- .decision-item .result {
622
- color: #00d4aa;
623
- font-weight: 600;
624
- }
625
- </style>
626
- </head>
627
- <body>
628
- <!-- Slide 1: Title -->
629
- <div class="slide title-slide active" id="slide1">
630
- <h1>SOCAR Historical Documents AI</h1>
631
- <p class="subtitle">Intelligent OCR & RAG System for Oil & Gas Archives</p>
632
- <p class="tagline">🛢️ Transforming 28 Historical Documents into Searchable Knowledge</p>
633
- <div style="margin-top: 50px;">
634
- <p style="font-size: 1.5rem; color: #00d4aa; font-weight: 700; margin-bottom: 15px;">Team BeatByte</p>
635
- <p style="font-size: 1.1rem; color: #8b949e;">Ulvi Bashirov &nbsp;•&nbsp; Samir Mehdiyev &nbsp;•&nbsp; Ismat Samadov</p>
636
- </div>
637
- </div>
638
-
639
- <!-- Slide 2: Problem Statement -->
640
- <div class="slide" id="slide2">
641
- <h2><span class="icon">⚠️</span> The Problem</h2>
642
- <div class="problem-grid">
643
- <div class="problem-card">
644
- <div class="icon">📄</div>
645
- <h4>Inaccessible Archives</h4>
646
- <p>Decades of valuable historical documents locked in PDF format, impossible to search</p>
647
- </div>
648
- <div class="problem-card">
649
- <div class="icon">🌍</div>
650
- <h4>Multi-Language Barrier</h4>
651
- <p>Documents in Azerbaijani, Russian, and English with complex Cyrillic text</p>
652
- </div>
653
- <div class="problem-card">
654
- <div class="icon">⏱️</div>
655
- <h4>Time-Consuming Research</h4>
656
- <p>Manual document review takes hours to find specific information</p>
657
- </div>
658
- </div>
659
- <p style="margin-top: 40px; text-align: center; font-size: 1.5rem; color: #ff6b6b;">
660
- How can we unlock institutional knowledge trapped in historical documents?
661
- </p>
662
- </div>
663
-
664
- <!-- Slide 3: Our Solution -->
665
- <div class="slide" id="slide3">
666
- <h2><span class="icon">💡</span> Our Solution</h2>
667
- <div class="solution-grid">
668
- <div class="solution-card">
669
- <h4>📸 Vision-Language OCR</h4>
670
- <p>State-of-the-art Llama-4-Maverick model extracts text from scanned documents with <span class="highlight-text">87.75% accuracy</span>, preserving Cyrillic characters perfectly</p>
671
- </div>
672
- <div class="solution-card">
673
- <h4>🔍 Semantic Search</h4>
674
- <p>BAAI/bge-large embeddings + Pinecone vector database enable instant retrieval across <span class="highlight-text">1,128 document chunks</span></p>
675
- </div>
676
- <div class="solution-card">
677
- <h4>🤖 RAG-Powered Q&A</h4>
678
- <p>Natural language questions answered with relevant context and <span class="highlight-text">source citations</span> for verification</p>
679
- </div>
680
- <div class="solution-card">
681
- <h4>🌐 Production-Ready API</h4>
682
- <p>FastAPI backend with Docker deployment, health monitoring, and interactive web interface</p>
683
- </div>
684
- </div>
685
- </div>
686
-
687
- <!-- Slide 4: Architecture -->
688
- <div class="slide" id="slide4">
689
- <h2><span class="icon">🏗️</span> System Architecture</h2>
690
- <div class="architecture">
691
- <div class="arch-box">
692
- <div class="title">Input</div>
693
- <div class="tech">PDF Documents</div>
694
- </div>
695
- <span class="arrow">→</span>
696
- <div class="arch-box highlight">
697
- <div class="title">OCR Engine</div>
698
- <div class="tech">Llama-4 Vision</div>
699
- </div>
700
- <span class="arrow">→</span>
701
- <div class="arch-box">
702
- <div class="title">Embeddings</div>
703
- <div class="tech">BAAI/bge-large</div>
704
- </div>
705
- <span class="arrow">→</span>
706
- <div class="arch-box highlight">
707
- <div class="title">Vector DB</div>
708
- <div class="tech">Pinecone Cloud</div>
709
- </div>
710
- <span class="arrow">→</span>
711
- <div class="arch-box">
712
- <div class="title">LLM</div>
713
- <div class="tech">Llama-4 17B</div>
714
- </div>
715
- </div>
716
- <div class="two-col" style="margin-top: 40px;">
717
- <div class="col">
718
- <h3>OCR Pipeline</h3>
719
- <div class="flow">
720
- <div class="flow-row">
721
- <div class="flow-box">PDF Upload</div>
722
- <span class="flow-arrow">→</span>
723
- <div class="flow-box">PyMuPDF (100 DPI)</div>
724
- <span class="flow-arrow">→</span>
725
- <div class="flow-box primary">Vision LLM</div>
726
- </div>
727
- <div class="flow-row">
728
- <div class="flow-box">Image Detection</div>
729
- <span class="flow-arrow">→</span>
730
- <div class="flow-box">Markdown Output</div>
731
- </div>
732
- </div>
733
- </div>
734
- <div class="col">
735
- <h3>RAG Pipeline</h3>
736
- <div class="flow">
737
- <div class="flow-row">
738
- <div class="flow-box">User Question</div>
739
- <span class="flow-arrow">→</span>
740
- <div class="flow-box">Embed Query</div>
741
- <span class="flow-arrow">→</span>
742
- <div class="flow-box primary">Top-3 Retrieval</div>
743
- </div>
744
- <div class="flow-row">
745
- <div class="flow-box">Context Building</div>
746
- <span class="flow-arrow">→</span>
747
- <div class="flow-box">LLM + Citations</div>
748
- </div>
749
- </div>
750
- </div>
751
- </div>
752
- </div>
753
-
754
- <!-- Slide 5: Technology Stack -->
755
- <div class="slide" id="slide5">
756
- <h2><span class="icon">🛠️</span> Technology Stack</h2>
757
- <div class="tech-stack">
758
- <div class="tech-item">
759
- <span class="icon">🦙</span>
760
- <div>
761
- <div class="name">Llama-4-Maverick 17B</div>
762
- <div class="desc">Vision & Language Model</div>
763
- </div>
764
- </div>
765
- <div class="tech-item">
766
- <span class="icon">🧠</span>
767
- <div>
768
- <div class="name">BAAI/bge-large-en</div>
769
- <div class="desc">1024-dim Embeddings</div>
770
- </div>
771
- </div>
772
- <div class="tech-item">
773
- <span class="icon">🌲</span>
774
- <div>
775
- <div class="name">Pinecone Cloud</div>
776
- <div class="desc">Vector Database</div>
777
- </div>
778
- </div>
779
- <div class="tech-item">
780
- <span class="icon">⚡</span>
781
- <div>
782
- <div class="name">FastAPI</div>
783
- <div class="desc">Async REST API</div>
784
- </div>
785
- </div>
786
- <div class="tech-item">
787
- <span class="icon">📄</span>
788
- <div>
789
- <div class="name">PyMuPDF</div>
790
- <div class="desc">PDF Processing</div>
791
- </div>
792
- </div>
793
- <div class="tech-item">
794
- <span class="icon">🐳</span>
795
- <div>
796
- <div class="name">Docker</div>
797
- <div class="desc">Containerization</div>
798
- </div>
799
- </div>
800
- </div>
801
- <div style="margin-top: 50px;">
802
- <h3>API Endpoints</h3>
803
- <div class="endpoint">
804
- <span class="method post">POST</span>
805
- <span class="path">/ocr</span>
806
- <div class="desc">Extract text from uploaded PDF with image detection</div>
807
- </div>
808
- <div class="endpoint">
809
- <span class="method post">POST</span>
810
- <span class="path">/llm</span>
811
- <div class="desc">RAG-based Q&A with source citations</div>
812
- </div>
813
- <div class="endpoint">
814
- <span class="method get">GET</span>
815
- <span class="path">/health</span>
816
- <div class="desc">Service health check and vector count</div>
817
- </div>
818
- </div>
819
- </div>
820
-
821
- <!-- Slide 6: Benchmark Results -->
822
- <div class="slide" id="slide6">
823
- <h2><span class="icon">📊</span> Benchmark Results</h2>
824
- <p style="margin-bottom: 30px;">We rigorously tested <span class="highlight-text">3 OCR models</span>, <span class="highlight-text">7 RAG configurations</span>, and <span class="highlight-text">3 LLMs</span> to optimize performance</p>
825
-
826
- <h3>OCR Model Comparison</h3>
827
- <table class="comparison-table">
828
- <tr>
829
- <th>Model</th>
830
- <th>Character Success Rate</th>
831
- <th>Word Success Rate</th>
832
- <th>Speed (12 pages)</th>
833
- <th>Type</th>
834
- </tr>
835
- <tr>
836
- <td>GPT-4.1</td>
837
- <td>88.12%</td>
838
- <td>67.44%</td>
839
- <td>199s</td>
840
- <td><span class="badge closed">Closed</span></td>
841
- </tr>
842
- <tr>
843
- <td class="winner">Llama-4-Maverick 17B ✓</td>
844
- <td class="winner">87.75%</td>
845
- <td class="winner">61.91%</td>
846
- <td class="winner">75s</td>
847
- <td><span class="badge open">Open</span></td>
848
- </tr>
849
- <tr>
850
- <td>Phi-4-multimodal</td>
851
- <td colspan="3" style="color: #ff6b6b;">Failed</td>
852
- <td><span class="badge open">Open</span></td>
853
- </tr>
854
- </table>
855
- <p style="margin-top: 20px; color: #00d4aa;">
856
- ✓ Chose Llama-4: Only 0.37% accuracy loss vs GPT-4.1, but <strong>2.7x faster</strong> and <strong>open-source</strong>
857
- </p>
858
- </div>
859
-
860
- <!-- Slide 7: RAG Optimization -->
861
- <div class="slide" id="slide7">
862
- <h2><span class="icon">🎯</span> RAG Optimization Results</h2>
863
- <table class="comparison-table">
864
- <tr>
865
- <th>Configuration</th>
866
- <th>Answer Quality</th>
867
- <th>Citation Rate</th>
868
- <th>Response Time</th>
869
- </tr>
870
- <tr>
871
- <td class="winner">Citation-focused + Vanilla k3 ✓</td>
872
- <td class="winner">55.67%</td>
873
- <td class="winner">73.33%</td>
874
- <td class="winner">3.61s</td>
875
- </tr>
876
- <tr>
877
- <td>Few-shot + Vanilla k3</td>
878
- <td>45.70%</td>
879
- <td>40.00%</td>
880
- <td>2.17s</td>
881
- </tr>
882
- <tr>
883
- <td>Baseline + Vanilla k3</td>
884
- <td>39.65%</td>
885
- <td>20.00%</td>
886
- <td>2.28s</td>
887
- </tr>
888
- <tr>
889
- <td>MMR Retrieval</td>
890
- <td>34.60%</td>
891
- <td>6.67%</td>
892
- <td>2.53s</td>
893
- </tr>
894
- </table>
895
-
896
- <div class="decision-list" style="margin-top: 30px;">
897
- <div class="decision-item">
898
- <h4>Key Insight: Simple Beats Complex</h4>
899
- <p>Vanilla retrieval outperforms MMR reranking by <span class="result">+21%</span>. Top-3 beats Top-5 by <span class="result">+20%</span></p>
900
- </div>
901
- <div class="decision-item">
902
- <h4>Citation-Focused Prompting</h4>
903
- <p>Custom Azerbaijani prompt improves quality by <span class="result">+16%</span> and citation rate by <span class="result">+53%</span></p>
904
- </div>
905
- </div>
906
- </div>
907
-
908
- <!-- Slide 8: Performance Metrics -->
909
- <div class="slide" id="slide8">
910
- <h2><span class="icon">🏆</span> Performance Metrics</h2>
911
- <div class="stats-grid">
912
- <div class="stat-card">
913
- <div class="number">87.75%</div>
914
- <div class="label">OCR Accuracy</div>
915
- </div>
916
- <div class="stat-card">
917
- <div class="number">55.67%</div>
918
- <div class="label">Answer Quality</div>
919
- </div>
920
- <div class="stat-card">
921
- <div class="number">73.33%</div>
922
- <div class="label">Citation Rate</div>
923
- </div>
924
- <div class="stat-card">
925
- <div class="number">3.6s</div>
926
- <div class="label">Response Time</div>
927
- </div>
928
- </div>
929
-
930
- <h3 style="margin-top: 50px;">Estimated Hackathon Score</h3>
931
- <div class="score-breakdown">
932
- <div class="score-item">
933
- <span class="score-label">OCR Quality (50%)</span>
934
- <div class="score-bar-container">
935
- <div class="score-bar" style="width: 87.75%;"></div>
936
- </div>
937
- <span class="score-value">43.9 / 50</span>
938
- </div>
939
- <div class="score-item">
940
- <span class="score-label">LLM Quality (30%)</span>
941
- <div class="score-bar-container">
942
- <div class="score-bar" style="width: 55.67%;"></div>
943
- </div>
944
- <span class="score-value">16.7 / 30</span>
945
- </div>
946
- <div class="score-item">
947
- <span class="score-label">Architecture (20%)</span>
948
- <div class="score-bar-container">
949
- <div class="score-bar" style="width: 100%;"></div>
950
- </div>
951
- <span class="score-value">20 / 20</span>
952
- </div>
953
- <div class="score-item" style="border-top: 2px solid #00d4aa; padding-top: 20px; margin-top: 10px;">
954
- <span class="score-label" style="color: #00d4aa; font-weight: 700;">TOTAL SCORE</span>
955
- <div class="score-bar-container">
956
- <div class="score-bar" style="width: 88.1%;"></div>
957
- </div>
958
- <span class="score-value" style="font-size: 1.6rem;">440.6 / 500</span>
959
- </div>
960
- </div>
961
- </div>
962
-
963
- <!-- Slide 9: Key Technical Decisions -->
964
- <div class="slide" id="slide9">
965
- <h2><span class="icon">🔬</span> Key Technical Decisions</h2>
966
- <div class="two-col">
967
- <div class="col">
968
- <h3 style="color: #00d4aa;">What We Did</h3>
969
- <ul>
970
- <li><strong>Open-source Llama</strong> over proprietary GPT-4</li>
971
- <li><strong>Top-3 retrieval</strong> - more context confused the LLM</li>
972
- <li><strong>Vanilla retrieval</strong> - simple beats complex reranking</li>
973
- <li><strong>Citation-focused prompt</strong> in Azerbaijani</li>
974
- <li><strong>BAAI embeddings</strong> - 25% better than multilingual</li>
975
- <li><strong>600-char chunks</strong> with 100-char overlap</li>
976
- </ul>
977
- </div>
978
- <div class="col">
979
- <h3 style="color: #ff6b6b;">What We Avoided</h3>
980
- <ul>
981
- <li><strong>MMR/Reranking</strong> - 21% worse performance</li>
982
- <li><strong>Top-5+ retrieval</strong> - information overload</li>
983
- <li><strong>Few-shot prompting</strong> - inconsistent results</li>
984
- <li><strong>Multilingual embeddings</strong> - underperformed</li>
985
- <li><strong>Complex architectures</strong> - kept it simple</li>
986
- <li><strong>Closed-source models</strong> - for transparency</li>
987
- </ul>
988
- </div>
989
- </div>
990
- <div style="margin-top: 40px; text-align: center; padding: 25px; background: rgba(0, 212, 170, 0.1); border-radius: 12px; border: 1px solid #00d4aa;">
991
- <p style="font-size: 1.4rem; color: #00d4aa;">
992
- "Every decision was validated through rigorous benchmarking across 3 Jupyter notebooks"
993
- </p>
994
- </div>
995
- </div>
996
-
997
- <!-- Slide 10: Demo Features -->
998
- <div class="slide" id="slide10">
999
- <h2><span class="icon">🎮</span> Live Demo Features</h2>
1000
- <div class="demo-features">
1001
- <div class="demo-feature">
1002
- <span class="icon">📤</span>
1003
- <div>
1004
- <h4>PDF Upload & OCR</h4>
1005
- <p>Drag & drop any PDF to extract text with image detection. Results in markdown format.</p>
1006
- </div>
1007
- </div>
1008
- <div class="demo-feature">
1009
- <span class="icon">💬</span>
1010
- <div>
1011
- <h4>Interactive Q&A Chat</h4>
1012
- <p>Ask questions in Azerbaijani, Russian, or English. Get answers with source citations.</p>
1013
- </div>
1014
- </div>
1015
- <div class="demo-feature">
1016
- <span class="icon">📚</span>
1017
- <div>
1018
- <h4>Source Citations</h4>
1019
- <p>Every answer includes document name, page number, and relevant excerpt for verification.</p>
1020
- </div>
1021
- </div>
1022
- <div class="demo-feature">
1023
- <span class="icon">📖</span>
1024
- <div>
1025
- <h4>Swagger Documentation</h4>
1026
- <p>Full API documentation at /docs with interactive testing capabilities.</p>
1027
- </div>
1028
- </div>
1029
- </div>
1030
- <div style="margin-top: 50px; text-align: center;">
1031
- <p style="font-size: 1.5rem; color: #58a6ff;">
1032
- 🌐 <strong>localhost:8000</strong> | 📖 <strong>/docs</strong> for Swagger UI
1033
- </p>
1034
- </div>
1035
- </div>
1036
-
1037
- <!-- Slide 11: What We Built -->
1038
- <div class="slide" id="slide11">
1039
- <h2><span class="icon">📦</span> Deliverables</h2>
1040
- <div class="stats-grid">
1041
- <div class="stat-card">
1042
- <div class="number">28</div>
1043
- <div class="label">PDFs Processed</div>
1044
- </div>
1045
- <div class="stat-card">
1046
- <div class="number">1,128</div>
1047
- <div class="label">Vector Chunks</div>
1048
- </div>
1049
- <div class="stat-card">
1050
- <div class="number">3</div>
1051
- <div class="label">Benchmark Notebooks</div>
1052
- </div>
1053
- <div class="stat-card">
1054
- <div class="number">100%</div>
1055
- <div class="label">Open Source</div>
1056
- </div>
1057
- </div>
1058
- <div class="two-col" style="margin-top: 40px;">
1059
- <div class="col">
1060
- <h3>Code & Infrastructure</h3>
1061
- <ul>
1062
- <li>FastAPI application (505 lines)</li>
1063
- <li>Data ingestion pipeline</li>
1064
- <li>Parallel processing (4x speedup)</li>
1065
- <li>Docker + Docker Compose</li>
1066
- <li>Health monitoring</li>
1067
- <li>Interactive web UI</li>
1068
- </ul>
1069
- </div>
1070
- <div class="col">
1071
- <h3>Documentation & Analysis</h3>
1072
- <ul>
1073
- <li>8 comprehensive markdown docs</li>
1074
- <li>VLM OCR benchmark notebook</li>
1075
- <li>RAG optimization notebook</li>
1076
- <li>LLM comparison notebook</li>
1077
- <li>Sample questions & answers</li>
1078
- <li>Deployment guide</li>
1079
- </ul>
1080
- </div>
1081
- </div>
1082
- </div>
1083
-
1084
- <!-- Slide 12: Thank You -->
1085
- <div class="slide final-slide" id="slide12">
1086
- <h2><span class="icon">🙏</span> Thank You!</h2>
1087
- <p style="font-size: 1.8rem; color: #c9d1d9; margin-bottom: 10px;">
1088
- SOCAR Historical Documents AI System
1089
- </p>
1090
- <p style="font-size: 1.3rem; color: #8b949e; margin-bottom: 20px;">
1091
- Transforming archives into accessible, searchable knowledge
1092
- </p>
1093
- <div style="margin-bottom: 30px;">
1094
- <p style="font-size: 1.6rem; color: #00d4aa; font-weight: 700; margin-bottom: 10px;">Team BeatByte</p>
1095
- <p style="font-size: 1.2rem; color: #58a6ff;">Ulvi Bashirov &nbsp;•&nbsp; Samir Mehdiyev &nbsp;•&nbsp; Ismat Samadov</p>
1096
- </div>
1097
- <div class="stats-grid" style="max-width: 800px; margin: 0 auto;">
1098
- <div class="stat-card">
1099
- <div class="number">87.75%</div>
1100
- <div class="label">OCR Accuracy</div>
1101
- </div>
1102
- <div class="stat-card">
1103
- <div class="number">440.6</div>
1104
- <div class="label">Est. Score / 500</div>
1105
- </div>
1106
- <div class="stat-card">
1107
- <div class="number">100%</div>
1108
- <div class="label">Open Source</div>
1109
- </div>
1110
- <div class="stat-card">
1111
- <div class="number">3.6s</div>
1112
- <div class="label">Response Time</div>
1113
- </div>
1114
- </div>
1115
- <div style="margin-top: 40px;">
1116
- <p style="font-size: 2rem; color: #00d4aa; font-weight: 700;">
1117
- Questions? Let's Demo! 🚀
1118
- </p>
1119
- </div>
1120
- </div>
1121
-
1122
- <!-- Navigation -->
1123
- <div class="nav">
1124
- <button onclick="prevSlide()">← Previous</button>
1125
- <button onclick="nextSlide()">Next →</button>
1126
- </div>
1127
- <div class="slide-counter">
1128
- <span id="currentSlide">1</span> / <span id="totalSlides">12</span>
1129
- </div>
1130
-
1131
- <script>
1132
- let currentSlide = 1;
1133
- const totalSlides = 12;
1134
-
1135
- document.getElementById('totalSlides').textContent = totalSlides;
1136
-
1137
- function showSlide(n) {
1138
- const slides = document.querySelectorAll('.slide');
1139
-
1140
- if (n > totalSlides) currentSlide = 1;
1141
- if (n < 1) currentSlide = totalSlides;
1142
-
1143
- slides.forEach(slide => slide.classList.remove('active'));
1144
- document.getElementById('slide' + currentSlide).classList.add('active');
1145
- document.getElementById('currentSlide').textContent = currentSlide;
1146
- }
1147
-
1148
- function nextSlide() {
1149
- currentSlide++;
1150
- showSlide(currentSlide);
1151
- }
1152
-
1153
- function prevSlide() {
1154
- currentSlide--;
1155
- showSlide(currentSlide);
1156
- }
1157
-
1158
- // Keyboard navigation
1159
- document.addEventListener('keydown', function(e) {
1160
- if (e.key === 'ArrowRight' || e.key === ' ') {
1161
- nextSlide();
1162
- } else if (e.key === 'ArrowLeft') {
1163
- prevSlide();
1164
- }
1165
- });
1166
-
1167
- // Touch navigation for mobile
1168
- let touchStartX = 0;
1169
- let touchEndX = 0;
1170
-
1171
- document.addEventListener('touchstart', e => {
1172
- touchStartX = e.changedTouches[0].screenX;
1173
- });
1174
-
1175
- document.addEventListener('touchend', e => {
1176
- touchEndX = e.changedTouches[0].screenX;
1177
- if (touchStartX - touchEndX > 50) {
1178
- nextSlide();
1179
- } else if (touchEndX - touchStartX > 50) {
1180
- prevSlide();
1181
- }
1182
- });
1183
- </script>
1184
- </body>
1185
- </html>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
presentation/pitch_deck.pdf DELETED
Binary file (79.4 kB)
 
presentation/pitch_deck_print.html DELETED
@@ -1,1084 +0,0 @@
1
- <!DOCTYPE html>
2
- <html lang="en">
3
- <head>
4
- <meta charset="UTF-8">
5
- <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
- <title>SOCAR Historical Documents AI - Hackathon Pitch</title>
7
- <style>
8
- @page {
9
- size: A4 landscape;
10
- margin: 0;
11
- }
12
-
13
- * {
14
- margin: 0;
15
- padding: 0;
16
- box-sizing: border-box;
17
- }
18
-
19
- body {
20
- font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
21
- background: #0d1117;
22
- color: #ffffff;
23
- }
24
-
25
- .slide {
26
- width: 297mm;
27
- height: 210mm;
28
- padding: 15mm 20mm;
29
- background: linear-gradient(135deg, #0d1117 0%, #161b22 100%);
30
- position: relative;
31
- overflow: hidden;
32
- page-break-after: always;
33
- page-break-inside: avoid;
34
- }
35
-
36
- .slide:last-child {
37
- page-break-after: auto;
38
- }
39
-
40
- .slide::before {
41
- content: '';
42
- position: absolute;
43
- top: 0;
44
- left: 0;
45
- right: 0;
46
- height: 3px;
47
- background: linear-gradient(90deg, #00d4aa, #0099ff, #00d4aa);
48
- }
49
-
50
- /* Title Slide */
51
- .title-slide {
52
- text-align: center;
53
- display: flex;
54
- flex-direction: column;
55
- justify-content: center;
56
- align-items: center;
57
- }
58
-
59
- .title-slide h1 {
60
- font-size: 42pt;
61
- font-weight: 700;
62
- color: #00d4aa;
63
- margin-bottom: 10px;
64
- }
65
-
66
- .title-slide .subtitle {
67
- font-size: 18pt;
68
- color: #8b949e;
69
- margin-bottom: 20px;
70
- }
71
-
72
- .title-slide .tagline {
73
- font-size: 14pt;
74
- color: #58a6ff;
75
- padding: 10px 20px;
76
- border: 2px solid #30363d;
77
- border-radius: 8px;
78
- background: rgba(88, 166, 255, 0.1);
79
- }
80
-
81
- .title-slide .team-info {
82
- margin-top: 30px;
83
- }
84
-
85
- .title-slide .team-name {
86
- font-size: 16pt;
87
- color: #00d4aa;
88
- font-weight: 700;
89
- margin-bottom: 8px;
90
- }
91
-
92
- .title-slide .team-members {
93
- font-size: 12pt;
94
- color: #8b949e;
95
- }
96
-
97
- /* Regular Slides */
98
- h2 {
99
- font-size: 28pt;
100
- color: #00d4aa;
101
- margin-bottom: 20px;
102
- }
103
-
104
- h2 .icon {
105
- font-size: 24pt;
106
- margin-right: 10px;
107
- }
108
-
109
- h3 {
110
- font-size: 14pt;
111
- color: #58a6ff;
112
- margin: 15px 0 10px 0;
113
- }
114
-
115
- p {
116
- font-size: 11pt;
117
- line-height: 1.6;
118
- color: #c9d1d9;
119
- }
120
-
121
- ul {
122
- list-style: none;
123
- padding-left: 0;
124
- }
125
-
126
- li {
127
- font-size: 11pt;
128
- line-height: 1.8;
129
- color: #c9d1d9;
130
- padding-left: 20px;
131
- position: relative;
132
- }
133
-
134
- li::before {
135
- content: '>';
136
- position: absolute;
137
- left: 0;
138
- color: #00d4aa;
139
- font-size: 10pt;
140
- }
141
-
142
- /* Stats Grid */
143
- .stats-grid {
144
- display: flex;
145
- justify-content: space-around;
146
- gap: 15px;
147
- margin-top: 20px;
148
- }
149
-
150
- .stat-card {
151
- background: linear-gradient(135deg, #21262d 0%, #161b22 100%);
152
- border: 1px solid #30363d;
153
- border-radius: 10px;
154
- padding: 15px 25px;
155
- text-align: center;
156
- flex: 1;
157
- }
158
-
159
- .stat-card .number {
160
- font-size: 24pt;
161
- font-weight: 700;
162
- color: #00d4aa;
163
- }
164
-
165
- .stat-card .label {
166
- font-size: 9pt;
167
- color: #8b949e;
168
- margin-top: 5px;
169
- text-transform: uppercase;
170
- letter-spacing: 1px;
171
- }
172
-
173
- /* Architecture Diagram */
174
- .architecture {
175
- display: flex;
176
- justify-content: space-between;
177
- align-items: center;
178
- margin-top: 15px;
179
- padding: 10px;
180
- }
181
-
182
- .arch-box {
183
- background: linear-gradient(135deg, #21262d 0%, #161b22 100%);
184
- border: 2px solid #30363d;
185
- border-radius: 8px;
186
- padding: 12px 18px;
187
- text-align: center;
188
- min-width: 90px;
189
- }
190
-
191
- .arch-box.highlight {
192
- border-color: #00d4aa;
193
- }
194
-
195
- .arch-box .title {
196
- font-size: 8pt;
197
- color: #8b949e;
198
- text-transform: uppercase;
199
- letter-spacing: 1px;
200
- margin-bottom: 4px;
201
- }
202
-
203
- .arch-box .tech {
204
- font-size: 10pt;
205
- color: #58a6ff;
206
- font-weight: 600;
207
- }
208
-
209
- .arrow {
210
- font-size: 16pt;
211
- color: #00d4aa;
212
- }
213
-
214
- /* Two Column Layout */
215
- .two-col {
216
- display: flex;
217
- gap: 30px;
218
- margin-top: 10px;
219
- }
220
-
221
- .col {
222
- background: rgba(33, 38, 45, 0.5);
223
- border-radius: 10px;
224
- padding: 15px;
225
- border: 1px solid #30363d;
226
- flex: 1;
227
- }
228
-
229
- /* Tech Stack */
230
- .tech-stack {
231
- display: flex;
232
- flex-wrap: wrap;
233
- gap: 10px;
234
- margin-top: 15px;
235
- }
236
-
237
- .tech-item {
238
- background: linear-gradient(135deg, #21262d 0%, #161b22 100%);
239
- border: 1px solid #30363d;
240
- border-radius: 8px;
241
- padding: 10px 15px;
242
- display: flex;
243
- align-items: center;
244
- gap: 10px;
245
- width: calc(33% - 10px);
246
- }
247
-
248
- .tech-item .icon {
249
- font-size: 16pt;
250
- }
251
-
252
- .tech-item .name {
253
- font-size: 10pt;
254
- color: #c9d1d9;
255
- }
256
-
257
- .tech-item .desc {
258
- font-size: 8pt;
259
- color: #8b949e;
260
- }
261
-
262
- /* Comparison Table */
263
- .comparison-table {
264
- width: 100%;
265
- margin-top: 15px;
266
- border-collapse: collapse;
267
- font-size: 10pt;
268
- }
269
-
270
- .comparison-table th,
271
- .comparison-table td {
272
- padding: 10px 15px;
273
- text-align: left;
274
- border-bottom: 1px solid #30363d;
275
- }
276
-
277
- .comparison-table th {
278
- background: #21262d;
279
- color: #58a6ff;
280
- font-size: 10pt;
281
- font-weight: 600;
282
- }
283
-
284
- .comparison-table td {
285
- color: #c9d1d9;
286
- }
287
-
288
- .comparison-table .winner {
289
- color: #00d4aa;
290
- font-weight: 600;
291
- }
292
-
293
- .comparison-table .badge {
294
- display: inline-block;
295
- padding: 2px 8px;
296
- border-radius: 10px;
297
- font-size: 8pt;
298
- font-weight: 600;
299
- }
300
-
301
- .badge.open {
302
- background: rgba(0, 212, 170, 0.2);
303
- color: #00d4aa;
304
- }
305
-
306
- .badge.closed {
307
- background: rgba(255, 107, 107, 0.2);
308
- color: #ff6b6b;
309
- }
310
-
311
- /* Flow Diagram */
312
- .flow {
313
- display: flex;
314
- flex-direction: column;
315
- gap: 8px;
316
- margin-top: 10px;
317
- }
318
-
319
- .flow-row {
320
- display: flex;
321
- align-items: center;
322
- gap: 8px;
323
- }
324
-
325
- .flow-box {
326
- background: #21262d;
327
- border: 1px solid #30363d;
328
- border-radius: 6px;
329
- padding: 6px 12px;
330
- font-size: 9pt;
331
- color: #c9d1d9;
332
- }
333
-
334
- .flow-box.primary {
335
- border-color: #00d4aa;
336
- color: #00d4aa;
337
- }
338
-
339
- .flow-arrow {
340
- color: #58a6ff;
341
- font-size: 10pt;
342
- }
343
-
344
- /* Problem icons */
345
- .problem-grid {
346
- display: flex;
347
- gap: 20px;
348
- margin-top: 20px;
349
- }
350
-
351
- .problem-card {
352
- background: linear-gradient(135deg, #21262d 0%, #161b22 100%);
353
- border: 1px solid #30363d;
354
- border-radius: 10px;
355
- padding: 20px;
356
- text-align: center;
357
- flex: 1;
358
- }
359
-
360
- .problem-card .icon {
361
- font-size: 24pt;
362
- margin-bottom: 10px;
363
- }
364
-
365
- .problem-card h4 {
366
- font-size: 12pt;
367
- color: #ff6b6b;
368
- margin-bottom: 8px;
369
- }
370
-
371
- .problem-card p {
372
- font-size: 9pt;
373
- color: #8b949e;
374
- }
375
-
376
- /* Solution cards */
377
- .solution-grid {
378
- display: flex;
379
- flex-wrap: wrap;
380
- gap: 15px;
381
- margin-top: 15px;
382
- }
383
-
384
- .solution-card {
385
- background: linear-gradient(135deg, rgba(0, 212, 170, 0.1) 0%, rgba(0, 153, 255, 0.1) 100%);
386
- border: 1px solid #00d4aa;
387
- border-radius: 10px;
388
- padding: 15px;
389
- width: calc(50% - 10px);
390
- }
391
-
392
- .solution-card h4 {
393
- font-size: 12pt;
394
- color: #00d4aa;
395
- margin-bottom: 8px;
396
- }
397
-
398
- .solution-card p {
399
- font-size: 9pt;
400
- color: #c9d1d9;
401
- }
402
-
403
- /* Score breakdown */
404
- .score-breakdown {
405
- margin-top: 15px;
406
- }
407
-
408
- .score-item {
409
- display: flex;
410
- align-items: center;
411
- margin-bottom: 12px;
412
- }
413
-
414
- .score-label {
415
- width: 140px;
416
- font-size: 10pt;
417
- color: #c9d1d9;
418
- }
419
-
420
- .score-bar-container {
421
- flex: 1;
422
- height: 20px;
423
- background: #21262d;
424
- border-radius: 10px;
425
- overflow: hidden;
426
- margin: 0 15px;
427
- }
428
-
429
- .score-bar {
430
- height: 100%;
431
- background: linear-gradient(90deg, #00d4aa, #0099ff);
432
- border-radius: 10px;
433
- }
434
-
435
- .score-value {
436
- width: 80px;
437
- font-size: 11pt;
438
- font-weight: 700;
439
- color: #00d4aa;
440
- text-align: right;
441
- }
442
-
443
- /* Final slide */
444
- .final-slide {
445
- text-align: center;
446
- display: flex;
447
- flex-direction: column;
448
- justify-content: center;
449
- align-items: center;
450
- }
451
-
452
- .final-slide h2 {
453
- font-size: 32pt;
454
- margin-bottom: 15px;
455
- }
456
-
457
- /* Highlight text */
458
- .highlight-text {
459
- color: #00d4aa;
460
- font-weight: 600;
461
- }
462
-
463
- /* Demo section */
464
- .demo-features {
465
- display: flex;
466
- flex-wrap: wrap;
467
- gap: 15px;
468
- margin-top: 15px;
469
- }
470
-
471
- .demo-feature {
472
- background: #21262d;
473
- border: 1px solid #30363d;
474
- border-radius: 10px;
475
- padding: 15px;
476
- display: flex;
477
- gap: 12px;
478
- align-items: flex-start;
479
- width: calc(50% - 10px);
480
- }
481
-
482
- .demo-feature .icon {
483
- font-size: 20pt;
484
- color: #00d4aa;
485
- }
486
-
487
- .demo-feature h4 {
488
- font-size: 11pt;
489
- color: #c9d1d9;
490
- margin-bottom: 5px;
491
- }
492
-
493
- .demo-feature p {
494
- font-size: 9pt;
495
- color: #8b949e;
496
- }
497
-
498
- /* API endpoints */
499
- .endpoint {
500
- background: #161b22;
501
- border: 1px solid #30363d;
502
- border-radius: 6px;
503
- padding: 10px 15px;
504
- margin: 8px 0;
505
- font-family: 'Courier New', monospace;
506
- }
507
-
508
- .endpoint .method {
509
- display: inline-block;
510
- padding: 2px 8px;
511
- border-radius: 4px;
512
- font-size: 8pt;
513
- font-weight: 700;
514
- margin-right: 10px;
515
- }
516
-
517
- .endpoint .method.post {
518
- background: rgba(0, 212, 170, 0.2);
519
- color: #00d4aa;
520
- }
521
-
522
- .endpoint .method.get {
523
- background: rgba(88, 166, 255, 0.2);
524
- color: #58a6ff;
525
- }
526
-
527
- .endpoint .path {
528
- color: #c9d1d9;
529
- font-size: 10pt;
530
- }
531
-
532
- .endpoint .desc {
533
- color: #8b949e;
534
- font-size: 8pt;
535
- margin-left: 50px;
536
- margin-top: 3px;
537
- }
538
-
539
- /* Key decisions */
540
- .decision-list {
541
- margin-top: 12px;
542
- }
543
-
544
- .decision-item {
545
- background: #21262d;
546
- border-left: 3px solid #00d4aa;
547
- padding: 12px 15px;
548
- margin: 10px 0;
549
- border-radius: 0 6px 6px 0;
550
- }
551
-
552
- .decision-item h4 {
553
- color: #58a6ff;
554
- font-size: 11pt;
555
- margin-bottom: 5px;
556
- }
557
-
558
- .decision-item p {
559
- font-size: 9pt;
560
- color: #8b949e;
561
- }
562
-
563
- .decision-item .result {
564
- color: #00d4aa;
565
- font-weight: 600;
566
- }
567
-
568
- .slide-number {
569
- position: absolute;
570
- bottom: 10mm;
571
- right: 15mm;
572
- font-size: 10pt;
573
- color: #8b949e;
574
- }
575
- </style>
576
- </head>
577
- <body>
578
- <!-- Slide 1: Title -->
579
- <div class="slide title-slide">
580
- <h1>SOCAR Historical Documents AI</h1>
581
- <p class="subtitle">Intelligent OCR & RAG System for Oil & Gas Archives</p>
582
- <p class="tagline">Transforming 28 Historical Documents into Searchable Knowledge</p>
583
- <div class="team-info">
584
- <p class="team-name">Team BeatByte</p>
585
- <p class="team-members">Ulvi Bashirov | Samir Mehdiyev | Ismat Samadov</p>
586
- </div>
587
- <div class="slide-number">1 / 12</div>
588
- </div>
589
-
590
- <!-- Slide 2: Problem Statement -->
591
- <div class="slide">
592
- <h2><span class="icon">!</span> The Problem</h2>
593
- <div class="problem-grid">
594
- <div class="problem-card">
595
- <div class="icon">PDF</div>
596
- <h4>Inaccessible Archives</h4>
597
- <p>Decades of valuable historical documents locked in PDF format, impossible to search</p>
598
- </div>
599
- <div class="problem-card">
600
- <div class="icon">ABC</div>
601
- <h4>Multi-Language Barrier</h4>
602
- <p>Documents in Azerbaijani, Russian, and English with complex Cyrillic text</p>
603
- </div>
604
- <div class="problem-card">
605
- <div class="icon">TIME</div>
606
- <h4>Time-Consuming Research</h4>
607
- <p>Manual document review takes hours to find specific information</p>
608
- </div>
609
- </div>
610
- <p style="margin-top: 25px; text-align: center; font-size: 14pt; color: #ff6b6b;">
611
- How can we unlock institutional knowledge trapped in historical documents?
612
- </p>
613
- <div class="slide-number">2 / 12</div>
614
- </div>
615
-
616
- <!-- Slide 3: Our Solution -->
617
- <div class="slide">
618
- <h2><span class="icon">*</span> Our Solution</h2>
619
- <div class="solution-grid">
620
- <div class="solution-card">
621
- <h4>Vision-Language OCR</h4>
622
- <p>State-of-the-art Llama-4-Maverick model extracts text from scanned documents with <span class="highlight-text">87.75% accuracy</span>, preserving Cyrillic characters perfectly</p>
623
- </div>
624
- <div class="solution-card">
625
- <h4>Semantic Search</h4>
626
- <p>BAAI/bge-large embeddings + Pinecone vector database enable instant retrieval across <span class="highlight-text">1,128 document chunks</span></p>
627
- </div>
628
- <div class="solution-card">
629
- <h4>RAG-Powered Q&A</h4>
630
- <p>Natural language questions answered with relevant context and <span class="highlight-text">source citations</span> for verification</p>
631
- </div>
632
- <div class="solution-card">
633
- <h4>Production-Ready API</h4>
634
- <p>FastAPI backend with Docker deployment, health monitoring, and interactive web interface</p>
635
- </div>
636
- </div>
637
- <div class="slide-number">3 / 12</div>
638
- </div>
639
-
640
- <!-- Slide 4: Architecture -->
641
- <div class="slide">
642
- <h2><span class="icon">#</span> System Architecture</h2>
643
- <div class="architecture">
644
- <div class="arch-box">
645
- <div class="title">Input</div>
646
- <div class="tech">PDF Documents</div>
647
- </div>
648
- <span class="arrow">-></span>
649
- <div class="arch-box highlight">
650
- <div class="title">OCR Engine</div>
651
- <div class="tech">Llama-4 Vision</div>
652
- </div>
653
- <span class="arrow">-></span>
654
- <div class="arch-box">
655
- <div class="title">Embeddings</div>
656
- <div class="tech">BAAI/bge-large</div>
657
- </div>
658
- <span class="arrow">-></span>
659
- <div class="arch-box highlight">
660
- <div class="title">Vector DB</div>
661
- <div class="tech">Pinecone Cloud</div>
662
- </div>
663
- <span class="arrow">-></span>
664
- <div class="arch-box">
665
- <div class="title">LLM</div>
666
- <div class="tech">Llama-4 17B</div>
667
- </div>
668
- </div>
669
- <div class="two-col" style="margin-top: 20px;">
670
- <div class="col">
671
- <h3>OCR Pipeline</h3>
672
- <div class="flow">
673
- <div class="flow-row">
674
- <div class="flow-box">PDF Upload</div>
675
- <span class="flow-arrow">-></span>
676
- <div class="flow-box">PyMuPDF (100 DPI)</div>
677
- <span class="flow-arrow">-></span>
678
- <div class="flow-box primary">Vision LLM</div>
679
- </div>
680
- <div class="flow-row">
681
- <div class="flow-box">Image Detection</div>
682
- <span class="flow-arrow">-></span>
683
- <div class="flow-box">Markdown Output</div>
684
- </div>
685
- </div>
686
- </div>
687
- <div class="col">
688
- <h3>RAG Pipeline</h3>
689
- <div class="flow">
690
- <div class="flow-row">
691
- <div class="flow-box">User Question</div>
692
- <span class="flow-arrow">-></span>
693
- <div class="flow-box">Embed Query</div>
694
- <span class="flow-arrow">-></span>
695
- <div class="flow-box primary">Top-3 Retrieval</div>
696
- </div>
697
- <div class="flow-row">
698
- <div class="flow-box">Context Building</div>
699
- <span class="flow-arrow">-></span>
700
- <div class="flow-box">LLM + Citations</div>
701
- </div>
702
- </div>
703
- </div>
704
- </div>
705
- <div class="slide-number">4 / 12</div>
706
- </div>
707
-
708
- <!-- Slide 5: Technology Stack -->
709
- <div class="slide">
710
- <h2><span class="icon">+</span> Technology Stack</h2>
711
- <div class="tech-stack">
712
- <div class="tech-item">
713
- <span class="icon">L</span>
714
- <div>
715
- <div class="name">Llama-4-Maverick 17B</div>
716
- <div class="desc">Vision & Language Model</div>
717
- </div>
718
- </div>
719
- <div class="tech-item">
720
- <span class="icon">B</span>
721
- <div>
722
- <div class="name">BAAI/bge-large-en</div>
723
- <div class="desc">1024-dim Embeddings</div>
724
- </div>
725
- </div>
726
- <div class="tech-item">
727
- <span class="icon">P</span>
728
- <div>
729
- <div class="name">Pinecone Cloud</div>
730
- <div class="desc">Vector Database</div>
731
- </div>
732
- </div>
733
- <div class="tech-item">
734
- <span class="icon">F</span>
735
- <div>
736
- <div class="name">FastAPI</div>
737
- <div class="desc">Async REST API</div>
738
- </div>
739
- </div>
740
- <div class="tech-item">
741
- <span class="icon">M</span>
742
- <div>
743
- <div class="name">PyMuPDF</div>
744
- <div class="desc">PDF Processing</div>
745
- </div>
746
- </div>
747
- <div class="tech-item">
748
- <span class="icon">D</span>
749
- <div>
750
- <div class="name">Docker</div>
751
- <div class="desc">Containerization</div>
752
- </div>
753
- </div>
754
- </div>
755
- <div style="margin-top: 25px;">
756
- <h3>API Endpoints</h3>
757
- <div class="endpoint">
758
- <span class="method post">POST</span>
759
- <span class="path">/ocr</span>
760
- <div class="desc">Extract text from uploaded PDF with image detection</div>
761
- </div>
762
- <div class="endpoint">
763
- <span class="method post">POST</span>
764
- <span class="path">/llm</span>
765
- <div class="desc">RAG-based Q&A with source citations</div>
766
- </div>
767
- <div class="endpoint">
768
- <span class="method get">GET</span>
769
- <span class="path">/health</span>
770
- <div class="desc">Service health check and vector count</div>
771
- </div>
772
- </div>
773
- <div class="slide-number">5 / 12</div>
774
- </div>
775
-
776
- <!-- Slide 6: Benchmark Results -->
777
- <div class="slide">
778
- <h2><span class="icon">%</span> Benchmark Results</h2>
779
- <p style="margin-bottom: 15px;">We rigorously tested <span class="highlight-text">3 OCR models</span>, <span class="highlight-text">7 RAG configurations</span>, and <span class="highlight-text">3 LLMs</span> to optimize performance</p>
780
-
781
- <h3>OCR Model Comparison</h3>
782
- <table class="comparison-table">
783
- <tr>
784
- <th>Model</th>
785
- <th>Character Success Rate</th>
786
- <th>Word Success Rate</th>
787
- <th>Speed (12 pages)</th>
788
- <th>Type</th>
789
- </tr>
790
- <tr>
791
- <td>GPT-4.1</td>
792
- <td>88.12%</td>
793
- <td>67.44%</td>
794
- <td>199s</td>
795
- <td><span class="badge closed">Closed</span></td>
796
- </tr>
797
- <tr>
798
- <td class="winner">Llama-4-Maverick 17B [Selected]</td>
799
- <td class="winner">87.75%</td>
800
- <td class="winner">61.91%</td>
801
- <td class="winner">75s</td>
802
- <td><span class="badge open">Open</span></td>
803
- </tr>
804
- <tr>
805
- <td>Phi-4-multimodal</td>
806
- <td colspan="3" style="color: #ff6b6b;">Failed</td>
807
- <td><span class="badge open">Open</span></td>
808
- </tr>
809
- </table>
810
- <p style="margin-top: 15px; color: #00d4aa;">
811
- Selected Llama-4: Only 0.37% accuracy loss vs GPT-4.1, but <strong>2.7x faster</strong> and <strong>open-source</strong>
812
- </p>
813
- <div class="slide-number">6 / 12</div>
814
- </div>
815
-
816
- <!-- Slide 7: RAG Optimization -->
817
- <div class="slide">
818
- <h2><span class="icon">@</span> RAG Optimization Results</h2>
819
- <table class="comparison-table">
820
- <tr>
821
- <th>Configuration</th>
822
- <th>Answer Quality</th>
823
- <th>Citation Rate</th>
824
- <th>Response Time</th>
825
- </tr>
826
- <tr>
827
- <td class="winner">Citation-focused + Vanilla k3 [Selected]</td>
828
- <td class="winner">55.67%</td>
829
- <td class="winner">73.33%</td>
830
- <td class="winner">3.61s</td>
831
- </tr>
832
- <tr>
833
- <td>Few-shot + Vanilla k3</td>
834
- <td>45.70%</td>
835
- <td>40.00%</td>
836
- <td>2.17s</td>
837
- </tr>
838
- <tr>
839
- <td>Baseline + Vanilla k3</td>
840
- <td>39.65%</td>
841
- <td>20.00%</td>
842
- <td>2.28s</td>
843
- </tr>
844
- <tr>
845
- <td>MMR Retrieval</td>
846
- <td>34.60%</td>
847
- <td>6.67%</td>
848
- <td>2.53s</td>
849
- </tr>
850
- </table>
851
-
852
- <div class="decision-list">
853
- <div class="decision-item">
854
- <h4>Key Insight: Simple Beats Complex</h4>
855
- <p>Vanilla retrieval outperforms MMR reranking by <span class="result">+21%</span>. Top-3 beats Top-5 by <span class="result">+20%</span></p>
856
- </div>
857
- <div class="decision-item">
858
- <h4>Citation-Focused Prompting</h4>
859
- <p>Custom Azerbaijani prompt improves quality by <span class="result">+16%</span> and citation rate by <span class="result">+53%</span></p>
860
- </div>
861
- </div>
862
- <div class="slide-number">7 / 12</div>
863
- </div>
864
-
865
- <!-- Slide 8: Performance Metrics -->
866
- <div class="slide">
867
- <h2><span class="icon">^</span> Performance Metrics</h2>
868
- <div class="stats-grid">
869
- <div class="stat-card">
870
- <div class="number">87.75%</div>
871
- <div class="label">OCR Accuracy</div>
872
- </div>
873
- <div class="stat-card">
874
- <div class="number">55.67%</div>
875
- <div class="label">Answer Quality</div>
876
- </div>
877
- <div class="stat-card">
878
- <div class="number">73.33%</div>
879
- <div class="label">Citation Rate</div>
880
- </div>
881
- <div class="stat-card">
882
- <div class="number">3.6s</div>
883
- <div class="label">Response Time</div>
884
- </div>
885
- </div>
886
-
887
- <h3 style="margin-top: 25px;">Estimated Hackathon Score</h3>
888
- <div class="score-breakdown">
889
- <div class="score-item">
890
- <span class="score-label">OCR Quality (50%)</span>
891
- <div class="score-bar-container">
892
- <div class="score-bar" style="width: 87.75%;"></div>
893
- </div>
894
- <span class="score-value">43.9 / 50</span>
895
- </div>
896
- <div class="score-item">
897
- <span class="score-label">LLM Quality (30%)</span>
898
- <div class="score-bar-container">
899
- <div class="score-bar" style="width: 55.67%;"></div>
900
- </div>
901
- <span class="score-value">16.7 / 30</span>
902
- </div>
903
- <div class="score-item">
904
- <span class="score-label">Architecture (20%)</span>
905
- <div class="score-bar-container">
906
- <div class="score-bar" style="width: 100%;"></div>
907
- </div>
908
- <span class="score-value">20 / 20</span>
909
- </div>
910
- <div class="score-item" style="border-top: 2px solid #00d4aa; padding-top: 12px; margin-top: 8px;">
911
- <span class="score-label" style="color: #00d4aa; font-weight: 700;">TOTAL SCORE</span>
912
- <div class="score-bar-container">
913
- <div class="score-bar" style="width: 88.1%;"></div>
914
- </div>
915
- <span class="score-value" style="font-size: 14pt;">440.6 / 500</span>
916
- </div>
917
- </div>
918
- <div class="slide-number">8 / 12</div>
919
- </div>
920
-
921
- <!-- Slide 9: Key Technical Decisions -->
922
- <div class="slide">
923
- <h2><span class="icon">&</span> Key Technical Decisions</h2>
924
- <div class="two-col">
925
- <div class="col">
926
- <h3 style="color: #00d4aa;">What We Did</h3>
927
- <ul>
928
- <li><strong>Open-source Llama</strong> over proprietary GPT-4</li>
929
- <li><strong>Top-3 retrieval</strong> - more context confused the LLM</li>
930
- <li><strong>Vanilla retrieval</strong> - simple beats complex reranking</li>
931
- <li><strong>Citation-focused prompt</strong> in Azerbaijani</li>
932
- <li><strong>BAAI embeddings</strong> - 25% better than multilingual</li>
933
- <li><strong>600-char chunks</strong> with 100-char overlap</li>
934
- </ul>
935
- </div>
936
- <div class="col">
937
- <h3 style="color: #ff6b6b;">What We Avoided</h3>
938
- <ul>
939
- <li><strong>MMR/Reranking</strong> - 21% worse performance</li>
940
- <li><strong>Top-5+ retrieval</strong> - information overload</li>
941
- <li><strong>Few-shot prompting</strong> - inconsistent results</li>
942
- <li><strong>Multilingual embeddings</strong> - underperformed</li>
943
- <li><strong>Complex architectures</strong> - kept it simple</li>
944
- <li><strong>Closed-source models</strong> - for transparency</li>
945
- </ul>
946
- </div>
947
- </div>
948
- <div style="margin-top: 20px; text-align: center; padding: 15px; background: rgba(0, 212, 170, 0.1); border-radius: 8px; border: 1px solid #00d4aa;">
949
- <p style="font-size: 12pt; color: #00d4aa;">
950
- "Every decision was validated through rigorous benchmarking across 3 Jupyter notebooks"
951
- </p>
952
- </div>
953
- <div class="slide-number">9 / 12</div>
954
- </div>
955
-
956
- <!-- Slide 10: Demo Features -->
957
- <div class="slide">
958
- <h2><span class="icon">></span> Live Demo Features</h2>
959
- <div class="demo-features">
960
- <div class="demo-feature">
961
- <span class="icon">[^]</span>
962
- <div>
963
- <h4>PDF Upload & OCR</h4>
964
- <p>Drag & drop any PDF to extract text with image detection. Results in markdown format.</p>
965
- </div>
966
- </div>
967
- <div class="demo-feature">
968
- <span class="icon">[?]</span>
969
- <div>
970
- <h4>Interactive Q&A Chat</h4>
971
- <p>Ask questions in Azerbaijani, Russian, or English. Get answers with source citations.</p>
972
- </div>
973
- </div>
974
- <div class="demo-feature">
975
- <span class="icon">[i]</span>
976
- <div>
977
- <h4>Source Citations</h4>
978
- <p>Every answer includes document name, page number, and relevant excerpt for verification.</p>
979
- </div>
980
- </div>
981
- <div class="demo-feature">
982
- <span class="icon">[=]</span>
983
- <div>
984
- <h4>Swagger Documentation</h4>
985
- <p>Full API documentation at /docs with interactive testing capabilities.</p>
986
- </div>
987
- </div>
988
- </div>
989
- <div style="margin-top: 30px; text-align: center;">
990
- <p style="font-size: 14pt; color: #58a6ff;">
991
- Web UI: <strong>localhost:8000</strong> | API Docs: <strong>/docs</strong>
992
- </p>
993
- </div>
994
- <div class="slide-number">10 / 12</div>
995
- </div>
996
-
997
- <!-- Slide 11: What We Built -->
998
- <div class="slide">
999
- <h2><span class="icon">=</span> Deliverables</h2>
1000
- <div class="stats-grid">
1001
- <div class="stat-card">
1002
- <div class="number">28</div>
1003
- <div class="label">PDFs Processed</div>
1004
- </div>
1005
- <div class="stat-card">
1006
- <div class="number">1,128</div>
1007
- <div class="label">Vector Chunks</div>
1008
- </div>
1009
- <div class="stat-card">
1010
- <div class="number">3</div>
1011
- <div class="label">Benchmark Notebooks</div>
1012
- </div>
1013
- <div class="stat-card">
1014
- <div class="number">100%</div>
1015
- <div class="label">Open Source</div>
1016
- </div>
1017
- </div>
1018
- <div class="two-col" style="margin-top: 20px;">
1019
- <div class="col">
1020
- <h3>Code & Infrastructure</h3>
1021
- <ul>
1022
- <li>FastAPI application (505 lines)</li>
1023
- <li>Data ingestion pipeline</li>
1024
- <li>Parallel processing (4x speedup)</li>
1025
- <li>Docker + Docker Compose</li>
1026
- <li>Health monitoring</li>
1027
- <li>Interactive web UI</li>
1028
- </ul>
1029
- </div>
1030
- <div class="col">
1031
- <h3>Documentation & Analysis</h3>
1032
- <ul>
1033
- <li>8 comprehensive markdown docs</li>
1034
- <li>VLM OCR benchmark notebook</li>
1035
- <li>RAG optimization notebook</li>
1036
- <li>LLM comparison notebook</li>
1037
- <li>Sample questions & answers</li>
1038
- <li>Deployment guide</li>
1039
- </ul>
1040
- </div>
1041
- </div>
1042
- <div class="slide-number">11 / 12</div>
1043
- </div>
1044
-
1045
- <!-- Slide 12: Thank You -->
1046
- <div class="slide final-slide">
1047
- <h2>Thank You!</h2>
1048
- <p style="font-size: 16pt; color: #c9d1d9; margin-bottom: 8px;">
1049
- SOCAR Historical Documents AI System
1050
- </p>
1051
- <p style="font-size: 11pt; color: #8b949e; margin-bottom: 15px;">
1052
- Transforming archives into accessible, searchable knowledge
1053
- </p>
1054
- <div style="margin-bottom: 20px;">
1055
- <p style="font-size: 14pt; color: #00d4aa; font-weight: 700; margin-bottom: 8px;">Team BeatByte</p>
1056
- <p style="font-size: 11pt; color: #58a6ff;">Ulvi Bashirov | Samir Mehdiyev | Ismat Samadov</p>
1057
- </div>
1058
- <div class="stats-grid" style="max-width: 600px;">
1059
- <div class="stat-card">
1060
- <div class="number">87.75%</div>
1061
- <div class="label">OCR Accuracy</div>
1062
- </div>
1063
- <div class="stat-card">
1064
- <div class="number">440.6</div>
1065
- <div class="label">Est. Score / 500</div>
1066
- </div>
1067
- <div class="stat-card">
1068
- <div class="number">100%</div>
1069
- <div class="label">Open Source</div>
1070
- </div>
1071
- <div class="stat-card">
1072
- <div class="number">3.6s</div>
1073
- <div class="label">Response Time</div>
1074
- </div>
1075
- </div>
1076
- <div style="margin-top: 25px;">
1077
- <p style="font-size: 18pt; color: #00d4aa; font-weight: 700;">
1078
- Questions? Let's Demo!
1079
- </p>
1080
- </div>
1081
- <div class="slide-number">12 / 12</div>
1082
- </div>
1083
- </body>
1084
- </html>