IsmatS commited on
Commit
4c9673e
Β·
1 Parent(s): a6ccae9
DEPLOYMENT.md DELETED
@@ -1,257 +0,0 @@
1
- # SOCAR Hackathon - LLM API Deployment Guide
2
-
3
- ## Overview
4
-
5
- Production-ready FastAPI service for SOCAR historical documents chatbot.
6
-
7
- **Configuration (Based on RAG Optimization Benchmark):**
8
- - **Model**: Llama-4-Maverick-17B-128E-Instruct-FP8 (Open-source)
9
- - **Embedding**: BAAI/bge-large-en-v1.5
10
- - **Retrieval**: Top-3 vanilla
11
- - **Prompt Strategy**: Citation-focused
12
- - **Performance**: 55.67% LLM Judge Score, 73.33% Citation Score, ~3.6s response time
13
-
14
- ## Quick Start
15
-
16
- ### Prerequisites
17
- - Docker and Docker Compose installed
18
- - `.env` file with API keys (see `.env.example`)
19
-
20
- ### 1. Configure Environment
21
-
22
- ```bash
23
- cp .env.example .env
24
- # Edit .env with your actual API keys:
25
- # - AZURE_OPENAI_API_KEY
26
- # - AZURE_OPENAI_ENDPOINT
27
- # - PINECONE_API_KEY
28
- # - PINECONE_INDEX_NAME
29
- ```
30
-
31
- ### 2. Build and Run with Docker
32
-
33
- ```bash
34
- # Build the image
35
- docker-compose build
36
-
37
- # Start the service
38
- docker-compose up -d
39
-
40
- # Check logs
41
- docker-compose logs -f llm-api
42
-
43
- # Check health
44
- curl http://localhost:8000/health
45
- ```
46
-
47
- ### 3. Test the API
48
-
49
- ```bash
50
- # Simple health check
51
- curl http://localhost:8000/
52
-
53
- # Test LLM endpoint
54
- curl -X POST http://localhost:8000/llm \
55
- -H "Content-Type: application/json" \
56
- -d '{
57
- "messages": [
58
- {"role": "user", "content": "PalΓ§Δ±q vulkanlarΔ±nΔ±n tΙ™sir radiusu nΙ™ qΙ™dΙ™rdir?"}
59
- ]
60
- }'
61
- ```
62
-
63
- ## API Endpoints
64
-
65
- ### GET `/`
66
- Root endpoint with service information.
67
-
68
- **Response:**
69
- ```json
70
- {
71
- "status": "healthy",
72
- "service": "SOCAR LLM Chatbot",
73
- "version": "1.0.0",
74
- "model": "Llama-4-Maverick-17B (open-source)",
75
- "configuration": {
76
- "embedding": "BAAI/bge-large-en-v1.5",
77
- "retrieval": "top-3 vanilla",
78
- "prompt": "citation_focused",
79
- "benchmark_score": "55.67%"
80
- }
81
- }
82
- ```
83
-
84
- ### GET `/health`
85
- Detailed health check with service status.
86
-
87
- **Response:**
88
- ```json
89
- {
90
- "status": "healthy",
91
- "pinecone": {
92
- "connected": true,
93
- "total_vectors": 1300
94
- },
95
- "azure_openai": "connected",
96
- "embedding_model": "loaded"
97
- }
98
- ```
99
-
100
- ### POST `/llm`
101
- Main chatbot endpoint.
102
-
103
- **Request:**
104
- ```json
105
- {
106
- "messages": [
107
- {"role": "user", "content": "Your question here"}
108
- ],
109
- "temperature": 0.2,
110
- "max_tokens": 1000
111
- }
112
- ```
113
-
114
- **Response:**
115
- ```json
116
- {
117
- "response": "Answer with citations...",
118
- "sources": [
119
- {
120
- "pdf_name": "document_00.pdf",
121
- "page_number": "5",
122
- "relevance_score": "0.892"
123
- }
124
- ],
125
- "response_time": 3.61,
126
- "model": "Llama-4-Maverick-17B-128E-Instruct-FP8"
127
- }
128
- ```
129
-
130
- ## Development Mode
131
-
132
- ### Run locally without Docker
133
-
134
- ```bash
135
- # Install dependencies
136
- cd app
137
- pip install -r requirements.txt
138
-
139
- # Run with uvicorn
140
- uvicorn main:app --reload --host 0.0.0.0 --port 8000
141
- ```
142
-
143
- ### Access API documentation
144
-
145
- Once running, visit:
146
- - **Swagger UI**: http://localhost:8000/docs
147
- - **ReDoc**: http://localhost:8000/redoc
148
-
149
- ## Production Deployment
150
-
151
- ### Environment Variables
152
-
153
- Required in `.env`:
154
- ```bash
155
- # Azure OpenAI
156
- AZURE_OPENAI_API_KEY=your_key_here
157
- AZURE_OPENAI_ENDPOINT=your_endpoint_here
158
- AZURE_OPENAI_API_VERSION=2024-08-01-preview
159
-
160
- # Pinecone
161
- PINECONE_API_KEY=your_key_here
162
- PINECONE_INDEX_NAME=hackathon
163
- ```
164
-
165
- ### Docker Commands
166
-
167
- ```bash
168
- # Build
169
- docker-compose build --no-cache
170
-
171
- # Start in background
172
- docker-compose up -d
173
-
174
- # View logs
175
- docker-compose logs -f
176
-
177
- # Stop
178
- docker-compose down
179
-
180
- # Restart
181
- docker-compose restart
182
-
183
- # Remove everything
184
- docker-compose down -v
185
- ```
186
-
187
- ### Health Checks
188
-
189
- The Docker container includes automatic health checks:
190
- - **Interval**: 30 seconds
191
- - **Timeout**: 10 seconds
192
- - **Start period**: 40 seconds (for model loading)
193
- - **Retries**: 3
194
-
195
- ### Monitoring
196
-
197
- ```bash
198
- # Check container status
199
- docker-compose ps
200
-
201
- # View resource usage
202
- docker stats socar-llm-api
203
-
204
- # Check logs
205
- docker-compose logs --tail=100 llm-api
206
- ```
207
-
208
- ## Performance Optimization
209
-
210
- ### Lazy Loading
211
- - Azure client, Pinecone index, and embedding model are lazy-loaded
212
- - First request may take longer (~5-10s for model loading)
213
- - Subsequent requests: ~3.6s average
214
-
215
- ### Caching (Future)
216
- To improve performance, consider:
217
- - Redis for frequently asked questions
218
- - Embedding cache for common queries
219
- - Model quantization for faster inference
220
-
221
- ## Troubleshooting
222
-
223
- ### Container won't start
224
- ```bash
225
- # Check logs
226
- docker-compose logs llm-api
227
-
228
- # Verify environment variables
229
- docker-compose config
230
-
231
- # Rebuild
232
- docker-compose build --no-cache
233
- ```
234
-
235
- ### API returns 500 errors
236
- - Check Azure OpenAI key and endpoint
237
- - Verify Pinecone connection
238
- - Check model deployment name matches
239
-
240
- ### Slow responses
241
- - First request loads models (5-10s)
242
- - Subsequent requests should be ~3-4s
243
- - Check network connectivity to Azure/Pinecone
244
-
245
- ## Architecture Score
246
-
247
- **Open-Source Stack (20% bonus):**
248
- - βœ… Llama-4-Maverick-17B (Open-source LLM)
249
- - βœ… BAAI/bge-large-en-v1.5 (Open-source embeddings)
250
- - βœ… FastAPI (Open-source framework)
251
- - βœ… Docker (Open-source deployment)
252
-
253
- **Total Architecture Score: Maximum 20% for hackathon!**
254
-
255
- ## License
256
-
257
- Built for SOCAR Hackathon 2025
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app/main.py CHANGED
@@ -1,15 +1,21 @@
1
  """
2
- SOCAR Hackathon - LLM Chatbot Endpoint
3
- Optimized based on RAG benchmark results
4
- Best config: citation_focused + vanilla_k3 + Llama-4-Maverick
 
5
  """
6
 
7
  import os
 
8
  import time
 
9
  from typing import List, Dict
10
  from pathlib import Path
 
11
 
12
- from fastapi import FastAPI, HTTPException
 
 
13
  from fastapi.middleware.cors import CORSMiddleware
14
  from pydantic import BaseModel
15
  from dotenv import load_dotenv
@@ -275,6 +281,138 @@ async def llm_endpoint(request: ChatRequest):
275
  raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
276
 
277
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
278
  if __name__ == "__main__":
279
  import uvicorn
280
  uvicorn.run(app, host="0.0.0.0", port=8000)
 
1
  """
2
+ SOCAR Hackathon - Complete API with /ocr and /llm endpoints
3
+ Optimized based on comprehensive benchmarking:
4
+ - OCR: Llama-4-Maverick-17B (87.75% CSR)
5
+ - LLM: citation_focused + vanilla_k3 + Llama-4-Maverick (55.67% score)
6
  """
7
 
8
  import os
9
+ import re
10
  import time
11
+ import base64
12
  from typing import List, Dict
13
  from pathlib import Path
14
+ from io import BytesIO
15
 
16
+ import fitz # PyMuPDF
17
+ from PIL import Image
18
+ from fastapi import FastAPI, HTTPException, File, UploadFile
19
  from fastapi.middleware.cors import CORSMiddleware
20
  from pydantic import BaseModel
21
  from dotenv import load_dotenv
 
281
  raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
282
 
283
 
284
+ # ============================================================================
285
+ # OCR ENDPOINT
286
+ # ============================================================================
287
+
288
+ class OCRPageResponse(BaseModel):
289
+ page_number: int
290
+ MD_text: str
291
+
292
+
293
+ def pdf_to_images(pdf_bytes: bytes, dpi: int = 100) -> List[Image.Image]:
294
+ """Convert PDF bytes to PIL Images."""
295
+ doc = fitz.open(stream=pdf_bytes, filetype="pdf")
296
+ images = []
297
+
298
+ for page_num in range(len(doc)):
299
+ page = doc[page_num]
300
+ zoom = dpi / 72
301
+ mat = fitz.Matrix(zoom, zoom)
302
+ pix = page.get_pixmap(matrix=mat)
303
+ img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
304
+ images.append(img)
305
+
306
+ doc.close()
307
+ return images
308
+
309
+
310
+ def image_to_base64(image: Image.Image, format: str = "JPEG", quality: int = 85) -> str:
311
+ """Convert PIL Image to base64 with compression."""
312
+ buffered = BytesIO()
313
+ image.save(buffered, format=format, quality=quality, optimize=True)
314
+ return base64.b64encode(buffered.getvalue()).decode("utf-8")
315
+
316
+
317
+ def detect_images_in_pdf(pdf_bytes: bytes) -> Dict[int, int]:
318
+ """
319
+ Detect images in each page of PDF.
320
+ Returns dict: {page_number: image_count}
321
+ """
322
+ doc = fitz.open(stream=pdf_bytes, filetype="pdf")
323
+ image_counts = {}
324
+
325
+ for page_num in range(len(doc)):
326
+ page = doc[page_num]
327
+ image_list = page.get_images()
328
+ image_counts[page_num + 1] = len(image_list)
329
+
330
+ doc.close()
331
+ return image_counts
332
+
333
+
334
+ @app.post("/ocr", response_model=List[OCRPageResponse])
335
+ async def ocr_endpoint(file: UploadFile = File(...)):
336
+ """
337
+ OCR endpoint for PDF text extraction with image detection.
338
+
339
+ Uses VLM (Llama-4-Maverick-17B) for best accuracy:
340
+ - Character Success Rate: 87.75%
341
+ - Word Success Rate: 61.91%
342
+ - Processing: ~6s per page
343
+
344
+ Returns:
345
+ List of {page_number, MD_text} with inline image references
346
+ """
347
+ try:
348
+ # Read PDF
349
+ pdf_bytes = await file.read()
350
+ pdf_filename = file.filename or "document.pdf"
351
+
352
+ # Convert to images
353
+ images = pdf_to_images(pdf_bytes, dpi=100)
354
+
355
+ # Detect images per page
356
+ image_counts = detect_images_in_pdf(pdf_bytes)
357
+
358
+ # OCR system prompt
359
+ system_prompt = """You are an expert OCR system for historical oil & gas documents.
360
+
361
+ Extract ALL text from the image with 100% accuracy. Follow these rules:
362
+ 1. Preserve EXACT spelling - including Azerbaijani, Russian, and English text
363
+ 2. Maintain original Cyrillic characters - DO NOT transliterate
364
+ 3. Keep all numbers, symbols, and special characters exactly as shown
365
+ 4. Preserve layout structure (paragraphs, line breaks)
366
+ 5. Include ALL text - headers, body, footnotes, tables, captions
367
+
368
+ Output ONLY the extracted text. No explanations, no descriptions."""
369
+
370
+ # Process each page
371
+ results = []
372
+ client = get_azure_client()
373
+
374
+ for page_num, image in enumerate(images, 1):
375
+ # Convert image to base64
376
+ image_base64 = image_to_base64(image, format="JPEG", quality=85)
377
+
378
+ # VLM OCR
379
+ messages = [
380
+ {"role": "system", "content": system_prompt},
381
+ {
382
+ "role": "user",
383
+ "content": [
384
+ {"type": "text", "text": f"Extract all text from page {page_num}:"},
385
+ {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}}
386
+ ]
387
+ }
388
+ ]
389
+
390
+ response = client.chat.completions.create(
391
+ model="Llama-4-Maverick-17B-128E-Instruct-FP8",
392
+ messages=messages,
393
+ temperature=0.0, # Deterministic OCR
394
+ max_tokens=4000
395
+ )
396
+
397
+ page_text = response.choices[0].message.content
398
+
399
+ # Add image references if images exist on this page
400
+ num_images = image_counts.get(page_num, 0)
401
+ if num_images > 0:
402
+ for img_idx in range(1, num_images + 1):
403
+ page_text += f"\n\n![Image]({pdf_filename}/page_{page_num}/image_{img_idx})\n\n"
404
+
405
+ results.append({
406
+ "page_number": page_num,
407
+ "MD_text": page_text
408
+ })
409
+
410
+ return results
411
+
412
+ except Exception as e:
413
+ raise HTTPException(status_code=500, detail=f"OCR Error: {str(e)}")
414
+
415
+
416
  if __name__ == "__main__":
417
  import uvicorn
418
  uvicorn.run(app, host="0.0.0.0", port=8000)
app/requirements.txt CHANGED
@@ -1,5 +1,5 @@
1
- # SOCAR Hackathon LLM Endpoint Dependencies
2
- # Optimized for production deployment
3
 
4
  # FastAPI and server
5
  fastapi==0.109.0
@@ -17,9 +17,14 @@ sentence-transformers==3.3.1
17
  torch==2.5.1
18
  numpy<2.0.0
19
 
 
 
 
 
20
  # Utilities
21
  python-dotenv==1.0.0
22
  python-multipart==0.0.6
 
23
 
24
  # Optional: monitoring and logging
25
  prometheus-fastapi-instrumentator==7.0.0
 
1
+ # SOCAR Hackathon - Complete API Dependencies
2
+ # Optimized for production deployment with /ocr and /llm endpoints
3
 
4
  # FastAPI and server
5
  fastapi==0.109.0
 
17
  torch==2.5.1
18
  numpy<2.0.0
19
 
20
+ # PDF processing and OCR
21
+ PyMuPDF==1.23.8
22
+ Pillow==10.1.0
23
+
24
  # Utilities
25
  python-dotenv==1.0.0
26
  python-multipart==0.0.6
27
+ tqdm==4.66.1
28
 
29
  # Optional: monitoring and logging
30
  prometheus-fastapi-instrumentator==7.0.0
notebooks/vlm_ocr_benchmark.ipynb CHANGED
The diff for this file is too large to render. See raw diff
 
scripts/ingest_pdfs.py ADDED
@@ -0,0 +1,449 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ PDF Ingestion Script for SOCAR Hackathon
3
+ Processes all PDFs with VLM OCR and uploads to Pinecone
4
+
5
+ Based on benchmark results:
6
+ - OCR: Llama-4-Maverick-17B (87.75% CSR)
7
+ - Embedding: BAAI/bge-large-en-v1.5 (1024 dims)
8
+ - Chunking: 600 chars with 100 overlap
9
+ - Vector DB: Pinecone (cosine similarity)
10
+ """
11
+
12
+ import os
13
+ import re
14
+ import time
15
+ import base64
16
+ from pathlib import Path
17
+ from typing import List, Dict
18
+ from io import BytesIO
19
+
20
+ import fitz # PyMuPDF
21
+ from PIL import Image
22
+ from dotenv import load_dotenv
23
+ from openai import AzureOpenAI
24
+ from pinecone import Pinecone
25
+ from sentence_transformers import SentenceTransformer
26
+ from tqdm import tqdm
27
+
28
+ # Load environment
29
+ load_dotenv()
30
+
31
+ # Project paths
32
+ PROJECT_ROOT = Path(__file__).parent.parent
33
+ PDFS_DIR = PROJECT_ROOT / "data" / "pdfs"
34
+ OUTPUT_DIR = PROJECT_ROOT / "output" / "ingestion"
35
+ OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
36
+
37
+ # Initialize clients
38
+ print("πŸ”„ Initializing clients...")
39
+
40
+ azure_client = AzureOpenAI(
41
+ api_key=os.getenv("AZURE_OPENAI_API_KEY"),
42
+ api_version=os.getenv("AZURE_OPENAI_API_VERSION", "2024-08-01-preview"),
43
+ azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
44
+ )
45
+
46
+ pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
47
+ index = pc.Index(os.getenv("PINECONE_INDEX_NAME", "hackathon"))
48
+
49
+ # Best performing embedding model from benchmarks
50
+ embedding_model = SentenceTransformer("BAAI/bge-large-en-v1.5")
51
+
52
+ # Best performing VLM from benchmarks
53
+ VLM_MODEL = "Llama-4-Maverick-17B-128E-Instruct-FP8"
54
+
55
+ # Optimal chunking parameters from benchmarks
56
+ CHUNK_SIZE = 600
57
+ CHUNK_OVERLAP = 100
58
+
59
+ print("βœ… Clients initialized")
60
+
61
+
62
+ def pdf_to_images(pdf_path: str, dpi: int = 100) -> List[Image.Image]:
63
+ """Convert PDF pages to PIL Images."""
64
+ doc = fitz.open(pdf_path)
65
+ images = []
66
+
67
+ for page_num in range(len(doc)):
68
+ page = doc[page_num]
69
+ zoom = dpi / 72
70
+ mat = fitz.Matrix(zoom, zoom)
71
+ pix = page.get_pixmap(matrix=mat)
72
+ img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
73
+ images.append(img)
74
+
75
+ doc.close()
76
+ return images
77
+
78
+
79
+ def image_to_base64(image: Image.Image, format: str = "JPEG", quality: int = 85) -> str:
80
+ """Convert PIL Image to base64 with compression."""
81
+ buffered = BytesIO()
82
+ image.save(buffered, format=format, quality=quality, optimize=True)
83
+ return base64.b64encode(buffered.getvalue()).decode("utf-8")
84
+
85
+
86
+ def vlm_extract_text(pdf_path: str) -> str:
87
+ """
88
+ Extract text from PDF using VLM (Llama-4-Maverick).
89
+ Best performer: 87.75% CSR, 75s for 12 pages
90
+ """
91
+ images = pdf_to_images(pdf_path, dpi=100)
92
+
93
+ system_prompt = """You are an expert OCR system for historical oil & gas documents.
94
+
95
+ Extract ALL text from the image with 100% accuracy. Follow these rules:
96
+ 1. Preserve EXACT spelling - including Azerbaijani, Russian, and English text
97
+ 2. Maintain original Cyrillic characters - DO NOT transliterate
98
+ 3. Keep all numbers, symbols, and special characters exactly as shown
99
+ 4. Preserve layout structure (paragraphs, line breaks)
100
+ 5. Include ALL text - headers, body, footnotes, tables, captions
101
+
102
+ Output ONLY the extracted text. No explanations, no descriptions."""
103
+
104
+ all_text = []
105
+
106
+ print(f" Extracting text from {len(images)} pages...")
107
+ for page_num, image in enumerate(tqdm(images, desc=" OCR Progress"), 1):
108
+ # Convert to base64
109
+ image_base64 = image_to_base64(image, format="JPEG", quality=85)
110
+
111
+ messages = [
112
+ {"role": "system", "content": system_prompt},
113
+ {
114
+ "role": "user",
115
+ "content": [
116
+ {"type": "text", "text": f"Extract all text from page {page_num}:"},
117
+ {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}}
118
+ ]
119
+ }
120
+ ]
121
+
122
+ try:
123
+ response = azure_client.chat.completions.create(
124
+ model=VLM_MODEL,
125
+ messages=messages,
126
+ temperature=0.0, # Deterministic OCR
127
+ max_tokens=4000
128
+ )
129
+
130
+ page_text = response.choices[0].message.content
131
+ all_text.append(page_text)
132
+
133
+ except Exception as e:
134
+ print(f" ❌ Error on page {page_num}: {e}")
135
+ all_text.append("") # Add empty page on error
136
+
137
+ # Combine all pages
138
+ full_text = "\n\n".join(all_text)
139
+ return full_text
140
+
141
+
142
+ def clean_text_for_vectordb(text: str) -> str:
143
+ """
144
+ Clean text for vector database storage.
145
+ CRITICAL: Remove image markdown - images are ONLY for /ocr endpoint!
146
+ """
147
+ # Remove image markdown references
148
+ clean = re.sub(r'!\[Image\]\([^)]+\)', '', text)
149
+
150
+ # Normalize whitespace
151
+ clean = re.sub(r'\n\s*\n+', '\n\n', clean)
152
+ clean = clean.strip()
153
+
154
+ return clean
155
+
156
+
157
+ def chunk_text(text: str, chunk_size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> List[str]:
158
+ """
159
+ Chunk text with overlap for better context preservation.
160
+ Optimal config from benchmarks: 600 chars, 100 overlap
161
+ """
162
+ if not text or len(text) == 0:
163
+ return []
164
+
165
+ chunks = []
166
+ start = 0
167
+
168
+ while start < len(text):
169
+ end = start + chunk_size
170
+ chunk = text[start:end]
171
+
172
+ # Try to break at word boundary
173
+ if end < len(text) and not text[end].isspace():
174
+ last_space = chunk.rfind(' ')
175
+ if last_space > chunk_size - 100: # Keep chunk reasonably sized
176
+ chunk = chunk[:last_space]
177
+ end = start + last_space
178
+
179
+ chunk = chunk.strip()
180
+ if chunk: # Only add non-empty chunks
181
+ chunks.append(chunk)
182
+
183
+ start = end - overlap if end < len(text) else end
184
+
185
+ return chunks
186
+
187
+
188
+ def ingest_pdf(pdf_path: str) -> Dict:
189
+ """
190
+ Full ingestion pipeline for one PDF:
191
+ 1. VLM OCR (Llama-4-Maverick)
192
+ 2. Clean text (remove images)
193
+ 3. Chunk (600/100)
194
+ 4. Embed (bge-large-en)
195
+ 5. Upsert to Pinecone
196
+ """
197
+ pdf_name = Path(pdf_path).name
198
+ start_time = time.time()
199
+
200
+ print(f"\n{'='*70}")
201
+ print(f"πŸ“„ Processing: {pdf_name}")
202
+ print(f"{'='*70}")
203
+
204
+ # Step 1: OCR with VLM
205
+ print(" Step 1/5: Running VLM OCR...")
206
+ ocr_start = time.time()
207
+ raw_text = vlm_extract_text(pdf_path)
208
+ ocr_time = time.time() - ocr_start
209
+ print(f" βœ… OCR complete: {len(raw_text)} characters ({ocr_time:.1f}s)")
210
+
211
+ # Step 2: Clean text (remove image markdown)
212
+ print(" Step 2/5: Cleaning text...")
213
+ clean = clean_text_for_vectordb(raw_text)
214
+ print(f" βœ… Cleaned: {len(clean)} characters")
215
+
216
+ # Step 3: Chunk text
217
+ print(" Step 3/5: Chunking text...")
218
+ chunks = chunk_text(clean, chunk_size=CHUNK_SIZE, overlap=CHUNK_OVERLAP)
219
+ print(f" βœ… Created {len(chunks)} chunks")
220
+
221
+ if len(chunks) == 0:
222
+ print(" ⚠️ No chunks created - skipping document")
223
+ return {
224
+ "pdf_name": pdf_name,
225
+ "status": "skipped",
226
+ "reason": "no_text_extracted",
227
+ "time": time.time() - start_time
228
+ }
229
+
230
+ # Step 4: Generate embeddings
231
+ print(f" Step 4/5: Generating embeddings...")
232
+ embed_start = time.time()
233
+ embeddings = embedding_model.encode(chunks, show_progress_bar=True)
234
+ embed_time = time.time() - embed_start
235
+ print(f" βœ… Embeddings generated ({embed_time:.1f}s)")
236
+
237
+ # Step 5: Prepare vectors for Pinecone
238
+ print(" Step 5/5: Upserting to Pinecone...")
239
+ vectors = []
240
+
241
+ # Calculate approximate page numbers
242
+ # (simple heuristic: distribute chunks evenly across document)
243
+ doc = fitz.open(pdf_path)
244
+ num_pages = len(doc)
245
+ doc.close()
246
+
247
+ for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
248
+ # Estimate page number (chunks distributed across pages)
249
+ estimated_page = int((i / len(chunks)) * num_pages) + 1
250
+
251
+ vectors.append({
252
+ "id": f"{pdf_name}_chunk_{i}",
253
+ "values": embedding.tolist(),
254
+ "metadata": {
255
+ "pdf_name": pdf_name,
256
+ "page_number": estimated_page,
257
+ "text": chunk
258
+ }
259
+ })
260
+
261
+ # Upsert in batches
262
+ batch_size = 100
263
+ upsert_start = time.time()
264
+
265
+ for i in range(0, len(vectors), batch_size):
266
+ batch = vectors[i:i + batch_size]
267
+ index.upsert(vectors=batch)
268
+
269
+ upsert_time = time.time() - upsert_start
270
+ total_time = time.time() - start_time
271
+
272
+ print(f" βœ… Upserted {len(vectors)} vectors ({upsert_time:.1f}s)")
273
+ print(f"\n πŸŽ‰ Complete: {pdf_name}")
274
+ print(f" πŸ“Š Total time: {total_time:.1f}s")
275
+ print(f" πŸ“Š Breakdown: OCR={ocr_time:.1f}s, Embed={embed_time:.1f}s, Upload={upsert_time:.1f}s")
276
+
277
+ return {
278
+ "pdf_name": pdf_name,
279
+ "status": "success",
280
+ "num_chunks": len(chunks),
281
+ "num_vectors": len(vectors),
282
+ "text_length": len(clean),
283
+ "time_total": round(total_time, 2),
284
+ "time_ocr": round(ocr_time, 2),
285
+ "time_embedding": round(embed_time, 2),
286
+ "time_upsert": round(upsert_time, 2)
287
+ }
288
+
289
+
290
+ def ingest_all_pdfs(clear_existing: bool = False):
291
+ """
292
+ Ingest all PDFs from data/pdfs directory.
293
+
294
+ Args:
295
+ clear_existing: If True, clear existing index before ingestion
296
+ """
297
+ print("\n" + "="*70)
298
+ print("πŸš€ SOCAR PDF INGESTION PIPELINE")
299
+ print("="*70)
300
+ print(f"πŸ“‚ PDF Directory: {PDFS_DIR}")
301
+ print(f"🎯 Vector Database: Pinecone ({os.getenv('PINECONE_INDEX_NAME')})")
302
+ print(f"πŸ€– OCR Model: {VLM_MODEL}")
303
+ print(f"πŸ“Š Embedding Model: BAAI/bge-large-en-v1.5")
304
+ print(f"βœ‚οΈ Chunking: {CHUNK_SIZE} chars, {CHUNK_OVERLAP} overlap")
305
+ print("="*70)
306
+
307
+ # Clear index if requested
308
+ if clear_existing:
309
+ print("\n⚠️ Clearing existing vectors from index...")
310
+ response = input("Are you sure? This will delete ALL vectors. (yes/no): ")
311
+ if response.lower() == "yes":
312
+ index.delete(delete_all=True)
313
+ print("οΏ½οΏ½οΏ½ Index cleared")
314
+ time.sleep(2) # Wait for index to stabilize
315
+ else:
316
+ print("❌ Clearing cancelled")
317
+ return
318
+
319
+ # Get all PDFs
320
+ pdf_files = sorted(PDFS_DIR.glob("*.pdf"))
321
+
322
+ if not pdf_files:
323
+ print(f"\n❌ No PDF files found in {PDFS_DIR}")
324
+ return
325
+
326
+ print(f"\nπŸ“š Found {len(pdf_files)} PDF files")
327
+
328
+ # Process each PDF
329
+ results = []
330
+ start_time = time.time()
331
+
332
+ for pdf_path in pdf_files:
333
+ try:
334
+ result = ingest_pdf(str(pdf_path))
335
+ results.append(result)
336
+ except Exception as e:
337
+ print(f"\n❌ Error processing {pdf_path.name}: {e}")
338
+ results.append({
339
+ "pdf_name": pdf_path.name,
340
+ "status": "error",
341
+ "error": str(e)
342
+ })
343
+
344
+ total_time = time.time() - start_time
345
+
346
+ # Summary
347
+ print("\n" + "="*70)
348
+ print("πŸ“Š INGESTION SUMMARY")
349
+ print("="*70)
350
+
351
+ successful = [r for r in results if r.get("status") == "success"]
352
+ failed = [r for r in results if r.get("status") == "error"]
353
+ skipped = [r for r in results if r.get("status") == "skipped"]
354
+
355
+ print(f"\nβœ… Successful: {len(successful)}/{len(pdf_files)}")
356
+ print(f"❌ Failed: {len(failed)}")
357
+ print(f"⏭️ Skipped: {len(skipped)}")
358
+ print(f"\n⏱️ Total Time: {total_time/60:.1f} minutes")
359
+
360
+ if successful:
361
+ total_chunks = sum(r["num_chunks"] for r in successful)
362
+ total_vectors = sum(r["num_vectors"] for r in successful)
363
+ avg_time = sum(r["time_total"] for r in successful) / len(successful)
364
+
365
+ print(f"\nπŸ“¦ Total Chunks: {total_chunks}")
366
+ print(f"πŸ”’ Total Vectors: {total_vectors}")
367
+ print(f"⏱️ Average Time per PDF: {avg_time:.1f}s")
368
+
369
+ # Check index stats
370
+ stats = index.describe_index_stats()
371
+ print(f"\nπŸ“Š Pinecone Index Stats:")
372
+ print(f" Total Vectors: {stats.get('total_vector_count', 0)}")
373
+ print(f" Dimensions: {stats.get('dimension', 0)}")
374
+
375
+ # Save detailed results
376
+ import json
377
+ results_file = OUTPUT_DIR / "ingestion_results.json"
378
+ with open(results_file, 'w', encoding='utf-8') as f:
379
+ json.dump({
380
+ "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
381
+ "total_pdfs": len(pdf_files),
382
+ "successful": len(successful),
383
+ "failed": len(failed),
384
+ "skipped": len(skipped),
385
+ "total_time_seconds": round(total_time, 2),
386
+ "results": results
387
+ }, f, indent=2, ensure_ascii=False)
388
+
389
+ print(f"\nπŸ“„ Detailed results saved to: {results_file}")
390
+ print("\n" + "="*70)
391
+ print("πŸŽ‰ INGESTION COMPLETE!")
392
+ print("="*70)
393
+
394
+
395
+ def test_single_pdf(pdf_name: str = "document_00.pdf"):
396
+ """Test ingestion with a single PDF."""
397
+ pdf_path = PDFS_DIR / pdf_name
398
+
399
+ if not pdf_path.exists():
400
+ print(f"❌ PDF not found: {pdf_path}")
401
+ return
402
+
403
+ print(f"\nπŸ§ͺ Testing with: {pdf_name}")
404
+ result = ingest_pdf(str(pdf_path))
405
+
406
+ print("\nπŸ“Š Test Result:")
407
+ print(json.dumps(result, indent=2))
408
+
409
+
410
+ if __name__ == "__main__":
411
+ import sys
412
+ import json
413
+
414
+ # Parse command line arguments
415
+ if len(sys.argv) > 1:
416
+ command = sys.argv[1]
417
+
418
+ if command == "test":
419
+ # Test with single PDF
420
+ pdf_name = sys.argv[2] if len(sys.argv) > 2 else "document_00.pdf"
421
+ test_single_pdf(pdf_name)
422
+
423
+ elif command == "clear":
424
+ # Clear index and ingest all
425
+ ingest_all_pdfs(clear_existing=True)
426
+
427
+ elif command == "stats":
428
+ # Show current index stats
429
+ stats = index.describe_index_stats()
430
+ print("\nπŸ“Š Pinecone Index Stats:")
431
+ if stats:
432
+ print(f" Total Vectors: {stats.get('total_vector_count', 0)}")
433
+ print(f" Dimensions: {stats.get('dimension', 0)}")
434
+ if 'namespaces' in stats:
435
+ print(f" Namespaces: {stats.get('namespaces', {})}")
436
+ else:
437
+ print(" No stats available")
438
+
439
+ else:
440
+ print("Usage:")
441
+ print(" python ingest_pdfs.py - Ingest all PDFs (append)")
442
+ print(" python ingest_pdfs.py clear - Clear index and ingest all")
443
+ print(" python ingest_pdfs.py test - Test with document_00.pdf")
444
+ print(" python ingest_pdfs.py test document_05.pdf - Test with specific PDF")
445
+ print(" python ingest_pdfs.py stats - Show index statistics")
446
+
447
+ else:
448
+ # Default: ingest all PDFs (append mode)
449
+ ingest_all_pdfs(clear_existing=False)