init

Browse files

Files changed (5) hide show

DEPLOYMENT.md +0 -257
app/main.py +142 -4
app/requirements.txt +7 -2
notebooks/vlm_ocr_benchmark.ipynb +0 -0
scripts/ingest_pdfs.py +449 -0

DEPLOYMENT.md DELETED Viewed

@@ -1,257 +0,0 @@
-# SOCAR Hackathon - LLM API Deployment Guide
-## Overview
-Production-ready FastAPI service for SOCAR historical documents chatbot.
-**Configuration (Based on RAG Optimization Benchmark):**
-- **Model**: Llama-4-Maverick-17B-128E-Instruct-FP8 (Open-source)
-- **Embedding**: BAAI/bge-large-en-v1.5
-- **Retrieval**: Top-3 vanilla
-- **Prompt Strategy**: Citation-focused
-- **Performance**: 55.67% LLM Judge Score, 73.33% Citation Score, ~3.6s response time
-## Quick Start
-### Prerequisites
-- Docker and Docker Compose installed
-- `.env` file with API keys (see `.env.example`)
-### 1. Configure Environment
-```bash
-cp .env.example .env
-# Edit .env with your actual API keys:
-# - AZURE_OPENAI_API_KEY
-# - AZURE_OPENAI_ENDPOINT
-# - PINECONE_API_KEY
-# - PINECONE_INDEX_NAME
-```
-### 2. Build and Run with Docker
-```bash
-# Build the image
-docker-compose build
-# Start the service
-docker-compose up -d
-# Check logs
-docker-compose logs -f llm-api
-# Check health
-curl http://localhost:8000/health
-```
-### 3. Test the API
-```bash
-# Simple health check
-curl http://localhost:8000/
-# Test LLM endpoint
-curl -X POST http://localhost:8000/llm \
-  -H "Content-Type: application/json" \
-  -d '{
-    "messages": [
-      {"role": "user", "content": "Palçıq vulkanlarının təsir radiusu nə qədərdir?"}
-    ]
-  }'
-```
-## API Endpoints
-### GET `/`
-Root endpoint with service information.
-**Response:**
-```json
-{
-  "status": "healthy",
-  "service": "SOCAR LLM Chatbot",
-  "version": "1.0.0",
-  "model": "Llama-4-Maverick-17B (open-source)",
-  "configuration": {
-    "embedding": "BAAI/bge-large-en-v1.5",
-    "retrieval": "top-3 vanilla",
-    "prompt": "citation_focused",
-    "benchmark_score": "55.67%"
-  }
-}
-```
-### GET `/health`
-Detailed health check with service status.
-**Response:**
-```json
-{
-  "status": "healthy",
-  "pinecone": {
-    "connected": true,
-    "total_vectors": 1300
-  },
-  "azure_openai": "connected",
-  "embedding_model": "loaded"
-}
-```
-### POST `/llm`
-Main chatbot endpoint.
-**Request:**
-```json
-{
-  "messages": [
-    {"role": "user", "content": "Your question here"}
-  ],
-  "temperature": 0.2,
-  "max_tokens": 1000
-}
-```
-**Response:**
-```json
-{
-  "response": "Answer with citations...",
-  "sources": [
-    {
-      "pdf_name": "document_00.pdf",
-      "page_number": "5",
-      "relevance_score": "0.892"
-    }
-  ],
-  "response_time": 3.61,
-  "model": "Llama-4-Maverick-17B-128E-Instruct-FP8"
-}
-```
-## Development Mode
-### Run locally without Docker
-```bash
-# Install dependencies
-cd app
-pip install -r requirements.txt
-# Run with uvicorn
-uvicorn main:app --reload --host 0.0.0.0 --port 8000
-```
-### Access API documentation
-Once running, visit:
-- **Swagger UI**: http://localhost:8000/docs
-- **ReDoc**: http://localhost:8000/redoc
-## Production Deployment
-### Environment Variables
-Required in `.env`:
-```bash
-# Azure OpenAI
-AZURE_OPENAI_API_KEY=your_key_here
-AZURE_OPENAI_ENDPOINT=your_endpoint_here
-AZURE_OPENAI_API_VERSION=2024-08-01-preview
-# Pinecone
-PINECONE_API_KEY=your_key_here
-PINECONE_INDEX_NAME=hackathon
-```
-### Docker Commands
-```bash
-# Build
-docker-compose build --no-cache
-# Start in background
-docker-compose up -d
-# View logs
-docker-compose logs -f
-# Stop
-docker-compose down
-# Restart
-docker-compose restart
-# Remove everything
-docker-compose down -v
-```
-### Health Checks
-The Docker container includes automatic health checks:
-- **Interval**: 30 seconds
-- **Timeout**: 10 seconds
-- **Start period**: 40 seconds (for model loading)
-- **Retries**: 3
-### Monitoring
-```bash
-# Check container status
-docker-compose ps
-# View resource usage
-docker stats socar-llm-api
-# Check logs
-docker-compose logs --tail=100 llm-api
-```
-## Performance Optimization
-### Lazy Loading
-- Azure client, Pinecone index, and embedding model are lazy-loaded
-- First request may take longer (~5-10s for model loading)
-- Subsequent requests: ~3.6s average
-### Caching (Future)
-To improve performance, consider:
-- Redis for frequently asked questions
-- Embedding cache for common queries
-- Model quantization for faster inference
-## Troubleshooting
-### Container won't start
-```bash
-# Check logs
-docker-compose logs llm-api
-# Verify environment variables
-docker-compose config
-# Rebuild
-docker-compose build --no-cache
-```
-### API returns 500 errors
-- Check Azure OpenAI key and endpoint
-- Verify Pinecone connection
-- Check model deployment name matches
-### Slow responses
-- First request loads models (5-10s)
-- Subsequent requests should be ~3-4s
-- Check network connectivity to Azure/Pinecone
-## Architecture Score
-**Open-Source Stack (20% bonus):**
-- ✅ Llama-4-Maverick-17B (Open-source LLM)
-- ✅ BAAI/bge-large-en-v1.5 (Open-source embeddings)
-- ✅ FastAPI (Open-source framework)
-- ✅ Docker (Open-source deployment)
-**Total Architecture Score: Maximum 20% for hackathon!**
-## License
-Built for SOCAR Hackathon 2025

app/main.py CHANGED Viewed

@@ -1,15 +1,21 @@
 """
-SOCAR Hackathon - LLM Chatbot Endpoint
-Optimized based on RAG benchmark results
-Best config: citation_focused + vanilla_k3 + Llama-4-Maverick
 """
 import os
 import time
 from typing import List, Dict
 from pathlib import Path
-from fastapi import FastAPI, HTTPException
 from fastapi.middleware.cors import CORSMiddleware
 from pydantic import BaseModel
 from dotenv import load_dotenv
@@ -275,6 +281,138 @@ async def llm_endpoint(request: ChatRequest):
         raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
 if __name__ == "__main__":
     import uvicorn
     uvicorn.run(app, host="0.0.0.0", port=8000)

 """
+SOCAR Hackathon - Complete API with /ocr and /llm endpoints
+Optimized based on comprehensive benchmarking:
+- OCR: Llama-4-Maverick-17B (87.75% CSR)
+- LLM: citation_focused + vanilla_k3 + Llama-4-Maverick (55.67% score)
 """
 import os
+import re
 import time
+import base64
 from typing import List, Dict
 from pathlib import Path
+from io import BytesIO
+import fitz  # PyMuPDF
+from PIL import Image
+from fastapi import FastAPI, HTTPException, File, UploadFile
 from fastapi.middleware.cors import CORSMiddleware
 from pydantic import BaseModel
 from dotenv import load_dotenv
         raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
+# ============================================================================
+# OCR ENDPOINT
+# ============================================================================
+class OCRPageResponse(BaseModel):
+    page_number: int
+    MD_text: str
+def pdf_to_images(pdf_bytes: bytes, dpi: int = 100) -> List[Image.Image]:
+    """Convert PDF bytes to PIL Images."""
+    doc = fitz.open(stream=pdf_bytes, filetype="pdf")
+    images = []
+    for page_num in range(len(doc)):
+        page = doc[page_num]
+        zoom = dpi / 72
+        mat = fitz.Matrix(zoom, zoom)
+        pix = page.get_pixmap(matrix=mat)
+        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
+        images.append(img)
+    doc.close()
+    return images
+def image_to_base64(image: Image.Image, format: str = "JPEG", quality: int = 85) -> str:
+    """Convert PIL Image to base64 with compression."""
+    buffered = BytesIO()
+    image.save(buffered, format=format, quality=quality, optimize=True)
+    return base64.b64encode(buffered.getvalue()).decode("utf-8")
+def detect_images_in_pdf(pdf_bytes: bytes) -> Dict[int, int]:
+    """
+    Detect images in each page of PDF.
+    Returns dict: {page_number: image_count}
+    """
+    doc = fitz.open(stream=pdf_bytes, filetype="pdf")
+    image_counts = {}
+    for page_num in range(len(doc)):
+        page = doc[page_num]
+        image_list = page.get_images()
+        image_counts[page_num + 1] = len(image_list)
+    doc.close()
+    return image_counts
+@app.post("/ocr", response_model=List[OCRPageResponse])
+async def ocr_endpoint(file: UploadFile = File(...)):
+    """
+    OCR endpoint for PDF text extraction with image detection.
+    Uses VLM (Llama-4-Maverick-17B) for best accuracy:
+    - Character Success Rate: 87.75%
+    - Word Success Rate: 61.91%
+    - Processing: ~6s per page
+    Returns:
+        List of {page_number, MD_text} with inline image references
+    """
+    try:
+        # Read PDF
+        pdf_bytes = await file.read()
+        pdf_filename = file.filename or "document.pdf"
+        # Convert to images
+        images = pdf_to_images(pdf_bytes, dpi=100)
+        # Detect images per page
+        image_counts = detect_images_in_pdf(pdf_bytes)
+        # OCR system prompt
+        system_prompt = """You are an expert OCR system for historical oil & gas documents.
+Extract ALL text from the image with 100% accuracy. Follow these rules:
+1. Preserve EXACT spelling - including Azerbaijani, Russian, and English text
+2. Maintain original Cyrillic characters - DO NOT transliterate
+3. Keep all numbers, symbols, and special characters exactly as shown
+4. Preserve layout structure (paragraphs, line breaks)
+5. Include ALL text - headers, body, footnotes, tables, captions
+Output ONLY the extracted text. No explanations, no descriptions."""
+        # Process each page
+        results = []
+        client = get_azure_client()
+        for page_num, image in enumerate(images, 1):
+            # Convert image to base64
+            image_base64 = image_to_base64(image, format="JPEG", quality=85)
+            # VLM OCR
+            messages = [
+                {"role": "system", "content": system_prompt},
+                {
+                    "role": "user",
+                    "content": [
+                        {"type": "text", "text": f"Extract all text from page {page_num}:"},
+                        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}}
+                    ]
+                }
+            ]
+            response = client.chat.completions.create(
+                model="Llama-4-Maverick-17B-128E-Instruct-FP8",
+                messages=messages,
+                temperature=0.0,  # Deterministic OCR
+                max_tokens=4000
+            )
+            page_text = response.choices[0].message.content
+            # Add image references if images exist on this page
+            num_images = image_counts.get(page_num, 0)
+            if num_images > 0:
+                for img_idx in range(1, num_images + 1):
+                    page_text += f"\n\n![Image]({pdf_filename}/page_{page_num}/image_{img_idx})\n\n"
+            results.append({
+                "page_number": page_num,
+                "MD_text": page_text
+            })
+        return results
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"OCR Error: {str(e)}")
 if __name__ == "__main__":
     import uvicorn
     uvicorn.run(app, host="0.0.0.0", port=8000)

app/requirements.txt CHANGED Viewed

@@ -1,5 +1,5 @@
-# SOCAR Hackathon LLM Endpoint Dependencies
-# Optimized for production deployment
 # FastAPI and server
 fastapi==0.109.0
@@ -17,9 +17,14 @@ sentence-transformers==3.3.1
 torch==2.5.1
 numpy<2.0.0
 # Utilities
 python-dotenv==1.0.0
 python-multipart==0.0.6
 # Optional: monitoring and logging
 prometheus-fastapi-instrumentator==7.0.0

+# SOCAR Hackathon - Complete API Dependencies
+# Optimized for production deployment with /ocr and /llm endpoints
 # FastAPI and server
 fastapi==0.109.0
 torch==2.5.1
 numpy<2.0.0
+# PDF processing and OCR
+PyMuPDF==1.23.8
+Pillow==10.1.0
 # Utilities
 python-dotenv==1.0.0
 python-multipart==0.0.6
+tqdm==4.66.1
 # Optional: monitoring and logging
 prometheus-fastapi-instrumentator==7.0.0

notebooks/vlm_ocr_benchmark.ipynb CHANGED Viewed

The diff for this file is too large to render. See raw diff

scripts/ingest_pdfs.py ADDED Viewed

	@@ -0,0 +1,449 @@

+"""
+PDF Ingestion Script for SOCAR Hackathon
+Processes all PDFs with VLM OCR and uploads to Pinecone
+Based on benchmark results:
+- OCR: Llama-4-Maverick-17B (87.75% CSR)
+- Embedding: BAAI/bge-large-en-v1.5 (1024 dims)
+- Chunking: 600 chars with 100 overlap
+- Vector DB: Pinecone (cosine similarity)
+"""
+import os
+import re
+import time
+import base64
+from pathlib import Path
+from typing import List, Dict
+from io import BytesIO
+import fitz  # PyMuPDF
+from PIL import Image
+from dotenv import load_dotenv
+from openai import AzureOpenAI
+from pinecone import Pinecone
+from sentence_transformers import SentenceTransformer
+from tqdm import tqdm
+# Load environment
+load_dotenv()
+# Project paths
+PROJECT_ROOT = Path(__file__).parent.parent
+PDFS_DIR = PROJECT_ROOT / "data" / "pdfs"
+OUTPUT_DIR = PROJECT_ROOT / "output" / "ingestion"
+OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+# Initialize clients
+print("🔄 Initializing clients...")
+azure_client = AzureOpenAI(
+    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
+    api_version=os.getenv("AZURE_OPENAI_API_VERSION", "2024-08-01-preview"),
+    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
+)
+pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
+index = pc.Index(os.getenv("PINECONE_INDEX_NAME", "hackathon"))
+# Best performing embedding model from benchmarks
+embedding_model = SentenceTransformer("BAAI/bge-large-en-v1.5")
+# Best performing VLM from benchmarks
+VLM_MODEL = "Llama-4-Maverick-17B-128E-Instruct-FP8"
+# Optimal chunking parameters from benchmarks
+CHUNK_SIZE = 600
+CHUNK_OVERLAP = 100
+print("✅ Clients initialized")
+def pdf_to_images(pdf_path: str, dpi: int = 100) -> List[Image.Image]:
+    """Convert PDF pages to PIL Images."""
+    doc = fitz.open(pdf_path)
+    images = []
+    for page_num in range(len(doc)):
+        page = doc[page_num]
+        zoom = dpi / 72
+        mat = fitz.Matrix(zoom, zoom)
+        pix = page.get_pixmap(matrix=mat)
+        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
+        images.append(img)
+    doc.close()
+    return images
+def image_to_base64(image: Image.Image, format: str = "JPEG", quality: int = 85) -> str:
+    """Convert PIL Image to base64 with compression."""
+    buffered = BytesIO()
+    image.save(buffered, format=format, quality=quality, optimize=True)
+    return base64.b64encode(buffered.getvalue()).decode("utf-8")
+def vlm_extract_text(pdf_path: str) -> str:
+    """
+    Extract text from PDF using VLM (Llama-4-Maverick).
+    Best performer: 87.75% CSR, 75s for 12 pages
+    """
+    images = pdf_to_images(pdf_path, dpi=100)
+    system_prompt = """You are an expert OCR system for historical oil & gas documents.
+Extract ALL text from the image with 100% accuracy. Follow these rules:
+1. Preserve EXACT spelling - including Azerbaijani, Russian, and English text
+2. Maintain original Cyrillic characters - DO NOT transliterate
+3. Keep all numbers, symbols, and special characters exactly as shown
+4. Preserve layout structure (paragraphs, line breaks)
+5. Include ALL text - headers, body, footnotes, tables, captions
+Output ONLY the extracted text. No explanations, no descriptions."""
+    all_text = []
+    print(f"   Extracting text from {len(images)} pages...")
+    for page_num, image in enumerate(tqdm(images, desc="   OCR Progress"), 1):
+        # Convert to base64
+        image_base64 = image_to_base64(image, format="JPEG", quality=85)
+        messages = [
+            {"role": "system", "content": system_prompt},
+            {
+                "role": "user",
+                "content": [
+                    {"type": "text", "text": f"Extract all text from page {page_num}:"},
+                    {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}}
+                ]
+            }
+        ]
+        try:
+            response = azure_client.chat.completions.create(
+                model=VLM_MODEL,
+                messages=messages,
+                temperature=0.0,  # Deterministic OCR
+                max_tokens=4000
+            )
+            page_text = response.choices[0].message.content
+            all_text.append(page_text)
+        except Exception as e:
+            print(f"      ❌ Error on page {page_num}: {e}")
+            all_text.append("")  # Add empty page on error
+    # Combine all pages
+    full_text = "\n\n".join(all_text)
+    return full_text
+def clean_text_for_vectordb(text: str) -> str:
+    """
+    Clean text for vector database storage.
+    CRITICAL: Remove image markdown - images are ONLY for /ocr endpoint!
+    """
+    # Remove image markdown references
+    clean = re.sub(r'!\[Image\]\([^)]+\)', '', text)
+    # Normalize whitespace
+    clean = re.sub(r'\n\s*\n+', '\n\n', clean)
+    clean = clean.strip()
+    return clean
+def chunk_text(text: str, chunk_size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> List[str]:
+    """
+    Chunk text with overlap for better context preservation.
+    Optimal config from benchmarks: 600 chars, 100 overlap
+    """
+    if not text or len(text) == 0:
+        return []
+    chunks = []
+    start = 0
+    while start < len(text):
+        end = start + chunk_size
+        chunk = text[start:end]
+        # Try to break at word boundary
+        if end < len(text) and not text[end].isspace():
+            last_space = chunk.rfind(' ')
+            if last_space > chunk_size - 100:  # Keep chunk reasonably sized
+                chunk = chunk[:last_space]
+                end = start + last_space
+        chunk = chunk.strip()
+        if chunk:  # Only add non-empty chunks
+            chunks.append(chunk)
+        start = end - overlap if end < len(text) else end
+    return chunks
+def ingest_pdf(pdf_path: str) -> Dict:
+    """
+    Full ingestion pipeline for one PDF:
+    1. VLM OCR (Llama-4-Maverick)
+    2. Clean text (remove images)
+    3. Chunk (600/100)
+    4. Embed (bge-large-en)
+    5. Upsert to Pinecone
+    """
+    pdf_name = Path(pdf_path).name
+    start_time = time.time()
+    print(f"\n{'='*70}")
+    print(f"📄 Processing: {pdf_name}")
+    print(f"{'='*70}")
+    # Step 1: OCR with VLM
+    print("   Step 1/5: Running VLM OCR...")
+    ocr_start = time.time()
+    raw_text = vlm_extract_text(pdf_path)
+    ocr_time = time.time() - ocr_start
+    print(f"   ✅ OCR complete: {len(raw_text)} characters ({ocr_time:.1f}s)")
+    # Step 2: Clean text (remove image markdown)
+    print("   Step 2/5: Cleaning text...")
+    clean = clean_text_for_vectordb(raw_text)
+    print(f"   ✅ Cleaned: {len(clean)} characters")
+    # Step 3: Chunk text
+    print("   Step 3/5: Chunking text...")
+    chunks = chunk_text(clean, chunk_size=CHUNK_SIZE, overlap=CHUNK_OVERLAP)
+    print(f"   ✅ Created {len(chunks)} chunks")
+    if len(chunks) == 0:
+        print("   ⚠️  No chunks created - skipping document")
+        return {
+            "pdf_name": pdf_name,
+            "status": "skipped",
+            "reason": "no_text_extracted",
+            "time": time.time() - start_time
+        }
+    # Step 4: Generate embeddings
+    print(f"   Step 4/5: Generating embeddings...")
+    embed_start = time.time()
+    embeddings = embedding_model.encode(chunks, show_progress_bar=True)
+    embed_time = time.time() - embed_start
+    print(f"   ✅ Embeddings generated ({embed_time:.1f}s)")
+    # Step 5: Prepare vectors for Pinecone
+    print("   Step 5/5: Upserting to Pinecone...")
+    vectors = []
+    # Calculate approximate page numbers
+    # (simple heuristic: distribute chunks evenly across document)
+    doc = fitz.open(pdf_path)
+    num_pages = len(doc)
+    doc.close()
+    for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
+        # Estimate page number (chunks distributed across pages)
+        estimated_page = int((i / len(chunks)) * num_pages) + 1
+        vectors.append({
+            "id": f"{pdf_name}_chunk_{i}",
+            "values": embedding.tolist(),
+            "metadata": {
+                "pdf_name": pdf_name,
+                "page_number": estimated_page,
+                "text": chunk
+            }
+        })
+    # Upsert in batches
+    batch_size = 100
+    upsert_start = time.time()
+    for i in range(0, len(vectors), batch_size):
+        batch = vectors[i:i + batch_size]
+        index.upsert(vectors=batch)
+    upsert_time = time.time() - upsert_start
+    total_time = time.time() - start_time
+    print(f"   ✅ Upserted {len(vectors)} vectors ({upsert_time:.1f}s)")
+    print(f"\n   🎉 Complete: {pdf_name}")
+    print(f"   📊 Total time: {total_time:.1f}s")
+    print(f"   📊 Breakdown: OCR={ocr_time:.1f}s, Embed={embed_time:.1f}s, Upload={upsert_time:.1f}s")
+    return {
+        "pdf_name": pdf_name,
+        "status": "success",
+        "num_chunks": len(chunks),
+        "num_vectors": len(vectors),
+        "text_length": len(clean),
+        "time_total": round(total_time, 2),
+        "time_ocr": round(ocr_time, 2),
+        "time_embedding": round(embed_time, 2),
+        "time_upsert": round(upsert_time, 2)
+    }
+def ingest_all_pdfs(clear_existing: bool = False):
+    """
+    Ingest all PDFs from data/pdfs directory.
+    Args:
+        clear_existing: If True, clear existing index before ingestion
+    """
+    print("\n" + "="*70)
+    print("🚀 SOCAR PDF INGESTION PIPELINE")
+    print("="*70)
+    print(f"📂 PDF Directory: {PDFS_DIR}")
+    print(f"🎯 Vector Database: Pinecone ({os.getenv('PINECONE_INDEX_NAME')})")
+    print(f"🤖 OCR Model: {VLM_MODEL}")
+    print(f"📊 Embedding Model: BAAI/bge-large-en-v1.5")
+    print(f"✂️  Chunking: {CHUNK_SIZE} chars, {CHUNK_OVERLAP} overlap")
+    print("="*70)
+    # Clear index if requested
+    if clear_existing:
+        print("\n⚠️  Clearing existing vectors from index...")
+        response = input("Are you sure? This will delete ALL vectors. (yes/no): ")
+        if response.lower() == "yes":
+            index.delete(delete_all=True)
+            print("��� Index cleared")
+            time.sleep(2)  # Wait for index to stabilize
+        else:
+            print("❌ Clearing cancelled")
+            return
+    # Get all PDFs
+    pdf_files = sorted(PDFS_DIR.glob("*.pdf"))
+    if not pdf_files:
+        print(f"\n❌ No PDF files found in {PDFS_DIR}")
+        return
+    print(f"\n📚 Found {len(pdf_files)} PDF files")
+    # Process each PDF
+    results = []
+    start_time = time.time()
+    for pdf_path in pdf_files:
+        try:
+            result = ingest_pdf(str(pdf_path))
+            results.append(result)
+        except Exception as e:
+            print(f"\n❌ Error processing {pdf_path.name}: {e}")
+            results.append({
+                "pdf_name": pdf_path.name,
+                "status": "error",
+                "error": str(e)
+            })
+    total_time = time.time() - start_time
+    # Summary
+    print("\n" + "="*70)
+    print("📊 INGESTION SUMMARY")
+    print("="*70)
+    successful = [r for r in results if r.get("status") == "success"]
+    failed = [r for r in results if r.get("status") == "error"]
+    skipped = [r for r in results if r.get("status") == "skipped"]
+    print(f"\n✅ Successful: {len(successful)}/{len(pdf_files)}")
+    print(f"❌ Failed: {len(failed)}")
+    print(f"⏭️  Skipped: {len(skipped)}")
+    print(f"\n⏱️  Total Time: {total_time/60:.1f} minutes")
+    if successful:
+        total_chunks = sum(r["num_chunks"] for r in successful)
+        total_vectors = sum(r["num_vectors"] for r in successful)
+        avg_time = sum(r["time_total"] for r in successful) / len(successful)
+        print(f"\n📦 Total Chunks: {total_chunks}")
+        print(f"🔢 Total Vectors: {total_vectors}")
+        print(f"⏱️  Average Time per PDF: {avg_time:.1f}s")
+    # Check index stats
+    stats = index.describe_index_stats()
+    print(f"\n📊 Pinecone Index Stats:")
+    print(f"   Total Vectors: {stats.get('total_vector_count', 0)}")
+    print(f"   Dimensions: {stats.get('dimension', 0)}")
+    # Save detailed results
+    import json
+    results_file = OUTPUT_DIR / "ingestion_results.json"
+    with open(results_file, 'w', encoding='utf-8') as f:
+        json.dump({
+            "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
+            "total_pdfs": len(pdf_files),
+            "successful": len(successful),
+            "failed": len(failed),
+            "skipped": len(skipped),
+            "total_time_seconds": round(total_time, 2),
+            "results": results
+        }, f, indent=2, ensure_ascii=False)
+    print(f"\n📄 Detailed results saved to: {results_file}")
+    print("\n" + "="*70)
+    print("🎉 INGESTION COMPLETE!")
+    print("="*70)
+def test_single_pdf(pdf_name: str = "document_00.pdf"):
+    """Test ingestion with a single PDF."""
+    pdf_path = PDFS_DIR / pdf_name
+    if not pdf_path.exists():
+        print(f"❌ PDF not found: {pdf_path}")
+        return
+    print(f"\n🧪 Testing with: {pdf_name}")
+    result = ingest_pdf(str(pdf_path))
+    print("\n📊 Test Result:")
+    print(json.dumps(result, indent=2))
+if __name__ == "__main__":
+    import sys
+    import json
+    # Parse command line arguments
+    if len(sys.argv) > 1:
+        command = sys.argv[1]
+        if command == "test":
+            # Test with single PDF
+            pdf_name = sys.argv[2] if len(sys.argv) > 2 else "document_00.pdf"
+            test_single_pdf(pdf_name)
+        elif command == "clear":
+            # Clear index and ingest all
+            ingest_all_pdfs(clear_existing=True)
+        elif command == "stats":
+            # Show current index stats
+            stats = index.describe_index_stats()
+            print("\n📊 Pinecone Index Stats:")
+            if stats:
+                print(f"   Total Vectors: {stats.get('total_vector_count', 0)}")
+                print(f"   Dimensions: {stats.get('dimension', 0)}")
+                if 'namespaces' in stats:
+                    print(f"   Namespaces: {stats.get('namespaces', {})}")
+            else:
+                print("   No stats available")
+        else:
+            print("Usage:")
+            print("  python ingest_pdfs.py          - Ingest all PDFs (append)")
+            print("  python ingest_pdfs.py clear    - Clear index and ingest all")
+            print("  python ingest_pdfs.py test     - Test with document_00.pdf")
+            print("  python ingest_pdfs.py test document_05.pdf  - Test with specific PDF")
+            print("  python ingest_pdfs.py stats    - Show index statistics")
+    else:
+        # Default: ingest all PDFs (append mode)
+        ingest_all_pdfs(clear_existing=False)