Spaces:

nexusbert
/

sabitax

Sleeping

App Files Files Community

nexusbert commited on Jan 7

Commit

4b1d477

verified ·

1 Parent(s): 19db73d

Upload 14 files

Browse files

Files changed (13) hide show

RAG_SYSTEM_PLAN.md +243 -0
app.py +370 -0
dockerfile +42 -0
rag/.env +9 -0
rag/__init__.py +0 -0
rag/__pycache__/__init__.cpython-312.pyc +0 -0
rag/__pycache__/ingest.cpython-312.pyc +0 -0
rag/__pycache__/utils.cpython-312.pyc +0 -0
rag/ingest.py +211 -0
rag/main.py +293 -0
rag/requirements.txt +10 -0
rag/utils.py +236 -0
requirements.txt +10 -0

RAG_SYSTEM_PLAN.md ADDED Viewed

	@@ -0,0 +1,243 @@

+# Nigerian Tax Law RAG System
+A lightweight, scalable Retrieval-Augmented Generation (RAG) system for querying Nigerian tax and legal documents.
+## Overview
+This system uses:
+- **FastAPI** - Backend API server
+- **Gemini API** - Embeddings + answer generation
+- **ChromaDB** - Vector database for semantic search
+- **pdfplumber** - PDF text extraction
+- **tiktoken** - Text chunking with token counting
+## Architecture
+```
+        ┌─────────────────────────────┐
+        │           Client            │
+        └───────────────┬─────────────┘
+                        │ /ask
+                ┌───────▼────────┐
+                │   FastAPI API   │
+                └───────┬────────┘
+                        │
+                        │ Query → Gemini Embedding
+                ┌───────▼──────────┐
+                │    Vector DB      │
+                │     (Chroma)      │
+                └───────┬──────────┘
+                        │
+                        │ Retrieved Chunks
+                ┌───────▼──────────┐
+                │   Gemini Model    │
+                │ (RAG Completion)  │
+                └───────┬──────────┘
+                        │
+                ┌───────▼──────────┐
+                │   Final Answer    │
+                └───────────────────┘
+```
+## File Structure
+```
+tax/
+├── docs/                    # Your PDF documents
+│   ├── Nigeria-Tax-Act-2025.pdf
+│   └── ... (other tax/legal PDFs)
+└── rag/
+    ├── RAG_SYSTEM_PLAN.md   # This file
+    ├── main.py              # FastAPI server
+    ├── ingest.py            # PDF → ChromaDB pipeline
+    ├── utils.py             # Chunking + embedding functions
+    ├── requirements.txt     # Python dependencies
+    └── db/                  # ChromaDB vector database (auto-created)
+```
+## Installation
+1. **Create a virtual environment** (recommended):
+   ```bash
+   cd rag
+   python -m venv venv
+   source venv/bin/activate  # Linux/Mac
+   # or: venv\Scripts\activate  # Windows
+   ```
+2. **Install dependencies**:
+   ```bash
+   pip install -r requirements.txt
+   ```
+3. **Set your Gemini API key**:
+   ```bash
+   export GEMINI_API_KEY='your-api-key-here'
+   ```
+   Get a free API key at: https://aistudio.google.com/apikey
+## Usage
+### Step 1: Ingest Documents
+Index your PDF documents into the vector database:
+```bash
+cd rag
+python ingest.py
+```
+Options:
+- `--force` or `-f`: Re-ingest all documents (update embeddings)
+- `--clear`: Clear the database before ingesting
+- `--stats`: Show database statistics only
+- `--data-dir PATH`: Use a different PDF directory
+### Step 2: Start the API Server
+```bash
+uvicorn main:app --reload
+```
+The API will be available at `http://localhost:8000`
+### Step 3: Query Documents
+**Ask a question:**
+```bash
+curl -X POST "http://localhost:8000/ask" \
+  -H "Content-Type: application/json" \
+  -d '{"question": "What are the tax rates for personal income in Nigeria?"}'
+```
+**Check API health:**
+```bash
+curl http://localhost:8000/health
+```
+**Get statistics:**
+```bash
+curl http://localhost:8000/stats
+```
+## API Endpoints
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| `GET` | `/` | API information |
+| `GET` | `/health` | Health check |
+| `POST` | `/ask` | Ask a question |
+| `POST` | `/ingest` | Upload a new PDF |
+| `GET` | `/stats` | Database statistics |
+| `DELETE` | `/documents/{name}` | Remove a document |
+### POST /ask
+Request body:
+```json
+{
+  "question": "What is the penalty for late tax filing?",
+  "top_k": 5,
+  "model": "gemini-2.0-flash"
+}
+```
+Response:
+```json
+{
+  "answer": "According to the Nigeria Tax Act 2025...",
+  "sources": [
+    {
+      "document": "Nigeria-Tax-Act-2025.pdf",
+      "chunk_index": 42,
+      "relevance_score": 0.8532
+    }
+  ],
+  "chunks_used": 5
+}
+```
+### POST /ingest
+Upload a PDF file to add to the index:
+```bash
+curl -X POST "http://localhost:8000/ingest" \
+  -F "file=@new-document.pdf"
+```
+## Configuration
+Key settings in `ingest.py`:
+- `CHUNK_SIZE = 500` - Tokens per chunk
+- `CHUNK_OVERLAP = 50` - Overlap between chunks
+- `DATA_DIR` - PDF source directory (`../docs`)
+- `DB_DIR` - ChromaDB storage (`./db`)
+## Components
+### Data Ingestion (`ingest.py`)
+1. Extracts text from PDFs using pdfplumber
+2. Chunks into ~500 tokens using tiktoken
+3. Generates embeddings with Gemini (`text-embedding-004`)
+4. Stores in ChromaDB with metadata
+### Retrieval & Answer Generation (`main.py`)
+1. Converts query to embedding
+2. Searches ChromaDB for top-K similar chunks
+3. Sends context + question to Gemini
+4. Returns grounded answer with sources
+### Utilities (`utils.py`)
+- `chunk_text()` - Split text into token-based chunks
+- `generate_embedding()` - Create document embeddings
+- `generate_query_embedding()` - Create query embeddings
+- `generate_answer()` - RAG completion with Gemini
+- `clean_text()` - Clean extracted PDF text
+## Models Used
+- **Embeddings**: `text-embedding-004` (768 dimensions)
+- **Generation**: `gemini-2.0-flash` (default, fast)
+  - Can also use `gemini-2.0-pro` for complex reasoning
+## Security Considerations
+- API keys loaded via environment variables
+- Input validation on all endpoints
+- CORS middleware configured (restrict in production)
+- Consider adding JWT authentication for production
+## Troubleshooting
+**"GEMINI_API_KEY not set"**
+```bash
+export GEMINI_API_KEY='your-key'
+```
+**"No documents indexed"**
+```bash
+python ingest.py
+```
+**"Error extracting text"**
+- Check if PDF is not corrupted
+- Some PDFs may be image-based (need OCR)
+**Slow ingestion**
+- Embedding generation is batched (100 texts at a time)
+- Large PDFs with many pages take longer
+## Future Improvements
+- [ ] Admin dashboard for document management
+- [ ] Streaming responses
+- [ ] Multi-collection support
+- [ ] Document summaries
+- [ ] Caching layer for frequent queries
+- [ ] OCR support for scanned PDFs
+- [ ] JWT authentication

app.py ADDED Viewed

	@@ -0,0 +1,370 @@

+import os
+import time
+import hashlib
+import uuid
+from pathlib import Path
+from contextlib import asynccontextmanager
+from collections import defaultdict
+from fastapi import FastAPI, HTTPException, UploadFile, File, Request, Depends, Form
+from typing import Optional
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.security import APIKeyHeader
+from pydantic import BaseModel, Field
+from pinecone import Pinecone
+from dotenv import load_dotenv
+load_dotenv("rag/.env")
+from rag.utils import (
+    get_gemini_client,
+    generate_query_embedding,
+    generate_answer
+)
+from rag.ingest import (
+    get_pinecone_client,
+    get_pinecone_index,
+    ingest_single_pdf,
+    PINECONE_INDEX,
+    DATA_DIR
+)
+API_KEY = os.environ.get("API_KEY")
+RATE_LIMIT_REQUESTS = int(os.environ.get("RATE_LIMIT_REQUESTS", "30"))
+RATE_LIMIT_WINDOW = int(os.environ.get("RATE_LIMIT_WINDOW", "60"))
+ALLOWED_ORIGINS = os.environ.get("ALLOWED_ORIGINS", "*").split(",")
+gemini_client = None
+pinecone_index = None
+rate_limit_store = defaultdict(list)
+conversation_sessions = defaultdict(list)
+def get_client_ip(request: Request) -> str:
+    forwarded = request.headers.get("X-Forwarded-For")
+    if forwarded:
+        return forwarded.split(",")[0].strip()
+    return request.client.host if request.client else "unknown"
+def check_rate_limit(request: Request):
+    client_ip = get_client_ip(request)
+    now = time.time()
+    rate_limit_store[client_ip] = [
+        t for t in rate_limit_store[client_ip]
+        if now - t < RATE_LIMIT_WINDOW
+    ]
+    if len(rate_limit_store[client_ip]) >= RATE_LIMIT_REQUESTS:
+        raise HTTPException(
+            status_code=429,
+            detail=f"Rate limit exceeded. Max {RATE_LIMIT_REQUESTS} requests per {RATE_LIMIT_WINDOW} seconds."
+        )
+    rate_limit_store[client_ip].append(now)
+api_key_header = APIKeyHeader(name="X-API-Key", auto_error=False)
+async def verify_api_key(api_key: str = Depends(api_key_header)):
+    if API_KEY and api_key != API_KEY:
+        raise HTTPException(status_code=403, detail="Invalid API key")
+    return api_key
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    global gemini_client, pinecone_index
+    print("Starting Nigerian Tax Law RAG API...")
+    if API_KEY:
+        print("API Key authentication enabled")
+    else:
+        print("Warning: No API_KEY set - API is unprotected")
+    try:
+        gemini_client = get_gemini_client()
+        print("Gemini client initialized")
+    except ValueError as e:
+        print(f"Warning: {e}")
+    try:
+        pinecone_index = get_pinecone_index()
+        stats = pinecone_index.describe_index_stats()
+        print(f"Pinecone initialized ({stats.total_vector_count} vectors)")
+    except Exception as e:
+        print(f"Warning: Pinecone error: {e}")
+    yield
+    print("Shutting down RAG API...")
+app = FastAPI(
+    title="Nigerian Tax Law RAG API",
+    description="Query Nigerian tax laws and legal documents using AI-powered retrieval",
+    version="1.0.0",
+    lifespan=lifespan
+)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=ALLOWED_ORIGINS,
+    allow_credentials=True,
+    allow_methods=["GET", "POST"],
+    allow_headers=["*"],
+)
+class AskResponse(BaseModel):
+    answer: str
+    sources: list[dict]
+    chunks_used: int
+    session_id: str
+class IngestResponse(BaseModel):
+    message: str
+    filename: str
+    chunks_added: int
+class StatsResponse(BaseModel):
+    total_vectors: int
+    dimension: int
+    index_name: str
+class HealthResponse(BaseModel):
+    status: str
+    gemini_connected: bool
+    pinecone_connected: bool
+    vectors_indexed: int
+@app.get("/", response_model=dict)
+async def root():
+    return {
+        "name": "Nigerian Tax Law RAG API",
+        "version": "1.0.0",
+        "endpoints": {
+            "POST /ask": "Ask a question about Nigerian tax law",
+            "POST /ingest": "Upload and index a new PDF document",
+            "GET /stats": "Get database statistics",
+            "GET /health": "Health check"
+        }
+    }
+@app.get("/health", response_model=HealthResponse)
+async def health_check():
+    gemini_ok = gemini_client is not None
+    pinecone_ok = pinecone_index is not None
+    vectors = 0
+    if pinecone_ok:
+        try:
+            stats = pinecone_index.describe_index_stats()
+            vectors = stats.total_vector_count
+        except:
+            pinecone_ok = False
+    return HealthResponse(
+        status="healthy" if (gemini_ok and pinecone_ok) else "degraded",
+        gemini_connected=gemini_ok,
+        pinecone_connected=pinecone_ok,
+        vectors_indexed=vectors
+    )
+@app.post("/ask", response_model=AskResponse)
+async def ask_question(
+    req: Request,
+    question: str = Form(..., min_length=3, max_length=2000),
+    top_k: int = Form(default=5, ge=1, le=20),
+    model: str = Form(default="gemini-2.5-flash"),
+    session_id: Optional[str] = Form(default=None),
+    image: Optional[UploadFile] = File(default=None),
+    api_key: str = Depends(verify_api_key)
+):
+    check_rate_limit(req)
+    if gemini_client is None:
+        raise HTTPException(
+            status_code=503,
+            detail="Gemini API not configured. Set GEMINI_API_KEY environment variable."
+        )
+    if pinecone_index is None:
+        raise HTTPException(status_code=503, detail="Pinecone not initialized.")
+    if not session_id:
+        session_id = str(uuid.uuid4())
+    image_data = None
+    image_mime_type = None
+    if image and image.filename:
+        allowed_types = ["image/jpeg", "image/png", "image/gif", "image/webp"]
+        if image.content_type not in allowed_types:
+            raise HTTPException(
+                status_code=400,
+                detail=f"Invalid image type. Allowed: {', '.join(allowed_types)}"
+            )
+        if image.size and image.size > 10 * 1024 * 1024:
+            raise HTTPException(status_code=400, detail="Image too large. Max 10MB.")
+        image_data = await image.read()
+        image_mime_type = image.content_type
+    try:
+        query_embedding = generate_query_embedding(gemini_client, question)
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error generating query embedding: {str(e)}")
+    try:
+        results = pinecone_index.query(
+            vector=query_embedding,
+            top_k=top_k,
+            include_metadata=True
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error querying Pinecone: {str(e)}")
+    if not results.matches:
+        conversation_sessions[session_id].append({"role": "user", "content": question})
+        conversation_sessions[session_id].append({"role": "assistant", "content": "I couldn't find any relevant information in the indexed documents."})
+        return AskResponse(
+            answer="I couldn't find any relevant information in the indexed documents.",
+            sources=[],
+            chunks_used=0,
+            session_id=session_id
+        )
+    context_parts = []
+    sources = []
+    for match in results.matches:
+        meta = match.metadata
+        source_name = meta.get("source", "Unknown")
+        chunk_idx = meta.get("chunk_index", 0)
+        text = meta.get("text", "")
+        context_parts.append(f"[Source: {source_name}, Chunk {chunk_idx + 1}]\n{text}")
+        sources.append({
+            "document": source_name,
+            "chunk_index": chunk_idx,
+            "relevance_score": round(match.score, 4)
+        })
+    context = "\n\n---\n\n".join(context_parts)
+    conversation_history = conversation_sessions.get(session_id, [])
+    try:
+        answer = generate_answer(
+            gemini_client,
+            question,
+            context,
+            model=model,
+            image_data=image_data,
+            image_mime_type=image_mime_type,
+            conversation_history=conversation_history
+        )
+    except Exception as e:
+        error_msg = str(e)
+        if "overloaded" in error_msg.lower() or "503" in error_msg:
+            raise HTTPException(status_code=503, detail=error_msg)
+        raise HTTPException(status_code=500, detail=f"Error generating answer: {error_msg}")
+    conversation_sessions[session_id].append({"role": "user", "content": question})
+    conversation_sessions[session_id].append({"role": "assistant", "content": answer})
+    if len(conversation_sessions[session_id]) > 20:
+        conversation_sessions[session_id] = conversation_sessions[session_id][-20:]
+    return AskResponse(
+        answer=answer,
+        sources=sources,
+        chunks_used=len(results.matches),
+        session_id=session_id
+    )
+@app.post("/ingest", response_model=IngestResponse)
+async def ingest_document(
+    req: Request,
+    file: UploadFile = File(...),
+    force: bool = False,
+    api_key: str = Depends(verify_api_key)
+):
+    check_rate_limit(req)
+    if gemini_client is None:
+        raise HTTPException(
+            status_code=503,
+            detail="Gemini API not configured. Set GEMINI_API_KEY environment variable."
+        )
+    if pinecone_index is None:
+        raise HTTPException(status_code=503, detail="Pinecone not initialized.")
+    if not file.filename.lower().endswith(".pdf"):
+        raise HTTPException(status_code=400, detail="Only PDF files are supported.")
+    if file.size and file.size > 50 * 1024 * 1024:
+        raise HTTPException(status_code=400, detail="File too large. Max 50MB.")
+    DATA_DIR.mkdir(parents=True, exist_ok=True)
+    safe_filename = "".join(c for c in file.filename if c.isalnum() or c in "._- ")
+    file_path = DATA_DIR / safe_filename
+    try:
+        contents = await file.read()
+        with open(file_path, "wb") as f:
+            f.write(contents)
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error saving file: {str(e)}")
+    try:
+        chunks_added, _ = ingest_single_pdf(
+            file_path,
+            pinecone_index,
+            gemini_client,
+            force=force
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error ingesting document: {str(e)}")
+    return IngestResponse(
+        message="Document ingested successfully" if chunks_added > 0 else "Document already exists",
+        filename=safe_filename,
+        chunks_added=chunks_added
+    )
+@app.get("/stats", response_model=StatsResponse)
+async def get_stats(api_key: str = Depends(verify_api_key)):
+    if pinecone_index is None:
+        raise HTTPException(status_code=503, detail="Pinecone not initialized.")
+    try:
+        stats = pinecone_index.describe_index_stats()
+        return StatsResponse(
+            total_vectors=stats.total_vector_count,
+            dimension=stats.dimension,
+            index_name=PINECONE_INDEX
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error getting stats: {str(e)}")
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run("app:app", host="0.0.0.0", port=8000, reload=True)

dockerfile ADDED Viewed

	@@ -0,0 +1,42 @@

+FROM python:3.10-slim
+ARG HF_TOKEN
+ENV DEBIAN_FRONTEND=noninteractive \
+    PYTHONUNBUFFERED=1 \
+    PYTHONDONTWRITEBYTECODE=1 \
+    HF_TOKEN=${HF_TOKEN}
+WORKDIR /code
+# System Dependencies
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    build-essential \
+    git \
+    curl \
+    libopenblas-dev \
+    libomp-dev \
+    && rm -rf /var/lib/apt/lists/*
+# Copy requirements and install Python dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Hugging Face dependencies
+RUN pip install --no-cache-dir huggingface-hub sentencepiece
+# Hugging Face cache environment
+ENV HF_HOME=/data/huggingface \
+    HUGGINGFACE_HUB_CACHE=/data/huggingface \
+    HF_HUB_CACHE=/data/huggingface \
+    API_PORT=7860
+# Create cache dir and set permissions
+RUN mkdir -p /data/huggingface && chmod -R 777 /data
+# Copy project files
+COPY . .
+EXPOSE 7860
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

rag/.env ADDED Viewed

	@@ -0,0 +1,9 @@

+GEMINI_API_KEY=AIzaSyBEEUyeYjDewcGffhHSbtsHjuhngyi3Coo
+PINECONE_API_KEY=pcsk_2BGUcf_CzBnGUF9jP7UTgL6Ned77DVj6zV75RBGyKfFMxVqzw36bAQAc6HiH1nwdMLBoYA
+PINECONE_INDEX=sabitax
+# Security
+API_KEY=11e10c46685090a8a464f7c8a8f09cd519b69836935a2c8897b71472e2b74138
+RATE_LIMIT_REQUESTS=30
+RATE_LIMIT_WINDOW=60
+ALLOWED_ORIGINS=*

rag/__init__.py ADDED Viewed

File without changes

rag/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (135 Bytes). View file

rag/__pycache__/ingest.cpython-312.pyc ADDED Viewed

Binary file (8.75 kB). View file

rag/__pycache__/utils.cpython-312.pyc ADDED Viewed

Binary file (12.4 kB). View file

rag/ingest.py ADDED Viewed

	@@ -0,0 +1,211 @@

+import os
+from pathlib import Path
+from hashlib import md5
+import pdfplumber
+from pinecone import Pinecone
+try:
+    from .utils import (
+        get_gemini_client,
+        chunk_text,
+        clean_text,
+        generate_batch_embeddings,
+        count_tokens
+    )
+except ImportError:
+    from utils import (
+        get_gemini_client,
+        chunk_text,
+        clean_text,
+        generate_batch_embeddings,
+        count_tokens
+    )
+DATA_DIR = Path(__file__).parent.parent / "docs"
+PINECONE_INDEX = os.environ.get("PINECONE_INDEX", "sabitax")
+CHUNK_SIZE = 500
+CHUNK_OVERLAP = 50
+def get_pinecone_client():
+    api_key = os.environ.get("PINECONE_API_KEY")
+    if not api_key:
+        raise ValueError("PINECONE_API_KEY environment variable is not set.")
+    return Pinecone(api_key=api_key)
+def get_pinecone_index(pc=None):
+    if pc is None:
+        pc = get_pinecone_client()
+    return pc.Index(PINECONE_INDEX)
+def extract_text_from_pdf(pdf_path: Path) -> str:
+    text_parts = []
+    try:
+        with pdfplumber.open(pdf_path) as pdf:
+            for page_num, page in enumerate(pdf.pages, 1):
+                page_text = page.extract_text()
+                if page_text:
+                    text_parts.append(f"[Page {page_num}]\n{page_text}")
+    except Exception as e:
+        print(f"  Error extracting text from {pdf_path.name}: {e}")
+        return ""
+    full_text = "\n\n".join(text_parts)
+    return clean_text(full_text)
+def generate_chunk_id(doc_name: str, chunk_index: int) -> str:
+    content = f"{doc_name}_{chunk_index}"
+    return md5(content.encode()).hexdigest()
+def ingest_single_pdf(
+    pdf_path: Path,
+    index,
+    gemini_client,
+    force: bool = False
+) -> tuple[int, int]:
+    doc_name = pdf_path.name
+    if not force:
+        test_id = generate_chunk_id(doc_name, 0)
+        result = index.fetch(ids=[test_id])
+        if result.vectors:
+            print(f"  Skipping {doc_name} (already ingested)")
+            return 0, 1
+    print(f"  Processing: {doc_name}")
+    text = extract_text_from_pdf(pdf_path)
+    if not text:
+        print(f"  No text extracted from {doc_name}")
+        return 0, 0
+    total_tokens = count_tokens(text)
+    print(f"     Extracted {total_tokens:,} tokens")
+    chunks = chunk_text(text, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
+    print(f"     Created {len(chunks)} chunks")
+    if not chunks:
+        return 0, 0
+    print(f"     Generating embeddings...")
+    embeddings = generate_batch_embeddings(gemini_client, chunks)
+    vectors = []
+    for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
+        vectors.append({
+            "id": generate_chunk_id(doc_name, i),
+            "values": embedding,
+            "metadata": {
+                "source": doc_name,
+                "chunk_index": i,
+                "total_chunks": len(chunks),
+                "text": chunk[:1000]
+            }
+        })
+    batch_size = 100
+    for i in range(0, len(vectors), batch_size):
+        batch = vectors[i:i + batch_size]
+        index.upsert(vectors=batch)
+    print(f"     Added {len(chunks)} chunks to Pinecone")
+    return len(chunks), 0
+def ingest_all_documents(data_dir: Path = DATA_DIR, force: bool = False):
+    print("\nStarting document ingestion pipeline\n")
+    print(f"Data directory: {data_dir}")
+    print(f"Pinecone index: {PINECONE_INDEX}\n")
+    pdf_files = list(data_dir.glob("*.pdf"))
+    if not pdf_files:
+        print(f"No PDF files found in {data_dir}")
+        return
+    print(f"Found {len(pdf_files)} PDF files\n")
+    print("Connecting to Gemini API...")
+    gemini_client = get_gemini_client()
+    print("Connecting to Pinecone...")
+    index = get_pinecone_index()
+    stats = index.describe_index_stats()
+    print(f"Current index size: {stats.total_vector_count} vectors\n")
+    print("-" * 60)
+    total_added = 0
+    total_skipped = 0
+    for pdf_path in sorted(pdf_files):
+        added, skipped = ingest_single_pdf(
+            pdf_path,
+            index,
+            gemini_client,
+            force=force
+        )
+        total_added += added
+        total_skipped += skipped
+    print("-" * 60)
+    stats = index.describe_index_stats()
+    print(f"\nIngestion complete!")
+    print(f"   Chunks added: {total_added}")
+    print(f"   Documents skipped: {total_skipped}")
+    print(f"   Total index size: {stats.total_vector_count} vectors\n")
+def clear_index():
+    print("Clearing Pinecone index...")
+    try:
+        index = get_pinecone_index()
+        index.delete(delete_all=True)
+        print("Index cleared successfully")
+    except Exception as e:
+        print(f"Error clearing index: {e}")
+def show_stats():
+    print("\nPinecone Index Statistics\n")
+    try:
+        index = get_pinecone_index()
+        stats = index.describe_index_stats()
+        print(f"   Index: {PINECONE_INDEX}")
+        print(f"   Total vectors: {stats.total_vector_count}")
+        print(f"   Dimensions: {stats.dimension}")
+    except Exception as e:
+        print(f"   Error: {e}")
+    print()
+if __name__ == "__main__":
+    import argparse
+    from dotenv import load_dotenv
+    load_dotenv()
+    parser = argparse.ArgumentParser(description="Ingest PDF documents into Pinecone for RAG")
+    parser.add_argument("--force", "-f", action="store_true")
+    parser.add_argument("--clear", action="store_true")
+    parser.add_argument("--stats", action="store_true")
+    parser.add_argument("--data-dir", type=Path, default=DATA_DIR)
+    args = parser.parse_args()
+    if args.stats:
+        show_stats()
+    elif args.clear:
+        clear_index()
+        if not args.stats:
+            ingest_all_documents(data_dir=args.data_dir, force=True)
+    else:
+        ingest_all_documents(data_dir=args.data_dir, force=args.force)

rag/main.py ADDED Viewed

	@@ -0,0 +1,293 @@

+import os
+import tempfile
+from pathlib import Path
+from contextlib import asynccontextmanager
+from typing import Optional
+from fastapi import FastAPI, HTTPException, UploadFile, File, BackgroundTasks
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel, Field
+import chromadb
+from utils import (
+    get_gemini_client,
+    generate_query_embedding,
+    generate_answer
+)
+from ingest import (
+    get_chroma_client,
+    get_or_create_collection,
+    ingest_single_pdf,
+    COLLECTION_NAME,
+    DATA_DIR
+)
+gemini_client = None
+chroma_collection = None
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    global gemini_client, chroma_collection
+    print("Starting Nigerian Tax Law RAG API...")
+    try:
+        gemini_client = get_gemini_client()
+        print("Gemini client initialized")
+    except ValueError as e:
+        print(f"Warning: {e}")
+        print("The API will not work until GEMINI_API_KEY is set.")
+    chroma_client = get_chroma_client()
+    chroma_collection = get_or_create_collection(chroma_client)
+    print(f"ChromaDB initialized ({chroma_collection.count()} chunks indexed)")
+    yield
+    print("Shutting down RAG API...")
+app = FastAPI(
+    title="Nigerian Tax Law RAG API",
+    description="Query Nigerian tax laws and legal documents using AI-powered retrieval",
+    version="1.0.0",
+    lifespan=lifespan
+)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+class AskRequest(BaseModel):
+    question: str = Field(..., min_length=3, max_length=2000)
+    top_k: int = Field(default=5, ge=1, le=20)
+    model: str = Field(default="gemini-2.0-flash")
+class AskResponse(BaseModel):
+    answer: str
+    sources: list[dict]
+    chunks_used: int
+class IngestResponse(BaseModel):
+    message: str
+    filename: str
+    chunks_added: int
+class StatsResponse(BaseModel):
+    total_chunks: int
+    total_documents: int
+    documents: list[dict]
+class HealthResponse(BaseModel):
+    status: str
+    gemini_connected: bool
+    chroma_connected: bool
+    chunks_indexed: int
+@app.get("/", response_model=dict)
+async def root():
+    return {
+        "name": "Nigerian Tax Law RAG API",
+        "version": "1.0.0",
+        "endpoints": {
+            "POST /ask": "Ask a question about Nigerian tax law",
+            "POST /ingest": "Upload and index a new PDF document",
+            "GET /stats": "Get database statistics",
+            "GET /health": "Health check"
+        }
+    }
+@app.get("/health", response_model=HealthResponse)
+async def health_check():
+    gemini_ok = gemini_client is not None
+    chroma_ok = chroma_collection is not None
+    chunks = chroma_collection.count() if chroma_ok else 0
+    return HealthResponse(
+        status="healthy" if (gemini_ok and chroma_ok) else "degraded",
+        gemini_connected=gemini_ok,
+        chroma_connected=chroma_ok,
+        chunks_indexed=chunks
+    )
+@app.post("/ask", response_model=AskResponse)
+async def ask_question(request: AskRequest):
+    if gemini_client is None:
+        raise HTTPException(
+            status_code=503,
+            detail="Gemini API not configured. Set GEMINI_API_KEY environment variable."
+        )
+    if chroma_collection is None:
+        raise HTTPException(status_code=503, detail="Vector database not initialized.")
+    if chroma_collection.count() == 0:
+        raise HTTPException(
+            status_code=404,
+            detail="No documents indexed. Please ingest documents first using: python ingest.py"
+        )
+    try:
+        query_embedding = generate_query_embedding(gemini_client, request.question)
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error generating query embedding: {str(e)}")
+    try:
+        results = chroma_collection.query(
+            query_embeddings=[query_embedding],
+            n_results=request.top_k,
+            include=["documents", "metadatas", "distances"]
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error querying vector database: {str(e)}")
+    documents = results["documents"][0] if results["documents"] else []
+    metadatas = results["metadatas"][0] if results["metadatas"] else []
+    distances = results["distances"][0] if results["distances"] else []
+    if not documents:
+        return AskResponse(
+            answer="I couldn't find any relevant information in the indexed documents.",
+            sources=[],
+            chunks_used=0
+        )
+    context_parts = []
+    sources = []
+    for i, (doc, meta, dist) in enumerate(zip(documents, metadatas, distances)):
+        source_name = meta.get("source", "Unknown")
+        chunk_idx = meta.get("chunk_index", 0)
+        context_parts.append(f"[Source: {source_name}, Chunk {chunk_idx + 1}]\n{doc}")
+        sources.append({
+            "document": source_name,
+            "chunk_index": chunk_idx,
+            "relevance_score": round(1 - dist, 4)
+        })
+    context = "\n\n---\n\n".join(context_parts)
+    try:
+        answer = generate_answer(
+            gemini_client,
+            request.question,
+            context,
+            model=request.model
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error generating answer: {str(e)}")
+    return AskResponse(
+        answer=answer,
+        sources=sources,
+        chunks_used=len(documents)
+    )
+@app.post("/ingest", response_model=IngestResponse)
+async def ingest_document(file: UploadFile = File(...), force: bool = False):
+    if gemini_client is None:
+        raise HTTPException(
+            status_code=503,
+            detail="Gemini API not configured. Set GEMINI_API_KEY environment variable."
+        )
+    if not file.filename.lower().endswith(".pdf"):
+        raise HTTPException(status_code=400, detail="Only PDF files are supported.")
+    DATA_DIR.mkdir(parents=True, exist_ok=True)
+    file_path = DATA_DIR / file.filename
+    try:
+        contents = await file.read()
+        with open(file_path, "wb") as f:
+            f.write(contents)
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error saving file: {str(e)}")
+    try:
+        chunks_added, _ = ingest_single_pdf(
+            file_path,
+            chroma_collection,
+            gemini_client,
+            force=force
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error ingesting document: {str(e)}")
+    return IngestResponse(
+        message="Document ingested successfully" if chunks_added > 0 else "Document already exists",
+        filename=file.filename,
+        chunks_added=chunks_added
+    )
+@app.get("/stats", response_model=StatsResponse)
+async def get_stats():
+    if chroma_collection is None:
+        raise HTTPException(status_code=503, detail="Vector database not initialized.")
+    count = chroma_collection.count()
+    if count == 0:
+        return StatsResponse(total_chunks=0, total_documents=0, documents=[])
+    results = chroma_collection.get(limit=count, include=["metadatas"])
+    doc_chunks = {}
+    for meta in results["metadatas"]:
+        if meta:
+            source = meta.get("source", "Unknown")
+            doc_chunks[source] = doc_chunks.get(source, 0) + 1
+    documents = [
+        {"name": name, "chunks": chunks}
+        for name, chunks in sorted(doc_chunks.items())
+    ]
+    return StatsResponse(
+        total_chunks=count,
+        total_documents=len(doc_chunks),
+        documents=documents
+    )
+@app.delete("/documents/{document_name}")
+async def delete_document(document_name: str):
+    if chroma_collection is None:
+        raise HTTPException(status_code=503, detail="Vector database not initialized.")
+    results = chroma_collection.get(
+        where={"source": document_name},
+        include=["metadatas"]
+    )
+    if not results["ids"]:
+        raise HTTPException(status_code=404, detail=f"Document '{document_name}' not found in index.")
+    chroma_collection.delete(ids=results["ids"])
+    return {
+        "message": f"Document '{document_name}' deleted successfully",
+        "chunks_deleted": len(results["ids"])
+    }
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run("main:app", host="0.0.0.0", port=8000, reload=True)

rag/requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+fastapi
+uvicorn[standard]
+python-multipart
+pdfplumber
+pinecone
+tiktoken
+google-genai
+pydantic
+python-dotenv
+Pillow

rag/utils.py ADDED Viewed

	@@ -0,0 +1,236 @@

+import os
+import re
+import io
+import time
+import tiktoken
+from dotenv import load_dotenv
+from google import genai
+from google.genai import types
+from PIL import Image
+load_dotenv()
+def get_gemini_client():
+    api_key = os.environ.get("GEMINI_API_KEY")
+    if not api_key:
+        raise ValueError(
+            "GEMINI_API_KEY environment variable is not set. "
+            "Please set it with: export GEMINI_API_KEY='your-api-key'"
+        )
+    return genai.Client(api_key=api_key)
+def count_tokens(text: str, model: str = "cl100k_base") -> int:
+    encoding = tiktoken.get_encoding(model)
+    return len(encoding.encode(text))
+def chunk_text(
+    text: str,
+    chunk_size: int = 500,
+    chunk_overlap: int = 50,
+    encoding_name: str = "cl100k_base"
+) -> list[str]:
+    encoding = tiktoken.get_encoding(encoding_name)
+    tokens = encoding.encode(text)
+    chunks = []
+    start = 0
+    while start < len(tokens):
+        end = start + chunk_size
+        chunk_tokens = tokens[start:end]
+        chunk_text = encoding.decode(chunk_tokens)
+        chunks.append(chunk_text)
+        start = end - chunk_overlap
+        if start <= 0 and len(chunks) > 0:
+            break
+    return chunks
+def generate_embedding(client: genai.Client, text: str) -> list[float]:
+    result = client.models.embed_content(
+        model="models/text-embedding-004",
+        contents=text,
+        config=types.EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT")
+    )
+    return result.embeddings[0].values
+def generate_query_embedding(client: genai.Client, query: str) -> list[float]:
+    result = client.models.embed_content(
+        model="models/text-embedding-004",
+        contents=query,
+        config=types.EmbedContentConfig(task_type="RETRIEVAL_QUERY")
+    )
+    return result.embeddings[0].values
+def generate_batch_embeddings(
+    client: genai.Client,
+    texts: list[str],
+    batch_size: int = 100
+) -> list[list[float]]:
+    all_embeddings = []
+    for i in range(0, len(texts), batch_size):
+        batch = texts[i:i + batch_size]
+        result = client.models.embed_content(
+            model="models/text-embedding-004",
+            contents=batch,
+            config=types.EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT")
+        )
+        batch_embeddings = [emb.values for emb in result.embeddings]
+        all_embeddings.extend(batch_embeddings)
+    return all_embeddings
+def generate_answer(
+    client: genai.Client,
+    question: str,
+    context: str,
+    model: str = "gemini-2.5-flash",
+    image_data: bytes = None,
+    image_mime_type: str = None,
+    conversation_history: list = None
+) -> str:
+    question_lower = question.lower().strip()
+    greetings = ["hello", "hi", "hey", "good morning", "good afternoon", "good evening", "greetings"]
+    is_greeting = any(question_lower.startswith(g) or question_lower == g for g in greetings)
+    if is_greeting:
+        prompt = f"""You are SabiTax, a friendly and conversational legal and tax expert assistant specializing in Nigerian law.
+The user has greeted you. Respond naturally and warmly, like you're chatting with a friend. Introduce yourself as SabiTax in a casual, friendly way, and let them know you're here to help with any questions about Nigerian tax laws.
+User: {question}
+Respond conversationally - be warm, natural, and brief (2-3 sentences). Use a friendly, approachable tone."""
+    else:
+        name_questions = ["what is your name", "who are you", "what are you called", "what's your name", "tell me your name", "introduce yourself"]
+        is_name_question = any(q in question_lower for q in name_questions)
+        if is_name_question:
+            prompt = f"""You are SabiTax, a friendly and conversational legal and tax expert assistant specializing in Nigerian law and taxation.
+User: {question}
+Respond naturally and conversationally. Introduce yourself as SabiTax in a friendly, casual way. Explain that you help people understand Nigerian tax laws in simple terms, like you're explaining to a friend. Keep it brief, warm, and conversational."""
+        else:
+            history_text = ""
+            if conversation_history and len(conversation_history) > 0:
+                history_text = "\n\nPrevious conversation:\n"
+                for msg in conversation_history[-6:]:
+                    role = "User" if msg["role"] == "user" else "You (SabiTax)"
+                    history_text += f"{role}: {msg['content']}\n"
+                history_text += "\n"
+            prompt = f"""You are SabiTax, a friendly and conversational legal and tax expert assistant specializing in Nigerian law and taxation. You talk to users like you're having a natural conversation with a friend - warm, approachable, and easy to understand.
+Your style:
+- Talk naturally, like you're chatting over coffee
+- Use "you" and "I" - make it personal and engaging
+- Be warm and friendly, not robotic or formal
+- Use everyday language and simple explanations
+- Reference previous parts of the conversation when relevant: "As I mentioned earlier..." or "Building on what we discussed..."
+- Ask follow-up questions if helpful: "Does that make sense?" or "Want me to explain that differently?"
+- Show enthusiasm about helping: "Great question!" or "I'm happy to help with that!"
+Your approach:
+1. **Reason through the information**: Think about what the user really needs to know
+2. **Break it down simply**: Translate complex legal stuff into everyday language
+3. **Make it practical**: Focus on "what this means for you" and "what you need to do"
+4. **Prioritize current info**: Always mention the most recent laws first (2025 over 2020, etc.) and note if something's been updated
+5. **Continue the conversation**: If this is part of an ongoing discussion, naturally reference what was said before
+Important rules:
+- Answer based ONLY on the provided context from the documents
+- Always prioritize the most recent/current legislation (e.g., 2025 acts over 2020 acts)
+- If there's old info, mention it's been updated: "The old 2020 law has been replaced by the 2025 act..."
+- Explain everything in simple terms - no legal jargon without explanation
+- Use examples and analogies to make things clearer
+- If you don't have enough info, say so honestly: "I don't have enough details on that, but here's what I know..."
+- Keep it conversational - use short paragraphs, bullet points when helpful, but write like you're talking
+- If the user is continuing a topic from earlier, acknowledge it and build on the previous conversation
+{history_text}Context from documents:
+{context}
+Question: {question}
+Respond naturally and conversationally. Explain things like you're helping a friend understand their taxes. Be clear, friendly, and focus on what they actually need to know. If this continues a previous topic, reference it naturally."""
+    if image_data:
+        img = Image.open(io.BytesIO(image_data))
+        contents = [prompt, img]
+    else:
+        contents = prompt
+    max_retries = 3
+    retry_delay = 2
+    for attempt in range(max_retries):
+        try:
+            response = client.models.generate_content(
+                model=model,
+                contents=contents
+            )
+            return response.text
+        except Exception as e:
+            error_str = str(e)
+            if "503" in error_str or "UNAVAILABLE" in error_str or "overloaded" in error_str.lower():
+                if attempt < max_retries - 1:
+                    wait_time = retry_delay * (2 ** attempt)
+                    time.sleep(wait_time)
+                    continue
+                else:
+                    raise Exception("Gemini service is temporarily overloaded. Please try again in a few moments.")
+            else:
+                raise e
+    raise Exception("Failed to generate answer after multiple attempts")
+def clean_text(text: str) -> str:
+    text = text.encode('utf-8', errors='ignore').decode('utf-8')
+    text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]', '', text)
+    text = re.sub(r'Page \d+ of \d+', '', text, flags=re.IGNORECASE)
+    text = re.sub(r'^\d+\s*$', '', text, flags=re.MULTILINE)
+    text = re.sub(r'^[-_=]{3,}$', '', text, flags=re.MULTILINE)
+    text = re.sub(r'\.{3,}', '...', text)
+    text = re.sub(r'_{2,}', ' ', text)
+    text = re.sub(r'-{3,}', ' - ', text)
+    text = re.sub(r'\t+', ' ', text)
+    text = re.sub(r' +', ' ', text)
+    text = re.sub(r'\n{3,}', '\n\n', text)
+    text = re.sub(r'(\d+)\s*\.\s*(\d+)', r'\1.\2', text)
+    text = re.sub(r'([a-z])\s*-\s*([a-z])', r'\1\2', text)
+    lines = []
+    for line in text.split('\n'):
+        line = line.strip()
+        if len(line) > 2:
+            lines.append(line)
+        elif line == '':
+            lines.append(line)
+    text = '\n'.join(lines)
+    seen = set()
+    final_lines = []
+    for line in text.split('\n'):
+        line_lower = line.lower().strip()
+        if len(line_lower) < 50 and line_lower in seen:
+            continue
+        if len(line_lower) > 5:
+            seen.add(line_lower)
+        final_lines.append(line)
+    return '\n'.join(final_lines).strip()

requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+fastapi
+uvicorn[standard]
+python-multipart
+pdfplumber
+pinecone
+tiktoken
+google-genai
+pydantic
+python-dotenv
+Pillow