sabitax / RAG_SYSTEM_PLAN.md
nexusbert's picture
Upload 14 files
d43d504 verified

Nigerian Tax Law RAG System

A lightweight, scalable Retrieval-Augmented Generation (RAG) system for querying Nigerian tax and legal documents.

Overview

This system uses:

  • FastAPI - Backend API server
  • Gemini API - Embeddings + answer generation
  • ChromaDB - Vector database for semantic search
  • pdfplumber - PDF text extraction
  • tiktoken - Text chunking with token counting

Architecture

        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚           Client            β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚ /ask
                β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚   FastAPI API   β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β”‚ Query β†’ Gemini Embedding
                β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚    Vector DB      β”‚
                β”‚     (Chroma)      β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β”‚ Retrieved Chunks
                β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚   Gemini Model    β”‚
                β”‚ (RAG Completion)  β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚   Final Answer    β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

File Structure

tax/
β”œβ”€β”€ docs/                    # Your PDF documents
β”‚   β”œβ”€β”€ Nigeria-Tax-Act-2025.pdf
β”‚   └── ... (other tax/legal PDFs)
└── rag/
    β”œβ”€β”€ RAG_SYSTEM_PLAN.md   # This file
    β”œβ”€β”€ main.py              # FastAPI server
    β”œβ”€β”€ ingest.py            # PDF β†’ ChromaDB pipeline
    β”œβ”€β”€ utils.py             # Chunking + embedding functions
    β”œβ”€β”€ requirements.txt     # Python dependencies
    └── db/                  # ChromaDB vector database (auto-created)

Installation

  1. Create a virtual environment (recommended):

    cd rag
    python -m venv venv
    source venv/bin/activate  # Linux/Mac
    # or: venv\Scripts\activate  # Windows
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Set your Gemini API key:

    export GEMINI_API_KEY='your-api-key-here'
    

    Get a free API key at: https://aistudio.google.com/apikey

Usage

Step 1: Ingest Documents

Index your PDF documents into the vector database:

cd rag
python ingest.py

Options:

  • --force or -f: Re-ingest all documents (update embeddings)
  • --clear: Clear the database before ingesting
  • --stats: Show database statistics only
  • --data-dir PATH: Use a different PDF directory

Step 2: Start the API Server

uvicorn main:app --reload

The API will be available at http://localhost:8000

Step 3: Query Documents

Ask a question:

curl -X POST "http://localhost:8000/ask" \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the tax rates for personal income in Nigeria?"}'

Check API health:

curl http://localhost:8000/health

Get statistics:

curl http://localhost:8000/stats

API Endpoints

Method Endpoint Description
GET / API information
GET /health Health check
POST /ask Ask a question
POST /ingest Upload a new PDF
GET /stats Database statistics
DELETE /documents/{name} Remove a document

POST /ask

Request body:

{
  "question": "What is the penalty for late tax filing?",
  "top_k": 5,
  "model": "gemini-2.5-flash"
}

Response:

{
  "answer": "According to the Nigeria Tax Act 2025...",
  "sources": [
    {
      "document": "Nigeria-Tax-Act-2025.pdf",
      "chunk_index": 42,
      "relevance_score": 0.8532
    }
  ],
  "chunks_used": 5
}

POST /ingest

Upload a PDF file to add to the index:

curl -X POST "http://localhost:8000/ingest" \
  -F "file=@new-document.pdf"

Configuration

Key settings in ingest.py:

  • CHUNK_SIZE = 500 - Tokens per chunk
  • CHUNK_OVERLAP = 50 - Overlap between chunks
  • DATA_DIR - PDF source directory (../docs)
  • DB_DIR - ChromaDB storage (./db)

Components

Data Ingestion (ingest.py)

  1. Extracts text from PDFs using pdfplumber
  2. Chunks into ~500 tokens using tiktoken
  3. Generates embeddings with Gemini (text-embedding-004)
  4. Stores in ChromaDB with metadata

Retrieval & Answer Generation (main.py)

  1. Converts query to embedding
  2. Searches ChromaDB for top-K similar chunks
  3. Sends context + question to Gemini
  4. Returns grounded answer with sources

Utilities (utils.py)

  • chunk_text() - Split text into token-based chunks
  • generate_embedding() - Create document embeddings
  • generate_query_embedding() - Create query embeddings
  • generate_answer() - RAG completion with Gemini
  • clean_text() - Clean extracted PDF text

Models Used

  • Embeddings: text-embedding-004 (768 dimensions)
  • Generation: gemini-2.5-flash (default, fast)
    • Can also use gemini-2.0-pro for complex reasoning

Security Considerations

  • API keys loaded via environment variables
  • Input validation on all endpoints
  • CORS middleware configured (restrict in production)
  • Consider adding JWT authentication for production

Troubleshooting

"GEMINI_API_KEY not set"

export GEMINI_API_KEY='your-key'

"No documents indexed"

python ingest.py

"Error extracting text"

  • Check if PDF is not corrupted
  • Some PDFs may be image-based (need OCR)

Slow ingestion

  • Embedding generation is batched (100 texts at a time)
  • Large PDFs with many pages take longer

Future Improvements

  • Admin dashboard for document management
  • Streaming responses
  • Multi-collection support
  • Document summaries
  • Caching layer for frequent queries
  • OCR support for scanned PDFs
  • JWT authentication