Spaces:

nexusbert
/

sabitax

Sleeping

App Files Files Community

sabitax / RAG_SYSTEM_PLAN.md

nexusbert

Upload 14 files

d43d504 verified 3 months ago

preview code

raw

history blame contribute delete

6.43 kB

Nigerian Tax Law RAG System

A lightweight, scalable Retrieval-Augmented Generation (RAG) system for querying Nigerian tax and legal documents.

Overview

This system uses:

FastAPI - Backend API server
Gemini API - Embeddings + answer generation
ChromaDB - Vector database for semantic search
pdfplumber - PDF text extraction
tiktoken - Text chunking with token counting

Architecture

        ┌─────────────────────────────┐
        │           Client            │
        └───────────────┬─────────────┘
                        │ /ask
                ┌───────▼────────┐
                │   FastAPI API   │
                └───────┬────────┘
                        │
                        │ Query → Gemini Embedding
                ┌───────▼──────────┐
                │    Vector DB      │
                │     (Chroma)      │
                └───────┬──────────┘
                        │
                        │ Retrieved Chunks
                ┌───────▼──────────┐
                │   Gemini Model    │
                │ (RAG Completion)  │
                └───────┬──────────┘
                        │
                ┌───────▼──────────┐
                │   Final Answer    │
                └───────────────────┘

File Structure

tax/
├── docs/                    # Your PDF documents
│   ├── Nigeria-Tax-Act-2025.pdf
│   └── ... (other tax/legal PDFs)
└── rag/
    ├── RAG_SYSTEM_PLAN.md   # This file
    ├── main.py              # FastAPI server
    ├── ingest.py            # PDF → ChromaDB pipeline
    ├── utils.py             # Chunking + embedding functions
    ├── requirements.txt     # Python dependencies
    └── db/                  # ChromaDB vector database (auto-created)

Installation

Create a virtual environment (recommended):

cd rag
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or: venv\Scripts\activate  # Windows

Install dependencies:
```
pip install -r requirements.txt
```
Set your Gemini API key:
```
export GEMINI_API_KEY='your-api-key-here'
```
Get a free API key at: https://aistudio.google.com/apikey

Usage

Step 1: Ingest Documents

Index your PDF documents into the vector database:

cd rag
python ingest.py

Options:

--force or -f: Re-ingest all documents (update embeddings)
--clear: Clear the database before ingesting
--stats: Show database statistics only
--data-dir PATH: Use a different PDF directory

Step 2: Start the API Server

uvicorn main:app --reload

The API will be available at http://localhost:8000

Step 3: Query Documents

Ask a question:

curl -X POST "http://localhost:8000/ask" \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the tax rates for personal income in Nigeria?"}'

Check API health:

curl http://localhost:8000/health

Get statistics:

curl http://localhost:8000/stats

API Endpoints

Method	Endpoint	Description
`GET`	`/`	API information
`GET`	`/health`	Health check
`POST`	`/ask`	Ask a question
`POST`	`/ingest`	Upload a new PDF
`GET`	`/stats`	Database statistics
`DELETE`	`/documents/{name}`	Remove a document

POST /ask

Request body:

{
  "question": "What is the penalty for late tax filing?",
  "top_k": 5,
  "model": "gemini-2.5-flash"
}

Response:

{
  "answer": "According to the Nigeria Tax Act 2025...",
  "sources": [
    {
      "document": "Nigeria-Tax-Act-2025.pdf",
      "chunk_index": 42,
      "relevance_score": 0.8532
    }
  ],
  "chunks_used": 5
}

POST /ingest

Upload a PDF file to add to the index:

curl -X POST "http://localhost:8000/ingest" \
  -F "file=@new-document.pdf"

Configuration

Key settings in ingest.py:

CHUNK_SIZE = 500 - Tokens per chunk
CHUNK_OVERLAP = 50 - Overlap between chunks
DATA_DIR - PDF source directory (../docs)
DB_DIR - ChromaDB storage (./db)

Components

Data Ingestion (`ingest.py`)

Extracts text from PDFs using pdfplumber
Chunks into ~500 tokens using tiktoken
Generates embeddings with Gemini (text-embedding-004)
Stores in ChromaDB with metadata

Retrieval & Answer Generation (`main.py`)

Converts query to embedding
Searches ChromaDB for top-K similar chunks
Sends context + question to Gemini
Returns grounded answer with sources

Utilities (`utils.py`)

chunk_text() - Split text into token-based chunks
generate_embedding() - Create document embeddings
generate_query_embedding() - Create query embeddings
generate_answer() - RAG completion with Gemini
clean_text() - Clean extracted PDF text

Models Used

Embeddings: text-embedding-004 (768 dimensions)
Generation: gemini-2.5-flash (default, fast)
- Can also use gemini-2.0-pro for complex reasoning

Security Considerations

API keys loaded via environment variables
Input validation on all endpoints
CORS middleware configured (restrict in production)
Consider adding JWT authentication for production

Troubleshooting

"GEMINI_API_KEY not set"

export GEMINI_API_KEY='your-key'

"No documents indexed"

python ingest.py

"Error extracting text"

Check if PDF is not corrupted
Some PDFs may be image-based (need OCR)

Slow ingestion

Embedding generation is batched (100 texts at a time)
Large PDFs with many pages take longer

Future Improvements

Admin dashboard for document management
Streaming responses
Multi-collection support
Document summaries
Caching layer for frequent queries
OCR support for scanned PDFs
JWT authentication