Nigerian Tax Law RAG System
A lightweight, scalable Retrieval-Augmented Generation (RAG) system for querying Nigerian tax and legal documents.
Overview
This system uses:
- FastAPI - Backend API server
- Gemini API - Embeddings + answer generation
- ChromaDB - Vector database for semantic search
- pdfplumber - PDF text extraction
- tiktoken - Text chunking with token counting
Architecture
βββββββββββββββββββββββββββββββ
β Client β
βββββββββββββββββ¬ββββββββββββββ
β /ask
βββββββββΌβββββββββ
β FastAPI API β
βββββββββ¬βββββββββ
β
β Query β Gemini Embedding
βββββββββΌβββββββββββ
β Vector DB β
β (Chroma) β
βββββββββ¬βββββββββββ
β
β Retrieved Chunks
βββββββββΌβββββββββββ
β Gemini Model β
β (RAG Completion) β
βββββββββ¬βββββββββββ
β
βββββββββΌβββββββββββ
β Final Answer β
βββββββββββββββββββββ
File Structure
tax/
βββ docs/ # Your PDF documents
β βββ Nigeria-Tax-Act-2025.pdf
β βββ ... (other tax/legal PDFs)
βββ rag/
βββ RAG_SYSTEM_PLAN.md # This file
βββ main.py # FastAPI server
βββ ingest.py # PDF β ChromaDB pipeline
βββ utils.py # Chunking + embedding functions
βββ requirements.txt # Python dependencies
βββ db/ # ChromaDB vector database (auto-created)
Installation
Create a virtual environment (recommended):
cd rag python -m venv venv source venv/bin/activate # Linux/Mac # or: venv\Scripts\activate # WindowsInstall dependencies:
pip install -r requirements.txtSet your Gemini API key:
export GEMINI_API_KEY='your-api-key-here'Get a free API key at: https://aistudio.google.com/apikey
Usage
Step 1: Ingest Documents
Index your PDF documents into the vector database:
cd rag
python ingest.py
Options:
--forceor-f: Re-ingest all documents (update embeddings)--clear: Clear the database before ingesting--stats: Show database statistics only--data-dir PATH: Use a different PDF directory
Step 2: Start the API Server
uvicorn main:app --reload
The API will be available at http://localhost:8000
Step 3: Query Documents
Ask a question:
curl -X POST "http://localhost:8000/ask" \
-H "Content-Type: application/json" \
-d '{"question": "What are the tax rates for personal income in Nigeria?"}'
Check API health:
curl http://localhost:8000/health
Get statistics:
curl http://localhost:8000/stats
API Endpoints
| Method | Endpoint | Description |
|---|---|---|
GET |
/ |
API information |
GET |
/health |
Health check |
POST |
/ask |
Ask a question |
POST |
/ingest |
Upload a new PDF |
GET |
/stats |
Database statistics |
DELETE |
/documents/{name} |
Remove a document |
POST /ask
Request body:
{
"question": "What is the penalty for late tax filing?",
"top_k": 5,
"model": "gemini-2.5-flash"
}
Response:
{
"answer": "According to the Nigeria Tax Act 2025...",
"sources": [
{
"document": "Nigeria-Tax-Act-2025.pdf",
"chunk_index": 42,
"relevance_score": 0.8532
}
],
"chunks_used": 5
}
POST /ingest
Upload a PDF file to add to the index:
curl -X POST "http://localhost:8000/ingest" \
-F "file=@new-document.pdf"
Configuration
Key settings in ingest.py:
CHUNK_SIZE = 500- Tokens per chunkCHUNK_OVERLAP = 50- Overlap between chunksDATA_DIR- PDF source directory (../docs)DB_DIR- ChromaDB storage (./db)
Components
Data Ingestion (ingest.py)
- Extracts text from PDFs using pdfplumber
- Chunks into ~500 tokens using tiktoken
- Generates embeddings with Gemini (
text-embedding-004) - Stores in ChromaDB with metadata
Retrieval & Answer Generation (main.py)
- Converts query to embedding
- Searches ChromaDB for top-K similar chunks
- Sends context + question to Gemini
- Returns grounded answer with sources
Utilities (utils.py)
chunk_text()- Split text into token-based chunksgenerate_embedding()- Create document embeddingsgenerate_query_embedding()- Create query embeddingsgenerate_answer()- RAG completion with Geminiclean_text()- Clean extracted PDF text
Models Used
- Embeddings:
text-embedding-004(768 dimensions) - Generation:
gemini-2.5-flash(default, fast)- Can also use
gemini-2.0-profor complex reasoning
- Can also use
Security Considerations
- API keys loaded via environment variables
- Input validation on all endpoints
- CORS middleware configured (restrict in production)
- Consider adding JWT authentication for production
Troubleshooting
"GEMINI_API_KEY not set"
export GEMINI_API_KEY='your-key'
"No documents indexed"
python ingest.py
"Error extracting text"
- Check if PDF is not corrupted
- Some PDFs may be image-based (need OCR)
Slow ingestion
- Embedding generation is batched (100 texts at a time)
- Large PDFs with many pages take longer
Future Improvements
- Admin dashboard for document management
- Streaming responses
- Multi-collection support
- Document summaries
- Caching layer for frequent queries
- OCR support for scanned PDFs
- JWT authentication