jansahayak / ARCHITECTURE.txt
Anmol4521's picture
Upload 95 files
388aa42 verified
"""
JanSahayak Architecture Overview
================================
SYSTEM COMPONENTS
-----------------
1. AGENTS (agents/)
- profiling_agent.py β†’ User Profile Extraction
- scheme_agent.py β†’ Government Scheme Recommendations
- exam_agent.py β†’ Competitive Exam Recommendations
- search_agent.py β†’ Live Web Search (Tavily)
- rag_agent.py β†’ Vector Database Retrieval
- document_agent.py β†’ PDF/Image Text Extraction
- benefit_agent.py β†’ Missed Benefits Calculator
2. PROMPTS (prompts/)
- profiling_prompt.py β†’ User profiling instructions
- scheme_prompt.py β†’ Scheme recommendation template
- exam_prompt.py β†’ Exam recommendation template
- rag_prompt.py β†’ RAG retrieval instructions
3. RAG SYSTEM (rag/)
- embeddings.py β†’ HuggingFace embeddings (CPU)
- scheme_vectorstore.py β†’ FAISS store for schemes
- exam_vectorstore.py β†’ FAISS store for exams
4. TOOLS (tools/)
- tavily_tool.py β†’ Live government website search
5. WORKFLOW (graph/)
- workflow.py β†’ LangGraph orchestration
6. I/O HANDLERS (agent_io/)
- profiling_io.py β†’ Profiling agent I/O
- scheme_io.py β†’ Scheme agent I/O
- exam_io.py β†’ Exam agent I/O
- benefit_io.py β†’ Benefit agent I/O
7. DATA (data/)
- schemes_pdfs/ β†’ Government scheme PDFs
- exams_pdfs/ β†’ Competitive exam PDFs
8. OUTPUTS (outputs/)
- results_*.json β†’ Generated analysis results
9. CONFIGURATION
- config.py β†’ Configuration loader
- .env β†’ API keys (user creates)
- requirements.txt β†’ Python dependencies
10. ENTRY POINTS
- main.py β†’ Main application
- setup.py β†’ Setup wizard
WORKFLOW EXECUTION
------------------
User Input
↓
[Profiling Agent]
↓
β”œβ”€β†’ [Scheme Agent] ──→ [Benefit Agent] ──┐
β”‚ ↓ β”‚
β”‚ [RAG Search] β”‚
β”‚ ↓ β”‚
β”‚ [Tavily Search] β”‚
β”‚ β”‚
└─→ [Exam Agent] ─────────────────────────
↓ β”‚
[RAG Search] β”‚
↓ β”‚
[Tavily Search] β”‚
↓
[Final Output]
↓
[JSON Results File]
TECHNOLOGY STACK
----------------
LLM & AI:
- Groq API (llama-3.3-70b-versatile) β†’ Fast inference
- LangChain β†’ Agent framework
- LangGraph β†’ Workflow orchestration
Embeddings & Search:
- HuggingFace Transformers β†’ sentence-transformers/all-MiniLM-L6-v2
- FAISS (CPU) β†’ Vector similarity search
Web Search:
- Tavily API β†’ Government website search
Document Processing:
- PyPDF β†’ PDF text extraction
- Pytesseract β†’ OCR for images
- Pillow β†’ Image processing
Infrastructure:
- Python 3.8+
- CPU-only deployment (no GPU needed)
- PyTorch CPU version
DATA FLOW
---------
1. User Input Processing:
Raw Text β†’ Profiling Agent β†’ Structured JSON Profile
2. Scheme Recommendation:
Profile β†’ RAG Query β†’ Vectorstore Search β†’ Top-K Documents
Profile + Documents β†’ Tavily Search (optional) β†’ Web Results
Profile + Documents + Web Results β†’ LLM β†’ Recommendations
3. Exam Recommendation:
Profile β†’ RAG Query β†’ Vectorstore Search β†’ Top-K Documents
Profile + Documents β†’ Tavily Search (optional) β†’ Web Results
Profile + Documents + Web Results β†’ LLM β†’ Recommendations
4. Benefit Calculation:
Profile + Scheme Recommendations β†’ LLM β†’ Missed Benefits Analysis
5. Final Output:
All Results β†’ JSON Compilation β†’ File Save β†’ User Display
API INTERACTIONS
----------------
1. Groq API:
- Used by: All LLM-powered agents
- Model: llama-3.3-70b-versatile
- Purpose: Natural language understanding & generation
- Rate: Per-request basis
2. Tavily API:
- Used by: search_agent, scheme_agent, exam_agent
- Purpose: Live government website search
- Filter: .gov.in domains preferred
- Depth: Advanced search mode
3. HuggingFace:
- Used by: embeddings module
- Model: sentence-transformers/all-MiniLM-L6-v2
- Purpose: Document embeddings for RAG
- Local: Runs on CPU, cached after first download
VECTORSTORE ARCHITECTURE
------------------------
Scheme Vectorstore (rag/scheme_index/):
β”œβ”€β”€ index.faiss β†’ FAISS index file
β”œβ”€β”€ index.pkl β†’ Metadata pickle
└── [Embedded chunks from schemes_pdfs/]
Exam Vectorstore (rag/exam_index/):
β”œβ”€β”€ index.faiss β†’ FAISS index file
β”œβ”€β”€ index.pkl β†’ Metadata pickle
└── [Embedded chunks from exams_pdfs/]
Embedding Dimension: 384
Similarity Metric: Cosine similarity
Chunk Size: Auto (from PyPDF)
AGENT SPECIALIZATIONS
---------------------
1. Profiling Agent:
- Extraction-focused
- Low temperature (0.1)
- JSON output required
- No external tools
2. Scheme Agent:
- RAG + Web search
- Temperature: 0.3
- Tools: Vectorstore, Tavily
- Output: Detailed scheme info
3. Exam Agent:
- RAG + Web search
- Temperature: 0.3
- Tools: Vectorstore, Tavily
- Output: Detailed exam info
4. Benefit Agent:
- Calculation-focused
- Temperature: 0.2
- No external tools
- Output: Financial analysis
5. Search Agent:
- Web search only
- Tool: Tavily API
- Focus: .gov.in domains
- Output: Live search results
6. RAG Agent:
- Vectorstore query only
- Tool: FAISS
- Similarity search
- Output: Relevant documents
7. Document Agent:
- File processing
- Tools: PyPDF, Pytesseract
- Supports: PDF, Images
- Output: Extracted text
SECURITY & PRIVACY
------------------
- API keys stored in .env (not committed to git)
- User data processed locally except LLM calls
- No data stored on external servers (except API providers)
- PDF data remains local
- Vectorstores are local
- Output files saved locally
SCALABILITY NOTES
-----------------
Current Setup (Single User):
- Synchronous workflow
- Local vectorstores
- CPU processing
Potential Scaling:
- Add Redis for caching
- Use cloud vectorstore (Pinecone, Weaviate)
- Parallel agent execution
- GPU acceleration for embeddings
- Database for user profiles
- API service deployment
ERROR HANDLING
--------------
Each agent includes:
- Try-catch blocks
- Error state tracking
- Graceful degradation
- Partial results on failure
- Error reporting in final output
MONITORING & LOGGING
--------------------
Current:
- Console print statements
- Agent start/completion messages
- Error messages
- Final output summary
Future Enhancement:
- Structured logging (logging module)
- Performance metrics
- API usage tracking
- User feedback collection
EXTENSIBILITY
-------------
Adding New Agent:
1. Create agent file in agents/
2. Add prompt template in prompts/
3. Create node function in workflow.py
4. Add node to graph
5. Define edges (connections)
6. Optional: Create I/O handler
Adding New Data Source:
1. Create vectorstore module in rag/
2. Add PDFs to data/ subdirectory
3. Build vectorstore
4. Create agent or modify existing
Adding New Tool:
1. Create tool in tools/
2. Import in agent
3. Use in agent logic
PERFORMANCE BENCHMARKS (Typical)
---------------------------------
Vectorstore Building:
- 10 PDFs: ~2-5 minutes
- 100 PDFs: ~20-30 minutes
Query Performance:
- Profiling: ~1-2 seconds
- RAG Search: ~0.5-1 second
- LLM Call: ~1-3 seconds
- Web Search: ~2-4 seconds
- Full Workflow: ~10-20 seconds
Memory Usage:
- Base: ~500 MB
- With models: ~2-3 GB
- With large PDFs: +500 MB per 100 PDFs
FUTURE ENHANCEMENTS
-------------------
1. Multilingual Support (Hindi, regional languages)
2. Voice input/output
3. Mobile app integration
4. Database for user history
5. Notification system for deadlines
6. Document upload interface
7. Real-time scheme updates
8. Community feedback integration
9. State-specific customization
10. Integration with government portals
END OF ARCHITECTURE DOCUMENT
"""