# HR Intervals AI Assistant - Architecture Documentation ## Project Overview An AI-powered bilingual chatbot for nonprofit organizations providing HR support, policy generation, and compliance checking. **Tech Stack:** - Backend: Python 3.12 + LangChain - Vector Database: Qdrant Cloud - AI Models: OpenAI (GPT-4o-mini, text-embedding-3-small) - UI Framework: Gradio - Web Scraping: Firecrawl - Monitoring: LangSmith (optional) - Deployment: Hugging Face Spaces --- ## System Architecture ### High-Level Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ USER LAYER │ ├──────────────────────────┬──────────────────────────────────┤ │ app.py │ admin.py │ │ (Chat Interface) │ (Admin Interface) │ │ - User Q&A │ - Upload documents │ │ - Policy generation │ - Scrape web pages │ │ - View sources │ - Manage content │ └──────────────────────────┴──────────────────────────────────┘ │ │ ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ │ APPLICATION LAYER │ ├──────────────┬─────────────────┬────────────────────────────┤ │ chatbot.py │ ingestion.py │ scraper.py │ │ - RAG chain │ - PDF/DOCX │ - Web scraping │ │ - Retrieval │ - Text chunking │ - URL processing │ │ - QA logic │ - Metadata │ - Content storage │ └──────────────┴─────────────────┴────────────────────────────┘ │ │ ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ │ EXTERNAL SERVICES │ ├─────────────┬─────────────┬───────────────┬─────────────────┤ │ Qdrant │ OpenAI │ Firecrawl │ LangSmith │ │ Cloud │ API │ API │ (optional) │ │ - Vectors │ - Embeddings│ - Scraping │ - Monitoring │ │ - Search │ - Chat │ - Markdown │ - Debugging │ └─────────────┴─────────────┴───────────────┴─────────────────┘ ``` --- ## Module Relationships ### Core Modules #### 1. `src/ingestion.py` - Document Processing Module **Purpose:** Load, process, and store PDF/DOCX documents into vector database **Key Functions:** ```python create_vectorstore() -> (vectorstore, embeddings, client) load_document(file_path: str) -> List[Document] chunk_documents(documents, chunk_size=1000, chunk_overlap=200) -> List[Document] add_metadata(chunks, source_name, doc_type="document") -> List[Document] ingest_document(file_path: str, doc_type="document") -> int ``` **Dependencies:** - `langchain_community.document_loaders` (PyPDFLoader, Docx2txtLoader) - `langchain.text_splitter` (RecursiveCharacterTextSplitter) - `langchain_openai` (OpenAIEmbeddings) - `langchain_qdrant` (QdrantVectorStore) - `qdrant_client` (QdrantClient) **Used By:** - `admin.py` (upload functionality) --- #### 2. `src/scraper.py` - Web Scraping Module **Purpose:** Scrape web pages and store content in vector database **Key Functions:** ```python scrape_url(url: str) -> str process_and_store_webpage(url: str) -> int ``` **Dependencies:** - `firecrawl` (FirecrawlApp) - `langchain.schema` (Document) - `langchain.text_splitter` (RecursiveCharacterTextSplitter) - `langchain_openai` (OpenAIEmbeddings) - `langchain_qdrant` (QdrantVectorStore) **Used By:** - `admin.py` (URL scraping functionality) --- #### 3. `src/chatbot.py` - RAG Question-Answering Module **Purpose:** Handle user questions using Retrieval-Augmented Generation **Key Functions:** ```python create_rag_chain() -> ConversationalRetrievalChain ask_question(qa_chain, question: str) -> (answer: str, sources: List[Document]) ``` **Components:** - Vector store retriever (k=5 similar documents) - LLM: GPT-4o-mini (temperature=0.3) - Conversation memory (ConversationBufferMemory) - System prompt with disclaimers **Dependencies:** - `langchain_openai` (ChatOpenAI, OpenAIEmbeddings) - `langchain_qdrant` (QdrantVectorStore) - `langchain.chains` (ConversationalRetrievalChain) - `langchain.memory` (ConversationBufferMemory) - `qdrant_client` (QdrantClient) **Used By:** - `app.py` (chat interface) --- ### User Interface Modules #### 4. `app.py` - Chat Interface (End Users) **Purpose:** Gradio-based chat interface for nonprofit users **Features:** - Real-time Q&A - PII detection and warnings - Source citations - Disclaimer display - Conversation history - Example questions **Calls:** - `src/chatbot.py` → `create_rag_chain()`, `ask_question()` **Port:** 7860 --- #### 5. `admin.py` - Admin Interface (Content Managers) **Purpose:** Gradio-based management interface for HR Intervals team **Features:** - View all documents - Upload PDF/DOCX files - Scrape single/multiple URLs - Delete documents by source - Update/replace documents **Calls:** - `src/ingestion.py` → `ingest_document()` - `src/scraper.py` → `process_and_store_webpage()` - `qdrant_client.QdrantClient` → direct CRUD operations **Port:** 7861 --- ## Data Flow Diagrams ### Flow 1: Document Upload ``` User (admin.py) ↓ [Select PDF/DOCX file] ↓ admin.py: upload_document() ↓ ingestion.py: ingest_document() ↓ [Load document] → PyPDFLoader / Docx2txtLoader ↓ [Split into chunks] → RecursiveCharacterTextSplitter │ - chunk_size: 1000 │ - chunk_overlap: 200 ↓ [Add metadata] │ - source: filename │ - type: document/policy/guide │ - upload_date: YYYY-MM-DD ↓ [Generate embeddings] → OpenAI text-embedding-3-small ↓ [Store vectors + metadata] → Qdrant Cloud ↓ ✅ Success: N chunks uploaded ``` --- ### Flow 2: Web Scraping ``` User (admin.py) ↓ [Enter URL(s)] ↓ admin.py: scrape_single_url() / scrape_multiple_urls() ↓ scraper.py: process_and_store_webpage() ↓ [Scrape webpage] → Firecrawl API │ - Returns: Markdown content ↓ [Create document with metadata] │ - source: URL │ - type: webpage │ - upload_date: YYYY-MM-DD ↓ [Split into chunks] → RecursiveCharacterTextSplitter ↓ [Generate embeddings] → OpenAI text-embedding-3-small ↓ [Store vectors + metadata] → Qdrant Cloud ↓ ✅ Success: N chunks uploaded ``` --- ### Flow 3: Question Answering (RAG) ``` User (app.py) ↓ [Type question] ↓ app.py: chat() ↓ [Check for PII] → Regex patterns │ - Capitalized names: [A-Z][a-z]+ [A-Z][a-z]+ │ - If detected: Show warning ↓ chatbot.py: ask_question() ↓ ConversationalRetrievalChain ↓ [Convert question to embedding] → OpenAI text-embedding-3-small ↓ [Similarity search] → Qdrant Cloud │ - Retrieve top 5 similar chunks │ - Return: chunks + metadata ↓ [Combine context + question + chat history] ↓ [Generate answer] → OpenAI GPT-4o-mini │ - Temperature: 0.3 │ - System prompt: HR assistant with disclaimers ↓ [Return answer + source documents] ↓ app.py: Display answer with sources ↓ User sees: - Answer - ⚠️ PII warning (if applicable) - 📚 Sources (top 3) ``` --- ### Flow 4: Document Deletion ``` User (admin.py) ↓ [Enter document name or URL] ↓ admin.py: delete_document() ↓ Qdrant Client: delete() ↓ [Filter by metadata] │ - Field: "source" │ - Match: exact document name ↓ [Delete all matching points] ↓ ✅ Success: All chunks from source deleted ``` --- ### Flow 5: Document Update ``` User (admin.py) ↓ [Specify old document name] [Select new file] ↓ admin.py: update_document() ↓ [Step 1: Delete old document] │ └─→ delete_document(old_source) ↓ [Step 2: Upload new document] │ └─→ upload_document(new_file) ↓ ✅ Success: Document replaced ``` --- ## Configuration ### Environment Variables (`.env`) ```bash # OpenAI API OPENAI_API_KEY=sk-proj-... OPEN_AI_EMBEDDING_MODEL=text-embedding-3-small OPEN_AI_CHAT_MODEL=gpt-4o-mini # Qdrant Cloud QDRANT_URL=https://xxx.cloud.qdrant.io:6333 QDRANT_API_KEY=xxx QDRANT_COLLECTION=hr-intervals # Firecrawl FIRECRAWL_API_KEY=fc-xxx # LangSmith (Optional) LANGSMITH_TRACING=false LANGSMITH_API_KEY=xxx LANGSMITH_PROJECT=hr-intervals-chatbot ``` --- ## Project Structure ``` hr-intervals-chatbot/ ├── src/ │ ├── __init__.py │ ├── ingestion.py # Document processing │ ├── chatbot.py # RAG Q&A logic │ └── scraper.py # Web scraping ├── data/ │ ├── documents/ # Uploaded files │ └── scraped/ # Scraped content (cache) ├── app.py # User chat interface ├── admin.py # Admin management interface ├── .env # API keys and config ├── requirements.txt # Python dependencies ├── ARCHITECTURE.md # This file └── README.md # Project overview ``` --- ## Key Technical Decisions ### 1. Vector Database: Qdrant Cloud - **Why:** Built-in web UI, easy document management, free tier - **Alternative considered:** Pinecone (limited free tier, no document-level UI) ### 2. Embedding Model: text-embedding-3-small - **Dimensions:** 1536 - **Why:** Excellent quality with best cost-performance ratio, multilingual support (English/French) ### 3. LLM: GPT-4o-mini - **Why:** Cost-effective, sufficient for HR Q&A, fast response - **Alternative:** GPT-4o (more expensive but higher quality) ### 4. Chunking Strategy - **Chunk size:** 1000 characters - **Overlap:** 200 characters - **Separators:** `["\n\n", "\n", ". ", " ", ""]` - **Why:** Balances context preservation and retrieval accuracy ### 5. Retrieval: Top-k similarity search - **k=5:** Retrieve 5 most similar chunks - **Distance metric:** Cosine similarity - **Why:** Good balance between context and noise --- ## Metadata Schema Every chunk stored in Qdrant has the following metadata: ```python { "source": str, # Filename or URL "type": str, # "document" | "webpage" | "policy" | "guide" "upload_date": str, # "YYYY-MM-DD" "page": int, # (optional) Page number for PDFs "valid_until": str, # (optional) Expiry date for policies "version": str, # (optional) Version number } ``` --- ## Document Management Operations ### View Documents ```python # List all unique documents client.scroll(collection_name, limit=1000, with_payload=True) # Group by 'source' field ``` ### Upload Document ```python # 1. Load: PyPDFLoader / Docx2txtLoader # 2. Chunk: RecursiveCharacterTextSplitter # 3. Add metadata: source, type, date # 4. Embed: OpenAI text-embedding-3-small # 5. Store: QdrantVectorStore.from_documents() ``` ### Delete Document ```python client.delete( collection_name=collection_name, points_selector=FilterSelector( filter=Filter( must=[ FieldCondition( key="source", match=MatchValue(value="filename.pdf") ) ] ) ) ) ``` ### Update Document ```python # 1. Delete old version (by source name) # 2. Upload new version ``` --- ## Security Features ### PII Detection - Regex pattern for names: `\b[A-Z][a-z]+ [A-Z][a-z]+\b` - Warning displayed to user if detected - Future: Integrate Microsoft Presidio for advanced PII detection ### Disclaimers - Shown on first interaction - Embedded in system prompt - Reminds users to consult professionals ### API Key Security - Stored in `.env` file (not in version control) - `.env` added to `.gitignore` --- ## Performance Considerations ### Embedding Cost - Model: text-embedding-3-small - Cost: ~$0.13 per 1M tokens - Typical document: 10 pages ≈ 5,000 tokens ≈ $0.0007 ### Chat Cost - Model: GPT-4o-mini - Cost: ~$0.15 per 1M input tokens, $0.60 per 1M output tokens - Typical query: 5 chunks (5,000 tokens) + question (100 tokens) ≈ $0.0008 ### Storage - Qdrant free tier: 1 GB - Each chunk: ~1 KB metadata + 12 KB vector (3072 dims × 4 bytes) - Capacity: ~75,000 chunks (approximately 1,500 documents of 50 chunks each) --- ## Future Enhancements ### Phase 1 (Week 9-12) - Policy Features - Policy template library - Policy generation from user input - Policy compliance checking - Risk identification ### Phase 2 (Week 13-18) - Advanced Features - Bilingual support (French) - Language detection and switching - Content recommendation system - Feedback collection mechanism ### Phase 3 (Week 19-20) - Production - Deployment to Hugging Face Spaces - User authentication (if needed) - Analytics dashboard - Automated expiry detection for policies --- ## Troubleshooting ### Common Issues **1. "Collection not found" error** ```bash # Solution: Collection is created automatically on first upload # Just upload a document and it will be created ``` **2. "No documents found" when asking questions** ```bash # Solution: Upload at least one document first via admin.py ``` **3. "Rate limit exceeded" from OpenAI** ```bash # Solution: Add delays between requests or upgrade OpenAI plan ``` **4. "Firecrawl scraping failed"** ```bash # Solution: Check if URL is accessible, verify Firecrawl API key ``` --- ## Development Timeline - **Week 1-2:** Infrastructure setup ✅ - **Week 3-4:** Basic RAG system ✅ - **Week 5-6:** Web scraping + chat interface - **Week 7-8:** Quality improvements - **Week 9-10:** Admin interface - **Week 11-12:** Demo delivery - **Week 13-16:** Policy features - **Week 17-18:** Bilingual support - **Week 19-20:** Final delivery --- ## References - LangChain Documentation: https://python.langchain.com/docs/ - Qdrant Documentation: https://qdrant.tech/documentation/ - OpenAI API Reference: https://platform.openai.com/docs/ - Gradio Documentation: https://www.gradio.app/docs/ ```