hr-intervals-chatbot / ARCHITECTURE.md
pikamomo's picture
initial deploy
c32cdfb
# HR Intervals AI Assistant - Architecture Documentation
## Project Overview
An AI-powered bilingual chatbot for nonprofit organizations providing HR support, policy generation, and compliance checking.
**Tech Stack:**
- Backend: Python 3.12 + LangChain
- Vector Database: Qdrant Cloud
- AI Models: OpenAI (GPT-4o-mini, text-embedding-3-small)
- UI Framework: Gradio
- Web Scraping: Firecrawl
- Monitoring: LangSmith (optional)
- Deployment: Hugging Face Spaces
---
## System Architecture
### High-Level Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ USER LAYER β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ app.py β”‚ admin.py β”‚
β”‚ (Chat Interface) β”‚ (Admin Interface) β”‚
β”‚ - User Q&A β”‚ - Upload documents β”‚
β”‚ - Policy generation β”‚ - Scrape web pages β”‚
β”‚ - View sources β”‚ - Manage content β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ APPLICATION LAYER β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ chatbot.py β”‚ ingestion.py β”‚ scraper.py β”‚
β”‚ - RAG chain β”‚ - PDF/DOCX β”‚ - Web scraping β”‚
β”‚ - Retrieval β”‚ - Text chunking β”‚ - URL processing β”‚
β”‚ - QA logic β”‚ - Metadata β”‚ - Content storage β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ EXTERNAL SERVICES β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Qdrant β”‚ OpenAI β”‚ Firecrawl β”‚ LangSmith β”‚
β”‚ Cloud β”‚ API β”‚ API β”‚ (optional) β”‚
β”‚ - Vectors β”‚ - Embeddingsβ”‚ - Scraping β”‚ - Monitoring β”‚
β”‚ - Search β”‚ - Chat β”‚ - Markdown β”‚ - Debugging β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## Module Relationships
### Core Modules
#### 1. `src/ingestion.py` - Document Processing Module
**Purpose:** Load, process, and store PDF/DOCX documents into vector database
**Key Functions:**
```python
create_vectorstore() -> (vectorstore, embeddings, client)
load_document(file_path: str) -> List[Document]
chunk_documents(documents, chunk_size=1000, chunk_overlap=200) -> List[Document]
add_metadata(chunks, source_name, doc_type="document") -> List[Document]
ingest_document(file_path: str, doc_type="document") -> int
```
**Dependencies:**
- `langchain_community.document_loaders` (PyPDFLoader, Docx2txtLoader)
- `langchain.text_splitter` (RecursiveCharacterTextSplitter)
- `langchain_openai` (OpenAIEmbeddings)
- `langchain_qdrant` (QdrantVectorStore)
- `qdrant_client` (QdrantClient)
**Used By:**
- `admin.py` (upload functionality)
---
#### 2. `src/scraper.py` - Web Scraping Module
**Purpose:** Scrape web pages and store content in vector database
**Key Functions:**
```python
scrape_url(url: str) -> str
process_and_store_webpage(url: str) -> int
```
**Dependencies:**
- `firecrawl` (FirecrawlApp)
- `langchain.schema` (Document)
- `langchain.text_splitter` (RecursiveCharacterTextSplitter)
- `langchain_openai` (OpenAIEmbeddings)
- `langchain_qdrant` (QdrantVectorStore)
**Used By:**
- `admin.py` (URL scraping functionality)
---
#### 3. `src/chatbot.py` - RAG Question-Answering Module
**Purpose:** Handle user questions using Retrieval-Augmented Generation
**Key Functions:**
```python
create_rag_chain() -> ConversationalRetrievalChain
ask_question(qa_chain, question: str) -> (answer: str, sources: List[Document])
```
**Components:**
- Vector store retriever (k=5 similar documents)
- LLM: GPT-4o-mini (temperature=0.3)
- Conversation memory (ConversationBufferMemory)
- System prompt with disclaimers
**Dependencies:**
- `langchain_openai` (ChatOpenAI, OpenAIEmbeddings)
- `langchain_qdrant` (QdrantVectorStore)
- `langchain.chains` (ConversationalRetrievalChain)
- `langchain.memory` (ConversationBufferMemory)
- `qdrant_client` (QdrantClient)
**Used By:**
- `app.py` (chat interface)
---
### User Interface Modules
#### 4. `app.py` - Chat Interface (End Users)
**Purpose:** Gradio-based chat interface for nonprofit users
**Features:**
- Real-time Q&A
- PII detection and warnings
- Source citations
- Disclaimer display
- Conversation history
- Example questions
**Calls:**
- `src/chatbot.py` β†’ `create_rag_chain()`, `ask_question()`
**Port:** 7860
---
#### 5. `admin.py` - Admin Interface (Content Managers)
**Purpose:** Gradio-based management interface for HR Intervals team
**Features:**
- View all documents
- Upload PDF/DOCX files
- Scrape single/multiple URLs
- Delete documents by source
- Update/replace documents
**Calls:**
- `src/ingestion.py` β†’ `ingest_document()`
- `src/scraper.py` β†’ `process_and_store_webpage()`
- `qdrant_client.QdrantClient` β†’ direct CRUD operations
**Port:** 7861
---
## Data Flow Diagrams
### Flow 1: Document Upload
```
User (admin.py)
↓
[Select PDF/DOCX file]
↓
admin.py: upload_document()
↓
ingestion.py: ingest_document()
↓
[Load document] β†’ PyPDFLoader / Docx2txtLoader
↓
[Split into chunks] β†’ RecursiveCharacterTextSplitter
β”‚ - chunk_size: 1000
β”‚ - chunk_overlap: 200
↓
[Add metadata]
β”‚ - source: filename
β”‚ - type: document/policy/guide
β”‚ - upload_date: YYYY-MM-DD
↓
[Generate embeddings] β†’ OpenAI text-embedding-3-small
↓
[Store vectors + metadata] β†’ Qdrant Cloud
↓
βœ… Success: N chunks uploaded
```
---
### Flow 2: Web Scraping
```
User (admin.py)
↓
[Enter URL(s)]
↓
admin.py: scrape_single_url() / scrape_multiple_urls()
↓
scraper.py: process_and_store_webpage()
↓
[Scrape webpage] β†’ Firecrawl API
β”‚ - Returns: Markdown content
↓
[Create document with metadata]
β”‚ - source: URL
β”‚ - type: webpage
β”‚ - upload_date: YYYY-MM-DD
↓
[Split into chunks] β†’ RecursiveCharacterTextSplitter
↓
[Generate embeddings] β†’ OpenAI text-embedding-3-small
↓
[Store vectors + metadata] β†’ Qdrant Cloud
↓
βœ… Success: N chunks uploaded
```
---
### Flow 3: Question Answering (RAG)
```
User (app.py)
↓
[Type question]
↓
app.py: chat()
↓
[Check for PII] β†’ Regex patterns
β”‚ - Capitalized names: [A-Z][a-z]+ [A-Z][a-z]+
β”‚ - If detected: Show warning
↓
chatbot.py: ask_question()
↓
ConversationalRetrievalChain
↓
[Convert question to embedding] β†’ OpenAI text-embedding-3-small
↓
[Similarity search] β†’ Qdrant Cloud
β”‚ - Retrieve top 5 similar chunks
β”‚ - Return: chunks + metadata
↓
[Combine context + question + chat history]
↓
[Generate answer] β†’ OpenAI GPT-4o-mini
β”‚ - Temperature: 0.3
β”‚ - System prompt: HR assistant with disclaimers
↓
[Return answer + source documents]
↓
app.py: Display answer with sources
↓
User sees:
- Answer
- ⚠️ PII warning (if applicable)
- πŸ“š Sources (top 3)
```
---
### Flow 4: Document Deletion
```
User (admin.py)
↓
[Enter document name or URL]
↓
admin.py: delete_document()
↓
Qdrant Client: delete()
↓
[Filter by metadata]
β”‚ - Field: "source"
β”‚ - Match: exact document name
↓
[Delete all matching points]
↓
βœ… Success: All chunks from source deleted
```
---
### Flow 5: Document Update
```
User (admin.py)
↓
[Specify old document name]
[Select new file]
↓
admin.py: update_document()
↓
[Step 1: Delete old document]
β”‚ └─→ delete_document(old_source)
↓
[Step 2: Upload new document]
β”‚ └─→ upload_document(new_file)
↓
βœ… Success: Document replaced
```
---
## Configuration
### Environment Variables (`.env`)
```bash
# OpenAI API
OPENAI_API_KEY=sk-proj-...
OPEN_AI_EMBEDDING_MODEL=text-embedding-3-small
OPEN_AI_CHAT_MODEL=gpt-4o-mini
# Qdrant Cloud
QDRANT_URL=https://xxx.cloud.qdrant.io:6333
QDRANT_API_KEY=xxx
QDRANT_COLLECTION=hr-intervals
# Firecrawl
FIRECRAWL_API_KEY=fc-xxx
# LangSmith (Optional)
LANGSMITH_TRACING=false
LANGSMITH_API_KEY=xxx
LANGSMITH_PROJECT=hr-intervals-chatbot
```
---
## Project Structure
```
hr-intervals-chatbot/
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ ingestion.py # Document processing
β”‚ β”œβ”€β”€ chatbot.py # RAG Q&A logic
β”‚ └── scraper.py # Web scraping
β”œβ”€β”€ data/
β”‚ β”œβ”€β”€ documents/ # Uploaded files
β”‚ └── scraped/ # Scraped content (cache)
β”œβ”€β”€ app.py # User chat interface
β”œβ”€β”€ admin.py # Admin management interface
β”œβ”€β”€ .env # API keys and config
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ ARCHITECTURE.md # This file
└── README.md # Project overview
```
---
## Key Technical Decisions
### 1. Vector Database: Qdrant Cloud
- **Why:** Built-in web UI, easy document management, free tier
- **Alternative considered:** Pinecone (limited free tier, no document-level UI)
### 2. Embedding Model: text-embedding-3-small
- **Dimensions:** 1536
- **Why:** Excellent quality with best cost-performance ratio, multilingual support (English/French)
### 3. LLM: GPT-4o-mini
- **Why:** Cost-effective, sufficient for HR Q&A, fast response
- **Alternative:** GPT-4o (more expensive but higher quality)
### 4. Chunking Strategy
- **Chunk size:** 1000 characters
- **Overlap:** 200 characters
- **Separators:** `["\n\n", "\n", ". ", " ", ""]`
- **Why:** Balances context preservation and retrieval accuracy
### 5. Retrieval: Top-k similarity search
- **k=5:** Retrieve 5 most similar chunks
- **Distance metric:** Cosine similarity
- **Why:** Good balance between context and noise
---
## Metadata Schema
Every chunk stored in Qdrant has the following metadata:
```python
{
"source": str, # Filename or URL
"type": str, # "document" | "webpage" | "policy" | "guide"
"upload_date": str, # "YYYY-MM-DD"
"page": int, # (optional) Page number for PDFs
"valid_until": str, # (optional) Expiry date for policies
"version": str, # (optional) Version number
}
```
---
## Document Management Operations
### View Documents
```python
# List all unique documents
client.scroll(collection_name, limit=1000, with_payload=True)
# Group by 'source' field
```
### Upload Document
```python
# 1. Load: PyPDFLoader / Docx2txtLoader
# 2. Chunk: RecursiveCharacterTextSplitter
# 3. Add metadata: source, type, date
# 4. Embed: OpenAI text-embedding-3-small
# 5. Store: QdrantVectorStore.from_documents()
```
### Delete Document
```python
client.delete(
collection_name=collection_name,
points_selector=FilterSelector(
filter=Filter(
must=[
FieldCondition(
key="source",
match=MatchValue(value="filename.pdf")
)
]
)
)
)
```
### Update Document
```python
# 1. Delete old version (by source name)
# 2. Upload new version
```
---
## Security Features
### PII Detection
- Regex pattern for names: `\b[A-Z][a-z]+ [A-Z][a-z]+\b`
- Warning displayed to user if detected
- Future: Integrate Microsoft Presidio for advanced PII detection
### Disclaimers
- Shown on first interaction
- Embedded in system prompt
- Reminds users to consult professionals
### API Key Security
- Stored in `.env` file (not in version control)
- `.env` added to `.gitignore`
---
## Performance Considerations
### Embedding Cost
- Model: text-embedding-3-small
- Cost: ~$0.13 per 1M tokens
- Typical document: 10 pages β‰ˆ 5,000 tokens β‰ˆ $0.0007
### Chat Cost
- Model: GPT-4o-mini
- Cost: ~$0.15 per 1M input tokens, $0.60 per 1M output tokens
- Typical query: 5 chunks (5,000 tokens) + question (100 tokens) β‰ˆ $0.0008
### Storage
- Qdrant free tier: 1 GB
- Each chunk: ~1 KB metadata + 12 KB vector (3072 dims Γ— 4 bytes)
- Capacity: ~75,000 chunks (approximately 1,500 documents of 50 chunks each)
---
## Future Enhancements
### Phase 1 (Week 9-12) - Policy Features
- Policy template library
- Policy generation from user input
- Policy compliance checking
- Risk identification
### Phase 2 (Week 13-18) - Advanced Features
- Bilingual support (French)
- Language detection and switching
- Content recommendation system
- Feedback collection mechanism
### Phase 3 (Week 19-20) - Production
- Deployment to Hugging Face Spaces
- User authentication (if needed)
- Analytics dashboard
- Automated expiry detection for policies
---
## Troubleshooting
### Common Issues
**1. "Collection not found" error**
```bash
# Solution: Collection is created automatically on first upload
# Just upload a document and it will be created
```
**2. "No documents found" when asking questions**
```bash
# Solution: Upload at least one document first via admin.py
```
**3. "Rate limit exceeded" from OpenAI**
```bash
# Solution: Add delays between requests or upgrade OpenAI plan
```
**4. "Firecrawl scraping failed"**
```bash
# Solution: Check if URL is accessible, verify Firecrawl API key
```
---
## Development Timeline
- **Week 1-2:** Infrastructure setup βœ…
- **Week 3-4:** Basic RAG system βœ…
- **Week 5-6:** Web scraping + chat interface
- **Week 7-8:** Quality improvements
- **Week 9-10:** Admin interface
- **Week 11-12:** Demo delivery
- **Week 13-16:** Policy features
- **Week 17-18:** Bilingual support
- **Week 19-20:** Final delivery
---
## References
- LangChain Documentation: https://python.langchain.com/docs/
- Qdrant Documentation: https://qdrant.tech/documentation/
- OpenAI API Reference: https://platform.openai.com/docs/
- Gradio Documentation: https://www.gradio.app/docs/
```