Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.12.0
HR Intervals AI Assistant - Architecture Documentation
Project Overview
An AI-powered bilingual chatbot for nonprofit organizations providing HR support, policy generation, and compliance checking.
Tech Stack:
- Backend: Python 3.12 + LangChain
- Vector Database: Qdrant Cloud
- AI Models: OpenAI (GPT-4o-mini, text-embedding-3-small)
- UI Framework: Gradio
- Web Scraping: Firecrawl
- Monitoring: LangSmith (optional)
- Deployment: Hugging Face Spaces
System Architecture
High-Level Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β USER LAYER β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ€
β app.py β admin.py β
β (Chat Interface) β (Admin Interface) β
β - User Q&A β - Upload documents β
β - Policy generation β - Scrape web pages β
β - View sources β - Manage content β
ββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β APPLICATION LAYER β
ββββββββββββββββ¬ββββββββββββββββββ¬βββββββββββββββββββββββββββββ€
β chatbot.py β ingestion.py β scraper.py β
β - RAG chain β - PDF/DOCX β - Web scraping β
β - Retrieval β - Text chunking β - URL processing β
β - QA logic β - Metadata β - Content storage β
ββββββββββββββββ΄ββββββββββββββββββ΄βββββββββββββββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EXTERNAL SERVICES β
βββββββββββββββ¬ββββββββββββββ¬ββββββββββββββββ¬ββββββββββββββββββ€
β Qdrant β OpenAI β Firecrawl β LangSmith β
β Cloud β API β API β (optional) β
β - Vectors β - Embeddingsβ - Scraping β - Monitoring β
β - Search β - Chat β - Markdown β - Debugging β
βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββββ΄ββββββββββββββββββ
Module Relationships
Core Modules
1. src/ingestion.py - Document Processing Module
Purpose: Load, process, and store PDF/DOCX documents into vector database
Key Functions:
create_vectorstore() -> (vectorstore, embeddings, client)
load_document(file_path: str) -> List[Document]
chunk_documents(documents, chunk_size=1000, chunk_overlap=200) -> List[Document]
add_metadata(chunks, source_name, doc_type="document") -> List[Document]
ingest_document(file_path: str, doc_type="document") -> int
Dependencies:
langchain_community.document_loaders(PyPDFLoader, Docx2txtLoader)langchain.text_splitter(RecursiveCharacterTextSplitter)langchain_openai(OpenAIEmbeddings)langchain_qdrant(QdrantVectorStore)qdrant_client(QdrantClient)
Used By:
admin.py(upload functionality)
2. src/scraper.py - Web Scraping Module
Purpose: Scrape web pages and store content in vector database
Key Functions:
scrape_url(url: str) -> str
process_and_store_webpage(url: str) -> int
Dependencies:
firecrawl(FirecrawlApp)langchain.schema(Document)langchain.text_splitter(RecursiveCharacterTextSplitter)langchain_openai(OpenAIEmbeddings)langchain_qdrant(QdrantVectorStore)
Used By:
admin.py(URL scraping functionality)
3. src/chatbot.py - RAG Question-Answering Module
Purpose: Handle user questions using Retrieval-Augmented Generation
Key Functions:
create_rag_chain() -> ConversationalRetrievalChain
ask_question(qa_chain, question: str) -> (answer: str, sources: List[Document])
Components:
- Vector store retriever (k=5 similar documents)
- LLM: GPT-4o-mini (temperature=0.3)
- Conversation memory (ConversationBufferMemory)
- System prompt with disclaimers
Dependencies:
langchain_openai(ChatOpenAI, OpenAIEmbeddings)langchain_qdrant(QdrantVectorStore)langchain.chains(ConversationalRetrievalChain)langchain.memory(ConversationBufferMemory)qdrant_client(QdrantClient)
Used By:
app.py(chat interface)
User Interface Modules
4. app.py - Chat Interface (End Users)
Purpose: Gradio-based chat interface for nonprofit users
Features:
- Real-time Q&A
- PII detection and warnings
- Source citations
- Disclaimer display
- Conversation history
- Example questions
Calls:
src/chatbot.pyβcreate_rag_chain(),ask_question()
Port: 7860
5. admin.py - Admin Interface (Content Managers)
Purpose: Gradio-based management interface for HR Intervals team
Features:
- View all documents
- Upload PDF/DOCX files
- Scrape single/multiple URLs
- Delete documents by source
- Update/replace documents
Calls:
src/ingestion.pyβingest_document()src/scraper.pyβprocess_and_store_webpage()qdrant_client.QdrantClientβ direct CRUD operations
Port: 7861
Data Flow Diagrams
Flow 1: Document Upload
User (admin.py)
β
[Select PDF/DOCX file]
β
admin.py: upload_document()
β
ingestion.py: ingest_document()
β
[Load document] β PyPDFLoader / Docx2txtLoader
β
[Split into chunks] β RecursiveCharacterTextSplitter
β - chunk_size: 1000
β - chunk_overlap: 200
β
[Add metadata]
β - source: filename
β - type: document/policy/guide
β - upload_date: YYYY-MM-DD
β
[Generate embeddings] β OpenAI text-embedding-3-small
β
[Store vectors + metadata] β Qdrant Cloud
β
β
Success: N chunks uploaded
Flow 2: Web Scraping
User (admin.py)
β
[Enter URL(s)]
β
admin.py: scrape_single_url() / scrape_multiple_urls()
β
scraper.py: process_and_store_webpage()
β
[Scrape webpage] β Firecrawl API
β - Returns: Markdown content
β
[Create document with metadata]
β - source: URL
β - type: webpage
β - upload_date: YYYY-MM-DD
β
[Split into chunks] β RecursiveCharacterTextSplitter
β
[Generate embeddings] β OpenAI text-embedding-3-small
β
[Store vectors + metadata] β Qdrant Cloud
β
β
Success: N chunks uploaded
Flow 3: Question Answering (RAG)
User (app.py)
β
[Type question]
β
app.py: chat()
β
[Check for PII] β Regex patterns
β - Capitalized names: [A-Z][a-z]+ [A-Z][a-z]+
β - If detected: Show warning
β
chatbot.py: ask_question()
β
ConversationalRetrievalChain
β
[Convert question to embedding] β OpenAI text-embedding-3-small
β
[Similarity search] β Qdrant Cloud
β - Retrieve top 5 similar chunks
β - Return: chunks + metadata
β
[Combine context + question + chat history]
β
[Generate answer] β OpenAI GPT-4o-mini
β - Temperature: 0.3
β - System prompt: HR assistant with disclaimers
β
[Return answer + source documents]
β
app.py: Display answer with sources
β
User sees:
- Answer
- β οΈ PII warning (if applicable)
- π Sources (top 3)
Flow 4: Document Deletion
User (admin.py)
β
[Enter document name or URL]
β
admin.py: delete_document()
β
Qdrant Client: delete()
β
[Filter by metadata]
β - Field: "source"
β - Match: exact document name
β
[Delete all matching points]
β
β
Success: All chunks from source deleted
Flow 5: Document Update
User (admin.py)
β
[Specify old document name]
[Select new file]
β
admin.py: update_document()
β
[Step 1: Delete old document]
β βββ delete_document(old_source)
β
[Step 2: Upload new document]
β βββ upload_document(new_file)
β
β
Success: Document replaced
Configuration
Environment Variables (.env)
# OpenAI API
OPENAI_API_KEY=sk-proj-...
OPEN_AI_EMBEDDING_MODEL=text-embedding-3-small
OPEN_AI_CHAT_MODEL=gpt-4o-mini
# Qdrant Cloud
QDRANT_URL=https://xxx.cloud.qdrant.io:6333
QDRANT_API_KEY=xxx
QDRANT_COLLECTION=hr-intervals
# Firecrawl
FIRECRAWL_API_KEY=fc-xxx
# LangSmith (Optional)
LANGSMITH_TRACING=false
LANGSMITH_API_KEY=xxx
LANGSMITH_PROJECT=hr-intervals-chatbot
Project Structure
hr-intervals-chatbot/
βββ src/
β βββ __init__.py
β βββ ingestion.py # Document processing
β βββ chatbot.py # RAG Q&A logic
β βββ scraper.py # Web scraping
βββ data/
β βββ documents/ # Uploaded files
β βββ scraped/ # Scraped content (cache)
βββ app.py # User chat interface
βββ admin.py # Admin management interface
βββ .env # API keys and config
βββ requirements.txt # Python dependencies
βββ ARCHITECTURE.md # This file
βββ README.md # Project overview
Key Technical Decisions
1. Vector Database: Qdrant Cloud
- Why: Built-in web UI, easy document management, free tier
- Alternative considered: Pinecone (limited free tier, no document-level UI)
2. Embedding Model: text-embedding-3-small
- Dimensions: 1536
- Why: Excellent quality with best cost-performance ratio, multilingual support (English/French)
3. LLM: GPT-4o-mini
- Why: Cost-effective, sufficient for HR Q&A, fast response
- Alternative: GPT-4o (more expensive but higher quality)
4. Chunking Strategy
- Chunk size: 1000 characters
- Overlap: 200 characters
- Separators:
["\n\n", "\n", ". ", " ", ""] - Why: Balances context preservation and retrieval accuracy
5. Retrieval: Top-k similarity search
- k=5: Retrieve 5 most similar chunks
- Distance metric: Cosine similarity
- Why: Good balance between context and noise
Metadata Schema
Every chunk stored in Qdrant has the following metadata:
{
"source": str, # Filename or URL
"type": str, # "document" | "webpage" | "policy" | "guide"
"upload_date": str, # "YYYY-MM-DD"
"page": int, # (optional) Page number for PDFs
"valid_until": str, # (optional) Expiry date for policies
"version": str, # (optional) Version number
}
Document Management Operations
View Documents
# List all unique documents
client.scroll(collection_name, limit=1000, with_payload=True)
# Group by 'source' field
Upload Document
# 1. Load: PyPDFLoader / Docx2txtLoader
# 2. Chunk: RecursiveCharacterTextSplitter
# 3. Add metadata: source, type, date
# 4. Embed: OpenAI text-embedding-3-small
# 5. Store: QdrantVectorStore.from_documents()
Delete Document
client.delete(
collection_name=collection_name,
points_selector=FilterSelector(
filter=Filter(
must=[
FieldCondition(
key="source",
match=MatchValue(value="filename.pdf")
)
]
)
)
)
Update Document
# 1. Delete old version (by source name)
# 2. Upload new version
Security Features
PII Detection
- Regex pattern for names:
\b[A-Z][a-z]+ [A-Z][a-z]+\b - Warning displayed to user if detected
- Future: Integrate Microsoft Presidio for advanced PII detection
Disclaimers
- Shown on first interaction
- Embedded in system prompt
- Reminds users to consult professionals
API Key Security
- Stored in
.envfile (not in version control) .envadded to.gitignore
Performance Considerations
Embedding Cost
- Model: text-embedding-3-small
- Cost: ~$0.13 per 1M tokens
- Typical document: 10 pages β 5,000 tokens β $0.0007
Chat Cost
- Model: GPT-4o-mini
- Cost: ~$0.15 per 1M input tokens, $0.60 per 1M output tokens
- Typical query: 5 chunks (5,000 tokens) + question (100 tokens) β $0.0008
Storage
- Qdrant free tier: 1 GB
- Each chunk: ~1 KB metadata + 12 KB vector (3072 dims Γ 4 bytes)
- Capacity: ~75,000 chunks (approximately 1,500 documents of 50 chunks each)
Future Enhancements
Phase 1 (Week 9-12) - Policy Features
- Policy template library
- Policy generation from user input
- Policy compliance checking
- Risk identification
Phase 2 (Week 13-18) - Advanced Features
- Bilingual support (French)
- Language detection and switching
- Content recommendation system
- Feedback collection mechanism
Phase 3 (Week 19-20) - Production
- Deployment to Hugging Face Spaces
- User authentication (if needed)
- Analytics dashboard
- Automated expiry detection for policies
Troubleshooting
Common Issues
1. "Collection not found" error
# Solution: Collection is created automatically on first upload
# Just upload a document and it will be created
2. "No documents found" when asking questions
# Solution: Upload at least one document first via admin.py
3. "Rate limit exceeded" from OpenAI
# Solution: Add delays between requests or upgrade OpenAI plan
4. "Firecrawl scraping failed"
# Solution: Check if URL is accessible, verify Firecrawl API key
Development Timeline
- Week 1-2: Infrastructure setup β
- Week 3-4: Basic RAG system β
- Week 5-6: Web scraping + chat interface
- Week 7-8: Quality improvements
- Week 9-10: Admin interface
- Week 11-12: Demo delivery
- Week 13-16: Policy features
- Week 17-18: Bilingual support
- Week 19-20: Final delivery
References
- LangChain Documentation: https://python.langchain.com/docs/
- Qdrant Documentation: https://qdrant.tech/documentation/
- OpenAI API Reference: https://platform.openai.com/docs/
- Gradio Documentation: https://www.gradio.app/docs/