hr-intervals-chatbot / ARCHITECTURE.md
pikamomo's picture
initial deploy
c32cdfb

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

HR Intervals AI Assistant - Architecture Documentation

Project Overview

An AI-powered bilingual chatbot for nonprofit organizations providing HR support, policy generation, and compliance checking.

Tech Stack:

  • Backend: Python 3.12 + LangChain
  • Vector Database: Qdrant Cloud
  • AI Models: OpenAI (GPT-4o-mini, text-embedding-3-small)
  • UI Framework: Gradio
  • Web Scraping: Firecrawl
  • Monitoring: LangSmith (optional)
  • Deployment: Hugging Face Spaces

System Architecture

High-Level Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        USER LAYER                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   app.py                 β”‚   admin.py                        β”‚
β”‚   (Chat Interface)       β”‚   (Admin Interface)               β”‚
β”‚   - User Q&A             β”‚   - Upload documents              β”‚
β”‚   - Policy generation    β”‚   - Scrape web pages              β”‚
β”‚   - View sources         β”‚   - Manage content                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚                    β”‚
                    β–Ό                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    APPLICATION LAYER                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ chatbot.py   β”‚ ingestion.py    β”‚ scraper.py                 β”‚
β”‚ - RAG chain  β”‚ - PDF/DOCX      β”‚ - Web scraping             β”‚
β”‚ - Retrieval  β”‚ - Text chunking β”‚ - URL processing           β”‚
β”‚ - QA logic   β”‚ - Metadata      β”‚ - Content storage          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚                    β”‚
                    β–Ό                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   EXTERNAL SERVICES                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Qdrant      β”‚ OpenAI      β”‚ Firecrawl     β”‚ LangSmith       β”‚
β”‚ Cloud       β”‚ API         β”‚ API           β”‚ (optional)      β”‚
β”‚ - Vectors   β”‚ - Embeddingsβ”‚ - Scraping    β”‚ - Monitoring    β”‚
β”‚ - Search    β”‚ - Chat      β”‚ - Markdown    β”‚ - Debugging     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Module Relationships

Core Modules

1. src/ingestion.py - Document Processing Module

Purpose: Load, process, and store PDF/DOCX documents into vector database

Key Functions:

create_vectorstore() -> (vectorstore, embeddings, client)
load_document(file_path: str) -> List[Document]
chunk_documents(documents, chunk_size=1000, chunk_overlap=200) -> List[Document]
add_metadata(chunks, source_name, doc_type="document") -> List[Document]
ingest_document(file_path: str, doc_type="document") -> int

Dependencies:

  • langchain_community.document_loaders (PyPDFLoader, Docx2txtLoader)
  • langchain.text_splitter (RecursiveCharacterTextSplitter)
  • langchain_openai (OpenAIEmbeddings)
  • langchain_qdrant (QdrantVectorStore)
  • qdrant_client (QdrantClient)

Used By:

  • admin.py (upload functionality)

2. src/scraper.py - Web Scraping Module

Purpose: Scrape web pages and store content in vector database

Key Functions:

scrape_url(url: str) -> str
process_and_store_webpage(url: str) -> int

Dependencies:

  • firecrawl (FirecrawlApp)
  • langchain.schema (Document)
  • langchain.text_splitter (RecursiveCharacterTextSplitter)
  • langchain_openai (OpenAIEmbeddings)
  • langchain_qdrant (QdrantVectorStore)

Used By:

  • admin.py (URL scraping functionality)

3. src/chatbot.py - RAG Question-Answering Module

Purpose: Handle user questions using Retrieval-Augmented Generation

Key Functions:

create_rag_chain() -> ConversationalRetrievalChain
ask_question(qa_chain, question: str) -> (answer: str, sources: List[Document])

Components:

  • Vector store retriever (k=5 similar documents)
  • LLM: GPT-4o-mini (temperature=0.3)
  • Conversation memory (ConversationBufferMemory)
  • System prompt with disclaimers

Dependencies:

  • langchain_openai (ChatOpenAI, OpenAIEmbeddings)
  • langchain_qdrant (QdrantVectorStore)
  • langchain.chains (ConversationalRetrievalChain)
  • langchain.memory (ConversationBufferMemory)
  • qdrant_client (QdrantClient)

Used By:

  • app.py (chat interface)

User Interface Modules

4. app.py - Chat Interface (End Users)

Purpose: Gradio-based chat interface for nonprofit users

Features:

  • Real-time Q&A
  • PII detection and warnings
  • Source citations
  • Disclaimer display
  • Conversation history
  • Example questions

Calls:

  • src/chatbot.py β†’ create_rag_chain(), ask_question()

Port: 7860


5. admin.py - Admin Interface (Content Managers)

Purpose: Gradio-based management interface for HR Intervals team

Features:

  • View all documents
  • Upload PDF/DOCX files
  • Scrape single/multiple URLs
  • Delete documents by source
  • Update/replace documents

Calls:

  • src/ingestion.py β†’ ingest_document()
  • src/scraper.py β†’ process_and_store_webpage()
  • qdrant_client.QdrantClient β†’ direct CRUD operations

Port: 7861


Data Flow Diagrams

Flow 1: Document Upload

User (admin.py)
    ↓
    [Select PDF/DOCX file]
    ↓
admin.py: upload_document()
    ↓
ingestion.py: ingest_document()
    ↓
    [Load document] β†’ PyPDFLoader / Docx2txtLoader
    ↓
    [Split into chunks] β†’ RecursiveCharacterTextSplitter
    β”‚   - chunk_size: 1000
    β”‚   - chunk_overlap: 200
    ↓
    [Add metadata]
    β”‚   - source: filename
    β”‚   - type: document/policy/guide
    β”‚   - upload_date: YYYY-MM-DD
    ↓
    [Generate embeddings] β†’ OpenAI text-embedding-3-small
    ↓
    [Store vectors + metadata] β†’ Qdrant Cloud
    ↓
βœ… Success: N chunks uploaded

Flow 2: Web Scraping

User (admin.py)
    ↓
    [Enter URL(s)]
    ↓
admin.py: scrape_single_url() / scrape_multiple_urls()
    ↓
scraper.py: process_and_store_webpage()
    ↓
    [Scrape webpage] β†’ Firecrawl API
    β”‚   - Returns: Markdown content
    ↓
    [Create document with metadata]
    β”‚   - source: URL
    β”‚   - type: webpage
    β”‚   - upload_date: YYYY-MM-DD
    ↓
    [Split into chunks] β†’ RecursiveCharacterTextSplitter
    ↓
    [Generate embeddings] β†’ OpenAI text-embedding-3-small
    ↓
    [Store vectors + metadata] β†’ Qdrant Cloud
    ↓
βœ… Success: N chunks uploaded

Flow 3: Question Answering (RAG)

User (app.py)
    ↓
    [Type question]
    ↓
app.py: chat()
    ↓
    [Check for PII] β†’ Regex patterns
    β”‚   - Capitalized names: [A-Z][a-z]+ [A-Z][a-z]+
    β”‚   - If detected: Show warning
    ↓
chatbot.py: ask_question()
    ↓
ConversationalRetrievalChain
    ↓
    [Convert question to embedding] β†’ OpenAI text-embedding-3-small
    ↓
    [Similarity search] β†’ Qdrant Cloud
    β”‚   - Retrieve top 5 similar chunks
    β”‚   - Return: chunks + metadata
    ↓
    [Combine context + question + chat history]
    ↓
    [Generate answer] β†’ OpenAI GPT-4o-mini
    β”‚   - Temperature: 0.3
    β”‚   - System prompt: HR assistant with disclaimers
    ↓
    [Return answer + source documents]
    ↓
app.py: Display answer with sources
    ↓
User sees:
    - Answer
    - ⚠️ PII warning (if applicable)
    - πŸ“š Sources (top 3)

Flow 4: Document Deletion

User (admin.py)
    ↓
    [Enter document name or URL]
    ↓
admin.py: delete_document()
    ↓
Qdrant Client: delete()
    ↓
    [Filter by metadata]
    β”‚   - Field: "source"
    β”‚   - Match: exact document name
    ↓
    [Delete all matching points]
    ↓
βœ… Success: All chunks from source deleted

Flow 5: Document Update

User (admin.py)
    ↓
    [Specify old document name]
    [Select new file]
    ↓
admin.py: update_document()
    ↓
    [Step 1: Delete old document]
    β”‚   └─→ delete_document(old_source)
    ↓
    [Step 2: Upload new document]
    β”‚   └─→ upload_document(new_file)
    ↓
βœ… Success: Document replaced

Configuration

Environment Variables (.env)

# OpenAI API
OPENAI_API_KEY=sk-proj-...
OPEN_AI_EMBEDDING_MODEL=text-embedding-3-small
OPEN_AI_CHAT_MODEL=gpt-4o-mini

# Qdrant Cloud
QDRANT_URL=https://xxx.cloud.qdrant.io:6333
QDRANT_API_KEY=xxx
QDRANT_COLLECTION=hr-intervals

# Firecrawl
FIRECRAWL_API_KEY=fc-xxx

# LangSmith (Optional)
LANGSMITH_TRACING=false
LANGSMITH_API_KEY=xxx
LANGSMITH_PROJECT=hr-intervals-chatbot

Project Structure

hr-intervals-chatbot/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ ingestion.py          # Document processing
β”‚   β”œβ”€β”€ chatbot.py             # RAG Q&A logic
β”‚   └── scraper.py             # Web scraping
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ documents/             # Uploaded files
β”‚   └── scraped/               # Scraped content (cache)
β”œβ”€β”€ app.py                     # User chat interface
β”œβ”€β”€ admin.py                   # Admin management interface
β”œβ”€β”€ .env                       # API keys and config
β”œβ”€β”€ requirements.txt           # Python dependencies
β”œβ”€β”€ ARCHITECTURE.md            # This file
└── README.md                  # Project overview

Key Technical Decisions

1. Vector Database: Qdrant Cloud

  • Why: Built-in web UI, easy document management, free tier
  • Alternative considered: Pinecone (limited free tier, no document-level UI)

2. Embedding Model: text-embedding-3-small

  • Dimensions: 1536
  • Why: Excellent quality with best cost-performance ratio, multilingual support (English/French)

3. LLM: GPT-4o-mini

  • Why: Cost-effective, sufficient for HR Q&A, fast response
  • Alternative: GPT-4o (more expensive but higher quality)

4. Chunking Strategy

  • Chunk size: 1000 characters
  • Overlap: 200 characters
  • Separators: ["\n\n", "\n", ". ", " ", ""]
  • Why: Balances context preservation and retrieval accuracy

5. Retrieval: Top-k similarity search

  • k=5: Retrieve 5 most similar chunks
  • Distance metric: Cosine similarity
  • Why: Good balance between context and noise

Metadata Schema

Every chunk stored in Qdrant has the following metadata:

{
    "source": str,           # Filename or URL
    "type": str,             # "document" | "webpage" | "policy" | "guide"
    "upload_date": str,      # "YYYY-MM-DD"
    "page": int,             # (optional) Page number for PDFs
    "valid_until": str,      # (optional) Expiry date for policies
    "version": str,          # (optional) Version number
}

Document Management Operations

View Documents

# List all unique documents
client.scroll(collection_name, limit=1000, with_payload=True)
# Group by 'source' field

Upload Document

# 1. Load: PyPDFLoader / Docx2txtLoader
# 2. Chunk: RecursiveCharacterTextSplitter
# 3. Add metadata: source, type, date
# 4. Embed: OpenAI text-embedding-3-small
# 5. Store: QdrantVectorStore.from_documents()

Delete Document

client.delete(
    collection_name=collection_name,
    points_selector=FilterSelector(
        filter=Filter(
            must=[
                FieldCondition(
                    key="source",
                    match=MatchValue(value="filename.pdf")
                )
            ]
        )
    )
)

Update Document

# 1. Delete old version (by source name)
# 2. Upload new version

Security Features

PII Detection

  • Regex pattern for names: \b[A-Z][a-z]+ [A-Z][a-z]+\b
  • Warning displayed to user if detected
  • Future: Integrate Microsoft Presidio for advanced PII detection

Disclaimers

  • Shown on first interaction
  • Embedded in system prompt
  • Reminds users to consult professionals

API Key Security

  • Stored in .env file (not in version control)
  • .env added to .gitignore

Performance Considerations

Embedding Cost

  • Model: text-embedding-3-small
  • Cost: ~$0.13 per 1M tokens
  • Typical document: 10 pages β‰ˆ 5,000 tokens β‰ˆ $0.0007

Chat Cost

  • Model: GPT-4o-mini
  • Cost: ~$0.15 per 1M input tokens, $0.60 per 1M output tokens
  • Typical query: 5 chunks (5,000 tokens) + question (100 tokens) β‰ˆ $0.0008

Storage

  • Qdrant free tier: 1 GB
  • Each chunk: ~1 KB metadata + 12 KB vector (3072 dims Γ— 4 bytes)
  • Capacity: ~75,000 chunks (approximately 1,500 documents of 50 chunks each)

Future Enhancements

Phase 1 (Week 9-12) - Policy Features

  • Policy template library
  • Policy generation from user input
  • Policy compliance checking
  • Risk identification

Phase 2 (Week 13-18) - Advanced Features

  • Bilingual support (French)
  • Language detection and switching
  • Content recommendation system
  • Feedback collection mechanism

Phase 3 (Week 19-20) - Production

  • Deployment to Hugging Face Spaces
  • User authentication (if needed)
  • Analytics dashboard
  • Automated expiry detection for policies

Troubleshooting

Common Issues

1. "Collection not found" error

# Solution: Collection is created automatically on first upload
# Just upload a document and it will be created

2. "No documents found" when asking questions

# Solution: Upload at least one document first via admin.py

3. "Rate limit exceeded" from OpenAI

# Solution: Add delays between requests or upgrade OpenAI plan

4. "Firecrawl scraping failed"

# Solution: Check if URL is accessible, verify Firecrawl API key

Development Timeline

  • Week 1-2: Infrastructure setup βœ…
  • Week 3-4: Basic RAG system βœ…
  • Week 5-6: Web scraping + chat interface
  • Week 7-8: Quality improvements
  • Week 9-10: Admin interface
  • Week 11-12: Demo delivery
  • Week 13-16: Policy features
  • Week 17-18: Bilingual support
  • Week 19-20: Final delivery

References