Spaces:
Sleeping
Sleeping
Sonu Prasad commited on
Commit ·
8a1c0d1
1
Parent(s): f95330e
updated
Browse files- DEPLOY.md +198 -0
- Dockerfile +24 -0
- README.md +29 -9
- api.py +297 -0
- endpoints.txt +14 -0
- requirements.txt +21 -0
- src/__init__.py +11 -0
- src/config.py +63 -0
- src/document_processor/__init__.py +8 -0
- src/document_processor/chunker.py +209 -0
- src/knowledge/__init__.py +10 -0
- src/knowledge/embeddings.py +101 -0
- src/knowledge/vector_store.py +205 -0
- src/rag/__init__.py +12 -0
- src/rag/prompts.py +155 -0
- src/rag/retriever.py +177 -0
- src/rag/synthesizer.py +197 -0
DEPLOY.md
ADDED
|
@@ -0,0 +1,198 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# HuggingFace Spaces Deployment Guide
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
This guide walks you through deploying the HR Report Generator API on HuggingFace Spaces using Docker.
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## Prerequisites
|
| 9 |
+
|
| 10 |
+
1. **HuggingFace Account**: Create a free account at [huggingface.co](https://huggingface.co)
|
| 11 |
+
2. **OpenRouter API Key**: Get your key from [openrouter.ai](https://openrouter.ai)
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
## Step-by-Step Deployment
|
| 16 |
+
|
| 17 |
+
### Step 1: Create a New Space
|
| 18 |
+
|
| 19 |
+
1. Go to [huggingface.co/new-space](https://huggingface.co/new-space)
|
| 20 |
+
2. Fill in the details:
|
| 21 |
+
- **Space name**: `hr-report-api` (or your preferred name)
|
| 22 |
+
- **License**: Apache 2.0 (or your preference)
|
| 23 |
+
- **SDK**: Select **Docker**
|
| 24 |
+
- **Visibility**: Private (recommended for HR data)
|
| 25 |
+
3. Click **Create Space**
|
| 26 |
+
|
| 27 |
+
### Step 2: Upload Files
|
| 28 |
+
|
| 29 |
+
Upload all files from this folder to your Space. The structure should be:
|
| 30 |
+
|
| 31 |
+
```
|
| 32 |
+
your-space/
|
| 33 |
+
├── api.py
|
| 34 |
+
├── Dockerfile
|
| 35 |
+
├── requirements.txt
|
| 36 |
+
├── endpoints.txt
|
| 37 |
+
├── README.md
|
| 38 |
+
└── src/
|
| 39 |
+
├── __init__.py
|
| 40 |
+
├── config.py
|
| 41 |
+
├── rag/
|
| 42 |
+
│ ├── __init__.py
|
| 43 |
+
│ ├── synthesizer.py
|
| 44 |
+
│ ├── retriever.py
|
| 45 |
+
│ └── prompts.py
|
| 46 |
+
├── knowledge/
|
| 47 |
+
│ ├── __init__.py
|
| 48 |
+
│ ├── vector_store.py
|
| 49 |
+
│ └── embeddings.py
|
| 50 |
+
└── document_processor/
|
| 51 |
+
├── __init__.py
|
| 52 |
+
└── chunker.py
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
You can upload via:
|
| 56 |
+
- **Web UI**: Drag and drop files
|
| 57 |
+
- **Git**: Clone the repo and push
|
| 58 |
+
|
| 59 |
+
```bash
|
| 60 |
+
git clone https://huggingface.co/spaces/YOUR_USERNAME/hr-report-api
|
| 61 |
+
cd hr-report-api
|
| 62 |
+
# Copy all files from this folder
|
| 63 |
+
git add .
|
| 64 |
+
git commit -m "Initial deployment"
|
| 65 |
+
git push
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
### Step 3: Configure Secrets
|
| 69 |
+
|
| 70 |
+
Go to **Settings → Secrets** in your Space and add:
|
| 71 |
+
|
| 72 |
+
| Secret Name | Value | Description |
|
| 73 |
+
|-------------|-------|-------------|
|
| 74 |
+
| `OPENROUTER_API_KEY` | `sk-or-...` | Your OpenRouter API key |
|
| 75 |
+
| `ALLOWED_ORIGINS` | `https://checkin.hillsideprimarycare.com,https://hsmg.netlify.app` | Comma-separated allowed origins |
|
| 76 |
+
| `LLM_MODEL` | `google/gemma-2-9b-it:free` | (Optional) Override model from endpoints.txt |
|
| 77 |
+
|
| 78 |
+
### Step 4: Upload FAISS Index (Optional)
|
| 79 |
+
|
| 80 |
+
If you have a pre-built FAISS index with HR policies:
|
| 81 |
+
|
| 82 |
+
1. Create a `data/embeddings/` folder in your Space
|
| 83 |
+
2. Upload:
|
| 84 |
+
- `faiss_index.faiss` - The FAISS index file
|
| 85 |
+
- `faiss_index.chunks.json` - The chunks metadata
|
| 86 |
+
|
| 87 |
+
Without this, the API will still work but report "insufficient documentation."
|
| 88 |
+
|
| 89 |
+
### Step 5: Verify Deployment
|
| 90 |
+
|
| 91 |
+
1. Wait for the build to complete (1-3 minutes)
|
| 92 |
+
2. Your API will be available at:
|
| 93 |
+
```
|
| 94 |
+
https://YOUR_USERNAME-hr-report-api.hf.space
|
| 95 |
+
```
|
| 96 |
+
3. Check health: `https://YOUR_USERNAME-hr-report-api.hf.space/api/health`
|
| 97 |
+
|
| 98 |
+
---
|
| 99 |
+
|
| 100 |
+
## API Endpoints
|
| 101 |
+
|
| 102 |
+
| Endpoint | Method | Description |
|
| 103 |
+
|----------|--------|-------------|
|
| 104 |
+
| `/` | GET | API info and status |
|
| 105 |
+
| `/api/health` | GET | Health check |
|
| 106 |
+
| `/api/generate` | POST | Generate HR document |
|
| 107 |
+
| `/api/status` | GET | Knowledge base status |
|
| 108 |
+
| `/api/config` | GET | Public configuration |
|
| 109 |
+
|
| 110 |
+
### Generate Document Example
|
| 111 |
+
|
| 112 |
+
```javascript
|
| 113 |
+
fetch('https://YOUR-SPACE.hf.space/api/generate', {
|
| 114 |
+
method: 'POST',
|
| 115 |
+
headers: { 'Content-Type': 'application/json' },
|
| 116 |
+
body: JSON.stringify({
|
| 117 |
+
doc_type: 'Memorandum',
|
| 118 |
+
employee_name: 'John Smith',
|
| 119 |
+
date_from: '2026-02-01',
|
| 120 |
+
date_to: '2026-02-01',
|
| 121 |
+
reason: 'Tardiness',
|
| 122 |
+
additional_notes: 'Employee arrived 30 minutes late.'
|
| 123 |
+
})
|
| 124 |
+
})
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
---
|
| 128 |
+
|
| 129 |
+
## Updating the LLM Model
|
| 130 |
+
|
| 131 |
+
1. Edit `endpoints.txt` in your Space
|
| 132 |
+
2. Uncomment the model you want to use
|
| 133 |
+
3. The first uncommented line will be used
|
| 134 |
+
|
| 135 |
+
```txt
|
| 136 |
+
# Free Models:
|
| 137 |
+
google/gemma-2-9b-it:free
|
| 138 |
+
# meta-llama/llama-3.2-3b-instruct:free
|
| 139 |
+
|
| 140 |
+
# Paid Models:
|
| 141 |
+
# openai/gpt-4o
|
| 142 |
+
```
|
| 143 |
+
|
| 144 |
+
---
|
| 145 |
+
|
| 146 |
+
## Origin Validation
|
| 147 |
+
|
| 148 |
+
The API validates the `Origin` header against `ALLOWED_ORIGINS`. Only requests from these domains are allowed:
|
| 149 |
+
|
| 150 |
+
- `https://checkin.hillsideprimarycare.com`
|
| 151 |
+
- `https://hsmg.netlify.app`
|
| 152 |
+
- `http://localhost:3000` (for development)
|
| 153 |
+
- `http://localhost:5500`
|
| 154 |
+
|
| 155 |
+
To add more origins, update the `ALLOWED_ORIGINS` secret (comma-separated).
|
| 156 |
+
|
| 157 |
+
---
|
| 158 |
+
|
| 159 |
+
## Troubleshooting
|
| 160 |
+
|
| 161 |
+
### Build Fails
|
| 162 |
+
- Check Dockerfile syntax
|
| 163 |
+
- Ensure all files are uploaded
|
| 164 |
+
- Check the build logs for errors
|
| 165 |
+
|
| 166 |
+
### CORS Errors
|
| 167 |
+
- Verify `ALLOWED_ORIGINS` includes your frontend domain
|
| 168 |
+
- Make sure the domain has `https://` prefix
|
| 169 |
+
|
| 170 |
+
### API Returns 500
|
| 171 |
+
- Check if `OPENROUTER_API_KEY` is set correctly
|
| 172 |
+
- Verify the model in `endpoints.txt` is available
|
| 173 |
+
- Check Space logs for detailed errors
|
| 174 |
+
|
| 175 |
+
### Slow Response
|
| 176 |
+
- First request may be slow due to model loading (~30s)
|
| 177 |
+
- Subsequent requests should be faster
|
| 178 |
+
|
| 179 |
+
---
|
| 180 |
+
|
| 181 |
+
## Cost
|
| 182 |
+
|
| 183 |
+
| Component | Cost |
|
| 184 |
+
|-----------|------|
|
| 185 |
+
| HuggingFace Space | **Free** (with cold starts) |
|
| 186 |
+
| OpenRouter (free models) | **Free** |
|
| 187 |
+
| Total | **$0/month** |
|
| 188 |
+
|
| 189 |
+
> **Note**: Free tier has 30-60 second cold starts when the Space sleeps after inactivity.
|
| 190 |
+
|
| 191 |
+
---
|
| 192 |
+
|
| 193 |
+
## Next Steps
|
| 194 |
+
|
| 195 |
+
1. ✅ Deploy to HuggingFace Spaces
|
| 196 |
+
2. ✅ Configure secrets
|
| 197 |
+
3. ⏳ Deploy frontend to Netlify (see `netlify/DEPLOY.md`)
|
| 198 |
+
4. ⏳ Test end-to-end integration
|
Dockerfile
ADDED
|
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
FROM python:3.11-slim
|
| 2 |
+
|
| 3 |
+
WORKDIR /app
|
| 4 |
+
|
| 5 |
+
# Install system dependencies
|
| 6 |
+
RUN apt-get update && apt-get install -y \
|
| 7 |
+
build-essential \
|
| 8 |
+
&& rm -rf /var/lib/apt/lists/*
|
| 9 |
+
|
| 10 |
+
# Copy requirements first for caching
|
| 11 |
+
COPY requirements.txt .
|
| 12 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
| 13 |
+
|
| 14 |
+
# Copy application code
|
| 15 |
+
COPY . .
|
| 16 |
+
|
| 17 |
+
# Create data directories
|
| 18 |
+
RUN mkdir -p data/embeddings data/outputs data/uploads
|
| 19 |
+
|
| 20 |
+
# Expose port 7860 (HuggingFace default)
|
| 21 |
+
EXPOSE 7860
|
| 22 |
+
|
| 23 |
+
# Run with gunicorn
|
| 24 |
+
CMD ["gunicorn", "--bind", "0.0.0.0:7860", "--workers", "2", "--timeout", "120", "api:app"]
|
README.md
CHANGED
|
@@ -1,10 +1,30 @@
|
|
| 1 |
-
-
|
| 2 |
-
title: Hrbot
|
| 3 |
-
emoji: 🐠
|
| 4 |
-
colorFrom: purple
|
| 5 |
-
colorTo: pink
|
| 6 |
-
sdk: docker
|
| 7 |
-
pinned: false
|
| 8 |
-
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# HR Report Generator - API Backend (HuggingFace Spaces Docker)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
+
This folder contains the API backend for the HR Report Generator, designed to run on HuggingFace Spaces using Docker.
|
| 4 |
+
|
| 5 |
+
## Structure
|
| 6 |
+
```
|
| 7 |
+
huggingface/
|
| 8 |
+
├── api.py # Flask API server
|
| 9 |
+
├── Dockerfile # Docker configuration
|
| 10 |
+
├── requirements.txt # Python dependencies
|
| 11 |
+
├── endpoints.txt # LLM endpoints (editable)
|
| 12 |
+
├── src/ # Source modules
|
| 13 |
+
│ ├── config.py
|
| 14 |
+
│ ├── rag/
|
| 15 |
+
│ │ ├── synthesizer.py
|
| 16 |
+
│ │ ├── retriever.py
|
| 17 |
+
│ │ └── prompts.py
|
| 18 |
+
│ ├── knowledge/
|
| 19 |
+
│ │ ├── vector_store.py
|
| 20 |
+
│ │ └── embeddings.py
|
| 21 |
+
│ └── document_processor/
|
| 22 |
+
│ └── chunker.py
|
| 23 |
+
├── data/ # Data directory (create on HF)
|
| 24 |
+
│ └── embeddings/
|
| 25 |
+
└── DEPLOY.md # Deployment guide
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
## Files
|
| 29 |
+
- **endpoints.txt**: Configure your LLM models here
|
| 30 |
+
- **api.py**: Main Flask API with CORS origin validation
|
api.py
ADDED
|
@@ -0,0 +1,297 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""HR Report Generator API - HuggingFace Spaces Backend.
|
| 2 |
+
|
| 3 |
+
This Flask API serves as the backend for the HR Report Generator.
|
| 4 |
+
It validates origins against allowed domains stored in HF secrets.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
from datetime import datetime
|
| 9 |
+
from pathlib import Path
|
| 10 |
+
from functools import wraps
|
| 11 |
+
|
| 12 |
+
from flask import Flask, request, jsonify
|
| 13 |
+
from flask_cors import CORS
|
| 14 |
+
from dotenv import load_dotenv
|
| 15 |
+
|
| 16 |
+
# Load environment variables
|
| 17 |
+
load_dotenv()
|
| 18 |
+
|
| 19 |
+
# Import our modules
|
| 20 |
+
from src.config import settings
|
| 21 |
+
from src.knowledge.vector_store import FAISSVectorStore
|
| 22 |
+
from src.rag.synthesizer import ReportSynthesizer
|
| 23 |
+
|
| 24 |
+
app = Flask(__name__)
|
| 25 |
+
|
| 26 |
+
# ============================================
|
| 27 |
+
# CORS and Origin Validation
|
| 28 |
+
# ============================================
|
| 29 |
+
|
| 30 |
+
# Allowed origins from environment (set in HF Secrets)
|
| 31 |
+
# Format: comma-separated list of allowed origins
|
| 32 |
+
ALLOWED_ORIGINS_STR = os.getenv("ALLOWED_ORIGINS", "")
|
| 33 |
+
ALLOWED_ORIGINS = [
|
| 34 |
+
origin.strip()
|
| 35 |
+
for origin in ALLOWED_ORIGINS_STR.split(",")
|
| 36 |
+
if origin.strip()
|
| 37 |
+
]
|
| 38 |
+
|
| 39 |
+
# Default allowed origins if none specified
|
| 40 |
+
if not ALLOWED_ORIGINS:
|
| 41 |
+
ALLOWED_ORIGINS = [
|
| 42 |
+
"https://checkin.hillsideprimarycare.com",
|
| 43 |
+
"https://hsmg.netlify.app",
|
| 44 |
+
"http://localhost:3000",
|
| 45 |
+
"http://localhost:5500",
|
| 46 |
+
"http://127.0.0.1:5500",
|
| 47 |
+
]
|
| 48 |
+
|
| 49 |
+
# Enable CORS with specific origins
|
| 50 |
+
CORS(app, resources={
|
| 51 |
+
r"/api/*": {
|
| 52 |
+
"origins": ALLOWED_ORIGINS,
|
| 53 |
+
"methods": ["GET", "POST", "OPTIONS"],
|
| 54 |
+
"allow_headers": ["Content-Type", "Authorization"],
|
| 55 |
+
}
|
| 56 |
+
})
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
def validate_origin(f):
|
| 60 |
+
"""Decorator to validate request origin against allowed list."""
|
| 61 |
+
@wraps(f)
|
| 62 |
+
def decorated_function(*args, **kwargs):
|
| 63 |
+
origin = request.headers.get("Origin", "")
|
| 64 |
+
referer = request.headers.get("Referer", "")
|
| 65 |
+
|
| 66 |
+
# Check if origin or referer matches allowed origins
|
| 67 |
+
is_allowed = False
|
| 68 |
+
for allowed in ALLOWED_ORIGINS:
|
| 69 |
+
if origin.startswith(allowed) or referer.startswith(allowed):
|
| 70 |
+
is_allowed = True
|
| 71 |
+
break
|
| 72 |
+
|
| 73 |
+
if not is_allowed and origin: # Allow requests without origin (e.g., curl)
|
| 74 |
+
return jsonify({
|
| 75 |
+
"success": False,
|
| 76 |
+
"error": "Origin not allowed"
|
| 77 |
+
}), 403
|
| 78 |
+
|
| 79 |
+
return f(*args, **kwargs)
|
| 80 |
+
return decorated_function
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
# ============================================
|
| 84 |
+
# Configuration
|
| 85 |
+
# ============================================
|
| 86 |
+
|
| 87 |
+
def get_active_model():
|
| 88 |
+
"""Read the active model from endpoints.txt."""
|
| 89 |
+
endpoints_file = Path("endpoints.txt")
|
| 90 |
+
default_model = os.getenv("LLM_MODEL", "google/gemma-2-9b-it:free")
|
| 91 |
+
|
| 92 |
+
if not endpoints_file.exists():
|
| 93 |
+
return default_model
|
| 94 |
+
|
| 95 |
+
try:
|
| 96 |
+
content = endpoints_file.read_text()
|
| 97 |
+
for line in content.splitlines():
|
| 98 |
+
line = line.strip()
|
| 99 |
+
# Skip comments and empty lines
|
| 100 |
+
if line and not line.startswith("#"):
|
| 101 |
+
return line
|
| 102 |
+
except Exception:
|
| 103 |
+
pass
|
| 104 |
+
|
| 105 |
+
return default_model
|
| 106 |
+
|
| 107 |
+
|
| 108 |
+
# Document type templates
|
| 109 |
+
DOCUMENT_TEMPLATES = {
|
| 110 |
+
'Memorandum': {
|
| 111 |
+
'title': 'MEMORANDUM FOR PERSONNEL FILE',
|
| 112 |
+
'sections': [
|
| 113 |
+
'Purpose of Document',
|
| 114 |
+
'Incident Background',
|
| 115 |
+
'Policy References',
|
| 116 |
+
'Employee Discussion',
|
| 117 |
+
'Corrective Measures',
|
| 118 |
+
'Conclusion'
|
| 119 |
+
]
|
| 120 |
+
},
|
| 121 |
+
'Termination Letter': {
|
| 122 |
+
'title': 'EMPLOYEE TERMINATION LETTER',
|
| 123 |
+
'sections': [
|
| 124 |
+
'Purpose of Document',
|
| 125 |
+
'Employment History',
|
| 126 |
+
'Reason for Termination',
|
| 127 |
+
'Policy Violations',
|
| 128 |
+
'Previous Warnings',
|
| 129 |
+
'Final Pay and Benefits',
|
| 130 |
+
'Return of Company Property'
|
| 131 |
+
]
|
| 132 |
+
},
|
| 133 |
+
'Written Disciplinary Action': {
|
| 134 |
+
'title': 'WRITTEN DISCIPLINARY ACTION',
|
| 135 |
+
'sections': [
|
| 136 |
+
'Purpose of Document',
|
| 137 |
+
'Incident Details',
|
| 138 |
+
'Policy References',
|
| 139 |
+
'Previous Coaching / Warnings',
|
| 140 |
+
'Corrective Action Required',
|
| 141 |
+
'Employee Acknowledgment'
|
| 142 |
+
]
|
| 143 |
+
},
|
| 144 |
+
'90 Day Performance Evaluation': {
|
| 145 |
+
'title': '90-DAY PERFORMANCE EVALUATION',
|
| 146 |
+
'sections': [
|
| 147 |
+
'Evaluation Period',
|
| 148 |
+
'Performance Summary',
|
| 149 |
+
'Areas of Strength',
|
| 150 |
+
'Areas for Improvement',
|
| 151 |
+
'Goals for Next Period',
|
| 152 |
+
'Recommendation'
|
| 153 |
+
]
|
| 154 |
+
}
|
| 155 |
+
}
|
| 156 |
+
|
| 157 |
+
|
| 158 |
+
# ============================================
|
| 159 |
+
# API Routes
|
| 160 |
+
# ============================================
|
| 161 |
+
|
| 162 |
+
@app.route('/')
|
| 163 |
+
def index():
|
| 164 |
+
"""Health check and API info."""
|
| 165 |
+
return jsonify({
|
| 166 |
+
"status": "ok",
|
| 167 |
+
"service": "HR Report Generator API",
|
| 168 |
+
"version": "1.0.0",
|
| 169 |
+
"allowed_origins": ALLOWED_ORIGINS,
|
| 170 |
+
"active_model": get_active_model(),
|
| 171 |
+
})
|
| 172 |
+
|
| 173 |
+
|
| 174 |
+
@app.route('/api/health')
|
| 175 |
+
def health():
|
| 176 |
+
"""Health check endpoint."""
|
| 177 |
+
return jsonify({"status": "healthy"})
|
| 178 |
+
|
| 179 |
+
|
| 180 |
+
@app.route('/api/generate', methods=['POST', 'OPTIONS'])
|
| 181 |
+
@validate_origin
|
| 182 |
+
def generate_report():
|
| 183 |
+
"""Generate an HR document based on type."""
|
| 184 |
+
if request.method == 'OPTIONS':
|
| 185 |
+
return '', 204
|
| 186 |
+
|
| 187 |
+
try:
|
| 188 |
+
data = request.json
|
| 189 |
+
|
| 190 |
+
doc_type = data.get('doc_type', 'Memorandum')
|
| 191 |
+
employee_name = data.get('employee_name', '')
|
| 192 |
+
date_from = data.get('date_from', '')
|
| 193 |
+
date_to = data.get('date_to', '')
|
| 194 |
+
reason = data.get('reason', '')
|
| 195 |
+
additional_notes = data.get('additional_notes', '')
|
| 196 |
+
|
| 197 |
+
if not all([employee_name, date_from, reason]):
|
| 198 |
+
return jsonify({'success': False, 'error': 'Missing required fields'}), 400
|
| 199 |
+
|
| 200 |
+
# Format date range
|
| 201 |
+
if date_to and date_to != date_from:
|
| 202 |
+
incident_date_range = f"{date_from} to {date_to}"
|
| 203 |
+
else:
|
| 204 |
+
incident_date_range = date_from
|
| 205 |
+
|
| 206 |
+
# Get document template info
|
| 207 |
+
template_info = DOCUMENT_TEMPLATES.get(doc_type, DOCUMENT_TEMPLATES['Memorandum'])
|
| 208 |
+
|
| 209 |
+
# Get active model from endpoints.txt
|
| 210 |
+
model_name = get_active_model()
|
| 211 |
+
|
| 212 |
+
# Build the synthesizer input
|
| 213 |
+
synth_input = {
|
| 214 |
+
'employee_name': employee_name,
|
| 215 |
+
'incident_date_range': incident_date_range,
|
| 216 |
+
'incident_reason': additional_notes or reason,
|
| 217 |
+
'violation_type': reason,
|
| 218 |
+
'report_date': datetime.now().strftime('%Y-%m-%d'),
|
| 219 |
+
'doc_type': doc_type,
|
| 220 |
+
'doc_title': template_info['title'],
|
| 221 |
+
'required_sections': template_info['sections'],
|
| 222 |
+
}
|
| 223 |
+
|
| 224 |
+
# Generate the report
|
| 225 |
+
synthesizer = ReportSynthesizer(model_name=model_name)
|
| 226 |
+
result = synthesizer.synthesize_from_dict(synth_input)
|
| 227 |
+
|
| 228 |
+
if not result.success:
|
| 229 |
+
return jsonify({'success': False, 'error': result.error}), 500
|
| 230 |
+
|
| 231 |
+
return jsonify({
|
| 232 |
+
'success': True,
|
| 233 |
+
'markdown': result.markdown_report,
|
| 234 |
+
'validation': {
|
| 235 |
+
'is_valid': True,
|
| 236 |
+
'errors': [],
|
| 237 |
+
'warnings': [],
|
| 238 |
+
},
|
| 239 |
+
'sources': result.retrieval_context.sources_used,
|
| 240 |
+
'model_used': model_name,
|
| 241 |
+
})
|
| 242 |
+
|
| 243 |
+
except Exception as e:
|
| 244 |
+
import traceback
|
| 245 |
+
traceback.print_exc()
|
| 246 |
+
return jsonify({'success': False, 'error': str(e)}), 500
|
| 247 |
+
|
| 248 |
+
|
| 249 |
+
@app.route('/api/status')
|
| 250 |
+
@validate_origin
|
| 251 |
+
def get_status():
|
| 252 |
+
"""Get system status."""
|
| 253 |
+
try:
|
| 254 |
+
vector_store = FAISSVectorStore()
|
| 255 |
+
loaded = vector_store.load()
|
| 256 |
+
|
| 257 |
+
return jsonify({
|
| 258 |
+
'knowledge_base': {
|
| 259 |
+
'loaded': loaded,
|
| 260 |
+
'chunks': vector_store.size if loaded else 0,
|
| 261 |
+
'sources': vector_store.get_sources() if loaded else [],
|
| 262 |
+
},
|
| 263 |
+
'active_model': get_active_model(),
|
| 264 |
+
'allowed_origins': ALLOWED_ORIGINS,
|
| 265 |
+
})
|
| 266 |
+
except Exception as e:
|
| 267 |
+
return jsonify({
|
| 268 |
+
'knowledge_base': {'loaded': False, 'chunks': 0, 'sources': []},
|
| 269 |
+
'error': str(e),
|
| 270 |
+
})
|
| 271 |
+
|
| 272 |
+
|
| 273 |
+
@app.route('/api/config')
|
| 274 |
+
def get_config():
|
| 275 |
+
"""Get public configuration (no secrets)."""
|
| 276 |
+
return jsonify({
|
| 277 |
+
'document_types': list(DOCUMENT_TEMPLATES.keys()),
|
| 278 |
+
'active_model': get_active_model(),
|
| 279 |
+
})
|
| 280 |
+
|
| 281 |
+
|
| 282 |
+
# ============================================
|
| 283 |
+
# Main
|
| 284 |
+
# ============================================
|
| 285 |
+
|
| 286 |
+
if __name__ == '__main__':
|
| 287 |
+
# Ensure directories exist
|
| 288 |
+
settings.ensure_directories()
|
| 289 |
+
|
| 290 |
+
print("=" * 50)
|
| 291 |
+
print("HR Report Generator API")
|
| 292 |
+
print("=" * 50)
|
| 293 |
+
print(f"Active Model: {get_active_model()}")
|
| 294 |
+
print(f"Allowed Origins: {ALLOWED_ORIGINS}")
|
| 295 |
+
print("=" * 50)
|
| 296 |
+
|
| 297 |
+
app.run(host='0.0.0.0', port=7860, debug=True)
|
endpoints.txt
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# LLM Endpoints Configuration
|
| 2 |
+
# Edit this file to change the model used by the API
|
| 3 |
+
# Format: model_name (one per line, first uncommented line is used)
|
| 4 |
+
|
| 5 |
+
# Free OpenRouter Models:
|
| 6 |
+
google/gemma-2-9b-it:free
|
| 7 |
+
# meta-llama/llama-3.2-3b-instruct:free
|
| 8 |
+
# mistralai/mistral-7b-instruct:free
|
| 9 |
+
# google/gemini-pro-1.5-exp
|
| 10 |
+
|
| 11 |
+
# Paid Models (if you have credits):
|
| 12 |
+
# anthropic/claude-3.5-sonnet
|
| 13 |
+
# openai/gpt-4o
|
| 14 |
+
# meta-llama/llama-3.1-70b-instruct
|
requirements.txt
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Flask and API
|
| 2 |
+
flask>=3.0.0
|
| 3 |
+
gunicorn>=21.0.0
|
| 4 |
+
flask-cors>=4.0.0
|
| 5 |
+
python-dotenv>=1.0.0
|
| 6 |
+
requests>=2.31.0
|
| 7 |
+
|
| 8 |
+
# Pydantic for data validation
|
| 9 |
+
pydantic>=2.0.0
|
| 10 |
+
pydantic-settings>=2.0.0
|
| 11 |
+
|
| 12 |
+
# Vector store and embeddings
|
| 13 |
+
faiss-cpu>=1.7.4
|
| 14 |
+
sentence-transformers>=2.2.0
|
| 15 |
+
numpy>=1.24.0
|
| 16 |
+
|
| 17 |
+
# Document processing
|
| 18 |
+
PyYAML>=6.0.0
|
| 19 |
+
|
| 20 |
+
# File handling
|
| 21 |
+
werkzeug>=3.0.0
|
src/__init__.py
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""HR Report Generator - HuggingFace Spaces Backend."""
|
| 2 |
+
|
| 3 |
+
from src.config import settings
|
| 4 |
+
from src.rag import ReportSynthesizer
|
| 5 |
+
from src.knowledge import FAISSVectorStore
|
| 6 |
+
|
| 7 |
+
__all__ = [
|
| 8 |
+
"settings",
|
| 9 |
+
"ReportSynthesizer",
|
| 10 |
+
"FAISSVectorStore",
|
| 11 |
+
]
|
src/config.py
ADDED
|
@@ -0,0 +1,63 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""HR Report Generator - Configuration Module (HuggingFace Version)."""
|
| 2 |
+
|
| 3 |
+
from pathlib import Path
|
| 4 |
+
from typing import Literal
|
| 5 |
+
import os
|
| 6 |
+
|
| 7 |
+
from pydantic import Field
|
| 8 |
+
from pydantic_settings import BaseSettings, SettingsConfigDict
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
class Settings(BaseSettings):
|
| 12 |
+
"""Application settings loaded from environment variables."""
|
| 13 |
+
|
| 14 |
+
model_config = SettingsConfigDict(
|
| 15 |
+
env_file=".env",
|
| 16 |
+
env_file_encoding="utf-8",
|
| 17 |
+
extra="ignore",
|
| 18 |
+
)
|
| 19 |
+
|
| 20 |
+
# LLM Configuration (loaded from environment or endpoints.txt)
|
| 21 |
+
llm_model: str = Field(default="google/gemma-2-9b-it:free", description="LLM model name")
|
| 22 |
+
llm_temperature: float = Field(default=0.0, ge=0.0, le=1.0, description="LLM temperature")
|
| 23 |
+
|
| 24 |
+
# Embedding Configuration
|
| 25 |
+
embedding_model: str = Field(
|
| 26 |
+
default="BAAI/bge-small-en-v1.5",
|
| 27 |
+
description="HuggingFace embedding model",
|
| 28 |
+
)
|
| 29 |
+
|
| 30 |
+
# Paths (relative for Docker)
|
| 31 |
+
data_dir: Path = Field(default=Path("./data"))
|
| 32 |
+
documents_dir: Path = Field(default=Path("./data/documents"))
|
| 33 |
+
markdown_dir: Path = Field(default=Path("./data/markdown"))
|
| 34 |
+
embeddings_dir: Path = Field(default=Path("./data/embeddings"))
|
| 35 |
+
outputs_dir: Path = Field(default=Path("./data/outputs"))
|
| 36 |
+
templates_dir: Path = Field(default=Path("./templates"))
|
| 37 |
+
|
| 38 |
+
# Vector Store
|
| 39 |
+
faiss_index_path: Path = Field(default=Path("./data/embeddings/faiss_index"))
|
| 40 |
+
|
| 41 |
+
# Chunking Configuration
|
| 42 |
+
chunk_size: int = Field(default=512, description="Target chunk size in tokens")
|
| 43 |
+
chunk_overlap: int = Field(default=50, description="Overlap between chunks")
|
| 44 |
+
|
| 45 |
+
# Retrieval Configuration
|
| 46 |
+
retrieval_top_k: int = Field(default=5, description="Number of documents to retrieve")
|
| 47 |
+
retrieval_min_score: float = Field(default=0.3, description="Minimum similarity score")
|
| 48 |
+
|
| 49 |
+
def ensure_directories(self) -> None:
|
| 50 |
+
"""Create all required directories if they don't exist."""
|
| 51 |
+
for path in [
|
| 52 |
+
self.data_dir,
|
| 53 |
+
self.documents_dir,
|
| 54 |
+
self.markdown_dir,
|
| 55 |
+
self.embeddings_dir,
|
| 56 |
+
self.outputs_dir,
|
| 57 |
+
self.templates_dir,
|
| 58 |
+
]:
|
| 59 |
+
path.mkdir(parents=True, exist_ok=True)
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
# Global settings instance
|
| 63 |
+
settings = Settings()
|
src/document_processor/__init__.py
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Document processor module for HR Report Generator."""
|
| 2 |
+
|
| 3 |
+
from src.document_processor.chunker import DocumentChunk, SemanticChunker
|
| 4 |
+
|
| 5 |
+
__all__ = [
|
| 6 |
+
"DocumentChunk",
|
| 7 |
+
"SemanticChunker",
|
| 8 |
+
]
|
src/document_processor/chunker.py
ADDED
|
@@ -0,0 +1,209 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Semantic document chunker for RAG processing."""
|
| 2 |
+
|
| 3 |
+
import re
|
| 4 |
+
from pathlib import Path
|
| 5 |
+
from typing import Optional
|
| 6 |
+
|
| 7 |
+
from pydantic import BaseModel
|
| 8 |
+
|
| 9 |
+
from src.config import settings
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
class DocumentChunk(BaseModel):
|
| 13 |
+
"""A chunk of document content with metadata."""
|
| 14 |
+
|
| 15 |
+
content: str
|
| 16 |
+
source_file: str
|
| 17 |
+
chunk_index: int
|
| 18 |
+
start_char: int
|
| 19 |
+
end_char: int
|
| 20 |
+
section_title: Optional[str] = None
|
| 21 |
+
page_hint: Optional[str] = None
|
| 22 |
+
|
| 23 |
+
@property
|
| 24 |
+
def chunk_id(self) -> str:
|
| 25 |
+
"""Generate unique chunk identifier."""
|
| 26 |
+
return f"{Path(self.source_file).stem}_{self.chunk_index:04d}"
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
class SemanticChunker:
|
| 30 |
+
"""Chunks Markdown documents by semantic boundaries.
|
| 31 |
+
|
| 32 |
+
Respects document structure (headers, paragraphs, lists) while
|
| 33 |
+
maintaining target chunk sizes for optimal embedding performance.
|
| 34 |
+
"""
|
| 35 |
+
|
| 36 |
+
def __init__(
|
| 37 |
+
self,
|
| 38 |
+
chunk_size: int = None,
|
| 39 |
+
chunk_overlap: int = None,
|
| 40 |
+
):
|
| 41 |
+
"""Initialize the chunker.
|
| 42 |
+
|
| 43 |
+
Args:
|
| 44 |
+
chunk_size: Target chunk size in characters.
|
| 45 |
+
chunk_overlap: Overlap between chunks in characters.
|
| 46 |
+
"""
|
| 47 |
+
self.chunk_size = chunk_size or settings.chunk_size
|
| 48 |
+
self.chunk_overlap = chunk_overlap or settings.chunk_overlap
|
| 49 |
+
|
| 50 |
+
# Patterns for semantic splitting
|
| 51 |
+
self._header_pattern = re.compile(r"^(#{1,6})\s+(.+)$", re.MULTILINE)
|
| 52 |
+
self._section_break_pattern = re.compile(r"\n{3,}")
|
| 53 |
+
self._list_item_pattern = re.compile(r"^[\s]*[-*+]\s+", re.MULTILINE)
|
| 54 |
+
|
| 55 |
+
def _extract_frontmatter(self, content: str) -> tuple[dict, str]:
|
| 56 |
+
"""Extract YAML frontmatter from markdown content."""
|
| 57 |
+
frontmatter = {}
|
| 58 |
+
body = content
|
| 59 |
+
|
| 60 |
+
if content.startswith("---"):
|
| 61 |
+
parts = content.split("---", 2)
|
| 62 |
+
if len(parts) >= 3:
|
| 63 |
+
import yaml
|
| 64 |
+
|
| 65 |
+
try:
|
| 66 |
+
frontmatter = yaml.safe_load(parts[1]) or {}
|
| 67 |
+
except Exception:
|
| 68 |
+
pass
|
| 69 |
+
body = parts[2].strip()
|
| 70 |
+
|
| 71 |
+
return frontmatter, body
|
| 72 |
+
|
| 73 |
+
def _find_section_boundaries(self, content: str) -> list[tuple[int, int, str]]:
|
| 74 |
+
"""Find semantic section boundaries based on headers.
|
| 75 |
+
|
| 76 |
+
Returns list of (start_pos, end_pos, section_title) tuples.
|
| 77 |
+
"""
|
| 78 |
+
boundaries = []
|
| 79 |
+
headers = list(self._header_pattern.finditer(content))
|
| 80 |
+
|
| 81 |
+
if not headers:
|
| 82 |
+
return [(0, len(content), "Document")]
|
| 83 |
+
|
| 84 |
+
# Add content before first header if exists
|
| 85 |
+
if headers[0].start() > 0:
|
| 86 |
+
boundaries.append((0, headers[0].start(), "Preamble"))
|
| 87 |
+
|
| 88 |
+
# Add each section
|
| 89 |
+
for i, header in enumerate(headers):
|
| 90 |
+
start = header.start()
|
| 91 |
+
end = headers[i + 1].start() if i + 1 < len(headers) else len(content)
|
| 92 |
+
title = header.group(2).strip()
|
| 93 |
+
boundaries.append((start, end, title))
|
| 94 |
+
|
| 95 |
+
return boundaries
|
| 96 |
+
|
| 97 |
+
def _split_section(self, content: str, section_title: str) -> list[str]:
|
| 98 |
+
"""Split a section into smaller chunks respecting boundaries."""
|
| 99 |
+
if len(content) <= self.chunk_size:
|
| 100 |
+
return [content] if content.strip() else []
|
| 101 |
+
|
| 102 |
+
chunks = []
|
| 103 |
+
current_chunk = ""
|
| 104 |
+
|
| 105 |
+
# Split by paragraphs first
|
| 106 |
+
paragraphs = re.split(r"\n\n+", content)
|
| 107 |
+
|
| 108 |
+
for para in paragraphs:
|
| 109 |
+
para = para.strip()
|
| 110 |
+
if not para:
|
| 111 |
+
continue
|
| 112 |
+
|
| 113 |
+
# If paragraph alone exceeds chunk size, split by sentences
|
| 114 |
+
if len(para) > self.chunk_size:
|
| 115 |
+
sentences = re.split(r"(?<=[.!?])\s+", para)
|
| 116 |
+
for sentence in sentences:
|
| 117 |
+
if len(current_chunk) + len(sentence) + 1 <= self.chunk_size:
|
| 118 |
+
current_chunk += (" " if current_chunk else "") + sentence
|
| 119 |
+
else:
|
| 120 |
+
if current_chunk:
|
| 121 |
+
chunks.append(current_chunk)
|
| 122 |
+
current_chunk = sentence
|
| 123 |
+
elif len(current_chunk) + len(para) + 2 <= self.chunk_size:
|
| 124 |
+
current_chunk += ("\n\n" if current_chunk else "") + para
|
| 125 |
+
else:
|
| 126 |
+
if current_chunk:
|
| 127 |
+
chunks.append(current_chunk)
|
| 128 |
+
current_chunk = para
|
| 129 |
+
|
| 130 |
+
if current_chunk.strip():
|
| 131 |
+
chunks.append(current_chunk)
|
| 132 |
+
|
| 133 |
+
return chunks
|
| 134 |
+
|
| 135 |
+
def _add_overlap(self, chunks: list[str]) -> list[str]:
|
| 136 |
+
"""Add overlap between chunks for context preservation."""
|
| 137 |
+
if self.chunk_overlap <= 0 or len(chunks) <= 1:
|
| 138 |
+
return chunks
|
| 139 |
+
|
| 140 |
+
overlapped = []
|
| 141 |
+
for i, chunk in enumerate(chunks):
|
| 142 |
+
if i > 0:
|
| 143 |
+
# Add end of previous chunk as prefix
|
| 144 |
+
prev_chunk = chunks[i - 1]
|
| 145 |
+
overlap_text = prev_chunk[-self.chunk_overlap :].strip()
|
| 146 |
+
if overlap_text:
|
| 147 |
+
chunk = f"...{overlap_text}\n\n{chunk}"
|
| 148 |
+
overlapped.append(chunk)
|
| 149 |
+
|
| 150 |
+
return overlapped
|
| 151 |
+
|
| 152 |
+
def chunk_document(self, markdown_path: Path) -> list[DocumentChunk]:
|
| 153 |
+
"""Chunk a Markdown document into semantic pieces.
|
| 154 |
+
|
| 155 |
+
Args:
|
| 156 |
+
markdown_path: Path to the Markdown file.
|
| 157 |
+
|
| 158 |
+
Returns:
|
| 159 |
+
List of DocumentChunks with metadata.
|
| 160 |
+
"""
|
| 161 |
+
markdown_path = Path(markdown_path)
|
| 162 |
+
content = markdown_path.read_text(encoding="utf-8")
|
| 163 |
+
|
| 164 |
+
frontmatter, body = self._extract_frontmatter(content)
|
| 165 |
+
source_file = frontmatter.get("source", markdown_path.name)
|
| 166 |
+
|
| 167 |
+
sections = self._find_section_boundaries(body)
|
| 168 |
+
all_chunks = []
|
| 169 |
+
chunk_index = 0
|
| 170 |
+
|
| 171 |
+
for start_pos, end_pos, section_title in sections:
|
| 172 |
+
section_content = body[start_pos:end_pos].strip()
|
| 173 |
+
if not section_content:
|
| 174 |
+
continue
|
| 175 |
+
|
| 176 |
+
section_chunks = self._split_section(section_content, section_title)
|
| 177 |
+
section_chunks = self._add_overlap(section_chunks)
|
| 178 |
+
|
| 179 |
+
for chunk_content in section_chunks:
|
| 180 |
+
if not chunk_content.strip():
|
| 181 |
+
continue
|
| 182 |
+
|
| 183 |
+
chunk = DocumentChunk(
|
| 184 |
+
content=chunk_content,
|
| 185 |
+
source_file=str(markdown_path),
|
| 186 |
+
chunk_index=chunk_index,
|
| 187 |
+
start_char=start_pos,
|
| 188 |
+
end_char=end_pos,
|
| 189 |
+
section_title=section_title,
|
| 190 |
+
)
|
| 191 |
+
all_chunks.append(chunk)
|
| 192 |
+
chunk_index += 1
|
| 193 |
+
|
| 194 |
+
return all_chunks
|
| 195 |
+
|
| 196 |
+
def chunk_documents(self, markdown_paths: list[Path]) -> list[DocumentChunk]:
|
| 197 |
+
"""Chunk multiple Markdown documents.
|
| 198 |
+
|
| 199 |
+
Args:
|
| 200 |
+
markdown_paths: List of paths to Markdown files.
|
| 201 |
+
|
| 202 |
+
Returns:
|
| 203 |
+
List of all DocumentChunks from all documents.
|
| 204 |
+
"""
|
| 205 |
+
all_chunks = []
|
| 206 |
+
for path in markdown_paths:
|
| 207 |
+
chunks = self.chunk_document(path)
|
| 208 |
+
all_chunks.extend(chunks)
|
| 209 |
+
return all_chunks
|
src/knowledge/__init__.py
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Knowledge module for HR Report Generator."""
|
| 2 |
+
|
| 3 |
+
from src.knowledge.vector_store import FAISSVectorStore, RetrievalResult
|
| 4 |
+
from src.knowledge.embeddings import EmbeddingModel
|
| 5 |
+
|
| 6 |
+
__all__ = [
|
| 7 |
+
"FAISSVectorStore",
|
| 8 |
+
"RetrievalResult",
|
| 9 |
+
"EmbeddingModel",
|
| 10 |
+
]
|
src/knowledge/embeddings.py
ADDED
|
@@ -0,0 +1,101 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Embedding model wrapper for document vectorization."""
|
| 2 |
+
|
| 3 |
+
from pathlib import Path
|
| 4 |
+
from typing import Optional
|
| 5 |
+
|
| 6 |
+
import numpy as np
|
| 7 |
+
from sentence_transformers import SentenceTransformer
|
| 8 |
+
|
| 9 |
+
from src.config import settings
|
| 10 |
+
from src.document_processor.chunker import DocumentChunk
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
class EmbeddingModel:
|
| 14 |
+
"""Wrapper for sentence-transformers embedding models.
|
| 15 |
+
|
| 16 |
+
Provides efficient batch embedding with caching support.
|
| 17 |
+
"""
|
| 18 |
+
|
| 19 |
+
def __init__(self, model_name: Optional[str] = None):
|
| 20 |
+
"""Initialize the embedding model.
|
| 21 |
+
|
| 22 |
+
Args:
|
| 23 |
+
model_name: HuggingFace model name. Defaults to settings.embedding_model.
|
| 24 |
+
"""
|
| 25 |
+
self.model_name = model_name or settings.embedding_model
|
| 26 |
+
self._model: Optional[SentenceTransformer] = None
|
| 27 |
+
|
| 28 |
+
@property
|
| 29 |
+
def model(self) -> SentenceTransformer:
|
| 30 |
+
"""Lazy load the embedding model."""
|
| 31 |
+
if self._model is None:
|
| 32 |
+
self._model = SentenceTransformer(self.model_name)
|
| 33 |
+
return self._model
|
| 34 |
+
|
| 35 |
+
@property
|
| 36 |
+
def embedding_dimension(self) -> int:
|
| 37 |
+
"""Get the dimension of embeddings produced by this model."""
|
| 38 |
+
return self.model.get_sentence_embedding_dimension()
|
| 39 |
+
|
| 40 |
+
def embed_text(self, text: str) -> np.ndarray:
|
| 41 |
+
"""Embed a single text string.
|
| 42 |
+
|
| 43 |
+
Args:
|
| 44 |
+
text: Text to embed.
|
| 45 |
+
|
| 46 |
+
Returns:
|
| 47 |
+
Embedding vector as numpy array.
|
| 48 |
+
"""
|
| 49 |
+
return self.model.encode(text, convert_to_numpy=True, normalize_embeddings=True)
|
| 50 |
+
|
| 51 |
+
def embed_texts(self, texts: list[str], batch_size: int = 32) -> np.ndarray:
|
| 52 |
+
"""Embed multiple texts efficiently.
|
| 53 |
+
|
| 54 |
+
Args:
|
| 55 |
+
texts: List of texts to embed.
|
| 56 |
+
batch_size: Batch size for processing.
|
| 57 |
+
|
| 58 |
+
Returns:
|
| 59 |
+
Array of embedding vectors (num_texts x embedding_dim).
|
| 60 |
+
"""
|
| 61 |
+
return self.model.encode(
|
| 62 |
+
texts,
|
| 63 |
+
batch_size=batch_size,
|
| 64 |
+
convert_to_numpy=True,
|
| 65 |
+
normalize_embeddings=True,
|
| 66 |
+
show_progress_bar=len(texts) > 100,
|
| 67 |
+
)
|
| 68 |
+
|
| 69 |
+
def embed_chunks(
|
| 70 |
+
self, chunks: list[DocumentChunk], batch_size: int = 32
|
| 71 |
+
) -> list[tuple[DocumentChunk, np.ndarray]]:
|
| 72 |
+
"""Embed document chunks with their metadata.
|
| 73 |
+
|
| 74 |
+
Args:
|
| 75 |
+
chunks: List of DocumentChunks to embed.
|
| 76 |
+
batch_size: Batch size for processing.
|
| 77 |
+
|
| 78 |
+
Returns:
|
| 79 |
+
List of (chunk, embedding) tuples.
|
| 80 |
+
"""
|
| 81 |
+
texts = [chunk.content for chunk in chunks]
|
| 82 |
+
embeddings = self.embed_texts(texts, batch_size=batch_size)
|
| 83 |
+
|
| 84 |
+
return list(zip(chunks, embeddings))
|
| 85 |
+
|
| 86 |
+
def embed_query(self, query: str) -> np.ndarray:
|
| 87 |
+
"""Embed a query for retrieval.
|
| 88 |
+
|
| 89 |
+
Some models use different prompting for queries vs documents.
|
| 90 |
+
|
| 91 |
+
Args:
|
| 92 |
+
query: Query text to embed.
|
| 93 |
+
|
| 94 |
+
Returns:
|
| 95 |
+
Query embedding vector.
|
| 96 |
+
"""
|
| 97 |
+
# BGE models benefit from query prefixes
|
| 98 |
+
if "bge" in self.model_name.lower():
|
| 99 |
+
query = f"Represent this sentence for searching relevant passages: {query}"
|
| 100 |
+
|
| 101 |
+
return self.embed_text(query)
|
src/knowledge/vector_store.py
ADDED
|
@@ -0,0 +1,205 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""FAISS vector store for document retrieval."""
|
| 2 |
+
|
| 3 |
+
import json
|
| 4 |
+
import pickle
|
| 5 |
+
from pathlib import Path
|
| 6 |
+
from typing import Optional
|
| 7 |
+
|
| 8 |
+
import faiss
|
| 9 |
+
import numpy as np
|
| 10 |
+
from pydantic import BaseModel
|
| 11 |
+
|
| 12 |
+
from src.config import settings
|
| 13 |
+
from src.document_processor.chunker import DocumentChunk
|
| 14 |
+
from src.knowledge.embeddings import EmbeddingModel
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
class RetrievalResult(BaseModel):
|
| 18 |
+
"""Result from vector store retrieval."""
|
| 19 |
+
|
| 20 |
+
chunk: DocumentChunk
|
| 21 |
+
score: float
|
| 22 |
+
rank: int
|
| 23 |
+
|
| 24 |
+
class Config:
|
| 25 |
+
arbitrary_types_allowed = True
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
class FAISSVectorStore:
|
| 29 |
+
"""FAISS-based vector store for efficient similarity search.
|
| 30 |
+
|
| 31 |
+
Stores document chunks with their embeddings and provides
|
| 32 |
+
fast retrieval with source tracking for citations.
|
| 33 |
+
"""
|
| 34 |
+
|
| 35 |
+
def __init__(
|
| 36 |
+
self,
|
| 37 |
+
embedding_model: Optional[EmbeddingModel] = None,
|
| 38 |
+
index_path: Optional[Path] = None,
|
| 39 |
+
):
|
| 40 |
+
"""Initialize the vector store.
|
| 41 |
+
|
| 42 |
+
Args:
|
| 43 |
+
embedding_model: Model for generating embeddings.
|
| 44 |
+
index_path: Path to store/load the FAISS index.
|
| 45 |
+
"""
|
| 46 |
+
self.embedding_model = embedding_model or EmbeddingModel()
|
| 47 |
+
self.index_path = Path(index_path or settings.faiss_index_path)
|
| 48 |
+
|
| 49 |
+
self._index: Optional[faiss.IndexFlatIP] = None
|
| 50 |
+
self._chunks: list[DocumentChunk] = []
|
| 51 |
+
self._is_loaded = False
|
| 52 |
+
|
| 53 |
+
def _ensure_directory(self) -> None:
|
| 54 |
+
"""Ensure the index directory exists."""
|
| 55 |
+
self.index_path.parent.mkdir(parents=True, exist_ok=True)
|
| 56 |
+
|
| 57 |
+
def _create_index(self, dimension: int) -> faiss.IndexFlatIP:
|
| 58 |
+
"""Create a new FAISS index.
|
| 59 |
+
|
| 60 |
+
Uses Inner Product (IP) since embeddings are normalized.
|
| 61 |
+
"""
|
| 62 |
+
return faiss.IndexFlatIP(dimension)
|
| 63 |
+
|
| 64 |
+
def add_chunks(self, chunks: list[DocumentChunk]) -> int:
|
| 65 |
+
"""Add document chunks to the vector store.
|
| 66 |
+
|
| 67 |
+
Args:
|
| 68 |
+
chunks: List of DocumentChunks to add.
|
| 69 |
+
|
| 70 |
+
Returns:
|
| 71 |
+
Number of chunks added.
|
| 72 |
+
"""
|
| 73 |
+
if not chunks:
|
| 74 |
+
return 0
|
| 75 |
+
|
| 76 |
+
# Generate embeddings
|
| 77 |
+
chunk_embeddings = self.embedding_model.embed_chunks(chunks)
|
| 78 |
+
|
| 79 |
+
# Initialize index if needed
|
| 80 |
+
if self._index is None:
|
| 81 |
+
dimension = self.embedding_model.embedding_dimension
|
| 82 |
+
self._index = self._create_index(dimension)
|
| 83 |
+
|
| 84 |
+
# Add to index
|
| 85 |
+
embeddings_array = np.vstack([emb for _, emb in chunk_embeddings])
|
| 86 |
+
self._index.add(embeddings_array)
|
| 87 |
+
|
| 88 |
+
# Store chunks for retrieval
|
| 89 |
+
for chunk, _ in chunk_embeddings:
|
| 90 |
+
self._chunks.append(chunk)
|
| 91 |
+
|
| 92 |
+
return len(chunks)
|
| 93 |
+
|
| 94 |
+
def search(
|
| 95 |
+
self,
|
| 96 |
+
query: str,
|
| 97 |
+
top_k: int = None,
|
| 98 |
+
min_score: float = None,
|
| 99 |
+
) -> list[RetrievalResult]:
|
| 100 |
+
"""Search for relevant chunks.
|
| 101 |
+
|
| 102 |
+
Args:
|
| 103 |
+
query: Search query.
|
| 104 |
+
top_k: Number of results to return.
|
| 105 |
+
min_score: Minimum similarity score threshold.
|
| 106 |
+
|
| 107 |
+
Returns:
|
| 108 |
+
List of RetrievalResults ordered by relevance.
|
| 109 |
+
"""
|
| 110 |
+
if self._index is None or self._index.ntotal == 0:
|
| 111 |
+
return []
|
| 112 |
+
|
| 113 |
+
top_k = top_k or settings.retrieval_top_k
|
| 114 |
+
min_score = min_score or settings.retrieval_min_score
|
| 115 |
+
|
| 116 |
+
# Embed query
|
| 117 |
+
query_embedding = self.embedding_model.embed_query(query)
|
| 118 |
+
query_embedding = query_embedding.reshape(1, -1)
|
| 119 |
+
|
| 120 |
+
# Search
|
| 121 |
+
scores, indices = self._index.search(query_embedding, min(top_k, self._index.ntotal))
|
| 122 |
+
|
| 123 |
+
# Build results
|
| 124 |
+
results = []
|
| 125 |
+
for rank, (score, idx) in enumerate(zip(scores[0], indices[0])):
|
| 126 |
+
if idx < 0 or score < min_score:
|
| 127 |
+
continue
|
| 128 |
+
|
| 129 |
+
chunk = self._chunks[idx]
|
| 130 |
+
results.append(
|
| 131 |
+
RetrievalResult(
|
| 132 |
+
chunk=chunk,
|
| 133 |
+
score=float(score),
|
| 134 |
+
rank=rank + 1,
|
| 135 |
+
)
|
| 136 |
+
)
|
| 137 |
+
|
| 138 |
+
return results
|
| 139 |
+
|
| 140 |
+
def save(self) -> None:
|
| 141 |
+
"""Save the index and chunks to disk."""
|
| 142 |
+
if self._index is None:
|
| 143 |
+
return
|
| 144 |
+
|
| 145 |
+
self._ensure_directory()
|
| 146 |
+
|
| 147 |
+
# Save FAISS index
|
| 148 |
+
index_file = self.index_path.with_suffix(".faiss")
|
| 149 |
+
faiss.write_index(self._index, str(index_file))
|
| 150 |
+
|
| 151 |
+
# Save chunks as JSON
|
| 152 |
+
chunks_file = self.index_path.with_suffix(".chunks.json")
|
| 153 |
+
chunks_data = [chunk.model_dump() for chunk in self._chunks]
|
| 154 |
+
chunks_file.write_text(json.dumps(chunks_data, indent=2), encoding="utf-8")
|
| 155 |
+
|
| 156 |
+
def load(self) -> bool:
|
| 157 |
+
"""Load the index and chunks from disk.
|
| 158 |
+
|
| 159 |
+
Returns:
|
| 160 |
+
True if loaded successfully, False otherwise.
|
| 161 |
+
"""
|
| 162 |
+
index_file = self.index_path.with_suffix(".faiss")
|
| 163 |
+
chunks_file = self.index_path.with_suffix(".chunks.json")
|
| 164 |
+
|
| 165 |
+
if not index_file.exists() or not chunks_file.exists():
|
| 166 |
+
return False
|
| 167 |
+
|
| 168 |
+
try:
|
| 169 |
+
# Load FAISS index
|
| 170 |
+
self._index = faiss.read_index(str(index_file))
|
| 171 |
+
|
| 172 |
+
# Load chunks
|
| 173 |
+
chunks_data = json.loads(chunks_file.read_text(encoding="utf-8"))
|
| 174 |
+
self._chunks = [DocumentChunk.model_validate(c) for c in chunks_data]
|
| 175 |
+
|
| 176 |
+
self._is_loaded = True
|
| 177 |
+
return True
|
| 178 |
+
|
| 179 |
+
except Exception as e:
|
| 180 |
+
print(f"Error loading index: {e}")
|
| 181 |
+
return False
|
| 182 |
+
|
| 183 |
+
def clear(self) -> None:
|
| 184 |
+
"""Clear the index and all stored chunks."""
|
| 185 |
+
self._index = None
|
| 186 |
+
self._chunks = []
|
| 187 |
+
self._is_loaded = False
|
| 188 |
+
|
| 189 |
+
# Remove files if they exist
|
| 190 |
+
index_file = self.index_path.with_suffix(".faiss")
|
| 191 |
+
chunks_file = self.index_path.with_suffix(".chunks.json")
|
| 192 |
+
|
| 193 |
+
if index_file.exists():
|
| 194 |
+
index_file.unlink()
|
| 195 |
+
if chunks_file.exists():
|
| 196 |
+
chunks_file.unlink()
|
| 197 |
+
|
| 198 |
+
@property
|
| 199 |
+
def size(self) -> int:
|
| 200 |
+
"""Get the number of chunks in the store."""
|
| 201 |
+
return len(self._chunks)
|
| 202 |
+
|
| 203 |
+
def get_sources(self) -> list[str]:
|
| 204 |
+
"""Get list of unique source files in the store."""
|
| 205 |
+
return list(set(chunk.source_file for chunk in self._chunks))
|
src/rag/__init__.py
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""RAG module for HR Report Generator."""
|
| 2 |
+
|
| 3 |
+
from src.rag.synthesizer import ReportSynthesizer, ReportInput, SynthesisResult
|
| 4 |
+
from src.rag.retriever import DocumentRetriever, RetrievalContext
|
| 5 |
+
|
| 6 |
+
__all__ = [
|
| 7 |
+
"ReportSynthesizer",
|
| 8 |
+
"ReportInput",
|
| 9 |
+
"SynthesisResult",
|
| 10 |
+
"DocumentRetriever",
|
| 11 |
+
"RetrievalContext",
|
| 12 |
+
]
|
src/rag/prompts.py
ADDED
|
@@ -0,0 +1,155 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Prompt templates for HR report generation.
|
| 2 |
+
|
| 3 |
+
All prompts enforce strict grounding to retrieved content with
|
| 4 |
+
zero tolerance for hallucination or invention of facts.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
# System prompt that enforces strict RAG constraints with document type awareness
|
| 8 |
+
SYSTEM_PROMPT = """You are an HR Documentation Assistant that generates professional HR documents.
|
| 9 |
+
|
| 10 |
+
## CRITICAL CONSTRAINTS - FOLLOW EXACTLY
|
| 11 |
+
|
| 12 |
+
1. **ONLY USE PROVIDED CONTEXT**: You may ONLY include information from the documents provided below. Do NOT invent, assume, or generalize any facts.
|
| 13 |
+
|
| 14 |
+
2. **CITE EVERYTHING**: Every factual claim MUST reference its source document and section.
|
| 15 |
+
|
| 16 |
+
3. **NO HALLUCINATION**: If information is not in the provided context, you MUST state "No documentation available" for that section.
|
| 17 |
+
|
| 18 |
+
4. **PROFESSIONAL TONE**: Use neutral, factual HR language. No opinions, no emotional language, no speculation.
|
| 19 |
+
|
| 20 |
+
5. **DETERMINISTIC**: Given the same inputs and context, produce identical outputs every time.
|
| 21 |
+
|
| 22 |
+
## POLICY REFERENCE FORMAT - VERY IMPORTANT
|
| 23 |
+
|
| 24 |
+
When citing policies, use this EXACT format for clarity:
|
| 25 |
+
- **Policy Title**: State the exact policy name (e.g., "Attendance and Punctuality Policy")
|
| 26 |
+
- **Source Document**: Reference the handbook or document name
|
| 27 |
+
- **Section**: Include section number if available
|
| 28 |
+
- **Relevant Text**: Quote the specific policy language in quotation marks
|
| 29 |
+
|
| 30 |
+
Example:
|
| 31 |
+
- Attendance and Punctuality, Hillside Medical Group Employee Handbook, Section 3.2: "Employees are considered tardy if they arrive at their work area 5 minutes after their scheduled starting time. Progressive discipline will begin with the second tardy offense within a 90-day period."
|
| 32 |
+
|
| 33 |
+
## OUTPUT STRUCTURE
|
| 34 |
+
|
| 35 |
+
Generate the document following this structure exactly:
|
| 36 |
+
|
| 37 |
+
```markdown
|
| 38 |
+
# [Document Title]
|
| 39 |
+
|
| 40 |
+
**Employee Name:** [employee_name]
|
| 41 |
+
**Document Date:** [report_date]
|
| 42 |
+
**Incident Date(s):** [incident_date_range]
|
| 43 |
+
**Prepared By:** Human Resources Department
|
| 44 |
+
|
| 45 |
+
## Purpose of Document
|
| 46 |
+
[Brief statement of why this document is being created]
|
| 47 |
+
|
| 48 |
+
## Incident Background
|
| 49 |
+
[Factual description of what occurred - synthesized ONLY from provided context]
|
| 50 |
+
|
| 51 |
+
## Policy References
|
| 52 |
+
[List specific policies with DETAILED citations. Include:
|
| 53 |
+
- Full policy name
|
| 54 |
+
- Source document
|
| 55 |
+
- Section number (if available)
|
| 56 |
+
- Exact quoted text from policy in quotation marks
|
| 57 |
+
If no policies found, state "No relevant policy documentation available."]
|
| 58 |
+
|
| 59 |
+
## Previous Coaching / Warnings
|
| 60 |
+
[Summary of prior warnings from context with dates and specifics. If none found, state "No prior warning documentation available."]
|
| 61 |
+
|
| 62 |
+
## Corrective Action Required
|
| 63 |
+
[Specific actions required - ONLY if stated in context. Otherwise: "Action pending HR review."]
|
| 64 |
+
|
| 65 |
+
## Employee Acknowledgment
|
| 66 |
+
Employee signature: ___________________ Date: ___________
|
| 67 |
+
Supervisor signature: _________________ Date: ___________
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
## VALIDATION CHECKLIST (Self-verify before outputting)
|
| 71 |
+
|
| 72 |
+
- [ ] Every fact traces to a provided document
|
| 73 |
+
- [ ] All policy citations include source, section, and quoted text
|
| 74 |
+
- [ ] No invented dates, names, or actions
|
| 75 |
+
- [ ] Professional, neutral language throughout
|
| 76 |
+
"""
|
| 77 |
+
|
| 78 |
+
# Template for the user message with context
|
| 79 |
+
USER_MESSAGE_TEMPLATE = """## DOCUMENT INFORMATION
|
| 80 |
+
|
| 81 |
+
- **Document Type:** {doc_type}
|
| 82 |
+
- **Employee Name:** {employee_name}
|
| 83 |
+
- **Document Date:** {report_date}
|
| 84 |
+
- **Incident Date(s):** {incident_date_range}
|
| 85 |
+
- **Incident Reason:** {incident_reason}
|
| 86 |
+
- **Violation Type:** {violation_type}
|
| 87 |
+
|
| 88 |
+
---
|
| 89 |
+
|
| 90 |
+
## RETRIEVED POLICY DOCUMENTS
|
| 91 |
+
|
| 92 |
+
The following policy documents contain relevant information. Use these for the Policy References section:
|
| 93 |
+
|
| 94 |
+
{policy_context}
|
| 95 |
+
|
| 96 |
+
---
|
| 97 |
+
|
| 98 |
+
## RETRIEVED WARNING/COACHING DOCUMENTS
|
| 99 |
+
|
| 100 |
+
The following documents contain prior warning/coaching records:
|
| 101 |
+
|
| 102 |
+
{warnings_context}
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## INSTRUCTIONS
|
| 107 |
+
|
| 108 |
+
1. Generate the HR document using ONLY the information provided above.
|
| 109 |
+
2. For the Policy References section:
|
| 110 |
+
- Extract and quote the EXACT policy language from the context
|
| 111 |
+
- Include the source document name
|
| 112 |
+
- Include section numbers where available
|
| 113 |
+
- Format each reference clearly with the policy title, source, and quoted text
|
| 114 |
+
3. Cite specific sections for all factual claims.
|
| 115 |
+
4. State "No documentation available" for any section without supporting context.
|
| 116 |
+
5. Use professional, neutral HR language throughout.
|
| 117 |
+
"""
|
| 118 |
+
|
| 119 |
+
# Template for formatting retrieved chunks as context
|
| 120 |
+
CONTEXT_CHUNK_TEMPLATE = """### Source: {source_file}
|
| 121 |
+
**Section:** {section_title}
|
| 122 |
+
**Relevance Score:** {score:.2f}
|
| 123 |
+
|
| 124 |
+
{content}
|
| 125 |
+
|
| 126 |
+
---
|
| 127 |
+
"""
|
| 128 |
+
|
| 129 |
+
# Refusal response for insufficient evidence
|
| 130 |
+
INSUFFICIENT_EVIDENCE_RESPONSE = """# HR Incident Report
|
| 131 |
+
|
| 132 |
+
**Employee Name:** {employee_name}
|
| 133 |
+
**Document Date:** {report_date}
|
| 134 |
+
**Incident Date(s):** {incident_date_range}
|
| 135 |
+
**Prepared By:** Human Resources Department
|
| 136 |
+
|
| 137 |
+
## Purpose of Document
|
| 138 |
+
To document {incident_reason} incident(s).
|
| 139 |
+
|
| 140 |
+
## Incident Background
|
| 141 |
+
Insufficient documentation available to generate incident summary.
|
| 142 |
+
|
| 143 |
+
## Policy References
|
| 144 |
+
No relevant policy documentation available. Please ensure policy documents have been ingested into the system.
|
| 145 |
+
|
| 146 |
+
## Previous Coaching / Warnings
|
| 147 |
+
No prior warning documentation available.
|
| 148 |
+
|
| 149 |
+
## Corrective Action Required
|
| 150 |
+
Report generation incomplete. Please ingest relevant HR documentation before proceeding.
|
| 151 |
+
|
| 152 |
+
---
|
| 153 |
+
|
| 154 |
+
**⚠️ NOTICE:** This report could not be completed due to insufficient documentation in the knowledge base. Please add relevant HR policy documents and any prior coaching/warning records, then regenerate this report.
|
| 155 |
+
"""
|
src/rag/retriever.py
ADDED
|
@@ -0,0 +1,177 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Document retriever for RAG pipeline."""
|
| 2 |
+
|
| 3 |
+
from pathlib import Path
|
| 4 |
+
from typing import Optional
|
| 5 |
+
|
| 6 |
+
from pydantic import BaseModel
|
| 7 |
+
|
| 8 |
+
from src.config import settings
|
| 9 |
+
from src.knowledge.vector_store import FAISSVectorStore, RetrievalResult
|
| 10 |
+
from src.rag.prompts import CONTEXT_CHUNK_TEMPLATE
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
class RetrievalContext(BaseModel):
|
| 14 |
+
"""Context retrieved for report generation."""
|
| 15 |
+
|
| 16 |
+
policy_results: list[RetrievalResult]
|
| 17 |
+
warning_results: list[RetrievalResult]
|
| 18 |
+
policy_context_text: str
|
| 19 |
+
warnings_context_text: str
|
| 20 |
+
has_sufficient_evidence: bool
|
| 21 |
+
sources_used: list[str]
|
| 22 |
+
|
| 23 |
+
class Config:
|
| 24 |
+
arbitrary_types_allowed = True
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
class DocumentRetriever:
|
| 28 |
+
"""Retrieves relevant documents for HR report generation.
|
| 29 |
+
|
| 30 |
+
Separates retrieval into policy documents and warning/coaching
|
| 31 |
+
documents to ensure proper context for each report section.
|
| 32 |
+
"""
|
| 33 |
+
|
| 34 |
+
def __init__(self, vector_store: Optional[FAISSVectorStore] = None):
|
| 35 |
+
"""Initialize the retriever.
|
| 36 |
+
|
| 37 |
+
Args:
|
| 38 |
+
vector_store: Vector store to search. Creates new one if not provided.
|
| 39 |
+
"""
|
| 40 |
+
self.vector_store = vector_store or FAISSVectorStore()
|
| 41 |
+
|
| 42 |
+
# Try to load existing index
|
| 43 |
+
if not self.vector_store._is_loaded:
|
| 44 |
+
self.vector_store.load()
|
| 45 |
+
|
| 46 |
+
def _format_results_as_context(self, results: list[RetrievalResult]) -> str:
|
| 47 |
+
"""Format retrieval results as context string for the LLM."""
|
| 48 |
+
if not results:
|
| 49 |
+
return "No relevant documents found."
|
| 50 |
+
|
| 51 |
+
context_parts = []
|
| 52 |
+
for result in results:
|
| 53 |
+
formatted = CONTEXT_CHUNK_TEMPLATE.format(
|
| 54 |
+
source_file=Path(result.chunk.source_file).name,
|
| 55 |
+
section_title=result.chunk.section_title or "General",
|
| 56 |
+
score=result.score,
|
| 57 |
+
content=result.chunk.content,
|
| 58 |
+
)
|
| 59 |
+
context_parts.append(formatted)
|
| 60 |
+
|
| 61 |
+
return "\n".join(context_parts)
|
| 62 |
+
|
| 63 |
+
def _classify_results(
|
| 64 |
+
self, results: list[RetrievalResult]
|
| 65 |
+
) -> tuple[list[RetrievalResult], list[RetrievalResult]]:
|
| 66 |
+
"""Classify results into policy and warning categories.
|
| 67 |
+
|
| 68 |
+
Uses simple heuristics based on source filename and content.
|
| 69 |
+
"""
|
| 70 |
+
policy_results = []
|
| 71 |
+
warning_results = []
|
| 72 |
+
|
| 73 |
+
policy_keywords = ["policy", "handbook", "manual", "guideline", "procedure"]
|
| 74 |
+
warning_keywords = ["warning", "coaching", "counseling", "disciplinary", "incident"]
|
| 75 |
+
|
| 76 |
+
for result in results:
|
| 77 |
+
source_lower = Path(result.chunk.source_file).stem.lower()
|
| 78 |
+
content_lower = result.chunk.content.lower()
|
| 79 |
+
|
| 80 |
+
# Check if it's a warning/coaching document
|
| 81 |
+
is_warning = any(kw in source_lower for kw in warning_keywords) or any(
|
| 82 |
+
kw in content_lower[:200] for kw in warning_keywords
|
| 83 |
+
)
|
| 84 |
+
|
| 85 |
+
# Check if it's a policy document
|
| 86 |
+
is_policy = any(kw in source_lower for kw in policy_keywords) or any(
|
| 87 |
+
kw in content_lower[:200] for kw in policy_keywords
|
| 88 |
+
)
|
| 89 |
+
|
| 90 |
+
if is_warning:
|
| 91 |
+
warning_results.append(result)
|
| 92 |
+
elif is_policy:
|
| 93 |
+
policy_results.append(result)
|
| 94 |
+
else:
|
| 95 |
+
# Default to policy if unclear
|
| 96 |
+
policy_results.append(result)
|
| 97 |
+
|
| 98 |
+
return policy_results, warning_results
|
| 99 |
+
|
| 100 |
+
def retrieve(
|
| 101 |
+
self,
|
| 102 |
+
employee_name: str,
|
| 103 |
+
violation_type: str,
|
| 104 |
+
incident_reason: str,
|
| 105 |
+
top_k: int = None,
|
| 106 |
+
min_score: float = None,
|
| 107 |
+
) -> RetrievalContext:
|
| 108 |
+
"""Retrieve relevant context for report generation.
|
| 109 |
+
|
| 110 |
+
Args:
|
| 111 |
+
employee_name: Name of the employee.
|
| 112 |
+
violation_type: Type of violation (e.g., "Tardiness").
|
| 113 |
+
incident_reason: Description of the incident.
|
| 114 |
+
top_k: Number of results per query.
|
| 115 |
+
min_score: Minimum similarity score.
|
| 116 |
+
|
| 117 |
+
Returns:
|
| 118 |
+
RetrievalContext with categorized results.
|
| 119 |
+
"""
|
| 120 |
+
top_k = top_k or settings.retrieval_top_k
|
| 121 |
+
min_score = min_score or settings.retrieval_min_score
|
| 122 |
+
|
| 123 |
+
# Build search queries
|
| 124 |
+
policy_query = f"{violation_type} policy procedure disciplinary action"
|
| 125 |
+
warning_query = f"{employee_name} warning coaching disciplinary {violation_type}"
|
| 126 |
+
incident_query = f"{incident_reason} {violation_type}"
|
| 127 |
+
|
| 128 |
+
# Execute searches
|
| 129 |
+
policy_results = self.vector_store.search(policy_query, top_k=top_k, min_score=min_score)
|
| 130 |
+
warning_results = self.vector_store.search(warning_query, top_k=top_k, min_score=min_score)
|
| 131 |
+
incident_results = self.vector_store.search(
|
| 132 |
+
incident_query, top_k=top_k, min_score=min_score
|
| 133 |
+
)
|
| 134 |
+
|
| 135 |
+
# Combine and deduplicate
|
| 136 |
+
all_results = {}
|
| 137 |
+
for result in policy_results + warning_results + incident_results:
|
| 138 |
+
chunk_id = result.chunk.chunk_id
|
| 139 |
+
if chunk_id not in all_results or result.score > all_results[chunk_id].score:
|
| 140 |
+
all_results[chunk_id] = result
|
| 141 |
+
|
| 142 |
+
# Re-classify all results
|
| 143 |
+
all_results_list = sorted(all_results.values(), key=lambda r: r.score, reverse=True)
|
| 144 |
+
policy_classified, warning_classified = self._classify_results(all_results_list)
|
| 145 |
+
|
| 146 |
+
# Format as context text
|
| 147 |
+
policy_context = self._format_results_as_context(policy_classified)
|
| 148 |
+
warnings_context = self._format_results_as_context(warning_classified)
|
| 149 |
+
|
| 150 |
+
# Determine if we have sufficient evidence
|
| 151 |
+
has_evidence = len(policy_classified) > 0 or len(warning_classified) > 0
|
| 152 |
+
|
| 153 |
+
# Collect unique sources
|
| 154 |
+
sources = list(
|
| 155 |
+
set(
|
| 156 |
+
Path(r.chunk.source_file).name
|
| 157 |
+
for r in policy_classified + warning_classified
|
| 158 |
+
)
|
| 159 |
+
)
|
| 160 |
+
|
| 161 |
+
return RetrievalContext(
|
| 162 |
+
policy_results=policy_classified,
|
| 163 |
+
warning_results=warning_classified,
|
| 164 |
+
policy_context_text=policy_context,
|
| 165 |
+
warnings_context_text=warnings_context,
|
| 166 |
+
has_sufficient_evidence=has_evidence,
|
| 167 |
+
sources_used=sources,
|
| 168 |
+
)
|
| 169 |
+
|
| 170 |
+
def retrieve_for_employee(
|
| 171 |
+
self, employee_name: str, top_k: int = 10
|
| 172 |
+
) -> list[RetrievalResult]:
|
| 173 |
+
"""Retrieve all documents mentioning an employee.
|
| 174 |
+
|
| 175 |
+
Useful for finding prior warnings and coaching records.
|
| 176 |
+
"""
|
| 177 |
+
return self.vector_store.search(employee_name, top_k=top_k, min_score=0.2)
|
src/rag/synthesizer.py
ADDED
|
@@ -0,0 +1,197 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Report synthesizer using LLM with strict grounding."""
|
| 2 |
+
|
| 3 |
+
from datetime import datetime
|
| 4 |
+
from typing import Optional
|
| 5 |
+
import os
|
| 6 |
+
|
| 7 |
+
import requests
|
| 8 |
+
from dotenv import load_dotenv
|
| 9 |
+
from pydantic import BaseModel
|
| 10 |
+
|
| 11 |
+
# Load environment variables
|
| 12 |
+
load_dotenv()
|
| 13 |
+
|
| 14 |
+
from src.config import settings
|
| 15 |
+
from src.rag.prompts import (
|
| 16 |
+
INSUFFICIENT_EVIDENCE_RESPONSE,
|
| 17 |
+
SYSTEM_PROMPT,
|
| 18 |
+
USER_MESSAGE_TEMPLATE,
|
| 19 |
+
)
|
| 20 |
+
from src.rag.retriever import DocumentRetriever, RetrievalContext
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
class ReportInput(BaseModel):
|
| 24 |
+
"""Structured input for report generation."""
|
| 25 |
+
|
| 26 |
+
employee_name: str
|
| 27 |
+
incident_date_range: str
|
| 28 |
+
incident_reason: str
|
| 29 |
+
violation_type: str
|
| 30 |
+
report_date: Optional[str] = None
|
| 31 |
+
doc_type: Optional[str] = "Memorandum"
|
| 32 |
+
|
| 33 |
+
def __init__(self, **data):
|
| 34 |
+
if "report_date" not in data or data["report_date"] is None:
|
| 35 |
+
data["report_date"] = datetime.now().strftime("%Y-%m-%d")
|
| 36 |
+
super().__init__(**data)
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
class SynthesisResult(BaseModel):
|
| 40 |
+
"""Result of report synthesis."""
|
| 41 |
+
|
| 42 |
+
markdown_report: str
|
| 43 |
+
retrieval_context: RetrievalContext
|
| 44 |
+
model_used: str
|
| 45 |
+
generation_timestamp: str
|
| 46 |
+
success: bool
|
| 47 |
+
error: Optional[str] = None
|
| 48 |
+
|
| 49 |
+
class Config:
|
| 50 |
+
arbitrary_types_allowed = True
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
class ReportSynthesizer:
|
| 54 |
+
"""Synthesizes HR incident reports using RAG.
|
| 55 |
+
|
| 56 |
+
Combines retrieved context with structured inputs to generate
|
| 57 |
+
strictly grounded reports using an LLM with temperature=0.
|
| 58 |
+
"""
|
| 59 |
+
|
| 60 |
+
def __init__(
|
| 61 |
+
self,
|
| 62 |
+
retriever: Optional[DocumentRetriever] = None,
|
| 63 |
+
model_name: Optional[str] = None,
|
| 64 |
+
):
|
| 65 |
+
"""Initialize the synthesizer.
|
| 66 |
+
|
| 67 |
+
Args:
|
| 68 |
+
retriever: Document retriever for context.
|
| 69 |
+
model_name: Model name for OpenRouter.
|
| 70 |
+
"""
|
| 71 |
+
self.retriever = retriever or DocumentRetriever()
|
| 72 |
+
self.model_name = model_name or settings.llm_model
|
| 73 |
+
self.api_key = os.getenv("OPENROUTER_API_KEY", "")
|
| 74 |
+
self.api_base = os.getenv("OPENROUTER_API_BASE", "https://openrouter.ai/api/v1")
|
| 75 |
+
|
| 76 |
+
def _build_user_message(
|
| 77 |
+
self, report_input: ReportInput, context: RetrievalContext
|
| 78 |
+
) -> str:
|
| 79 |
+
"""Build the user message with retrieved context."""
|
| 80 |
+
return USER_MESSAGE_TEMPLATE.format(
|
| 81 |
+
doc_type=report_input.doc_type,
|
| 82 |
+
employee_name=report_input.employee_name,
|
| 83 |
+
report_date=report_input.report_date,
|
| 84 |
+
incident_date_range=report_input.incident_date_range,
|
| 85 |
+
incident_reason=report_input.incident_reason,
|
| 86 |
+
violation_type=report_input.violation_type,
|
| 87 |
+
policy_context=context.policy_context_text,
|
| 88 |
+
warnings_context=context.warnings_context_text,
|
| 89 |
+
)
|
| 90 |
+
|
| 91 |
+
def _generate_with_llm(self, user_message: str) -> str:
|
| 92 |
+
"""Generate report using OpenRouter API."""
|
| 93 |
+
headers = {
|
| 94 |
+
"Authorization": f"Bearer {self.api_key}",
|
| 95 |
+
"Content-Type": "application/json",
|
| 96 |
+
"HTTP-Referer": "https://hr-report-generator.hf.space",
|
| 97 |
+
"X-Title": "HR Incident Report Generator",
|
| 98 |
+
}
|
| 99 |
+
|
| 100 |
+
payload = {
|
| 101 |
+
"model": self.model_name,
|
| 102 |
+
"messages": [
|
| 103 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 104 |
+
{"role": "user", "content": user_message},
|
| 105 |
+
],
|
| 106 |
+
"temperature": settings.llm_temperature,
|
| 107 |
+
"max_tokens": 2048,
|
| 108 |
+
}
|
| 109 |
+
|
| 110 |
+
response = requests.post(
|
| 111 |
+
f"{self.api_base}/chat/completions",
|
| 112 |
+
headers=headers,
|
| 113 |
+
json=payload,
|
| 114 |
+
timeout=120,
|
| 115 |
+
)
|
| 116 |
+
|
| 117 |
+
if response.status_code != 200:
|
| 118 |
+
raise Exception(f"OpenRouter API error: {response.status_code} - {response.text}")
|
| 119 |
+
|
| 120 |
+
result = response.json()
|
| 121 |
+
return result["choices"][0]["message"]["content"]
|
| 122 |
+
|
| 123 |
+
def _generate_insufficient_evidence_report(self, report_input: ReportInput) -> str:
|
| 124 |
+
"""Generate a report indicating insufficient evidence."""
|
| 125 |
+
return INSUFFICIENT_EVIDENCE_RESPONSE.format(
|
| 126 |
+
employee_name=report_input.employee_name,
|
| 127 |
+
report_date=report_input.report_date,
|
| 128 |
+
incident_date_range=report_input.incident_date_range,
|
| 129 |
+
incident_reason=report_input.incident_reason,
|
| 130 |
+
violation_type=report_input.violation_type,
|
| 131 |
+
)
|
| 132 |
+
|
| 133 |
+
def synthesize(self, report_input: ReportInput) -> SynthesisResult:
|
| 134 |
+
"""Synthesize an HR incident report.
|
| 135 |
+
|
| 136 |
+
Args:
|
| 137 |
+
report_input: Structured input data.
|
| 138 |
+
|
| 139 |
+
Returns:
|
| 140 |
+
SynthesisResult with the generated report.
|
| 141 |
+
"""
|
| 142 |
+
timestamp = datetime.now().isoformat()
|
| 143 |
+
|
| 144 |
+
try:
|
| 145 |
+
# Retrieve relevant context
|
| 146 |
+
context = self.retriever.retrieve(
|
| 147 |
+
employee_name=report_input.employee_name,
|
| 148 |
+
violation_type=report_input.violation_type,
|
| 149 |
+
incident_reason=report_input.incident_reason,
|
| 150 |
+
)
|
| 151 |
+
|
| 152 |
+
# Check for sufficient evidence
|
| 153 |
+
if not context.has_sufficient_evidence:
|
| 154 |
+
return SynthesisResult(
|
| 155 |
+
markdown_report=self._generate_insufficient_evidence_report(report_input),
|
| 156 |
+
retrieval_context=context,
|
| 157 |
+
model_used=self.model_name,
|
| 158 |
+
generation_timestamp=timestamp,
|
| 159 |
+
success=True,
|
| 160 |
+
)
|
| 161 |
+
|
| 162 |
+
# Build prompt and generate
|
| 163 |
+
user_message = self._build_user_message(report_input, context)
|
| 164 |
+
markdown_report = self._generate_with_llm(user_message)
|
| 165 |
+
|
| 166 |
+
return SynthesisResult(
|
| 167 |
+
markdown_report=markdown_report,
|
| 168 |
+
retrieval_context=context,
|
| 169 |
+
model_used=self.model_name,
|
| 170 |
+
generation_timestamp=timestamp,
|
| 171 |
+
success=True,
|
| 172 |
+
)
|
| 173 |
+
|
| 174 |
+
except Exception as e:
|
| 175 |
+
return SynthesisResult(
|
| 176 |
+
markdown_report="",
|
| 177 |
+
retrieval_context=RetrievalContext(
|
| 178 |
+
policy_results=[],
|
| 179 |
+
warning_results=[],
|
| 180 |
+
policy_context_text="",
|
| 181 |
+
warnings_context_text="",
|
| 182 |
+
has_sufficient_evidence=False,
|
| 183 |
+
sources_used=[],
|
| 184 |
+
),
|
| 185 |
+
model_used=self.model_name,
|
| 186 |
+
generation_timestamp=timestamp,
|
| 187 |
+
success=False,
|
| 188 |
+
error=str(e),
|
| 189 |
+
)
|
| 190 |
+
|
| 191 |
+
def synthesize_from_dict(self, data: dict) -> SynthesisResult:
|
| 192 |
+
"""Synthesize a report from dictionary input.
|
| 193 |
+
|
| 194 |
+
Convenience method for CLI and API usage.
|
| 195 |
+
"""
|
| 196 |
+
report_input = ReportInput(**data)
|
| 197 |
+
return self.synthesize(report_input)
|