Sonu Prasad commited on
Commit
8a1c0d1
·
1 Parent(s): f95330e
DEPLOY.md ADDED
@@ -0,0 +1,198 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HuggingFace Spaces Deployment Guide
2
+
3
+ ## Overview
4
+ This guide walks you through deploying the HR Report Generator API on HuggingFace Spaces using Docker.
5
+
6
+ ---
7
+
8
+ ## Prerequisites
9
+
10
+ 1. **HuggingFace Account**: Create a free account at [huggingface.co](https://huggingface.co)
11
+ 2. **OpenRouter API Key**: Get your key from [openrouter.ai](https://openrouter.ai)
12
+
13
+ ---
14
+
15
+ ## Step-by-Step Deployment
16
+
17
+ ### Step 1: Create a New Space
18
+
19
+ 1. Go to [huggingface.co/new-space](https://huggingface.co/new-space)
20
+ 2. Fill in the details:
21
+ - **Space name**: `hr-report-api` (or your preferred name)
22
+ - **License**: Apache 2.0 (or your preference)
23
+ - **SDK**: Select **Docker**
24
+ - **Visibility**: Private (recommended for HR data)
25
+ 3. Click **Create Space**
26
+
27
+ ### Step 2: Upload Files
28
+
29
+ Upload all files from this folder to your Space. The structure should be:
30
+
31
+ ```
32
+ your-space/
33
+ ├── api.py
34
+ ├── Dockerfile
35
+ ├── requirements.txt
36
+ ├── endpoints.txt
37
+ ├── README.md
38
+ └── src/
39
+ ├── __init__.py
40
+ ├── config.py
41
+ ├── rag/
42
+ │ ├── __init__.py
43
+ │ ├── synthesizer.py
44
+ │ ├── retriever.py
45
+ │ └── prompts.py
46
+ ├── knowledge/
47
+ │ ├── __init__.py
48
+ │ ├── vector_store.py
49
+ │ └── embeddings.py
50
+ └── document_processor/
51
+ ├── __init__.py
52
+ └── chunker.py
53
+ ```
54
+
55
+ You can upload via:
56
+ - **Web UI**: Drag and drop files
57
+ - **Git**: Clone the repo and push
58
+
59
+ ```bash
60
+ git clone https://huggingface.co/spaces/YOUR_USERNAME/hr-report-api
61
+ cd hr-report-api
62
+ # Copy all files from this folder
63
+ git add .
64
+ git commit -m "Initial deployment"
65
+ git push
66
+ ```
67
+
68
+ ### Step 3: Configure Secrets
69
+
70
+ Go to **Settings → Secrets** in your Space and add:
71
+
72
+ | Secret Name | Value | Description |
73
+ |-------------|-------|-------------|
74
+ | `OPENROUTER_API_KEY` | `sk-or-...` | Your OpenRouter API key |
75
+ | `ALLOWED_ORIGINS` | `https://checkin.hillsideprimarycare.com,https://hsmg.netlify.app` | Comma-separated allowed origins |
76
+ | `LLM_MODEL` | `google/gemma-2-9b-it:free` | (Optional) Override model from endpoints.txt |
77
+
78
+ ### Step 4: Upload FAISS Index (Optional)
79
+
80
+ If you have a pre-built FAISS index with HR policies:
81
+
82
+ 1. Create a `data/embeddings/` folder in your Space
83
+ 2. Upload:
84
+ - `faiss_index.faiss` - The FAISS index file
85
+ - `faiss_index.chunks.json` - The chunks metadata
86
+
87
+ Without this, the API will still work but report "insufficient documentation."
88
+
89
+ ### Step 5: Verify Deployment
90
+
91
+ 1. Wait for the build to complete (1-3 minutes)
92
+ 2. Your API will be available at:
93
+ ```
94
+ https://YOUR_USERNAME-hr-report-api.hf.space
95
+ ```
96
+ 3. Check health: `https://YOUR_USERNAME-hr-report-api.hf.space/api/health`
97
+
98
+ ---
99
+
100
+ ## API Endpoints
101
+
102
+ | Endpoint | Method | Description |
103
+ |----------|--------|-------------|
104
+ | `/` | GET | API info and status |
105
+ | `/api/health` | GET | Health check |
106
+ | `/api/generate` | POST | Generate HR document |
107
+ | `/api/status` | GET | Knowledge base status |
108
+ | `/api/config` | GET | Public configuration |
109
+
110
+ ### Generate Document Example
111
+
112
+ ```javascript
113
+ fetch('https://YOUR-SPACE.hf.space/api/generate', {
114
+ method: 'POST',
115
+ headers: { 'Content-Type': 'application/json' },
116
+ body: JSON.stringify({
117
+ doc_type: 'Memorandum',
118
+ employee_name: 'John Smith',
119
+ date_from: '2026-02-01',
120
+ date_to: '2026-02-01',
121
+ reason: 'Tardiness',
122
+ additional_notes: 'Employee arrived 30 minutes late.'
123
+ })
124
+ })
125
+ ```
126
+
127
+ ---
128
+
129
+ ## Updating the LLM Model
130
+
131
+ 1. Edit `endpoints.txt` in your Space
132
+ 2. Uncomment the model you want to use
133
+ 3. The first uncommented line will be used
134
+
135
+ ```txt
136
+ # Free Models:
137
+ google/gemma-2-9b-it:free
138
+ # meta-llama/llama-3.2-3b-instruct:free
139
+
140
+ # Paid Models:
141
+ # openai/gpt-4o
142
+ ```
143
+
144
+ ---
145
+
146
+ ## Origin Validation
147
+
148
+ The API validates the `Origin` header against `ALLOWED_ORIGINS`. Only requests from these domains are allowed:
149
+
150
+ - `https://checkin.hillsideprimarycare.com`
151
+ - `https://hsmg.netlify.app`
152
+ - `http://localhost:3000` (for development)
153
+ - `http://localhost:5500`
154
+
155
+ To add more origins, update the `ALLOWED_ORIGINS` secret (comma-separated).
156
+
157
+ ---
158
+
159
+ ## Troubleshooting
160
+
161
+ ### Build Fails
162
+ - Check Dockerfile syntax
163
+ - Ensure all files are uploaded
164
+ - Check the build logs for errors
165
+
166
+ ### CORS Errors
167
+ - Verify `ALLOWED_ORIGINS` includes your frontend domain
168
+ - Make sure the domain has `https://` prefix
169
+
170
+ ### API Returns 500
171
+ - Check if `OPENROUTER_API_KEY` is set correctly
172
+ - Verify the model in `endpoints.txt` is available
173
+ - Check Space logs for detailed errors
174
+
175
+ ### Slow Response
176
+ - First request may be slow due to model loading (~30s)
177
+ - Subsequent requests should be faster
178
+
179
+ ---
180
+
181
+ ## Cost
182
+
183
+ | Component | Cost |
184
+ |-----------|------|
185
+ | HuggingFace Space | **Free** (with cold starts) |
186
+ | OpenRouter (free models) | **Free** |
187
+ | Total | **$0/month** |
188
+
189
+ > **Note**: Free tier has 30-60 second cold starts when the Space sleeps after inactivity.
190
+
191
+ ---
192
+
193
+ ## Next Steps
194
+
195
+ 1. ✅ Deploy to HuggingFace Spaces
196
+ 2. ✅ Configure secrets
197
+ 3. ⏳ Deploy frontend to Netlify (see `netlify/DEPLOY.md`)
198
+ 4. ⏳ Test end-to-end integration
Dockerfile ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ WORKDIR /app
4
+
5
+ # Install system dependencies
6
+ RUN apt-get update && apt-get install -y \
7
+ build-essential \
8
+ && rm -rf /var/lib/apt/lists/*
9
+
10
+ # Copy requirements first for caching
11
+ COPY requirements.txt .
12
+ RUN pip install --no-cache-dir -r requirements.txt
13
+
14
+ # Copy application code
15
+ COPY . .
16
+
17
+ # Create data directories
18
+ RUN mkdir -p data/embeddings data/outputs data/uploads
19
+
20
+ # Expose port 7860 (HuggingFace default)
21
+ EXPOSE 7860
22
+
23
+ # Run with gunicorn
24
+ CMD ["gunicorn", "--bind", "0.0.0.0:7860", "--workers", "2", "--timeout", "120", "api:app"]
README.md CHANGED
@@ -1,10 +1,30 @@
1
- ---
2
- title: Hrbot
3
- emoji: 🐠
4
- colorFrom: purple
5
- colorTo: pink
6
- sdk: docker
7
- pinned: false
8
- ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HR Report Generator - API Backend (HuggingFace Spaces Docker)
 
 
 
 
 
 
 
2
 
3
+ This folder contains the API backend for the HR Report Generator, designed to run on HuggingFace Spaces using Docker.
4
+
5
+ ## Structure
6
+ ```
7
+ huggingface/
8
+ ├── api.py # Flask API server
9
+ ├── Dockerfile # Docker configuration
10
+ ├── requirements.txt # Python dependencies
11
+ ├── endpoints.txt # LLM endpoints (editable)
12
+ ├── src/ # Source modules
13
+ │ ├── config.py
14
+ │ ├── rag/
15
+ │ │ ├── synthesizer.py
16
+ │ │ ├── retriever.py
17
+ │ │ └── prompts.py
18
+ │ ├── knowledge/
19
+ │ │ ├── vector_store.py
20
+ │ │ └── embeddings.py
21
+ │ └── document_processor/
22
+ │ └── chunker.py
23
+ ├── data/ # Data directory (create on HF)
24
+ │ └── embeddings/
25
+ └── DEPLOY.md # Deployment guide
26
+ ```
27
+
28
+ ## Files
29
+ - **endpoints.txt**: Configure your LLM models here
30
+ - **api.py**: Main Flask API with CORS origin validation
api.py ADDED
@@ -0,0 +1,297 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """HR Report Generator API - HuggingFace Spaces Backend.
2
+
3
+ This Flask API serves as the backend for the HR Report Generator.
4
+ It validates origins against allowed domains stored in HF secrets.
5
+ """
6
+
7
+ import os
8
+ from datetime import datetime
9
+ from pathlib import Path
10
+ from functools import wraps
11
+
12
+ from flask import Flask, request, jsonify
13
+ from flask_cors import CORS
14
+ from dotenv import load_dotenv
15
+
16
+ # Load environment variables
17
+ load_dotenv()
18
+
19
+ # Import our modules
20
+ from src.config import settings
21
+ from src.knowledge.vector_store import FAISSVectorStore
22
+ from src.rag.synthesizer import ReportSynthesizer
23
+
24
+ app = Flask(__name__)
25
+
26
+ # ============================================
27
+ # CORS and Origin Validation
28
+ # ============================================
29
+
30
+ # Allowed origins from environment (set in HF Secrets)
31
+ # Format: comma-separated list of allowed origins
32
+ ALLOWED_ORIGINS_STR = os.getenv("ALLOWED_ORIGINS", "")
33
+ ALLOWED_ORIGINS = [
34
+ origin.strip()
35
+ for origin in ALLOWED_ORIGINS_STR.split(",")
36
+ if origin.strip()
37
+ ]
38
+
39
+ # Default allowed origins if none specified
40
+ if not ALLOWED_ORIGINS:
41
+ ALLOWED_ORIGINS = [
42
+ "https://checkin.hillsideprimarycare.com",
43
+ "https://hsmg.netlify.app",
44
+ "http://localhost:3000",
45
+ "http://localhost:5500",
46
+ "http://127.0.0.1:5500",
47
+ ]
48
+
49
+ # Enable CORS with specific origins
50
+ CORS(app, resources={
51
+ r"/api/*": {
52
+ "origins": ALLOWED_ORIGINS,
53
+ "methods": ["GET", "POST", "OPTIONS"],
54
+ "allow_headers": ["Content-Type", "Authorization"],
55
+ }
56
+ })
57
+
58
+
59
+ def validate_origin(f):
60
+ """Decorator to validate request origin against allowed list."""
61
+ @wraps(f)
62
+ def decorated_function(*args, **kwargs):
63
+ origin = request.headers.get("Origin", "")
64
+ referer = request.headers.get("Referer", "")
65
+
66
+ # Check if origin or referer matches allowed origins
67
+ is_allowed = False
68
+ for allowed in ALLOWED_ORIGINS:
69
+ if origin.startswith(allowed) or referer.startswith(allowed):
70
+ is_allowed = True
71
+ break
72
+
73
+ if not is_allowed and origin: # Allow requests without origin (e.g., curl)
74
+ return jsonify({
75
+ "success": False,
76
+ "error": "Origin not allowed"
77
+ }), 403
78
+
79
+ return f(*args, **kwargs)
80
+ return decorated_function
81
+
82
+
83
+ # ============================================
84
+ # Configuration
85
+ # ============================================
86
+
87
+ def get_active_model():
88
+ """Read the active model from endpoints.txt."""
89
+ endpoints_file = Path("endpoints.txt")
90
+ default_model = os.getenv("LLM_MODEL", "google/gemma-2-9b-it:free")
91
+
92
+ if not endpoints_file.exists():
93
+ return default_model
94
+
95
+ try:
96
+ content = endpoints_file.read_text()
97
+ for line in content.splitlines():
98
+ line = line.strip()
99
+ # Skip comments and empty lines
100
+ if line and not line.startswith("#"):
101
+ return line
102
+ except Exception:
103
+ pass
104
+
105
+ return default_model
106
+
107
+
108
+ # Document type templates
109
+ DOCUMENT_TEMPLATES = {
110
+ 'Memorandum': {
111
+ 'title': 'MEMORANDUM FOR PERSONNEL FILE',
112
+ 'sections': [
113
+ 'Purpose of Document',
114
+ 'Incident Background',
115
+ 'Policy References',
116
+ 'Employee Discussion',
117
+ 'Corrective Measures',
118
+ 'Conclusion'
119
+ ]
120
+ },
121
+ 'Termination Letter': {
122
+ 'title': 'EMPLOYEE TERMINATION LETTER',
123
+ 'sections': [
124
+ 'Purpose of Document',
125
+ 'Employment History',
126
+ 'Reason for Termination',
127
+ 'Policy Violations',
128
+ 'Previous Warnings',
129
+ 'Final Pay and Benefits',
130
+ 'Return of Company Property'
131
+ ]
132
+ },
133
+ 'Written Disciplinary Action': {
134
+ 'title': 'WRITTEN DISCIPLINARY ACTION',
135
+ 'sections': [
136
+ 'Purpose of Document',
137
+ 'Incident Details',
138
+ 'Policy References',
139
+ 'Previous Coaching / Warnings',
140
+ 'Corrective Action Required',
141
+ 'Employee Acknowledgment'
142
+ ]
143
+ },
144
+ '90 Day Performance Evaluation': {
145
+ 'title': '90-DAY PERFORMANCE EVALUATION',
146
+ 'sections': [
147
+ 'Evaluation Period',
148
+ 'Performance Summary',
149
+ 'Areas of Strength',
150
+ 'Areas for Improvement',
151
+ 'Goals for Next Period',
152
+ 'Recommendation'
153
+ ]
154
+ }
155
+ }
156
+
157
+
158
+ # ============================================
159
+ # API Routes
160
+ # ============================================
161
+
162
+ @app.route('/')
163
+ def index():
164
+ """Health check and API info."""
165
+ return jsonify({
166
+ "status": "ok",
167
+ "service": "HR Report Generator API",
168
+ "version": "1.0.0",
169
+ "allowed_origins": ALLOWED_ORIGINS,
170
+ "active_model": get_active_model(),
171
+ })
172
+
173
+
174
+ @app.route('/api/health')
175
+ def health():
176
+ """Health check endpoint."""
177
+ return jsonify({"status": "healthy"})
178
+
179
+
180
+ @app.route('/api/generate', methods=['POST', 'OPTIONS'])
181
+ @validate_origin
182
+ def generate_report():
183
+ """Generate an HR document based on type."""
184
+ if request.method == 'OPTIONS':
185
+ return '', 204
186
+
187
+ try:
188
+ data = request.json
189
+
190
+ doc_type = data.get('doc_type', 'Memorandum')
191
+ employee_name = data.get('employee_name', '')
192
+ date_from = data.get('date_from', '')
193
+ date_to = data.get('date_to', '')
194
+ reason = data.get('reason', '')
195
+ additional_notes = data.get('additional_notes', '')
196
+
197
+ if not all([employee_name, date_from, reason]):
198
+ return jsonify({'success': False, 'error': 'Missing required fields'}), 400
199
+
200
+ # Format date range
201
+ if date_to and date_to != date_from:
202
+ incident_date_range = f"{date_from} to {date_to}"
203
+ else:
204
+ incident_date_range = date_from
205
+
206
+ # Get document template info
207
+ template_info = DOCUMENT_TEMPLATES.get(doc_type, DOCUMENT_TEMPLATES['Memorandum'])
208
+
209
+ # Get active model from endpoints.txt
210
+ model_name = get_active_model()
211
+
212
+ # Build the synthesizer input
213
+ synth_input = {
214
+ 'employee_name': employee_name,
215
+ 'incident_date_range': incident_date_range,
216
+ 'incident_reason': additional_notes or reason,
217
+ 'violation_type': reason,
218
+ 'report_date': datetime.now().strftime('%Y-%m-%d'),
219
+ 'doc_type': doc_type,
220
+ 'doc_title': template_info['title'],
221
+ 'required_sections': template_info['sections'],
222
+ }
223
+
224
+ # Generate the report
225
+ synthesizer = ReportSynthesizer(model_name=model_name)
226
+ result = synthesizer.synthesize_from_dict(synth_input)
227
+
228
+ if not result.success:
229
+ return jsonify({'success': False, 'error': result.error}), 500
230
+
231
+ return jsonify({
232
+ 'success': True,
233
+ 'markdown': result.markdown_report,
234
+ 'validation': {
235
+ 'is_valid': True,
236
+ 'errors': [],
237
+ 'warnings': [],
238
+ },
239
+ 'sources': result.retrieval_context.sources_used,
240
+ 'model_used': model_name,
241
+ })
242
+
243
+ except Exception as e:
244
+ import traceback
245
+ traceback.print_exc()
246
+ return jsonify({'success': False, 'error': str(e)}), 500
247
+
248
+
249
+ @app.route('/api/status')
250
+ @validate_origin
251
+ def get_status():
252
+ """Get system status."""
253
+ try:
254
+ vector_store = FAISSVectorStore()
255
+ loaded = vector_store.load()
256
+
257
+ return jsonify({
258
+ 'knowledge_base': {
259
+ 'loaded': loaded,
260
+ 'chunks': vector_store.size if loaded else 0,
261
+ 'sources': vector_store.get_sources() if loaded else [],
262
+ },
263
+ 'active_model': get_active_model(),
264
+ 'allowed_origins': ALLOWED_ORIGINS,
265
+ })
266
+ except Exception as e:
267
+ return jsonify({
268
+ 'knowledge_base': {'loaded': False, 'chunks': 0, 'sources': []},
269
+ 'error': str(e),
270
+ })
271
+
272
+
273
+ @app.route('/api/config')
274
+ def get_config():
275
+ """Get public configuration (no secrets)."""
276
+ return jsonify({
277
+ 'document_types': list(DOCUMENT_TEMPLATES.keys()),
278
+ 'active_model': get_active_model(),
279
+ })
280
+
281
+
282
+ # ============================================
283
+ # Main
284
+ # ============================================
285
+
286
+ if __name__ == '__main__':
287
+ # Ensure directories exist
288
+ settings.ensure_directories()
289
+
290
+ print("=" * 50)
291
+ print("HR Report Generator API")
292
+ print("=" * 50)
293
+ print(f"Active Model: {get_active_model()}")
294
+ print(f"Allowed Origins: {ALLOWED_ORIGINS}")
295
+ print("=" * 50)
296
+
297
+ app.run(host='0.0.0.0', port=7860, debug=True)
endpoints.txt ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LLM Endpoints Configuration
2
+ # Edit this file to change the model used by the API
3
+ # Format: model_name (one per line, first uncommented line is used)
4
+
5
+ # Free OpenRouter Models:
6
+ google/gemma-2-9b-it:free
7
+ # meta-llama/llama-3.2-3b-instruct:free
8
+ # mistralai/mistral-7b-instruct:free
9
+ # google/gemini-pro-1.5-exp
10
+
11
+ # Paid Models (if you have credits):
12
+ # anthropic/claude-3.5-sonnet
13
+ # openai/gpt-4o
14
+ # meta-llama/llama-3.1-70b-instruct
requirements.txt ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Flask and API
2
+ flask>=3.0.0
3
+ gunicorn>=21.0.0
4
+ flask-cors>=4.0.0
5
+ python-dotenv>=1.0.0
6
+ requests>=2.31.0
7
+
8
+ # Pydantic for data validation
9
+ pydantic>=2.0.0
10
+ pydantic-settings>=2.0.0
11
+
12
+ # Vector store and embeddings
13
+ faiss-cpu>=1.7.4
14
+ sentence-transformers>=2.2.0
15
+ numpy>=1.24.0
16
+
17
+ # Document processing
18
+ PyYAML>=6.0.0
19
+
20
+ # File handling
21
+ werkzeug>=3.0.0
src/__init__.py ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """HR Report Generator - HuggingFace Spaces Backend."""
2
+
3
+ from src.config import settings
4
+ from src.rag import ReportSynthesizer
5
+ from src.knowledge import FAISSVectorStore
6
+
7
+ __all__ = [
8
+ "settings",
9
+ "ReportSynthesizer",
10
+ "FAISSVectorStore",
11
+ ]
src/config.py ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """HR Report Generator - Configuration Module (HuggingFace Version)."""
2
+
3
+ from pathlib import Path
4
+ from typing import Literal
5
+ import os
6
+
7
+ from pydantic import Field
8
+ from pydantic_settings import BaseSettings, SettingsConfigDict
9
+
10
+
11
+ class Settings(BaseSettings):
12
+ """Application settings loaded from environment variables."""
13
+
14
+ model_config = SettingsConfigDict(
15
+ env_file=".env",
16
+ env_file_encoding="utf-8",
17
+ extra="ignore",
18
+ )
19
+
20
+ # LLM Configuration (loaded from environment or endpoints.txt)
21
+ llm_model: str = Field(default="google/gemma-2-9b-it:free", description="LLM model name")
22
+ llm_temperature: float = Field(default=0.0, ge=0.0, le=1.0, description="LLM temperature")
23
+
24
+ # Embedding Configuration
25
+ embedding_model: str = Field(
26
+ default="BAAI/bge-small-en-v1.5",
27
+ description="HuggingFace embedding model",
28
+ )
29
+
30
+ # Paths (relative for Docker)
31
+ data_dir: Path = Field(default=Path("./data"))
32
+ documents_dir: Path = Field(default=Path("./data/documents"))
33
+ markdown_dir: Path = Field(default=Path("./data/markdown"))
34
+ embeddings_dir: Path = Field(default=Path("./data/embeddings"))
35
+ outputs_dir: Path = Field(default=Path("./data/outputs"))
36
+ templates_dir: Path = Field(default=Path("./templates"))
37
+
38
+ # Vector Store
39
+ faiss_index_path: Path = Field(default=Path("./data/embeddings/faiss_index"))
40
+
41
+ # Chunking Configuration
42
+ chunk_size: int = Field(default=512, description="Target chunk size in tokens")
43
+ chunk_overlap: int = Field(default=50, description="Overlap between chunks")
44
+
45
+ # Retrieval Configuration
46
+ retrieval_top_k: int = Field(default=5, description="Number of documents to retrieve")
47
+ retrieval_min_score: float = Field(default=0.3, description="Minimum similarity score")
48
+
49
+ def ensure_directories(self) -> None:
50
+ """Create all required directories if they don't exist."""
51
+ for path in [
52
+ self.data_dir,
53
+ self.documents_dir,
54
+ self.markdown_dir,
55
+ self.embeddings_dir,
56
+ self.outputs_dir,
57
+ self.templates_dir,
58
+ ]:
59
+ path.mkdir(parents=True, exist_ok=True)
60
+
61
+
62
+ # Global settings instance
63
+ settings = Settings()
src/document_processor/__init__.py ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ """Document processor module for HR Report Generator."""
2
+
3
+ from src.document_processor.chunker import DocumentChunk, SemanticChunker
4
+
5
+ __all__ = [
6
+ "DocumentChunk",
7
+ "SemanticChunker",
8
+ ]
src/document_processor/chunker.py ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Semantic document chunker for RAG processing."""
2
+
3
+ import re
4
+ from pathlib import Path
5
+ from typing import Optional
6
+
7
+ from pydantic import BaseModel
8
+
9
+ from src.config import settings
10
+
11
+
12
+ class DocumentChunk(BaseModel):
13
+ """A chunk of document content with metadata."""
14
+
15
+ content: str
16
+ source_file: str
17
+ chunk_index: int
18
+ start_char: int
19
+ end_char: int
20
+ section_title: Optional[str] = None
21
+ page_hint: Optional[str] = None
22
+
23
+ @property
24
+ def chunk_id(self) -> str:
25
+ """Generate unique chunk identifier."""
26
+ return f"{Path(self.source_file).stem}_{self.chunk_index:04d}"
27
+
28
+
29
+ class SemanticChunker:
30
+ """Chunks Markdown documents by semantic boundaries.
31
+
32
+ Respects document structure (headers, paragraphs, lists) while
33
+ maintaining target chunk sizes for optimal embedding performance.
34
+ """
35
+
36
+ def __init__(
37
+ self,
38
+ chunk_size: int = None,
39
+ chunk_overlap: int = None,
40
+ ):
41
+ """Initialize the chunker.
42
+
43
+ Args:
44
+ chunk_size: Target chunk size in characters.
45
+ chunk_overlap: Overlap between chunks in characters.
46
+ """
47
+ self.chunk_size = chunk_size or settings.chunk_size
48
+ self.chunk_overlap = chunk_overlap or settings.chunk_overlap
49
+
50
+ # Patterns for semantic splitting
51
+ self._header_pattern = re.compile(r"^(#{1,6})\s+(.+)$", re.MULTILINE)
52
+ self._section_break_pattern = re.compile(r"\n{3,}")
53
+ self._list_item_pattern = re.compile(r"^[\s]*[-*+]\s+", re.MULTILINE)
54
+
55
+ def _extract_frontmatter(self, content: str) -> tuple[dict, str]:
56
+ """Extract YAML frontmatter from markdown content."""
57
+ frontmatter = {}
58
+ body = content
59
+
60
+ if content.startswith("---"):
61
+ parts = content.split("---", 2)
62
+ if len(parts) >= 3:
63
+ import yaml
64
+
65
+ try:
66
+ frontmatter = yaml.safe_load(parts[1]) or {}
67
+ except Exception:
68
+ pass
69
+ body = parts[2].strip()
70
+
71
+ return frontmatter, body
72
+
73
+ def _find_section_boundaries(self, content: str) -> list[tuple[int, int, str]]:
74
+ """Find semantic section boundaries based on headers.
75
+
76
+ Returns list of (start_pos, end_pos, section_title) tuples.
77
+ """
78
+ boundaries = []
79
+ headers = list(self._header_pattern.finditer(content))
80
+
81
+ if not headers:
82
+ return [(0, len(content), "Document")]
83
+
84
+ # Add content before first header if exists
85
+ if headers[0].start() > 0:
86
+ boundaries.append((0, headers[0].start(), "Preamble"))
87
+
88
+ # Add each section
89
+ for i, header in enumerate(headers):
90
+ start = header.start()
91
+ end = headers[i + 1].start() if i + 1 < len(headers) else len(content)
92
+ title = header.group(2).strip()
93
+ boundaries.append((start, end, title))
94
+
95
+ return boundaries
96
+
97
+ def _split_section(self, content: str, section_title: str) -> list[str]:
98
+ """Split a section into smaller chunks respecting boundaries."""
99
+ if len(content) <= self.chunk_size:
100
+ return [content] if content.strip() else []
101
+
102
+ chunks = []
103
+ current_chunk = ""
104
+
105
+ # Split by paragraphs first
106
+ paragraphs = re.split(r"\n\n+", content)
107
+
108
+ for para in paragraphs:
109
+ para = para.strip()
110
+ if not para:
111
+ continue
112
+
113
+ # If paragraph alone exceeds chunk size, split by sentences
114
+ if len(para) > self.chunk_size:
115
+ sentences = re.split(r"(?<=[.!?])\s+", para)
116
+ for sentence in sentences:
117
+ if len(current_chunk) + len(sentence) + 1 <= self.chunk_size:
118
+ current_chunk += (" " if current_chunk else "") + sentence
119
+ else:
120
+ if current_chunk:
121
+ chunks.append(current_chunk)
122
+ current_chunk = sentence
123
+ elif len(current_chunk) + len(para) + 2 <= self.chunk_size:
124
+ current_chunk += ("\n\n" if current_chunk else "") + para
125
+ else:
126
+ if current_chunk:
127
+ chunks.append(current_chunk)
128
+ current_chunk = para
129
+
130
+ if current_chunk.strip():
131
+ chunks.append(current_chunk)
132
+
133
+ return chunks
134
+
135
+ def _add_overlap(self, chunks: list[str]) -> list[str]:
136
+ """Add overlap between chunks for context preservation."""
137
+ if self.chunk_overlap <= 0 or len(chunks) <= 1:
138
+ return chunks
139
+
140
+ overlapped = []
141
+ for i, chunk in enumerate(chunks):
142
+ if i > 0:
143
+ # Add end of previous chunk as prefix
144
+ prev_chunk = chunks[i - 1]
145
+ overlap_text = prev_chunk[-self.chunk_overlap :].strip()
146
+ if overlap_text:
147
+ chunk = f"...{overlap_text}\n\n{chunk}"
148
+ overlapped.append(chunk)
149
+
150
+ return overlapped
151
+
152
+ def chunk_document(self, markdown_path: Path) -> list[DocumentChunk]:
153
+ """Chunk a Markdown document into semantic pieces.
154
+
155
+ Args:
156
+ markdown_path: Path to the Markdown file.
157
+
158
+ Returns:
159
+ List of DocumentChunks with metadata.
160
+ """
161
+ markdown_path = Path(markdown_path)
162
+ content = markdown_path.read_text(encoding="utf-8")
163
+
164
+ frontmatter, body = self._extract_frontmatter(content)
165
+ source_file = frontmatter.get("source", markdown_path.name)
166
+
167
+ sections = self._find_section_boundaries(body)
168
+ all_chunks = []
169
+ chunk_index = 0
170
+
171
+ for start_pos, end_pos, section_title in sections:
172
+ section_content = body[start_pos:end_pos].strip()
173
+ if not section_content:
174
+ continue
175
+
176
+ section_chunks = self._split_section(section_content, section_title)
177
+ section_chunks = self._add_overlap(section_chunks)
178
+
179
+ for chunk_content in section_chunks:
180
+ if not chunk_content.strip():
181
+ continue
182
+
183
+ chunk = DocumentChunk(
184
+ content=chunk_content,
185
+ source_file=str(markdown_path),
186
+ chunk_index=chunk_index,
187
+ start_char=start_pos,
188
+ end_char=end_pos,
189
+ section_title=section_title,
190
+ )
191
+ all_chunks.append(chunk)
192
+ chunk_index += 1
193
+
194
+ return all_chunks
195
+
196
+ def chunk_documents(self, markdown_paths: list[Path]) -> list[DocumentChunk]:
197
+ """Chunk multiple Markdown documents.
198
+
199
+ Args:
200
+ markdown_paths: List of paths to Markdown files.
201
+
202
+ Returns:
203
+ List of all DocumentChunks from all documents.
204
+ """
205
+ all_chunks = []
206
+ for path in markdown_paths:
207
+ chunks = self.chunk_document(path)
208
+ all_chunks.extend(chunks)
209
+ return all_chunks
src/knowledge/__init__.py ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ """Knowledge module for HR Report Generator."""
2
+
3
+ from src.knowledge.vector_store import FAISSVectorStore, RetrievalResult
4
+ from src.knowledge.embeddings import EmbeddingModel
5
+
6
+ __all__ = [
7
+ "FAISSVectorStore",
8
+ "RetrievalResult",
9
+ "EmbeddingModel",
10
+ ]
src/knowledge/embeddings.py ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Embedding model wrapper for document vectorization."""
2
+
3
+ from pathlib import Path
4
+ from typing import Optional
5
+
6
+ import numpy as np
7
+ from sentence_transformers import SentenceTransformer
8
+
9
+ from src.config import settings
10
+ from src.document_processor.chunker import DocumentChunk
11
+
12
+
13
+ class EmbeddingModel:
14
+ """Wrapper for sentence-transformers embedding models.
15
+
16
+ Provides efficient batch embedding with caching support.
17
+ """
18
+
19
+ def __init__(self, model_name: Optional[str] = None):
20
+ """Initialize the embedding model.
21
+
22
+ Args:
23
+ model_name: HuggingFace model name. Defaults to settings.embedding_model.
24
+ """
25
+ self.model_name = model_name or settings.embedding_model
26
+ self._model: Optional[SentenceTransformer] = None
27
+
28
+ @property
29
+ def model(self) -> SentenceTransformer:
30
+ """Lazy load the embedding model."""
31
+ if self._model is None:
32
+ self._model = SentenceTransformer(self.model_name)
33
+ return self._model
34
+
35
+ @property
36
+ def embedding_dimension(self) -> int:
37
+ """Get the dimension of embeddings produced by this model."""
38
+ return self.model.get_sentence_embedding_dimension()
39
+
40
+ def embed_text(self, text: str) -> np.ndarray:
41
+ """Embed a single text string.
42
+
43
+ Args:
44
+ text: Text to embed.
45
+
46
+ Returns:
47
+ Embedding vector as numpy array.
48
+ """
49
+ return self.model.encode(text, convert_to_numpy=True, normalize_embeddings=True)
50
+
51
+ def embed_texts(self, texts: list[str], batch_size: int = 32) -> np.ndarray:
52
+ """Embed multiple texts efficiently.
53
+
54
+ Args:
55
+ texts: List of texts to embed.
56
+ batch_size: Batch size for processing.
57
+
58
+ Returns:
59
+ Array of embedding vectors (num_texts x embedding_dim).
60
+ """
61
+ return self.model.encode(
62
+ texts,
63
+ batch_size=batch_size,
64
+ convert_to_numpy=True,
65
+ normalize_embeddings=True,
66
+ show_progress_bar=len(texts) > 100,
67
+ )
68
+
69
+ def embed_chunks(
70
+ self, chunks: list[DocumentChunk], batch_size: int = 32
71
+ ) -> list[tuple[DocumentChunk, np.ndarray]]:
72
+ """Embed document chunks with their metadata.
73
+
74
+ Args:
75
+ chunks: List of DocumentChunks to embed.
76
+ batch_size: Batch size for processing.
77
+
78
+ Returns:
79
+ List of (chunk, embedding) tuples.
80
+ """
81
+ texts = [chunk.content for chunk in chunks]
82
+ embeddings = self.embed_texts(texts, batch_size=batch_size)
83
+
84
+ return list(zip(chunks, embeddings))
85
+
86
+ def embed_query(self, query: str) -> np.ndarray:
87
+ """Embed a query for retrieval.
88
+
89
+ Some models use different prompting for queries vs documents.
90
+
91
+ Args:
92
+ query: Query text to embed.
93
+
94
+ Returns:
95
+ Query embedding vector.
96
+ """
97
+ # BGE models benefit from query prefixes
98
+ if "bge" in self.model_name.lower():
99
+ query = f"Represent this sentence for searching relevant passages: {query}"
100
+
101
+ return self.embed_text(query)
src/knowledge/vector_store.py ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """FAISS vector store for document retrieval."""
2
+
3
+ import json
4
+ import pickle
5
+ from pathlib import Path
6
+ from typing import Optional
7
+
8
+ import faiss
9
+ import numpy as np
10
+ from pydantic import BaseModel
11
+
12
+ from src.config import settings
13
+ from src.document_processor.chunker import DocumentChunk
14
+ from src.knowledge.embeddings import EmbeddingModel
15
+
16
+
17
+ class RetrievalResult(BaseModel):
18
+ """Result from vector store retrieval."""
19
+
20
+ chunk: DocumentChunk
21
+ score: float
22
+ rank: int
23
+
24
+ class Config:
25
+ arbitrary_types_allowed = True
26
+
27
+
28
+ class FAISSVectorStore:
29
+ """FAISS-based vector store for efficient similarity search.
30
+
31
+ Stores document chunks with their embeddings and provides
32
+ fast retrieval with source tracking for citations.
33
+ """
34
+
35
+ def __init__(
36
+ self,
37
+ embedding_model: Optional[EmbeddingModel] = None,
38
+ index_path: Optional[Path] = None,
39
+ ):
40
+ """Initialize the vector store.
41
+
42
+ Args:
43
+ embedding_model: Model for generating embeddings.
44
+ index_path: Path to store/load the FAISS index.
45
+ """
46
+ self.embedding_model = embedding_model or EmbeddingModel()
47
+ self.index_path = Path(index_path or settings.faiss_index_path)
48
+
49
+ self._index: Optional[faiss.IndexFlatIP] = None
50
+ self._chunks: list[DocumentChunk] = []
51
+ self._is_loaded = False
52
+
53
+ def _ensure_directory(self) -> None:
54
+ """Ensure the index directory exists."""
55
+ self.index_path.parent.mkdir(parents=True, exist_ok=True)
56
+
57
+ def _create_index(self, dimension: int) -> faiss.IndexFlatIP:
58
+ """Create a new FAISS index.
59
+
60
+ Uses Inner Product (IP) since embeddings are normalized.
61
+ """
62
+ return faiss.IndexFlatIP(dimension)
63
+
64
+ def add_chunks(self, chunks: list[DocumentChunk]) -> int:
65
+ """Add document chunks to the vector store.
66
+
67
+ Args:
68
+ chunks: List of DocumentChunks to add.
69
+
70
+ Returns:
71
+ Number of chunks added.
72
+ """
73
+ if not chunks:
74
+ return 0
75
+
76
+ # Generate embeddings
77
+ chunk_embeddings = self.embedding_model.embed_chunks(chunks)
78
+
79
+ # Initialize index if needed
80
+ if self._index is None:
81
+ dimension = self.embedding_model.embedding_dimension
82
+ self._index = self._create_index(dimension)
83
+
84
+ # Add to index
85
+ embeddings_array = np.vstack([emb for _, emb in chunk_embeddings])
86
+ self._index.add(embeddings_array)
87
+
88
+ # Store chunks for retrieval
89
+ for chunk, _ in chunk_embeddings:
90
+ self._chunks.append(chunk)
91
+
92
+ return len(chunks)
93
+
94
+ def search(
95
+ self,
96
+ query: str,
97
+ top_k: int = None,
98
+ min_score: float = None,
99
+ ) -> list[RetrievalResult]:
100
+ """Search for relevant chunks.
101
+
102
+ Args:
103
+ query: Search query.
104
+ top_k: Number of results to return.
105
+ min_score: Minimum similarity score threshold.
106
+
107
+ Returns:
108
+ List of RetrievalResults ordered by relevance.
109
+ """
110
+ if self._index is None or self._index.ntotal == 0:
111
+ return []
112
+
113
+ top_k = top_k or settings.retrieval_top_k
114
+ min_score = min_score or settings.retrieval_min_score
115
+
116
+ # Embed query
117
+ query_embedding = self.embedding_model.embed_query(query)
118
+ query_embedding = query_embedding.reshape(1, -1)
119
+
120
+ # Search
121
+ scores, indices = self._index.search(query_embedding, min(top_k, self._index.ntotal))
122
+
123
+ # Build results
124
+ results = []
125
+ for rank, (score, idx) in enumerate(zip(scores[0], indices[0])):
126
+ if idx < 0 or score < min_score:
127
+ continue
128
+
129
+ chunk = self._chunks[idx]
130
+ results.append(
131
+ RetrievalResult(
132
+ chunk=chunk,
133
+ score=float(score),
134
+ rank=rank + 1,
135
+ )
136
+ )
137
+
138
+ return results
139
+
140
+ def save(self) -> None:
141
+ """Save the index and chunks to disk."""
142
+ if self._index is None:
143
+ return
144
+
145
+ self._ensure_directory()
146
+
147
+ # Save FAISS index
148
+ index_file = self.index_path.with_suffix(".faiss")
149
+ faiss.write_index(self._index, str(index_file))
150
+
151
+ # Save chunks as JSON
152
+ chunks_file = self.index_path.with_suffix(".chunks.json")
153
+ chunks_data = [chunk.model_dump() for chunk in self._chunks]
154
+ chunks_file.write_text(json.dumps(chunks_data, indent=2), encoding="utf-8")
155
+
156
+ def load(self) -> bool:
157
+ """Load the index and chunks from disk.
158
+
159
+ Returns:
160
+ True if loaded successfully, False otherwise.
161
+ """
162
+ index_file = self.index_path.with_suffix(".faiss")
163
+ chunks_file = self.index_path.with_suffix(".chunks.json")
164
+
165
+ if not index_file.exists() or not chunks_file.exists():
166
+ return False
167
+
168
+ try:
169
+ # Load FAISS index
170
+ self._index = faiss.read_index(str(index_file))
171
+
172
+ # Load chunks
173
+ chunks_data = json.loads(chunks_file.read_text(encoding="utf-8"))
174
+ self._chunks = [DocumentChunk.model_validate(c) for c in chunks_data]
175
+
176
+ self._is_loaded = True
177
+ return True
178
+
179
+ except Exception as e:
180
+ print(f"Error loading index: {e}")
181
+ return False
182
+
183
+ def clear(self) -> None:
184
+ """Clear the index and all stored chunks."""
185
+ self._index = None
186
+ self._chunks = []
187
+ self._is_loaded = False
188
+
189
+ # Remove files if they exist
190
+ index_file = self.index_path.with_suffix(".faiss")
191
+ chunks_file = self.index_path.with_suffix(".chunks.json")
192
+
193
+ if index_file.exists():
194
+ index_file.unlink()
195
+ if chunks_file.exists():
196
+ chunks_file.unlink()
197
+
198
+ @property
199
+ def size(self) -> int:
200
+ """Get the number of chunks in the store."""
201
+ return len(self._chunks)
202
+
203
+ def get_sources(self) -> list[str]:
204
+ """Get list of unique source files in the store."""
205
+ return list(set(chunk.source_file for chunk in self._chunks))
src/rag/__init__.py ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """RAG module for HR Report Generator."""
2
+
3
+ from src.rag.synthesizer import ReportSynthesizer, ReportInput, SynthesisResult
4
+ from src.rag.retriever import DocumentRetriever, RetrievalContext
5
+
6
+ __all__ = [
7
+ "ReportSynthesizer",
8
+ "ReportInput",
9
+ "SynthesisResult",
10
+ "DocumentRetriever",
11
+ "RetrievalContext",
12
+ ]
src/rag/prompts.py ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Prompt templates for HR report generation.
2
+
3
+ All prompts enforce strict grounding to retrieved content with
4
+ zero tolerance for hallucination or invention of facts.
5
+ """
6
+
7
+ # System prompt that enforces strict RAG constraints with document type awareness
8
+ SYSTEM_PROMPT = """You are an HR Documentation Assistant that generates professional HR documents.
9
+
10
+ ## CRITICAL CONSTRAINTS - FOLLOW EXACTLY
11
+
12
+ 1. **ONLY USE PROVIDED CONTEXT**: You may ONLY include information from the documents provided below. Do NOT invent, assume, or generalize any facts.
13
+
14
+ 2. **CITE EVERYTHING**: Every factual claim MUST reference its source document and section.
15
+
16
+ 3. **NO HALLUCINATION**: If information is not in the provided context, you MUST state "No documentation available" for that section.
17
+
18
+ 4. **PROFESSIONAL TONE**: Use neutral, factual HR language. No opinions, no emotional language, no speculation.
19
+
20
+ 5. **DETERMINISTIC**: Given the same inputs and context, produce identical outputs every time.
21
+
22
+ ## POLICY REFERENCE FORMAT - VERY IMPORTANT
23
+
24
+ When citing policies, use this EXACT format for clarity:
25
+ - **Policy Title**: State the exact policy name (e.g., "Attendance and Punctuality Policy")
26
+ - **Source Document**: Reference the handbook or document name
27
+ - **Section**: Include section number if available
28
+ - **Relevant Text**: Quote the specific policy language in quotation marks
29
+
30
+ Example:
31
+ - Attendance and Punctuality, Hillside Medical Group Employee Handbook, Section 3.2: "Employees are considered tardy if they arrive at their work area 5 minutes after their scheduled starting time. Progressive discipline will begin with the second tardy offense within a 90-day period."
32
+
33
+ ## OUTPUT STRUCTURE
34
+
35
+ Generate the document following this structure exactly:
36
+
37
+ ```markdown
38
+ # [Document Title]
39
+
40
+ **Employee Name:** [employee_name]
41
+ **Document Date:** [report_date]
42
+ **Incident Date(s):** [incident_date_range]
43
+ **Prepared By:** Human Resources Department
44
+
45
+ ## Purpose of Document
46
+ [Brief statement of why this document is being created]
47
+
48
+ ## Incident Background
49
+ [Factual description of what occurred - synthesized ONLY from provided context]
50
+
51
+ ## Policy References
52
+ [List specific policies with DETAILED citations. Include:
53
+ - Full policy name
54
+ - Source document
55
+ - Section number (if available)
56
+ - Exact quoted text from policy in quotation marks
57
+ If no policies found, state "No relevant policy documentation available."]
58
+
59
+ ## Previous Coaching / Warnings
60
+ [Summary of prior warnings from context with dates and specifics. If none found, state "No prior warning documentation available."]
61
+
62
+ ## Corrective Action Required
63
+ [Specific actions required - ONLY if stated in context. Otherwise: "Action pending HR review."]
64
+
65
+ ## Employee Acknowledgment
66
+ Employee signature: ___________________ Date: ___________
67
+ Supervisor signature: _________________ Date: ___________
68
+ ```
69
+
70
+ ## VALIDATION CHECKLIST (Self-verify before outputting)
71
+
72
+ - [ ] Every fact traces to a provided document
73
+ - [ ] All policy citations include source, section, and quoted text
74
+ - [ ] No invented dates, names, or actions
75
+ - [ ] Professional, neutral language throughout
76
+ """
77
+
78
+ # Template for the user message with context
79
+ USER_MESSAGE_TEMPLATE = """## DOCUMENT INFORMATION
80
+
81
+ - **Document Type:** {doc_type}
82
+ - **Employee Name:** {employee_name}
83
+ - **Document Date:** {report_date}
84
+ - **Incident Date(s):** {incident_date_range}
85
+ - **Incident Reason:** {incident_reason}
86
+ - **Violation Type:** {violation_type}
87
+
88
+ ---
89
+
90
+ ## RETRIEVED POLICY DOCUMENTS
91
+
92
+ The following policy documents contain relevant information. Use these for the Policy References section:
93
+
94
+ {policy_context}
95
+
96
+ ---
97
+
98
+ ## RETRIEVED WARNING/COACHING DOCUMENTS
99
+
100
+ The following documents contain prior warning/coaching records:
101
+
102
+ {warnings_context}
103
+
104
+ ---
105
+
106
+ ## INSTRUCTIONS
107
+
108
+ 1. Generate the HR document using ONLY the information provided above.
109
+ 2. For the Policy References section:
110
+ - Extract and quote the EXACT policy language from the context
111
+ - Include the source document name
112
+ - Include section numbers where available
113
+ - Format each reference clearly with the policy title, source, and quoted text
114
+ 3. Cite specific sections for all factual claims.
115
+ 4. State "No documentation available" for any section without supporting context.
116
+ 5. Use professional, neutral HR language throughout.
117
+ """
118
+
119
+ # Template for formatting retrieved chunks as context
120
+ CONTEXT_CHUNK_TEMPLATE = """### Source: {source_file}
121
+ **Section:** {section_title}
122
+ **Relevance Score:** {score:.2f}
123
+
124
+ {content}
125
+
126
+ ---
127
+ """
128
+
129
+ # Refusal response for insufficient evidence
130
+ INSUFFICIENT_EVIDENCE_RESPONSE = """# HR Incident Report
131
+
132
+ **Employee Name:** {employee_name}
133
+ **Document Date:** {report_date}
134
+ **Incident Date(s):** {incident_date_range}
135
+ **Prepared By:** Human Resources Department
136
+
137
+ ## Purpose of Document
138
+ To document {incident_reason} incident(s).
139
+
140
+ ## Incident Background
141
+ Insufficient documentation available to generate incident summary.
142
+
143
+ ## Policy References
144
+ No relevant policy documentation available. Please ensure policy documents have been ingested into the system.
145
+
146
+ ## Previous Coaching / Warnings
147
+ No prior warning documentation available.
148
+
149
+ ## Corrective Action Required
150
+ Report generation incomplete. Please ingest relevant HR documentation before proceeding.
151
+
152
+ ---
153
+
154
+ **⚠️ NOTICE:** This report could not be completed due to insufficient documentation in the knowledge base. Please add relevant HR policy documents and any prior coaching/warning records, then regenerate this report.
155
+ """
src/rag/retriever.py ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Document retriever for RAG pipeline."""
2
+
3
+ from pathlib import Path
4
+ from typing import Optional
5
+
6
+ from pydantic import BaseModel
7
+
8
+ from src.config import settings
9
+ from src.knowledge.vector_store import FAISSVectorStore, RetrievalResult
10
+ from src.rag.prompts import CONTEXT_CHUNK_TEMPLATE
11
+
12
+
13
+ class RetrievalContext(BaseModel):
14
+ """Context retrieved for report generation."""
15
+
16
+ policy_results: list[RetrievalResult]
17
+ warning_results: list[RetrievalResult]
18
+ policy_context_text: str
19
+ warnings_context_text: str
20
+ has_sufficient_evidence: bool
21
+ sources_used: list[str]
22
+
23
+ class Config:
24
+ arbitrary_types_allowed = True
25
+
26
+
27
+ class DocumentRetriever:
28
+ """Retrieves relevant documents for HR report generation.
29
+
30
+ Separates retrieval into policy documents and warning/coaching
31
+ documents to ensure proper context for each report section.
32
+ """
33
+
34
+ def __init__(self, vector_store: Optional[FAISSVectorStore] = None):
35
+ """Initialize the retriever.
36
+
37
+ Args:
38
+ vector_store: Vector store to search. Creates new one if not provided.
39
+ """
40
+ self.vector_store = vector_store or FAISSVectorStore()
41
+
42
+ # Try to load existing index
43
+ if not self.vector_store._is_loaded:
44
+ self.vector_store.load()
45
+
46
+ def _format_results_as_context(self, results: list[RetrievalResult]) -> str:
47
+ """Format retrieval results as context string for the LLM."""
48
+ if not results:
49
+ return "No relevant documents found."
50
+
51
+ context_parts = []
52
+ for result in results:
53
+ formatted = CONTEXT_CHUNK_TEMPLATE.format(
54
+ source_file=Path(result.chunk.source_file).name,
55
+ section_title=result.chunk.section_title or "General",
56
+ score=result.score,
57
+ content=result.chunk.content,
58
+ )
59
+ context_parts.append(formatted)
60
+
61
+ return "\n".join(context_parts)
62
+
63
+ def _classify_results(
64
+ self, results: list[RetrievalResult]
65
+ ) -> tuple[list[RetrievalResult], list[RetrievalResult]]:
66
+ """Classify results into policy and warning categories.
67
+
68
+ Uses simple heuristics based on source filename and content.
69
+ """
70
+ policy_results = []
71
+ warning_results = []
72
+
73
+ policy_keywords = ["policy", "handbook", "manual", "guideline", "procedure"]
74
+ warning_keywords = ["warning", "coaching", "counseling", "disciplinary", "incident"]
75
+
76
+ for result in results:
77
+ source_lower = Path(result.chunk.source_file).stem.lower()
78
+ content_lower = result.chunk.content.lower()
79
+
80
+ # Check if it's a warning/coaching document
81
+ is_warning = any(kw in source_lower for kw in warning_keywords) or any(
82
+ kw in content_lower[:200] for kw in warning_keywords
83
+ )
84
+
85
+ # Check if it's a policy document
86
+ is_policy = any(kw in source_lower for kw in policy_keywords) or any(
87
+ kw in content_lower[:200] for kw in policy_keywords
88
+ )
89
+
90
+ if is_warning:
91
+ warning_results.append(result)
92
+ elif is_policy:
93
+ policy_results.append(result)
94
+ else:
95
+ # Default to policy if unclear
96
+ policy_results.append(result)
97
+
98
+ return policy_results, warning_results
99
+
100
+ def retrieve(
101
+ self,
102
+ employee_name: str,
103
+ violation_type: str,
104
+ incident_reason: str,
105
+ top_k: int = None,
106
+ min_score: float = None,
107
+ ) -> RetrievalContext:
108
+ """Retrieve relevant context for report generation.
109
+
110
+ Args:
111
+ employee_name: Name of the employee.
112
+ violation_type: Type of violation (e.g., "Tardiness").
113
+ incident_reason: Description of the incident.
114
+ top_k: Number of results per query.
115
+ min_score: Minimum similarity score.
116
+
117
+ Returns:
118
+ RetrievalContext with categorized results.
119
+ """
120
+ top_k = top_k or settings.retrieval_top_k
121
+ min_score = min_score or settings.retrieval_min_score
122
+
123
+ # Build search queries
124
+ policy_query = f"{violation_type} policy procedure disciplinary action"
125
+ warning_query = f"{employee_name} warning coaching disciplinary {violation_type}"
126
+ incident_query = f"{incident_reason} {violation_type}"
127
+
128
+ # Execute searches
129
+ policy_results = self.vector_store.search(policy_query, top_k=top_k, min_score=min_score)
130
+ warning_results = self.vector_store.search(warning_query, top_k=top_k, min_score=min_score)
131
+ incident_results = self.vector_store.search(
132
+ incident_query, top_k=top_k, min_score=min_score
133
+ )
134
+
135
+ # Combine and deduplicate
136
+ all_results = {}
137
+ for result in policy_results + warning_results + incident_results:
138
+ chunk_id = result.chunk.chunk_id
139
+ if chunk_id not in all_results or result.score > all_results[chunk_id].score:
140
+ all_results[chunk_id] = result
141
+
142
+ # Re-classify all results
143
+ all_results_list = sorted(all_results.values(), key=lambda r: r.score, reverse=True)
144
+ policy_classified, warning_classified = self._classify_results(all_results_list)
145
+
146
+ # Format as context text
147
+ policy_context = self._format_results_as_context(policy_classified)
148
+ warnings_context = self._format_results_as_context(warning_classified)
149
+
150
+ # Determine if we have sufficient evidence
151
+ has_evidence = len(policy_classified) > 0 or len(warning_classified) > 0
152
+
153
+ # Collect unique sources
154
+ sources = list(
155
+ set(
156
+ Path(r.chunk.source_file).name
157
+ for r in policy_classified + warning_classified
158
+ )
159
+ )
160
+
161
+ return RetrievalContext(
162
+ policy_results=policy_classified,
163
+ warning_results=warning_classified,
164
+ policy_context_text=policy_context,
165
+ warnings_context_text=warnings_context,
166
+ has_sufficient_evidence=has_evidence,
167
+ sources_used=sources,
168
+ )
169
+
170
+ def retrieve_for_employee(
171
+ self, employee_name: str, top_k: int = 10
172
+ ) -> list[RetrievalResult]:
173
+ """Retrieve all documents mentioning an employee.
174
+
175
+ Useful for finding prior warnings and coaching records.
176
+ """
177
+ return self.vector_store.search(employee_name, top_k=top_k, min_score=0.2)
src/rag/synthesizer.py ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Report synthesizer using LLM with strict grounding."""
2
+
3
+ from datetime import datetime
4
+ from typing import Optional
5
+ import os
6
+
7
+ import requests
8
+ from dotenv import load_dotenv
9
+ from pydantic import BaseModel
10
+
11
+ # Load environment variables
12
+ load_dotenv()
13
+
14
+ from src.config import settings
15
+ from src.rag.prompts import (
16
+ INSUFFICIENT_EVIDENCE_RESPONSE,
17
+ SYSTEM_PROMPT,
18
+ USER_MESSAGE_TEMPLATE,
19
+ )
20
+ from src.rag.retriever import DocumentRetriever, RetrievalContext
21
+
22
+
23
+ class ReportInput(BaseModel):
24
+ """Structured input for report generation."""
25
+
26
+ employee_name: str
27
+ incident_date_range: str
28
+ incident_reason: str
29
+ violation_type: str
30
+ report_date: Optional[str] = None
31
+ doc_type: Optional[str] = "Memorandum"
32
+
33
+ def __init__(self, **data):
34
+ if "report_date" not in data or data["report_date"] is None:
35
+ data["report_date"] = datetime.now().strftime("%Y-%m-%d")
36
+ super().__init__(**data)
37
+
38
+
39
+ class SynthesisResult(BaseModel):
40
+ """Result of report synthesis."""
41
+
42
+ markdown_report: str
43
+ retrieval_context: RetrievalContext
44
+ model_used: str
45
+ generation_timestamp: str
46
+ success: bool
47
+ error: Optional[str] = None
48
+
49
+ class Config:
50
+ arbitrary_types_allowed = True
51
+
52
+
53
+ class ReportSynthesizer:
54
+ """Synthesizes HR incident reports using RAG.
55
+
56
+ Combines retrieved context with structured inputs to generate
57
+ strictly grounded reports using an LLM with temperature=0.
58
+ """
59
+
60
+ def __init__(
61
+ self,
62
+ retriever: Optional[DocumentRetriever] = None,
63
+ model_name: Optional[str] = None,
64
+ ):
65
+ """Initialize the synthesizer.
66
+
67
+ Args:
68
+ retriever: Document retriever for context.
69
+ model_name: Model name for OpenRouter.
70
+ """
71
+ self.retriever = retriever or DocumentRetriever()
72
+ self.model_name = model_name or settings.llm_model
73
+ self.api_key = os.getenv("OPENROUTER_API_KEY", "")
74
+ self.api_base = os.getenv("OPENROUTER_API_BASE", "https://openrouter.ai/api/v1")
75
+
76
+ def _build_user_message(
77
+ self, report_input: ReportInput, context: RetrievalContext
78
+ ) -> str:
79
+ """Build the user message with retrieved context."""
80
+ return USER_MESSAGE_TEMPLATE.format(
81
+ doc_type=report_input.doc_type,
82
+ employee_name=report_input.employee_name,
83
+ report_date=report_input.report_date,
84
+ incident_date_range=report_input.incident_date_range,
85
+ incident_reason=report_input.incident_reason,
86
+ violation_type=report_input.violation_type,
87
+ policy_context=context.policy_context_text,
88
+ warnings_context=context.warnings_context_text,
89
+ )
90
+
91
+ def _generate_with_llm(self, user_message: str) -> str:
92
+ """Generate report using OpenRouter API."""
93
+ headers = {
94
+ "Authorization": f"Bearer {self.api_key}",
95
+ "Content-Type": "application/json",
96
+ "HTTP-Referer": "https://hr-report-generator.hf.space",
97
+ "X-Title": "HR Incident Report Generator",
98
+ }
99
+
100
+ payload = {
101
+ "model": self.model_name,
102
+ "messages": [
103
+ {"role": "system", "content": SYSTEM_PROMPT},
104
+ {"role": "user", "content": user_message},
105
+ ],
106
+ "temperature": settings.llm_temperature,
107
+ "max_tokens": 2048,
108
+ }
109
+
110
+ response = requests.post(
111
+ f"{self.api_base}/chat/completions",
112
+ headers=headers,
113
+ json=payload,
114
+ timeout=120,
115
+ )
116
+
117
+ if response.status_code != 200:
118
+ raise Exception(f"OpenRouter API error: {response.status_code} - {response.text}")
119
+
120
+ result = response.json()
121
+ return result["choices"][0]["message"]["content"]
122
+
123
+ def _generate_insufficient_evidence_report(self, report_input: ReportInput) -> str:
124
+ """Generate a report indicating insufficient evidence."""
125
+ return INSUFFICIENT_EVIDENCE_RESPONSE.format(
126
+ employee_name=report_input.employee_name,
127
+ report_date=report_input.report_date,
128
+ incident_date_range=report_input.incident_date_range,
129
+ incident_reason=report_input.incident_reason,
130
+ violation_type=report_input.violation_type,
131
+ )
132
+
133
+ def synthesize(self, report_input: ReportInput) -> SynthesisResult:
134
+ """Synthesize an HR incident report.
135
+
136
+ Args:
137
+ report_input: Structured input data.
138
+
139
+ Returns:
140
+ SynthesisResult with the generated report.
141
+ """
142
+ timestamp = datetime.now().isoformat()
143
+
144
+ try:
145
+ # Retrieve relevant context
146
+ context = self.retriever.retrieve(
147
+ employee_name=report_input.employee_name,
148
+ violation_type=report_input.violation_type,
149
+ incident_reason=report_input.incident_reason,
150
+ )
151
+
152
+ # Check for sufficient evidence
153
+ if not context.has_sufficient_evidence:
154
+ return SynthesisResult(
155
+ markdown_report=self._generate_insufficient_evidence_report(report_input),
156
+ retrieval_context=context,
157
+ model_used=self.model_name,
158
+ generation_timestamp=timestamp,
159
+ success=True,
160
+ )
161
+
162
+ # Build prompt and generate
163
+ user_message = self._build_user_message(report_input, context)
164
+ markdown_report = self._generate_with_llm(user_message)
165
+
166
+ return SynthesisResult(
167
+ markdown_report=markdown_report,
168
+ retrieval_context=context,
169
+ model_used=self.model_name,
170
+ generation_timestamp=timestamp,
171
+ success=True,
172
+ )
173
+
174
+ except Exception as e:
175
+ return SynthesisResult(
176
+ markdown_report="",
177
+ retrieval_context=RetrievalContext(
178
+ policy_results=[],
179
+ warning_results=[],
180
+ policy_context_text="",
181
+ warnings_context_text="",
182
+ has_sufficient_evidence=False,
183
+ sources_used=[],
184
+ ),
185
+ model_used=self.model_name,
186
+ generation_timestamp=timestamp,
187
+ success=False,
188
+ error=str(e),
189
+ )
190
+
191
+ def synthesize_from_dict(self, data: dict) -> SynthesisResult:
192
+ """Synthesize a report from dictionary input.
193
+
194
+ Convenience method for CLI and API usage.
195
+ """
196
+ report_input = ReportInput(**data)
197
+ return self.synthesize(report_input)