Spaces:
Build error
title: shl-ai-agent
sdk: docker
app_port: 7860
license: mit
SHL Assessment Recommendation Agent
A conversational AI agent for recommending SHL psychometric assessments from the SHL Individual Test Solutions catalog.
Built for the SHL Research Intern, AI Assignment β deployed on Hugging Face Spaces using Docker.
What it does
- Accepts a full conversation history via
POST /chatand returns a recommendation reply. - Recommends 1β10 SHL assessments per response when enough context is available.
- Asks clarifying questions when the query is vague.
- Refuses off-topic requests (legal advice, compensation, prompt-injection).
- Tracks constraints across conversation turns (role, seniority, domain, language).
- Returns
end_of_conversation: truewhen the user confirms the shortlist. - Stateless β no server-side session storage.
API Schema
GET /health
{"status": "ok"}
POST /chat
Request:
{
"messages": [
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
]
}
Response:
{
"reply": "string",
"recommendations": [
{"name": "string", "url": "string", "test_type": "string"}
],
"end_of_conversation": false
}
recommendationsis[]when clarifying or refusing.recommendationshas 1β10 items when shortlisting.
Project Structure
shl-agent/
βββ app/
β βββ __init__.py # Package marker
β βββ main.py # FastAPI app, routes, lifespan
β βββ schemas.py # Pydantic request/response models
β βββ agent.py # LLM orchestration, refusal logic, response parsing
β βββ retrieval.py # TF-IDF index build + query
β βββ catalog_loader.py # Catalog I/O and validation
βββ data/
β βββ shl_catalog.json # SHL catalog (35 items extracted from sample conversations)
βββ scripts/
β βββ build_index.py # Precompute TF-IDF index artifacts
βββ tests/
β βββ sample_requests.json # 10 test scenarios
β βββ evaluate.py # Automated evaluation script
βββ Dockerfile
βββ requirements.txt
βββ .gitignore
βββ README.md
Local Setup and Run
Prerequisites
- Python 3.11+
- An Anthropic API key (
claude-sonnet-4-20250514)
Steps
# 1. Clone the repo
git clone <your-repo-url>
cd shl-agent
# 2. Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
# 3. Install dependencies
pip install -r requirements.txt
# 4. Set your API key
export ANTHROPIC_API_KEY="sk-ant-..."
# 5. (Optional but recommended) Pre-build the TF-IDF index
python scripts/build_index.py
# 6. Start the server
uvicorn app.main:app --host 0.0.0.0 --port 7860 --reload
The server is now running at http://localhost:7860.
Docker Local Run
# Build the Docker image
docker build -t shl-agent .
# Run the container with your API key
docker run -p 7860:7860 -e ANTHROPIC_API_KEY="sk-ant-..." shl-agent
Curl Commands
Health check
curl http://localhost:7860/health
# Expected: {"status":"ok"}
Vague query (should clarify)
curl -X POST http://localhost:7860/chat \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "We need a solution for senior leadership."}
]
}'
Clear query (should recommend)
curl -X POST http://localhost:7860/chat \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "I need a cognitive ability test and personality test for graduate management trainees."}
]
}'
Multi-turn conversation (add constraint)
curl -X POST http://localhost:7860/chat \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "I need a cognitive ability test and personality test for graduate management trainees."},
{"role": "assistant", "content": "For graduate trainees I recommend Verify G+ and OPQ32r."},
{"role": "user", "content": "Can you also add a situational judgement element?"}
]
}'
Comparison question
curl -X POST http://localhost:7860/chat \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "What is the difference between OPQ32r and OPQ MQ Sales Report?"}
]
}'
Off-topic refusal
curl -X POST http://localhost:7860/chat \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Are we legally required under HIPAA to test all staff?"}
]
}'
Hugging Face Spaces Deployment
Prerequisites
- A Hugging Face account and a Space (Docker SDK).
huggingface_hubCLI installed:pip install huggingface_hub.- Your Anthropic API key added as a Space Secret (not in code).
Add the API key as a Space Secret
In your Space settings β Secrets β add:
ANTHROPIC_API_KEY = sk-ant-...
Git commands to push to Hugging Face Spaces
# 1. Install git-lfs (required for HF)
git lfs install
# 2. Clone your HF Space repo
git clone https://huggingface.co/spaces/<your-username>/shl-ai-agent
cd shl-ai-agent
# 3. Copy project files into the cloned repo
cp -r /path/to/shl-agent/* .
# 4. Commit and push
git add .
git commit -m "Initial deployment: SHL AI Agent"
git push
# HF will build the Docker image automatically on push.
# Monitor the build in: https://huggingface.co/spaces/<username>/shl-ai-agent
Running the Evaluation Script
# Against local server
python tests/evaluate.py --base-url http://localhost:7860
# Against deployed HF Space
python tests/evaluate.py --base-url https://<username>-shl-ai-agent.hf.space
The script exits with code 0 on full pass, code 1 on any failure β suitable for CI.
Common Deployment Mistakes on HF Spaces (and how to avoid them)
| Mistake | Fix |
|---|---|
Binding to 127.0.0.1 instead of 0.0.0.0 |
Always use --host 0.0.0.0 in uvicorn CMD |
| Wrong port | app_port: 7860 in README front matter must match Dockerfile EXPOSE and uvicorn --port |
| API key in code | Set as a Space Secret; read via os.environ.get("ANTHROPIC_API_KEY") |
| Running as root | Add useradd and USER in Dockerfile |
| Importing heavy ML libraries (torch) | We use scikit-learn only β stays within HF free-tier RAM |
| Cold build takes too long | Pre-build TF-IDF index in Dockerfile (RUN python scripts/build_index.py) |
| Missing README YAML front matter | The --- block must be the first thing in README.md |
| git-lfs not installed | Run git lfs install before cloning HF repo |
| Forgetting to set Space to Public | Public required for evaluator to reach your endpoint |
Approach Document
1. Problem Framing
Given a multi-turn conversation between an HR professional and an AI agent, the system must recommend the most relevant SHL psychometric assessments from a fixed catalog. The agent must handle vague queries, accumulate constraints, support comparisons, and refuse off-topic requests.
2. Data Ingestion
The SHL catalog is stored as a structured JSON file (data/shl_catalog.json) with 35 items extracted from the 10 provided sample conversations. Each item has: name, url, test_type, description, duration, languages, keys, seniority, and domains. The catalog is the single source of truth β no external APIs are called for catalog data.
3. Retrieval Design
We use TF-IDF (bigrams) over rich document strings constructed from all catalog fields. Query = concatenation of all user messages (latest message doubled for recency bias). Similarity = cosine distance via linear_kernel. Top-10 results above a score threshold of 0.05 are injected into the system prompt as context.
Why TF-IDF over sentence-transformers? At 35 items, neural embeddings provide minimal recall benefit while adding ~2 GB of model weight. TF-IDF with bigrams and rich field concatenation is fast, transparent, and interview-defensible.
4. Agent Policy and Decision Logic
The LLM (Claude Sonnet) receives:
- A fixed system prompt defining scope, refusal rules, and output format.
- Retrieved catalog items as grounding context.
- Full conversation history.
The system prompt uses XML output tags (<reply>, <recommendations>, <end_of_conversation>) for reliable parsing. This avoids JSON fragility from LLM outputs.
5. Scope Control and Refusal Strategy
Two layers:
- Pre-LLM regex guard: checks the latest user message against known refusal patterns (legal, compensation, prompt-injection). Fires before any LLM call β zero token cost.
- System prompt instructions: tells the LLM to refuse anything outside the catalog scope. Belt-and-suspenders.
URL validation post-parse: every URL returned by the LLM is checked against the catalog URL set. Non-catalog URLs are silently dropped. This eliminates hallucinated URLs.
6. Evaluation Strategy
10 test scenarios covering: vagueβclarify, clearβrecommend, add constraintβrefine, comparison, off-topic refusal, prompt injection, EOC detection, technical roles, high-volume screening, compensation refusal. Each test checks: recommendations empty/non-empty, end_of_conversation flag, reply non-empty, URL format.
7. Trade-offs and Future Improvements
| Current | Improvement |
|---|---|
| TF-IDF retrieval | Sentence-transformers + FAISS for semantic recall on large catalogs |
| Static JSON catalog | Live SHL API feed with caching |
| Regex refusal guards | Fine-tuned classifier for nuanced refusal detection |
| Single-worker uvicorn | Gunicorn + multiple uvicorn workers for throughput |
| Closing phrase heuristic | LLM-based intent classification for EOC detection |
8. Use of AI Assistance
Claude (Anthropic) was used for: scaffolding boilerplate (FastAPI app structure, Dockerfile patterns), suggesting retrieval approaches, and reviewing code for obvious bugs. All architecture decisions, retrieval design, refusal logic, schema choices, and system prompt engineering were made by the developer and are fully explainable and reviewable.