shlaiagent / README.md
Utkarsh430's picture
Update README.md
bc1594d verified
|
Raw
History Blame Contribute Delete
10.4 kB
metadata
title: shl-ai-agent
sdk: docker
app_port: 7860
license: mit

SHL Assessment Recommendation Agent

A conversational AI agent for recommending SHL psychometric assessments from the SHL Individual Test Solutions catalog.

Built for the SHL Research Intern, AI Assignment β€” deployed on Hugging Face Spaces using Docker.


What it does

  • Accepts a full conversation history via POST /chat and returns a recommendation reply.
  • Recommends 1–10 SHL assessments per response when enough context is available.
  • Asks clarifying questions when the query is vague.
  • Refuses off-topic requests (legal advice, compensation, prompt-injection).
  • Tracks constraints across conversation turns (role, seniority, domain, language).
  • Returns end_of_conversation: true when the user confirms the shortlist.
  • Stateless β€” no server-side session storage.

API Schema

GET /health

{"status": "ok"}

POST /chat

Request:

{
  "messages": [
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "..."}
  ]
}

Response:

{
  "reply": "string",
  "recommendations": [
    {"name": "string", "url": "string", "test_type": "string"}
  ],
  "end_of_conversation": false
}
  • recommendations is [] when clarifying or refusing.
  • recommendations has 1–10 items when shortlisting.

Project Structure

shl-agent/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ __init__.py          # Package marker
β”‚   β”œβ”€β”€ main.py              # FastAPI app, routes, lifespan
β”‚   β”œβ”€β”€ schemas.py           # Pydantic request/response models
β”‚   β”œβ”€β”€ agent.py             # LLM orchestration, refusal logic, response parsing
β”‚   β”œβ”€β”€ retrieval.py         # TF-IDF index build + query
β”‚   └── catalog_loader.py    # Catalog I/O and validation
β”œβ”€β”€ data/
β”‚   └── shl_catalog.json     # SHL catalog (35 items extracted from sample conversations)
β”œβ”€β”€ scripts/
β”‚   └── build_index.py       # Precompute TF-IDF index artifacts
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ sample_requests.json # 10 test scenarios
β”‚   └── evaluate.py          # Automated evaluation script
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .gitignore
└── README.md

Local Setup and Run

Prerequisites

  • Python 3.11+
  • An Anthropic API key (claude-sonnet-4-20250514)

Steps

# 1. Clone the repo
git clone <your-repo-url>
cd shl-agent

# 2. Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate       # Linux/macOS
# .venv\Scripts\activate        # Windows

# 3. Install dependencies
pip install -r requirements.txt

# 4. Set your API key
export ANTHROPIC_API_KEY="sk-ant-..."

# 5. (Optional but recommended) Pre-build the TF-IDF index
python scripts/build_index.py

# 6. Start the server
uvicorn app.main:app --host 0.0.0.0 --port 7860 --reload

The server is now running at http://localhost:7860.

Docker Local Run

# Build the Docker image
docker build -t shl-agent .

# Run the container with your API key
docker run -p 7860:7860 -e ANTHROPIC_API_KEY="sk-ant-..." shl-agent

Curl Commands

Health check

curl http://localhost:7860/health
# Expected: {"status":"ok"}

Vague query (should clarify)

curl -X POST http://localhost:7860/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "We need a solution for senior leadership."}
    ]
  }'

Clear query (should recommend)

curl -X POST http://localhost:7860/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "I need a cognitive ability test and personality test for graduate management trainees."}
    ]
  }'

Multi-turn conversation (add constraint)

curl -X POST http://localhost:7860/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "I need a cognitive ability test and personality test for graduate management trainees."},
      {"role": "assistant", "content": "For graduate trainees I recommend Verify G+ and OPQ32r."},
      {"role": "user", "content": "Can you also add a situational judgement element?"}
    ]
  }'

Comparison question

curl -X POST http://localhost:7860/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "What is the difference between OPQ32r and OPQ MQ Sales Report?"}
    ]
  }'

Off-topic refusal

curl -X POST http://localhost:7860/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Are we legally required under HIPAA to test all staff?"}
    ]
  }'

Hugging Face Spaces Deployment

Prerequisites

  • A Hugging Face account and a Space (Docker SDK).
  • huggingface_hub CLI installed: pip install huggingface_hub.
  • Your Anthropic API key added as a Space Secret (not in code).

Add the API key as a Space Secret

In your Space settings β†’ Secrets β†’ add:

ANTHROPIC_API_KEY = sk-ant-...

Git commands to push to Hugging Face Spaces

# 1. Install git-lfs (required for HF)
git lfs install

# 2. Clone your HF Space repo
git clone https://huggingface.co/spaces/<your-username>/shl-ai-agent
cd shl-ai-agent

# 3. Copy project files into the cloned repo
cp -r /path/to/shl-agent/* .

# 4. Commit and push
git add .
git commit -m "Initial deployment: SHL AI Agent"
git push

# HF will build the Docker image automatically on push.
# Monitor the build in: https://huggingface.co/spaces/<username>/shl-ai-agent

Running the Evaluation Script

# Against local server
python tests/evaluate.py --base-url http://localhost:7860

# Against deployed HF Space
python tests/evaluate.py --base-url https://<username>-shl-ai-agent.hf.space

The script exits with code 0 on full pass, code 1 on any failure β€” suitable for CI.


Common Deployment Mistakes on HF Spaces (and how to avoid them)

Mistake Fix
Binding to 127.0.0.1 instead of 0.0.0.0 Always use --host 0.0.0.0 in uvicorn CMD
Wrong port app_port: 7860 in README front matter must match Dockerfile EXPOSE and uvicorn --port
API key in code Set as a Space Secret; read via os.environ.get("ANTHROPIC_API_KEY")
Running as root Add useradd and USER in Dockerfile
Importing heavy ML libraries (torch) We use scikit-learn only β€” stays within HF free-tier RAM
Cold build takes too long Pre-build TF-IDF index in Dockerfile (RUN python scripts/build_index.py)
Missing README YAML front matter The --- block must be the first thing in README.md
git-lfs not installed Run git lfs install before cloning HF repo
Forgetting to set Space to Public Public required for evaluator to reach your endpoint

Approach Document

1. Problem Framing

Given a multi-turn conversation between an HR professional and an AI agent, the system must recommend the most relevant SHL psychometric assessments from a fixed catalog. The agent must handle vague queries, accumulate constraints, support comparisons, and refuse off-topic requests.

2. Data Ingestion

The SHL catalog is stored as a structured JSON file (data/shl_catalog.json) with 35 items extracted from the 10 provided sample conversations. Each item has: name, url, test_type, description, duration, languages, keys, seniority, and domains. The catalog is the single source of truth β€” no external APIs are called for catalog data.

3. Retrieval Design

We use TF-IDF (bigrams) over rich document strings constructed from all catalog fields. Query = concatenation of all user messages (latest message doubled for recency bias). Similarity = cosine distance via linear_kernel. Top-10 results above a score threshold of 0.05 are injected into the system prompt as context.

Why TF-IDF over sentence-transformers? At 35 items, neural embeddings provide minimal recall benefit while adding ~2 GB of model weight. TF-IDF with bigrams and rich field concatenation is fast, transparent, and interview-defensible.

4. Agent Policy and Decision Logic

The LLM (Claude Sonnet) receives:

  • A fixed system prompt defining scope, refusal rules, and output format.
  • Retrieved catalog items as grounding context.
  • Full conversation history.

The system prompt uses XML output tags (<reply>, <recommendations>, <end_of_conversation>) for reliable parsing. This avoids JSON fragility from LLM outputs.

5. Scope Control and Refusal Strategy

Two layers:

  1. Pre-LLM regex guard: checks the latest user message against known refusal patterns (legal, compensation, prompt-injection). Fires before any LLM call β€” zero token cost.
  2. System prompt instructions: tells the LLM to refuse anything outside the catalog scope. Belt-and-suspenders.

URL validation post-parse: every URL returned by the LLM is checked against the catalog URL set. Non-catalog URLs are silently dropped. This eliminates hallucinated URLs.

6. Evaluation Strategy

10 test scenarios covering: vague→clarify, clear→recommend, add constraint→refine, comparison, off-topic refusal, prompt injection, EOC detection, technical roles, high-volume screening, compensation refusal. Each test checks: recommendations empty/non-empty, end_of_conversation flag, reply non-empty, URL format.

7. Trade-offs and Future Improvements

Current Improvement
TF-IDF retrieval Sentence-transformers + FAISS for semantic recall on large catalogs
Static JSON catalog Live SHL API feed with caching
Regex refusal guards Fine-tuned classifier for nuanced refusal detection
Single-worker uvicorn Gunicorn + multiple uvicorn workers for throughput
Closing phrase heuristic LLM-based intent classification for EOC detection

8. Use of AI Assistance

Claude (Anthropic) was used for: scaffolding boilerplate (FastAPI app structure, Dockerfile patterns), suggesting retrieval approaches, and reviewing code for obvious bugs. All architecture decisions, retrieval design, refusal logic, schema choices, and system prompt engineering were made by the developer and are fully explainable and reviewable.