Spaces:

Utkarsh430
/

shlaiagent

Build error

App Files Files Community

shlaiagent / README.md

Utkarsh430

Update README.md

bc1594d verified about 2 months ago

preview code

Raw

History Blame Contribute Delete

10.4 kB

	---
	title: shl-ai-agent
	sdk: docker
	app_port: 7860
	license: mit
	---

	# SHL Assessment Recommendation Agent

	A conversational AI agent for recommending SHL psychometric assessments from the SHL Individual Test Solutions catalog.

	Built for the SHL Research Intern, AI Assignment — deployed on Hugging Face Spaces using Docker.

	---

	## What it does

	- Accepts a full conversation history via `POST /chat` and returns a recommendation reply.
	- Recommends 1–10 SHL assessments per response when enough context is available.
	- Asks clarifying questions when the query is vague.
	- Refuses off-topic requests (legal advice, compensation, prompt-injection).
	- Tracks constraints across conversation turns (role, seniority, domain, language).
	- Returns `end_of_conversation: true` when the user confirms the shortlist.
	- Stateless — no server-side session storage.

	---

	## API Schema

	### `GET /health`

	```json
	{"status": "ok"}
	```

	### `POST /chat`

	Request:
	```json
	{
	"messages": [
	{"role": "user", "content": "..."},
	{"role": "assistant", "content": "..."}
	]
	}
	```

	Response:
	```json
	{
	"reply": "string",
	"recommendations": [
	{"name": "string", "url": "string", "test_type": "string"}
	],
	"end_of_conversation": false
	}
	```

	- `recommendations` is `[]` when clarifying or refusing.
	- `recommendations` has 1–10 items when shortlisting.

	---

	## Project Structure

	```
	shl-agent/
	├── app/
	│ ├── __init__.py # Package marker
	│ ├── main.py # FastAPI app, routes, lifespan
	│ ├── schemas.py # Pydantic request/response models
	│ ├── agent.py # LLM orchestration, refusal logic, response parsing
	│ ├── retrieval.py # TF-IDF index build + query
	│ └── catalog_loader.py # Catalog I/O and validation
	├── data/
	│ └── shl_catalog.json # SHL catalog (35 items extracted from sample conversations)
	├── scripts/
	│ └── build_index.py # Precompute TF-IDF index artifacts
	├── tests/
	│ ├── sample_requests.json # 10 test scenarios
	│ └── evaluate.py # Automated evaluation script
	├── Dockerfile
	├── requirements.txt
	├── .gitignore
	└── README.md
	```

	---

	## Local Setup and Run

	### Prerequisites

	- Python 3.11+
	- An Anthropic API key (`claude-sonnet-4-20250514`)

	### Steps

	```bash
	# 1. Clone the repo
	git clone <your-repo-url>
	cd shl-agent

	# 2. Create and activate virtual environment
	python -m venv .venv
	source .venv/bin/activate # Linux/macOS
	# .venv\Scripts\activate # Windows

	# 3. Install dependencies
	pip install -r requirements.txt

	# 4. Set your API key
	export ANTHROPIC_API_KEY="sk-ant-..."

	# 5. (Optional but recommended) Pre-build the TF-IDF index
	python scripts/build_index.py

	# 6. Start the server
	uvicorn app.main:app --host 0.0.0.0 --port 7860 --reload
	```

	The server is now running at `http://localhost:7860`.

	### Docker Local Run

	```bash
	# Build the Docker image
	docker build -t shl-agent .

	# Run the container with your API key
	docker run -p 7860:7860 -e ANTHROPIC_API_KEY="sk-ant-..." shl-agent
	```

	---

	## Curl Commands

	### Health check

	```bash
	curl http://localhost:7860/health
	# Expected: {"status":"ok"}
	```

	### Vague query (should clarify)

	```bash
	curl -X POST http://localhost:7860/chat \
	-H "Content-Type: application/json" \
	-d '{
	"messages": [
	{"role": "user", "content": "We need a solution for senior leadership."}
	]
	}'
	```

	### Clear query (should recommend)

	```bash
	curl -X POST http://localhost:7860/chat \
	-H "Content-Type: application/json" \
	-d '{
	"messages": [
	{"role": "user", "content": "I need a cognitive ability test and personality test for graduate management trainees."}
	]
	}'
	```

	### Multi-turn conversation (add constraint)

	```bash
	curl -X POST http://localhost:7860/chat \
	-H "Content-Type: application/json" \
	-d '{
	"messages": [
	{"role": "user", "content": "I need a cognitive ability test and personality test for graduate management trainees."},
	{"role": "assistant", "content": "For graduate trainees I recommend Verify G+ and OPQ32r."},
	{"role": "user", "content": "Can you also add a situational judgement element?"}
	]
	}'
	```

	### Comparison question

	```bash
	curl -X POST http://localhost:7860/chat \
	-H "Content-Type: application/json" \
	-d '{
	"messages": [
	{"role": "user", "content": "What is the difference between OPQ32r and OPQ MQ Sales Report?"}
	]
	}'
	```

	### Off-topic refusal

	```bash
	curl -X POST http://localhost:7860/chat \
	-H "Content-Type: application/json" \
	-d '{
	"messages": [
	{"role": "user", "content": "Are we legally required under HIPAA to test all staff?"}
	]
	}'
	```

	---

	## Hugging Face Spaces Deployment

	### Prerequisites

	- A Hugging Face account and a Space (Docker SDK).
	- `huggingface_hub` CLI installed: `pip install huggingface_hub`.
	- Your Anthropic API key added as a Space Secret (not in code).

	### Add the API key as a Space Secret

	In your Space settings → Secrets → add:
	```
	ANTHROPIC_API_KEY = sk-ant-...
	```

	### Git commands to push to Hugging Face Spaces

	```bash
	# 1. Install git-lfs (required for HF)
	git lfs install

	# 2. Clone your HF Space repo
	git clone https://huggingface.co/spaces/<your-username>/shl-ai-agent
	cd shl-ai-agent

	# 3. Copy project files into the cloned repo
	cp -r /path/to/shl-agent/* .

	# 4. Commit and push
	git add .
	git commit -m "Initial deployment: SHL AI Agent"
	git push

	# HF will build the Docker image automatically on push.
	# Monitor the build in: https://huggingface.co/spaces/<username>/shl-ai-agent
	```

	---

	## Running the Evaluation Script

	```bash
	# Against local server
	python tests/evaluate.py --base-url http://localhost:7860

	# Against deployed HF Space
	python tests/evaluate.py --base-url https://<username>-shl-ai-agent.hf.space
	```

	The script exits with code 0 on full pass, code 1 on any failure — suitable for CI.

	---

	## Common Deployment Mistakes on HF Spaces (and how to avoid them)

	\| Mistake \| Fix \|
	\|---------\|-----\|
	\| Binding to `127.0.0.1` instead of `0.0.0.0` \| Always use `--host 0.0.0.0` in uvicorn CMD \|
	\| Wrong port \| `app_port: 7860` in README front matter must match Dockerfile EXPOSE and uvicorn `--port` \|
	\| API key in code \| Set as a Space Secret; read via `os.environ.get("ANTHROPIC_API_KEY")` \|
	\| Running as root \| Add `useradd` and `USER` in Dockerfile \|
	\| Importing heavy ML libraries (torch) \| We use scikit-learn only — stays within HF free-tier RAM \|
	\| Cold build takes too long \| Pre-build TF-IDF index in Dockerfile (`RUN python scripts/build_index.py`) \|
	\| Missing README YAML front matter \| The `---` block must be the first thing in README.md \|
	\| git-lfs not installed \| Run `git lfs install` before cloning HF repo \|
	\| Forgetting to set Space to Public \| Public required for evaluator to reach your endpoint \|

	---

	## Approach Document

	### 1. Problem Framing

	Given a multi-turn conversation between an HR professional and an AI agent, the system must recommend the most relevant SHL psychometric assessments from a fixed catalog. The agent must handle vague queries, accumulate constraints, support comparisons, and refuse off-topic requests.

	### 2. Data Ingestion

	The SHL catalog is stored as a structured JSON file (`data/shl_catalog.json`) with 35 items extracted from the 10 provided sample conversations. Each item has: `name`, `url`, `test_type`, `description`, `duration`, `languages`, `keys`, `seniority`, and `domains`. The catalog is the single source of truth — no external APIs are called for catalog data.

	### 3. Retrieval Design

	We use TF-IDF (bigrams) over rich document strings constructed from all catalog fields. Query = concatenation of all user messages (latest message doubled for recency bias). Similarity = cosine distance via `linear_kernel`. Top-10 results above a score threshold of 0.05 are injected into the system prompt as context.

	Why TF-IDF over sentence-transformers? At 35 items, neural embeddings provide minimal recall benefit while adding ~2 GB of model weight. TF-IDF with bigrams and rich field concatenation is fast, transparent, and interview-defensible.

	### 4. Agent Policy and Decision Logic

	The LLM (Claude Sonnet) receives:
	- A fixed system prompt defining scope, refusal rules, and output format.
	- Retrieved catalog items as grounding context.
	- Full conversation history.

	The system prompt uses XML output tags (`<reply>`, `<recommendations>`, `<end_of_conversation>`) for reliable parsing. This avoids JSON fragility from LLM outputs.

	### 5. Scope Control and Refusal Strategy

	Two layers:
	1. Pre-LLM regex guard: checks the latest user message against known refusal patterns (legal, compensation, prompt-injection). Fires before any LLM call — zero token cost.
	2. System prompt instructions: tells the LLM to refuse anything outside the catalog scope. Belt-and-suspenders.

	URL validation post-parse: every URL returned by the LLM is checked against the catalog URL set. Non-catalog URLs are silently dropped. This eliminates hallucinated URLs.

	### 6. Evaluation Strategy

	10 test scenarios covering: vague→clarify, clear→recommend, add constraint→refine, comparison, off-topic refusal, prompt injection, EOC detection, technical roles, high-volume screening, compensation refusal. Each test checks: recommendations empty/non-empty, end_of_conversation flag, reply non-empty, URL format.

	### 7. Trade-offs and Future Improvements

	\| Current \| Improvement \|
	\|---------\|-------------\|
	\| TF-IDF retrieval \| Sentence-transformers + FAISS for semantic recall on large catalogs \|
	\| Static JSON catalog \| Live SHL API feed with caching \|
	\| Regex refusal guards \| Fine-tuned classifier for nuanced refusal detection \|
	\| Single-worker uvicorn \| Gunicorn + multiple uvicorn workers for throughput \|
	\| Closing phrase heuristic \| LLM-based intent classification for EOC detection \|

	### 8. Use of AI Assistance


	Claude (Anthropic) was used for: scaffolding boilerplate (FastAPI app structure, Dockerfile patterns), suggesting retrieval approaches, and reviewing code for obvious bugs. All architecture decisions, retrieval design, refusal logic, schema choices, and system prompt engineering were made by the developer and are fully explainable and reviewable.