File size: 2,601 Bytes
b2efd24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# ⚑ AI Recruitment Agent

A production-grade hybrid candidate matching pipeline using **Groq LLM**, **Pinecone vector DB**, and a **Gradio 4.16.0** UI.

## Architecture

```
CSV Input β†’ Stage 1: Normalize (Groq)
          β†’ Stage 2: Embed + Match (Pinecone + SentenceTransformers) β†’ Top 20
          β†’ Stage 3: Deterministic Rerank (Groq) β†’ Top 10
          β†’ Stage 4: LLM Deep Review (Groq) β†’ Top 5
          β†’ Stage 5: Final Synthesis (Groq) β†’ Shortlist
```

## Setup

### 1. Install dependencies

```bash
pip install -r requirements.txt
```

### 2. Configure environment

```bash
cp .env.example .env
# Edit .env and fill in your API keys
```

### 3. Create Pinecone index

In your Pinecone console:
- Create an index named `recruitment-index` (or whatever you set in `PINECONE_INDEX`)
- Dimension: **1024** for `BAAI/bge-m3`, **768** for `bge-large-en`, **384** for `all-MiniLM-L6-v2`
- Metric: **cosine**

### 4. Run the Gradio UI

```bash
python gradio_app.py
```

Open http://localhost:7860 in your browser.

### 5. (Optional) Run the FastAPI backend

```bash
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
```

API docs at http://localhost:8000/docs

## CSV Format

Your CSV should have these columns (exact names or common variants):

| Column | Variants |
|--------|----------|
| `name` | `full_name`, `candidate_name` |
| `email` | `email_address` |
| `skills` | `parsed_skills`, `technical_skills` |
| `experience` | `parsed_work_experience`, `years_of_experience` |
| `education` | `parsed_metadata_education` |
| `resume_text` | `parsed_summary`, `summary` |

## Key Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `GROQ_API_KEYS` | Comma-separated keys for rotation | β€” |
| `GROQ_MODEL` | Model name | `llama3-70b-8192` |
| `PINECONE_API_KEY` | Pinecone API key | β€” |
| `PINECONE_INDEX` | Index name | `recruitment-index` |
| `EMBEDDING_MODEL` | SentenceTransformer model | `BAAI/bge-m3` |
| `STAGE2_TOP_K` | Candidates retrieved by embeddings | `20` |
| `GRADIO_PORT` | UI port | `7860` |
| `GRADIO_SHARE` | Enable public share link | `False` |

## Pipeline Stages

| Stage | Method | Input | Output |
|-------|--------|-------|--------|
| 1. Normalize | Groq LLM | All candidates | Structured features |
| 2. Embed & Match | Pinecone + BAAI/bge-m3 | All candidates | Top 20 by similarity |
| 3. Rerank | Groq LLM (deterministic scoring) | Top 20 | Top 10 with scores |
| 4. Deep Review | Groq LLM | Top 5 | Verdicts + signals |
| 5. Final Synthesis | Groq LLM | Top 5 reviews | Final ranked shortlist |