ketannnn commited on
Commit
6106435
·
1 Parent(s): acf867c

feat: add docker-compose and rewrite README with architecture rationale

Browse files
Files changed (4) hide show
  1. README.md +120 -0
  2. backend/Dockerfile +13 -0
  3. docker-compose.yml +29 -0
  4. frontend/Dockerfile +10 -0
README.md ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TalentPulse — AI Candidate Matching System
2
+
3
+ A production-grade two-stage AI pipeline for matching job descriptions against large candidate pools. Built in 3 days as a coding assignment demonstration.
4
+
5
+ ## Architecture
6
+
7
+ ### Why Two Stages?
8
+
9
+ Calling an LLM directly on 100K candidates costs hundreds of dollars and takes hours. The two-stage funnel gets it done in seconds for under $1:
10
+
11
+ ```
12
+ Candidates (100K) → Stage 1 (ANN + scoring) → Top 50 → Stage 2 (reranker + LLM) → Top 20 ranked
13
+ ```
14
+
15
+ **Stage 1 — Fast Retrieval (~50-100ms)**
16
+ Bi-encoder vector search (BAAI/bge-small-en-v1.5) in Qdrant retrieves top-200 semantically similar candidates. A weighted structured scorer then ranks them:
17
+
18
+ ```python
19
+ score = (
20
+ 0.35 * skill_jaccard(jd_skills, candidate_skills)
21
+ + 0.20 * cosine_similarity # from ANN
22
+ + 0.15 * yoe_match(min_yoe, candidate_yoe)
23
+ + 0.10 * company_quality_signal # funded/product company bonus
24
+ + 0.10 * growth_velocity # proprietary trajectory score
25
+ + 0.10 * education_match
26
+ )
27
+ ```
28
+
29
+ **Stage 2 — Deep Reranking (~2-5s)**
30
+ BGE cross-encoder (bge-reranker-v2-m3) jointly re-scores the top-50 shortlist. Scores are fused via Reciprocal Rank Fusion (Cormack 2009). Groq LLM generates explanations for the final top-20, grounded in pre-computed structured gaps.
31
+
32
+ ### Unique Differentiators
33
+
34
+ 1. **Trajectory Scoring** — Computes career growth velocity from work history timelines and title seniority progression. Rewards fast-promoters at funded product companies.
35
+
36
+ 2. **JD Quality Feedback** — Every match response includes a `jd_quality` report: vagueness score, breadth score, contradictions. Instead of silently handling vague JDs, we surface them as a feature.
37
+
38
+ 3. **Structured Gap Analysis** — Pre-computes missing skills, YOE delta, location mismatch, engineer type mismatch for each candidate. LLM explanations are grounded in this structured data — no hallucination.
39
+
40
+ 4. **Live Weight Sliders** — Component scores are stored in Postgres. Changing slider weights triggers a pure in-memory `/rerank` call — no model inference, sub-100ms response.
41
+
42
+ ## Stack
43
+
44
+ | Component | Tech |
45
+ |-----------|------|
46
+ | Backend API | FastAPI + uvicorn |
47
+ | Database | Neon Postgres (async via asyncpg) |
48
+ | Vector DB | Qdrant Cloud |
49
+ | Task Queue | Celery + Redis Cloud |
50
+ | Embedding | BAAI/bge-small-en-v1.5 (local, 384-dim) |
51
+ | Reranker | BAAI/bge-reranker-v2-m3 (cross-encoder) |
52
+ | LLM | Groq — llama-3.3-70b-versatile |
53
+ | Frontend | Next.js 16 + React 19 + TypeScript |
54
+
55
+ ## API Reference
56
+
57
+ | Method | Endpoint | Description |
58
+ |--------|----------|-------------|
59
+ | POST | `/api/jds` | Create JD, triggers async parsing |
60
+ | GET | `/api/jds` | List all JDs |
61
+ | GET | `/api/jds/{id}` | JD detail + quality report |
62
+ | POST | `/api/candidates/upload` | Upload CSV/JSON, queues batch ingestion |
63
+ | GET | `/api/candidates/status/{task_id}` | Celery task status |
64
+ | POST | `/api/match/{jd_id}` | Run full match pipeline |
65
+ | GET | `/api/match/{jd_id}` | Get cached ranked results |
66
+ | GET | `/api/match/{jd_id}/{cid}` | Candidate detail + LLM explanation |
67
+ | POST | `/api/match/{jd_id}/rerank` | Re-sort with custom weights (< 100ms) |
68
+
69
+ ## Setup
70
+
71
+ ### Backend
72
+
73
+ ```bash
74
+ cd backend
75
+ python -m venv .venv
76
+ .venv\Scripts\activate # Windows
77
+ pip install -r requirements.txt
78
+
79
+ # Run migrations
80
+ alembic upgrade head
81
+
82
+ # Start API
83
+ uvicorn main:app --reload --port 8000
84
+
85
+ # Start Celery worker (separate terminal)
86
+ celery -A src.workers.celery_app.celery_app worker --loglevel=info
87
+ ```
88
+
89
+ ### Frontend
90
+
91
+ ```bash
92
+ cd frontend
93
+ npm install
94
+ npm run dev
95
+ ```
96
+
97
+ Open http://localhost:3000
98
+
99
+ ### Docker
100
+
101
+ ```bash
102
+ docker-compose up --build
103
+ ```
104
+
105
+ ## Research Foundation
106
+
107
+ - **Bi-encoder retrieval**: Karpukhin et al. (2020) — Dense Passage Retrieval
108
+ - **Cross-encoder reranking**: Lin et al. (2021) — Pretrained Transformers for Text Ranking
109
+ - **BGE models**: Chen et al. (2024) — BGE M3-Embedding
110
+ - **Score fusion**: Cormack et al. (2009) — Reciprocal Rank Fusion
111
+ - **LLM as ranker**: Sun et al. (2023) — Is ChatGPT Good at Search?
112
+ - **Job matching**: Zhu et al. (2018) — Person-Job Fit with Joint Representation Learning
113
+
114
+ ## Candidate CSV Schema
115
+
116
+ The upload accepts any CSV. Recognized columns:
117
+
118
+ `id, name, email, looking_for, currently_employed, notice_period, open_to_working_at, role_type, engineer_type, years_of_experience, programming_languages, backend_frameworks, frontend_technologies, gen_ai_experience, recent_experience_type, education_status, degree, parsed_summary, parsed_skills, parsed_work_experience, most_recent_company, most_recent_company_description, most_recent_company_is_funded, most_recent_company_is_product_company, most_recent_company_total_funding, most_recent_company_funding_status, time_in_current_company, is_actively_or_passively_looking`
119
+
120
+ Missing columns are handled gracefully — ingestion continues row by row.
backend/Dockerfile ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.12-slim
2
+
3
+ WORKDIR /app
4
+
5
+ RUN apt-get update && apt-get install -y --no-install-recommends gcc && rm -rf /var/lib/apt/lists/*
6
+
7
+ COPY requirements.txt .
8
+ RUN pip install --no-cache-dir -r requirements.txt
9
+
10
+ COPY . .
11
+
12
+ EXPOSE 8000
13
+ CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
docker-compose.yml ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ version: "3.9"
2
+
3
+ services:
4
+ backend:
5
+ build:
6
+ context: ./backend
7
+ dockerfile: Dockerfile
8
+ ports:
9
+ - "8000:8000"
10
+ env_file: ./backend/.env
11
+ command: uvicorn main:app --host 0.0.0.0 --port 8000 --reload
12
+
13
+ worker:
14
+ build:
15
+ context: ./backend
16
+ dockerfile: Dockerfile
17
+ env_file: ./backend/.env
18
+ command: celery -A src.workers.celery_app.celery_app worker --loglevel=info --concurrency=2
19
+
20
+ frontend:
21
+ build:
22
+ context: ./frontend
23
+ dockerfile: Dockerfile
24
+ ports:
25
+ - "3000:3000"
26
+ environment:
27
+ - NEXT_PUBLIC_API_URL=http://backend:8000
28
+ depends_on:
29
+ - backend
frontend/Dockerfile ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM node:20-alpine
2
+
3
+ WORKDIR /app
4
+ COPY package*.json ./
5
+ RUN npm ci
6
+ COPY . .
7
+ RUN npm run build
8
+
9
+ EXPOSE 3000
10
+ CMD ["npm", "start"]