Spaces:

ketannnn
/

coderound

Sleeping

App Files Files Community

ketannnn commited on Apr 12

Commit

6106435

1 Parent(s): acf867c

feat: add docker-compose and rewrite README with architecture rationale

Browse files

Files changed (4) hide show

README.md +120 -0
backend/Dockerfile +13 -0
docker-compose.yml +29 -0
frontend/Dockerfile +10 -0

README.md ADDED Viewed

	@@ -0,0 +1,120 @@

+# TalentPulse — AI Candidate Matching System
+A production-grade two-stage AI pipeline for matching job descriptions against large candidate pools. Built in 3 days as a coding assignment demonstration.
+## Architecture
+### Why Two Stages?
+Calling an LLM directly on 100K candidates costs hundreds of dollars and takes hours. The two-stage funnel gets it done in seconds for under $1:
+```
+Candidates (100K) → Stage 1 (ANN + scoring) → Top 50 → Stage 2 (reranker + LLM) → Top 20 ranked
+```
+**Stage 1 — Fast Retrieval (~50-100ms)**
+Bi-encoder vector search (BAAI/bge-small-en-v1.5) in Qdrant retrieves top-200 semantically similar candidates. A weighted structured scorer then ranks them:
+```python
+score = (
+  0.35 * skill_jaccard(jd_skills, candidate_skills)
++ 0.20 * cosine_similarity          # from ANN
++ 0.15 * yoe_match(min_yoe, candidate_yoe)
++ 0.10 * company_quality_signal     # funded/product company bonus
++ 0.10 * growth_velocity            # proprietary trajectory score
++ 0.10 * education_match
+)
+```
+**Stage 2 — Deep Reranking (~2-5s)**
+BGE cross-encoder (bge-reranker-v2-m3) jointly re-scores the top-50 shortlist. Scores are fused via Reciprocal Rank Fusion (Cormack 2009). Groq LLM generates explanations for the final top-20, grounded in pre-computed structured gaps.
+### Unique Differentiators
+1. **Trajectory Scoring** — Computes career growth velocity from work history timelines and title seniority progression. Rewards fast-promoters at funded product companies.
+2. **JD Quality Feedback** — Every match response includes a `jd_quality` report: vagueness score, breadth score, contradictions. Instead of silently handling vague JDs, we surface them as a feature.
+3. **Structured Gap Analysis** — Pre-computes missing skills, YOE delta, location mismatch, engineer type mismatch for each candidate. LLM explanations are grounded in this structured data — no hallucination.
+4. **Live Weight Sliders** — Component scores are stored in Postgres. Changing slider weights triggers a pure in-memory `/rerank` call — no model inference, sub-100ms response.
+## Stack
+| Component | Tech |
+|-----------|------|
+| Backend API | FastAPI + uvicorn |
+| Database | Neon Postgres (async via asyncpg) |
+| Vector DB | Qdrant Cloud |
+| Task Queue | Celery + Redis Cloud |
+| Embedding | BAAI/bge-small-en-v1.5 (local, 384-dim) |
+| Reranker | BAAI/bge-reranker-v2-m3 (cross-encoder) |
+| LLM | Groq — llama-3.3-70b-versatile |
+| Frontend | Next.js 16 + React 19 + TypeScript |
+## API Reference
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| POST | `/api/jds` | Create JD, triggers async parsing |
+| GET | `/api/jds` | List all JDs |
+| GET | `/api/jds/{id}` | JD detail + quality report |
+| POST | `/api/candidates/upload` | Upload CSV/JSON, queues batch ingestion |
+| GET | `/api/candidates/status/{task_id}` | Celery task status |
+| POST | `/api/match/{jd_id}` | Run full match pipeline |
+| GET | `/api/match/{jd_id}` | Get cached ranked results |
+| GET | `/api/match/{jd_id}/{cid}` | Candidate detail + LLM explanation |
+| POST | `/api/match/{jd_id}/rerank` | Re-sort with custom weights (< 100ms) |
+## Setup
+### Backend
+```bash
+cd backend
+python -m venv .venv
+.venv\Scripts\activate   # Windows
+pip install -r requirements.txt
+# Run migrations
+alembic upgrade head
+# Start API
+uvicorn main:app --reload --port 8000
+# Start Celery worker (separate terminal)
+celery -A src.workers.celery_app.celery_app worker --loglevel=info
+```
+### Frontend
+```bash
+cd frontend
+npm install
+npm run dev
+```
+Open http://localhost:3000
+### Docker
+```bash
+docker-compose up --build
+```
+## Research Foundation
+- **Bi-encoder retrieval**: Karpukhin et al. (2020) — Dense Passage Retrieval
+- **Cross-encoder reranking**: Lin et al. (2021) — Pretrained Transformers for Text Ranking
+- **BGE models**: Chen et al. (2024) — BGE M3-Embedding
+- **Score fusion**: Cormack et al. (2009) — Reciprocal Rank Fusion
+- **LLM as ranker**: Sun et al. (2023) — Is ChatGPT Good at Search?
+- **Job matching**: Zhu et al. (2018) — Person-Job Fit with Joint Representation Learning
+## Candidate CSV Schema
+The upload accepts any CSV. Recognized columns:
+`id, name, email, looking_for, currently_employed, notice_period, open_to_working_at, role_type, engineer_type, years_of_experience, programming_languages, backend_frameworks, frontend_technologies, gen_ai_experience, recent_experience_type, education_status, degree, parsed_summary, parsed_skills, parsed_work_experience, most_recent_company, most_recent_company_description, most_recent_company_is_funded, most_recent_company_is_product_company, most_recent_company_total_funding, most_recent_company_funding_status, time_in_current_company, is_actively_or_passively_looking`
+Missing columns are handled gracefully — ingestion continues row by row.

backend/Dockerfile ADDED Viewed

	@@ -0,0 +1,13 @@

+FROM python:3.12-slim
+WORKDIR /app
+RUN apt-get update && apt-get install -y --no-install-recommends gcc && rm -rf /var/lib/apt/lists/*
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY . .
+EXPOSE 8000
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

docker-compose.yml ADDED Viewed

	@@ -0,0 +1,29 @@

+version: "3.9"
+services:
+  backend:
+    build:
+      context: ./backend
+      dockerfile: Dockerfile
+    ports:
+      - "8000:8000"
+    env_file: ./backend/.env
+    command: uvicorn main:app --host 0.0.0.0 --port 8000 --reload
+  worker:
+    build:
+      context: ./backend
+      dockerfile: Dockerfile
+    env_file: ./backend/.env
+    command: celery -A src.workers.celery_app.celery_app worker --loglevel=info --concurrency=2
+  frontend:
+    build:
+      context: ./frontend
+      dockerfile: Dockerfile
+    ports:
+      - "3000:3000"
+    environment:
+      - NEXT_PUBLIC_API_URL=http://backend:8000
+    depends_on:
+      - backend

frontend/Dockerfile ADDED Viewed

	@@ -0,0 +1,10 @@

+FROM node:20-alpine
+WORKDIR /app
+COPY package*.json ./
+RUN npm ci
+COPY . .
+RUN npm run build
+EXPOSE 3000
+CMD ["npm", "start"]