Kaito117 commited on
Commit
3f7f9cc
·
1 Parent(s): 8cf6b7b

update readme, add diagrams

Browse files
Files changed (2) hide show
  1. .env.example +1 -1
  2. README.md +177 -33
.env.example CHANGED
@@ -3,4 +3,4 @@ MONGO_DATABASE=
3
  SERPAPI_KEY=
4
  GROQ_API_KEY=
5
  RAPIDAPI_API_KEY=
6
- DEV=False
 
3
  SERPAPI_KEY=
4
  GROQ_API_KEY=
5
  RAPIDAPI_API_KEY=
6
+ DEBUG=False
README.md CHANGED
@@ -1,62 +1,206 @@
1
- # score_profiles
2
 
3
- AI-powered LinkedIn candidate sourcing and scoring microservice.
4
 
5
- ## Features
6
 
7
- - FastAPI HTTP API with a `/jobs` endpoint
8
- - URL extraction, SerpAPI search, GitHub & LinkedIn profile clients
9
- - Profile data extraction & ranking via a `CandidateScorer`
10
- - CORS enabled for all origins
11
- - Comprehensive unit & integration tests using `pytest`, `respx` & FastAPI `TestClient`
12
 
13
- ## Getting Started
14
 
15
- ### Prerequisites
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  - Python 3.12+
18
- - `uv` or `pip`
19
- - `.env` file copied from `.env.example`, populated with your api keys
20
 
21
  ### Installation
22
 
23
- ```sh
24
- # Using pip
 
 
 
 
25
  pip install -r requirements.txt
26
- # or using poetry
27
  uv sync
28
- ```
29
 
30
- ### Configuration
 
 
 
31
 
32
- Copy `.env.example` to `.env` and populate with your own fields:
33
 
34
- Additionally, check is you want to change any configs in `config.py` (though the defaults are sensible).
 
 
 
 
 
 
 
35
 
36
- ### Running the Service
37
 
38
- ```sh
 
39
  python app/main.py
 
 
 
40
  ```
41
 
42
- Docker based setup is not ready yet.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
- The API docs are available at `http://localhost:8000/docs`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
  ## Testing
47
 
48
- All HTTP calls are stubbed using [`respx`](https://github.com/lundberg/respx) and fixtures under `test/data/`.
49
 
50
- ```sh
 
51
  pytest
 
 
 
 
 
 
 
52
  ```
53
 
54
- ## Workflow
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
- 1. Client POSTs to `/jobs` with `search_query` (job description).
57
- 2. `LinkedInSourcingAgent` orchestrates:
58
- - `SerpAPIClient.search` → get LinkedIn & GitHub URLs
59
- - `LinkedInProfileClient.fetch_profile` & `GitHubClient.fetch_github_profile_html`
60
- - `LinkedInProfileExtractor` & `GitHubProfileExtractor` → normalized profile dicts
61
- - `CandidateScorer.batch_score_candidates` → rank & filter
62
- 3. Returns top-N scored
 
1
+ # LinkedIn Sourcing Agent
2
 
3
+ An autonomous AI agent that sources LinkedIn profiles at scale, scores candidates using advanced fit algorithms, and generates personalized outreach messages - built for the Synapse AI Challenge.
4
 
5
+ ## Challenge Overview
6
 
7
+ This project implements a complete LinkedIn sourcing pipeline that:
8
+ - **Discovers** relevant LinkedIn profiles from job descriptions
9
+ - **Scores** candidates using a comprehensive 6-factor rubric
10
+ - **Generates** personalized AI-powered outreach messages
11
+ - **Scales** to handle multiple jobs simultaneously
12
 
 
13
 
14
+ ## Key Features
15
+
16
+ ### Core Functionality
17
+ - **Smart Profile Discovery**: Multi-source candidate sourcing via SerpAPI + Google Search
18
+ - **Data collection and Processing**: Collect Linkedin data using RapidAPI, Github profile using HTTP request and HTML parsing
19
+ - **Advanced Scoring Algorithm**: 6-factor rubric (Education, Trajectory, Company, Skills, Location, Tenure)
20
+ - **AI-Powered Outreach**: Personalized LinkedIn messages using LLama (`Groq`)
21
+ - **FastAPI backend**: FastAPI endpoint for instant results
22
+ - **Multi-Source Enhancement**: Combines LinkedIn + GitHub data for improved scoring
23
+ - **Smart Caching**: Intelligent caching to avoid re-fetching profiles
24
+ - **Batch Processing**: Handles multiple jobs in parallel (asyncio)
25
+ - **Confidence Scoring**: Shows confidence levels when data is incomplete
26
+
27
+ ## Architecture
28
+
29
+ ![Architecture](architecture.png)
30
+
31
+ ![Data flow](data_flow.png)
32
 
33
+ ## Tech Stack
34
+
35
+ - **Language**: Python 3.12+
36
+ - **Framework**: FastAPI
37
+ - **LLM (to build)**: OpenAI GPT-4, o4-mini + Claude 4 + Gemini 2.5 Pro
38
+ - **Search**: SerpAPI - Google Search
39
+ - **Storage**: In-memory with JSON persistence
40
+ - **Testing**: pytest + respx
41
+ - **Data parsing**: BeautifulSoup for Github
42
+
43
+ ## Quick Start
44
+
45
+ ### Prerequisites
46
  - Python 3.12+
47
+ - API Keys (OpenAI, SerpAPI)
 
48
 
49
  ### Installation
50
 
51
+ ```bash
52
+ # Clone the repository
53
+ git clone <your-repo-url>
54
+ cd score_profiles
55
+
56
+ # Install dependencies
57
  pip install -r requirements.txt
58
+ # or using uv
59
  uv sync
 
60
 
61
+ # Setup environment
62
+ cp .env.example .env
63
+ # Add your API keys to .env
64
+ ```
65
 
66
+ ### Environment Variables (at `.env.example`)
67
 
68
+ ```env
69
+ MONGODB_URI=
70
+ MONGO_DATABASE=
71
+ SERPAPI_KEY=
72
+ GROQ_API_KEY=
73
+ RAPIDAPI_API_KEY=
74
+ DEV=False
75
+ ```
76
 
77
+ ### Running the Agent
78
 
79
+ ```bash
80
+ # Start the FastAPI server
81
  python app/main.py
82
+
83
+ # API available at: http://localhost:8000
84
+ # Interactive docs: http://localhost:8000/docs
85
  ```
86
 
87
+ ## Fit Scoring
88
+
89
+ The scoring system evaluates candidates across 6 dimensions:
90
+
91
+ | Factor | Weight | Scoring Criteria |
92
+ |--------|--------|------------------|
93
+ | **Education** | 20% | Elite schools (9-10), Strong schools (7-8), Standard (5-6) |
94
+ | **Career Trajectory** | 20% | Clear progression (8-10), Steady growth (6-8), Limited (3-5) |
95
+ | **Company Relevance** | 15% | Top tech (9-10), Relevant industry (7-8), Any experience (5-6) |
96
+ | **Experience Match** | 25% | Perfect match (9-10), Strong overlap (7-8), Some relevance (5-6) |
97
+ | **Location Match** | 10% | Exact city (10), Same metro (8), Remote-friendly (6) |
98
+ | **Tenure** | 10% | 2-3 years avg (9-10), 1-2 years (6-8), Job hopping (3-5) |
99
+
100
+ ## API
101
+
102
+ ### Single Job Processing
103
+
104
+ Use `FastAPI`'s built in docs endpoint for an interactive test.
105
+
106
+ Or if you want to use a script:
107
 
108
+ ```python
109
+ import requests
110
+
111
+ response = requests.post("http://localhost:8000/jobs", json={
112
+ "search_query": "Software Engineer, ML Research\nWindsurf • Full Time • Mountain View, CA • On-site • $140,000 – $300,000 + Equity\nAbout the Company:\nWindsurf (formerly Codeium) is a Forbes AI 50 company building the future of developer productivity through AI. With over 200 employees and $243M raised across multiple rounds including a Series C, Windsurf provides \ncutting-edge in-editor autocomplete, chat assistants, and full IDEs powered by proprietary LLMs. Their user base spans hundreds of thousands of developers worldwide, reflecting strong\nproduct-market fit and commercial traction.\nRoles and Responsibilities:\nTrain and fine-tune LLMs focused on developer productivity\nDesign and prioritize experiments for product impact\nAnalyze results, conduct ablation studies, and document findings\nConvert ML discoveries into scalable product features\nParticipate in the ML reading group and contribute to knowledge sharing\nJob Requirements:\n2+ years in software engineering with fast promotions\nStrong software engineering and systems thinking skills\nProven experience training and iterating on large production neural networks\nStrong GPA from a top CS undergrad program (MIT, Stanford, CMU, UIUC, etc.)\nFamiliarity with tools like Copilot, ChatGPT, or Windsurf is preferred\nDeep curiosity for the code generation space\nExcellent documentation and experimentation discipline\nPrior experience with applied research (not purely academic publishing)\nMust be able to work in Mountain View, CA full-time onsite\nExcited to build product-facing features from ML research\nInterview Process\nRecruiter Chat (15 min)\nVirtual Algorithm Round (LeetCode-style, 45 min)\nVirtual ML Case Study (1 hour)\nOnsite (3 hours): Additional ML case, implementation project, and culture interview\nOffer Extended",
113
+ "max_candidates": 50,
114
+ "include_github": false,
115
+ "confidence_threshold": 0.3
116
+ })
117
+
118
+ results = response.json()
119
+ ```
120
+
121
+ ### Sample Response
122
+
123
+ ```json
124
+ {
125
+ "job_id": "backend-fintech-sf-2024",
126
+ "candidates_found": 25,
127
+ "processing_time": "45.2s",
128
+ "top_candidates": [
129
+ {
130
+ "name": "Jane Smith",
131
+ "linkedin_url": "linkedin.com/in/janesmith",
132
+ "fit_score": 8.5,
133
+ "confidence": 0.92,
134
+ "score_breakdown": {
135
+ "education": 9.0,
136
+ "trajectory": 8.0,
137
+ "company": 8.5,
138
+ "skills": 9.0,
139
+ "location": 10.0,
140
+ "tenure": 7.0
141
+ },
142
+ "outreach_message": "Hi Jane, I noticed your impressive 6 years at Stripe building payment infrastructure. Your experience with distributed systems and fintech regulations makes you a perfect fit for our Senior Backend Engineer role...",
143
+ "key_highlights": [
144
+ "6 years at Stripe in payments infrastructure",
145
+ "Stanford CS degree",
146
+ "Expert in distributed systems & microservices"
147
+ ]
148
+ }
149
+ ]
150
+ }
151
+ ```
152
 
153
  ## Testing
154
 
155
+ Comprehensive test suite with mocked HTTP calls:
156
 
157
+ ```bash
158
+ # Run all tests
159
  pytest
160
+
161
+ # Run with coverage
162
+ pytest --cov=app
163
+
164
+ # Run specific test categories
165
+ pytest tests/test_scoring.py
166
+ pytest tests/test_integration.py
167
  ```
168
 
169
+ ## Tradeoffs
170
+
171
+ - Not including Twitter or personal website - high variance, low signal
172
+ - Not including Github - false positives (getting company profiles recommended)
173
+
174
+ ### Sample Generated Outreach
175
+ *"Hi Alex, I came across your profile and was impressed by your work at OpenAI on transformer architectures. Your research background in neural code generation and experience with large-scale ML training makes you an ideal candidate for Windsurf's ML Research Engineer role. We're building the next generation of AI-powered developer tools - would love to discuss how your expertise could accelerate our LLM training initiatives..."*
176
+
177
+ ## Scaling Strategy
178
+
179
+ For production scale (100s of jobs):
180
+
181
+ 1. **Concurrency**: Asyncio is good, unless you have multiple cpu cores (use multiprocessing + asyncio - multiple docker containers)
182
+ 2. **Queue System**: Redis/Celery as async task queue (partial setup done)
183
+ 3. **Database**: MongoDB for intermediate and final results storage
184
+ 4. **Rate Limiting**: Intelligent backoff with multiple API key rotation
185
+ 5. **Monitoring**: Comprehensive logging and metrics (Prometheus and Grafana, Otel)
186
+
187
+ ## Future Enhancements
188
+
189
+ - [ ] **Database Integration**: MongoDB integration with motor (asynchronous)
190
+ - [ ] **Dockerization**: For ease of
191
+ - [ ] **Advanced Deduplication**: Bloom filters for large-scale URL dedup
192
+ - [ ] **ML Enhancement**: Custom embedding models for better skill matching
193
+ - [ ] **Multi-platform**: Improve on Github integration, add Twitter integration
194
+ - [ ] **A/B Testing**: Message, prompt effectiveness tracking
195
+
196
+
197
+ Built for the Synapse AI Challenge. Code structure designed for easy extension and modification.
198
+
199
+ ## License
200
+
201
+ MIT License - Built for challenge purposes.
202
+
203
+ ---
204
 
205
+ **Demo Video**: [Link to 3-minute demo]
206
+ **Live API**: [HuggingFace Space URL]