PhdScout
AI-powered search agent for PhD positions, postdocs, research fellowships, and academic staff roles. Powered by the Groq free API — no subscriptions required.
Multi-source Search
5 job boards searched simultaneously — Europe, worldwide, and country-specific
AI Scoring
Each position scored 0–100 against your CV profile
Cover Letters
Personalised draft generated for every position
ZIP Export
Download all approved applications in one click
How it works
PDF, DOCX, or TXT. The LLM extracts a structured profile: education, publications, skills, research interests.
PhdScout queries Euraxess, mlscientist.com, jobs.ac.uk, scholarshipdb.net, and nature.com/careers in parallel, then deduplicates and filters by recency (expired listings discarded).
Each position is scored 0–100 for fit. The LLM reasons semantically — "NLP" and "natural language processing" are treated as equivalent. Postdoc and fellowship positions are automatically penalised when the candidate's CV shows no completed or in-progress PhD.
Load any position to see CV tailoring hints and a draft cover letter. Edit freely before approving.
Download all approved applications as a ZIP containing cover letters and position summaries.
Installation
PhdScout runs locally with Python 3.10+ or on HuggingFace Spaces.
Clone & install
git clone https://github.com/Hipsterfil998/PhDScout.git
cd PhDScout
pip install -r requirements.txt
Get a Groq API key
Groq provides a generous free tier — no credit card required. Register at console.groq.com/keys.
Configure
Create a .env file in the project root:
# Required
LLM_BACKEND=groq
GROQ_API_KEY=gsk_your_key_here
# Optional overrides (see Configuration section)
OUTPUT_DIR=./output
Run
python app.py
Open http://localhost:7860 in your browser.
Dependencies
| Package | Purpose |
|---|---|
openai | Groq and Ollama API client (OpenAI-compatible) |
gradio | Web UI |
pdfplumber | PDF text extraction |
python-docx | DOCX text extraction |
beautifulsoup4 + lxml | HTML scraping |
requests | HTTP client for scrapers |
python-dotenv | .env loading |
Quickstart
From zero to your first scored job list in under 5 minutes.
Click the upload area and select your PDF, DOCX, or TXT file.
Enter a research field ("machine learning", "computational neuroscience"…), choose a location, and pick a position type.
Wait ~2–3 minutes. The agent scrapes all sources, parses your CV, and scores every match.
Switch to the Results tab. Positions are sorted by posting date (newest first) and labelled with a freshness indicator.
In Review & Edit, select a position, read the CV hints, edit the draft, and click Approve & Save.
Go to the Export tab and download the ZIP.
Tip: Use comma-separated fields for broader searches: "machine learning, NLP, computer vision".
Web Interface
The Gradio UI is organised into three tabs.
Tab 1 — Setup & Search
| Field | Description |
|---|---|
| CV upload | PDF, DOCX, or TXT file |
| Research field | Free-text or comma-separated list |
| Location | 40+ countries or custom value |
| Position type | PhD, postdoc, predoctoral, fellowship, research staff |
| Min. match score | Threshold for the "above score" count (all positions still visible) |
Tab 2 — Results
Displays a scored table with columns: #, Score, Title, Institution, Type, Freshness, Rec., Why good fit.
Freshness labels
| Label | Meaning |
|---|---|
| 🟢 Recent | Posted within the last 30 days |
| 🟡 Older | Has a date, posted more than 30 days ago |
| 🔴 Closing soon | Deadline within 14 days |
| empty | No date information available |
Expired listings (deadline already passed, or posted in a previous year) are automatically excluded from results.
Tab 3 — Review & Edit
Select a position from the dropdown, click Load Position, then:
- Read the Position Details and match analysis
- Follow the CV Tailoring Hints panel
- Edit the Cover Letter draft freely
- Click Regenerate for a different version
- Download the letter as a .txt file
- Click Approve & Save to add it to the export queue
Command-Line Interface
For batch use or scripting, PhdScout exposes a CLI via main.py.
Basic usage
python main.py \
--cv path/to/cv.pdf \
--field "machine learning" \
--location "Germany" \
--type phd
Options
| Flag | Default | Description |
|---|---|---|
--cv | required | Path to CV file (PDF, DOCX, TXT) |
--field | required | Research field(s), comma-separated |
--location | Europe | Location filter |
--type | phd | Position type |
--min-score | 60 | Minimum match score to show |
Python API
from agent import JobAgent
agent = JobAgent(
model="llama-3.1-8b-instant",
backend="groq",
api_key="gsk_...",
)
profile, profile_text = agent.parse_cv("cv.pdf")
jobs = agent.search_jobs(field="NLP", location="Europe", position_type="phd")
scored = agent.score_jobs(jobs, profile_text)
for job in scored[:5]:
m = job["match"]
print(m["match_score"], job["title"], job.get("freshness"))
Job Sources
Euraxess
EU/worldwide research portal. Country-filtered via API parameters.
mlscientist.com
ML & AI academic positions. 14 country categories supported.
jobs.ac.uk
UK academic jobs. Queried only when UK or Worldwide is selected.
scholarshipdb.net
Worldwide aggregator with 28k+ positions across all disciplines. Country-filtered via URL path.
nature.com/careers
Multidisciplinary global board. Keyword search + ISO country code filtering.
Freshness filtering
After scraping, PhdScout automatically removes:
- Postings with a posting date in a previous year
- Postings with a deadline already passed
- Jobs with no date info are kept (benefit of the doubt)
PhD eligibility gate
Before scoring, PhdScout checks whether the candidate holds or is pursuing a PhD and enforces two caps on postdoc and fellowship positions:
| Candidate status | Postdoc / Fellowship score cap |
|---|---|
| No PhD detected in CV | ≤ 30 — set to skip |
| PhD in progress (candidate / student) | ≤ 65 |
| PhD completed | No cap |
This gate is enforced at two levels: in the LLM prompt (via JOB_MATCHER_PROMPT) and in code (agent/matching/matcher.py) as a safety net. PhD positions are always open to master's graduates — no cap applies.
Adding a source
Create a new file in agent/search/scrapers/ that subclasses BaseScraper:
from agent.search.scrapers.base import BaseScraper
class MyScraper(BaseScraper):
name = "mysource"
def scrape(self, field, location, position_type):
soup = self._fetch(f"https://example.com/jobs?q={field}")
if soup is None: return []
results = []
for card in soup.select(".job-card"):
results.append({
"title": card.select_one("h2").text,
"url": card.select_one("a")["href"],
"posted": card.select_one(".date").text,
"source": self.name,
"type": self._detect_type(card.text, ""),
})
return results
Then register it in agent/search/searcher.py → _build_scrapers().
Configuration
All settings live in config.py. Edit the file directly — no restart needed if using the CLI, restart the Gradio app after changes.
LLM settings
| Parameter | Default | Description |
|---|---|---|
default_model | llama-3.1-8b-instant | Groq model to use |
max_tokens | 4096 | Max tokens per LLM response |
llm_backend | ollama | Backend: groq | huggingface | ollama |
Scraper settings
| Parameter | Default | Description |
|---|---|---|
scraper_delay | 1.5 s | Polite delay between HTTP requests |
max_results_per_source | 20 | Max listings fetched per source |
Freshness thresholds
| Parameter | Default | Description |
|---|---|---|
recent_days | 30 | Days since posting → 🟢 Recent |
deadline_warn_days | 14 | Days until deadline → 🔴 Closing soon |
UI defaults
| Parameter | Default | Description |
|---|---|---|
min_score_default | 60 | Default minimum match score slider value |
Environment variables
| Variable | Description |
|---|---|
GROQ_API_KEY | Groq API key (takes priority over HF_TOKEN) |
HF_TOKEN | HuggingFace token (fallback backend) |
LLM_BACKEND | Override backend: groq | huggingface | ollama |
OUTPUT_DIR | Output directory for ZIP exports (default: ./output) |
Prompts
All LLM prompts live in agent/prompts/. Each service has its own file — edit the relevant file to tune that part of the agent's behaviour.
Prompts use Python .format() placeholders like {profile}. Keep all placeholders intact when editing.
Available prompts
| Constant | Used by | Controls |
|---|---|---|
File: agent/prompts/cv_parser.py | ||
CV_PARSER_SYSTEMCV_PARSER_PROMPT | CVParser | How the CV is structured into JSON. Tweak to extract custom fields. |
File: agent/prompts/job_matcher.py | ||
JOB_MATCHER_SYSTEMJOB_MATCHER_PROMPT | JobMatcher | Scoring criteria, eligibility gate, and scoring guide. Edit thresholds here. |
File: agent/prompts/cv_tailor.py | ||
CV_TAILOR_SYSTEMCV_TAILOR_PROMPT | CVTailor | What tailoring hints to produce and how specific to be. |
File: agent/prompts/cover_letter.py | ||
COVER_LETTER_SYSTEMCOVER_LETTER_PROMPT | CoverLetterWriter | Letter style, length, structure, and language detection. |
Example: changing the letter length
In agent/prompts/cover_letter.py, find COVER_LETTER_SYSTEM and change:
# Before
The letter should be 400-600 words (3-4 paragraphs).
# After
The letter should be 250-350 words (2-3 paragraphs).
Example: stricter scoring
In JOB_MATCHER_PROMPT, raise the thresholds in the scoring guide:
Scoring guide:
85-100: Excellent — perfect research keyword overlap, recent publications
70-84: Good — strong overlap on primary research area
50-69: Partial — some overlap, transferable skills
0-49: Skip — different area or missing key requirements
Architecture
Project structure
Pipeline flow
CV file
↓ CVParser.extract_raw_text()
Raw text
↓ CVParser.parse() → LLM → CVProfile JSON
↓ CVParser.summarize() → profile_text
profile_text
↓ (in parallel with search)
↓ JobSearcher.search() → scrapers → deduplicate → filter stale → label freshness
jobs[]
↓ JobMatcher.score_all() → LLM × N → sort by score
scored_jobs[]
↓ (per selected job)
↓ CVTailor.generate() → LLM → TailoringHints
↓ CoverLetterWriter.generate() → LLM → draft letter
approved_jobs[] → ZIP export
LLM backends
| Backend | env var | Notes |
|---|---|---|
| Groq (recommended) | GROQ_API_KEY | Free tier, fast, OpenAI-compatible |
| Ollama | — | Local inference, set LLM_BACKEND=ollama |
| HuggingFace | HF_TOKEN | Fallback, free tier has rate limits |
Deployment
HuggingFace Spaces (recommended)
Go to huggingface.co/spaces → New Space → SDK: Gradio.
Add the Space as a remote and push: git push space main
In Space Settings → Variables and Secrets, add GROQ_API_KEY.
Run ./push_to_hf.sh — it injects the required YAML frontmatter automatically.
GitHub Pages (this documentation)
This documentation is a single HTML file at docs/index.html — no build step required.
To enable GitHub Pages:
- Go to your GitHub repo → Settings → Pages
- Source: Deploy from a branch
- Branch:
main/ folder:/docs - Click Save
The docs will be live at https://<username>.github.io/PhDScout.
Editing the docs
To modify this documentation directly on GitHub:
- Go to your repo on GitHub
- Navigate to
docs/index.html - Click the pencil icon (Edit this file)
- Edit the HTML — each section is a
<section class="section" id="...">block - Commit directly to
main— GitHub Pages rebuilds automatically
The navigation links are wired by JavaScript at the bottom of the file. To add a new section: add a <button> in the sidebar and a matching <section> in the main area.