PhdScout

AI-powered search agent for PhD positions, postdocs, research fellowships, and academic staff roles. Powered by the Groq free API — no subscriptions required.

100% Free Groq API Gradio UI Python 3.10+

🔍

Multi-source Search

5 job boards searched simultaneously — Europe, worldwide, and country-specific

🤖

AI Scoring

Each position scored 0–100 against your CV profile

✉️

Cover Letters

Personalised draft generated for every position

📦

ZIP Export

Download all approved applications in one click

How it works

Upload your CV

PDF, DOCX, or TXT. The LLM extracts a structured profile: education, publications, skills, research interests.

Search job boards

PhdScout queries Euraxess, mlscientist.com, jobs.ac.uk, scholarshipdb.net, and nature.com/careers in parallel, then deduplicates and filters by recency (expired listings discarded).

Score & rank

Each position is scored 0–100 for fit. The LLM reasons semantically — "NLP" and "natural language processing" are treated as equivalent. Postdoc and fellowship positions are automatically penalised when the candidate's CV shows no completed or in-progress PhD.

Review & edit

Load any position to see CV tailoring hints and a draft cover letter. Edit freely before approving.

Export

Download all approved applications as a ZIP containing cover letters and position summaries.

Installation

PhdScout runs locally with Python 3.10+ or on HuggingFace Spaces.

Clone & install

git clone https://github.com/Hipsterfil998/PhDScout.git
cd PhDScout
pip install -r requirements.txt

Get a Groq API key

ℹ️

Groq provides a generous free tier — no credit card required. Register at console.groq.com/keys.

Configure

Create a .env file in the project root:

# Required
LLM_BACKEND=groq
GROQ_API_KEY=gsk_your_key_here

# Optional overrides (see Configuration section)
OUTPUT_DIR=./output

Run

python app.py

Open http://localhost:7860 in your browser.

Dependencies

Package	Purpose
`openai`	Groq and Ollama API client (OpenAI-compatible)
`gradio`	Web UI
`pdfplumber`	PDF text extraction
`python-docx`	DOCX text extraction
`beautifulsoup4 + lxml`	HTML scraping
`requests`	HTTP client for scrapers
`python-dotenv`	.env loading

Quickstart

From zero to your first scored job list in under 5 minutes.

Upload your CV

Click the upload area and select your PDF, DOCX, or TXT file.

Fill in the search fields

Enter a research field ("machine learning", "computational neuroscience"…), choose a location, and pick a position type.

Click "Parse CV & Search Positions"

Wait ~2–3 minutes. The agent scrapes all sources, parses your CV, and scores every match.

Review results

Switch to the Results tab. Positions are sorted by posting date (newest first) and labelled with a freshness indicator.

Generate & approve cover letters

In Review & Edit, select a position, read the CV hints, edit the draft, and click Approve & Save.

Export

Go to the Export tab and download the ZIP.

💡

Tip: Use comma-separated fields for broader searches: "machine learning, NLP, computer vision".

Web Interface

The Gradio UI is organised into three tabs.

Tab 1 — Setup & Search

Field	Description
CV upload	PDF, DOCX, or TXT file
Research field	Free-text or comma-separated list
Location	40+ countries or custom value
Position type	PhD, postdoc, predoctoral, fellowship, research staff
Min. match score	Threshold for the "above score" count (all positions still visible)

Tab 2 — Results

Displays a scored table with columns: #, Score, Title, Institution, Type, Freshness, Rec., Why good fit.

Freshness labels

Label	Meaning
🟢 Recent	Posted within the last 30 days
🟡 Older	Has a date, posted more than 30 days ago
🔴 Closing soon	Deadline within 14 days
empty	No date information available

ℹ️

Expired listings (deadline already passed, or posted in a previous year) are automatically excluded from results.

Tab 3 — Review & Edit

Select a position from the dropdown, click Load Position, then:

Read the Position Details and match analysis
Follow the CV Tailoring Hints panel
Edit the Cover Letter draft freely
Click Regenerate for a different version
Download the letter as a .txt file
Click Approve & Save to add it to the export queue

Command-Line Interface

For batch use or scripting, PhdScout exposes a CLI via main.py.

Basic usage

python main.py \
  --cv path/to/cv.pdf \
  --field "machine learning" \
  --location "Germany" \
  --type phd

Options

Flag	Default	Description
`--cv`	required	Path to CV file (PDF, DOCX, TXT)
`--field`	required	Research field(s), comma-separated
`--location`	`Europe`	Location filter
`--type`	`phd`	Position type
`--min-score`	`60`	Minimum match score to show

Python API

from agent import JobAgent

agent = JobAgent(
    model="llama-3.1-8b-instant",
    backend="groq",
    api_key="gsk_...",
)

profile, profile_text = agent.parse_cv("cv.pdf")
jobs = agent.search_jobs(field="NLP", location="Europe", position_type="phd")
scored = agent.score_jobs(jobs, profile_text)

for job in scored[:5]:
    m = job["match"]
    print(m["match_score"], job["title"], job.get("freshness"))

Job Sources

🇪🇺

Euraxess

EU/worldwide research portal. Country-filtered via API parameters.

🤖

mlscientist.com

ML & AI academic positions. 14 country categories supported.

🇬🇧

jobs.ac.uk

UK academic jobs. Queried only when UK or Worldwide is selected.

🌍

scholarshipdb.net

Worldwide aggregator with 28k+ positions across all disciplines. Country-filtered via URL path.

🔬

nature.com/careers

Multidisciplinary global board. Keyword search + ISO country code filtering.

Freshness filtering

After scraping, PhdScout automatically removes:

Postings with a posting date in a previous year
Postings with a deadline already passed
Jobs with no date info are kept (benefit of the doubt)

PhD eligibility gate

Before scoring, PhdScout checks whether the candidate holds or is pursuing a PhD and enforces two caps on postdoc and fellowship positions:

Candidate status	Postdoc / Fellowship score cap
No PhD detected in CV	≤ 30 — set to skip
PhD in progress (candidate / student)	≤ 65
PhD completed	No cap

ℹ️

This gate is enforced at two levels: in the LLM prompt (via JOB_MATCHER_PROMPT) and in code (agent/matching/matcher.py) as a safety net. PhD positions are always open to master's graduates — no cap applies.

Adding a source

Create a new file in agent/search/scrapers/ that subclasses BaseScraper:

from agent.search.scrapers.base import BaseScraper

class MyScraper(BaseScraper):
    name = "mysource"

    def scrape(self, field, location, position_type):
        soup = self._fetch(f"https://example.com/jobs?q={field}")
        if soup is None: return []
        results = []
        for card in soup.select(".job-card"):
            results.append({
                "title": card.select_one("h2").text,
                "url": card.select_one("a")["href"],
                "posted": card.select_one(".date").text,
                "source": self.name,
                "type": self._detect_type(card.text, ""),
            })
        return results

Then register it in agent/search/searcher.py → _build_scrapers().

Configuration

All settings live in config.py. Edit the file directly — no restart needed if using the CLI, restart the Gradio app after changes.

LLM settings

Parameter	Default	Description
`default_model`	`llama-3.1-8b-instant`	Groq model to use
`max_tokens`	`4096`	Max tokens per LLM response
`llm_backend`	`ollama`	Backend: `groq` \| `huggingface` \| `ollama`

Scraper settings

Parameter	Default	Description
`scraper_delay`	`1.5` s	Polite delay between HTTP requests
`max_results_per_source`	`20`	Max listings fetched per source

Freshness thresholds

Parameter	Default	Description
`recent_days`	`30`	Days since posting → 🟢 Recent
`deadline_warn_days`	`14`	Days until deadline → 🔴 Closing soon

UI defaults

Parameter	Default	Description
`min_score_default`	`60`	Default minimum match score slider value

Environment variables

Variable	Description
`GROQ_API_KEY`	Groq API key (takes priority over HF_TOKEN)
`HF_TOKEN`	HuggingFace token (fallback backend)
`LLM_BACKEND`	Override backend: `groq` \| `huggingface` \| `ollama`
`OUTPUT_DIR`	Output directory for ZIP exports (default: `./output`)

Prompts

All LLM prompts live in agent/prompts/. Each service has its own file — edit the relevant file to tune that part of the agent's behaviour.

⚠️

Prompts use Python .format() placeholders like {profile}. Keep all placeholders intact when editing.

Available prompts

Constant	Used by	Controls
File: `agent/prompts/cv_parser.py`
`CV_PARSER_SYSTEM` `CV_PARSER_PROMPT`	`CVParser`	How the CV is structured into JSON. Tweak to extract custom fields.
File: `agent/prompts/job_matcher.py`
`JOB_MATCHER_SYSTEM` `JOB_MATCHER_PROMPT`	`JobMatcher`	Scoring criteria, eligibility gate, and scoring guide. Edit thresholds here.
File: `agent/prompts/cv_tailor.py`
`CV_TAILOR_SYSTEM` `CV_TAILOR_PROMPT`	`CVTailor`	What tailoring hints to produce and how specific to be.
File: `agent/prompts/cover_letter.py`
`COVER_LETTER_SYSTEM` `COVER_LETTER_PROMPT`	`CoverLetterWriter`	Letter style, length, structure, and language detection.

Example: changing the letter length

In agent/prompts/cover_letter.py, find COVER_LETTER_SYSTEM and change:

# Before
The letter should be 400-600 words (3-4 paragraphs).

# After
The letter should be 250-350 words (2-3 paragraphs).

Example: stricter scoring

In JOB_MATCHER_PROMPT, raise the thresholds in the scoring guide:

Scoring guide:
  85-100: Excellent — perfect research keyword overlap, recent publications
  70-84:  Good — strong overlap on primary research area
  50-69:  Partial — some overlap, transferable skills
  0-49:   Skip — different area or missing key requirements

Architecture

Project structure

PhDScout/ ├── app.py # Gradio web interface ├── config.py # Runtime settings (model, thresholds, delays) ├── main.py # CLI entry point ├── requirements.txt ├── agent/ │ ├── __init__.py # Public API: JobAgent, LLMQuotaError │ ├── pipeline.py # JobAgent orchestrator │ ├── base_service.py # BaseLLMService base class │ ├── llm_client.py # Groq / HuggingFace / Ollama client │ ├── utils.py # JSON parsing, shared helpers │ ├── prompts/ # LLM prompts — one file per service │ │ ├── cv_parser.py # CV extraction prompts │ │ ├── job_matcher.py # Scoring + eligibility gate prompts │ │ ├── cv_tailor.py # Tailoring hints prompts │ │ └── cover_letter.py # Cover letter prompts │ ├── cv/ # CV-related services │ │ ├── parser.py # CV extraction + LLM parsing │ │ ├── tailor.py # Tailoring hints generator │ │ └── cover_letter.py # Cover letter writer │ ├── matching/ # Scoring engine │ │ └── matcher.py # JobMatcher + PhD eligibility cap │ └── search/ # Job search infrastructure │ ├── searcher.py # JobSearcher (orchestrates scrapers) │ └── scrapers/ │ ├── base.py # BaseScraper ABC + shared helpers │ ├── euraxess.py # EU/worldwide research portal │ ├── mlscientist.py # ML & AI academic positions │ ├── jobs_ac_uk.py # UK academic jobs (UK/worldwide only) │ ├── scholarshipdb.py # Worldwide aggregator (28k+ positions) │ └── nature_careers.py # nature.com/careers — multidisciplinary └── tests/ # 156 unit tests (pytest)

Pipeline flow

CV file
  ↓ CVParser.extract_raw_text()
Raw text
  ↓ CVParser.parse() → LLM → CVProfile JSON
  ↓ CVParser.summarize() → profile_text
profile_text
  ↓ (in parallel with search)
  ↓ JobSearcher.search() → scrapers → deduplicate → filter stale → label freshness
jobs[]
  ↓ JobMatcher.score_all() → LLM × N → sort by score
scored_jobs[]
  ↓ (per selected job)
  ↓ CVTailor.generate() → LLM → TailoringHints
  ↓ CoverLetterWriter.generate() → LLM → draft letter
approved_jobs[] → ZIP export

LLM backends

Backend	env var	Notes
Groq (recommended)	`GROQ_API_KEY`	Free tier, fast, OpenAI-compatible
Ollama	—	Local inference, set `LLM_BACKEND=ollama`
HuggingFace	`HF_TOKEN`	Fallback, free tier has rate limits

Deployment

HuggingFace Spaces (recommended)

Fork or create a Space

Go to huggingface.co/spaces → New Space → SDK: Gradio.

Push the code

Add the Space as a remote and push: git push space main

Set secrets

In Space Settings → Variables and Secrets, add GROQ_API_KEY.

Add HF frontmatter to README

Run ./push_to_hf.sh — it injects the required YAML frontmatter automatically.

GitHub Pages (this documentation)

💡

This documentation is a single HTML file at docs/index.html — no build step required.

To enable GitHub Pages:

Go to your GitHub repo → Settings → Pages
Source: Deploy from a branch
Branch: main / folder: /docs
Click Save

The docs will be live at https://<username>.github.io/PhDScout.

Editing the docs

To modify this documentation directly on GitHub:

Go to your repo on GitHub
Navigate to docs/index.html
Click the pencil icon (Edit this file)
Edit the HTML — each section is a <section class="section" id="..."> block
Commit directly to main — GitHub Pages rebuilds automatically

ℹ️

The navigation links are wired by JavaScript at the bottom of the file. To add a new section: add a <button> in the sidebar and a matching <section> in the main area.