Spaces:
Sleeping
Sleeping
Upload folder using huggingface_hub
Browse files- README.md +63 -42
- app.py +458 -414
- requirements.txt +5 -3
README.md
CHANGED
|
@@ -1,81 +1,102 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: gradio
|
| 7 |
-
sdk_version:
|
| 8 |
python_version: "3.10"
|
| 9 |
app_file: app.py
|
| 10 |
pinned: false
|
| 11 |
license: mit
|
| 12 |
-
short_description:
|
| 13 |
---
|
| 14 |
|
| 15 |
-
#
|
| 16 |
|
| 17 |
-
An AI-powered tool that
|
| 18 |
|
| 19 |
## Features
|
| 20 |
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
- **Structured Output** -- Every summary includes:
|
| 24 |
-
- Extracted paper title
|
| 25 |
-
- Concise abstract-length summary
|
| 26 |
-
- Key findings from the results/conclusion sections
|
| 27 |
-
- Methodology overview
|
| 28 |
-
- Word-count statistics with compression ratio
|
| 29 |
-
- **Long Document Support** -- Papers of any length are automatically chunked and summarized in multiple passes, then combined into a coherent final summary.
|
| 30 |
-
- **Clean PDF Processing** -- Handles hyphenated line breaks, control characters, and other common PDF artifacts.
|
| 31 |
|
| 32 |
-
##
|
|
|
|
| 33 |
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
5. **Section Extraction** -- Regex heuristics identify Results, Methodology, and Conclusion sections for targeted summarization of key findings and methods.
|
| 39 |
|
| 40 |
-
##
|
|
|
|
| 41 |
|
| 42 |
-
|
|
|
|
| 43 |
|
| 44 |
-
##
|
|
|
|
| 45 |
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
- **Processing time** may be longer for very large papers due to CPU-only inference.
|
| 49 |
|
| 50 |
-
##
|
| 51 |
|
| 52 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|---|---|
|
| 54 |
| Web framework | Gradio 4.44 |
|
| 55 |
-
|
|
|
|
|
| 56 |
| PDF parsing | PyMuPDF (fitz) |
|
| 57 |
-
|
|
| 58 |
|
| 59 |
-
##
|
| 60 |
|
| 61 |
```bash
|
| 62 |
# Clone the repository
|
| 63 |
-
git clone https://huggingface.co/spaces/gr8monk3ys/
|
| 64 |
-
cd
|
| 65 |
|
| 66 |
# Install dependencies
|
| 67 |
pip install -r requirements.txt
|
| 68 |
|
| 69 |
-
#
|
| 70 |
python app.py
|
| 71 |
```
|
| 72 |
|
| 73 |
The app will be available at `http://localhost:7860`.
|
| 74 |
|
| 75 |
-
##
|
| 76 |
|
| 77 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
|
| 79 |
-
##
|
| 80 |
|
| 81 |
-
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Resume Analyzer
|
| 3 |
+
emoji: 📋
|
| 4 |
+
colorFrom: green
|
| 5 |
+
colorTo: blue
|
| 6 |
sdk: gradio
|
| 7 |
+
sdk_version: 5.9.1
|
| 8 |
python_version: "3.10"
|
| 9 |
app_file: app.py
|
| 10 |
pinned: false
|
| 11 |
license: mit
|
| 12 |
+
short_description: AI-powered resume analysis against job descriptions
|
| 13 |
---
|
| 14 |
|
| 15 |
+
# Resume Analyzer
|
| 16 |
|
| 17 |
+
An AI-powered tool that analyzes how well your resume matches a target job description. Built with Gradio, sentence-transformers, and scikit-learn.
|
| 18 |
|
| 19 |
## Features
|
| 20 |
|
| 21 |
+
### Semantic Similarity Scoring
|
| 22 |
+
Uses the `sentence-transformers/all-MiniLM-L6-v2` model to compute deep semantic similarity between your resume and the job description. This goes beyond simple keyword matching to understand the meaning and context of your experience relative to what the role demands.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
+
### TF-IDF Keyword Extraction
|
| 25 |
+
Extracts the most important keywords and phrases from both documents using Term Frequency-Inverse Document Frequency (TF-IDF) with unigrams and bigrams. This surfaces the specific terms that carry the most weight in each document.
|
| 26 |
|
| 27 |
+
### Keyword Gap Analysis
|
| 28 |
+
Compares the job description's top keywords against your resume content to identify:
|
| 29 |
+
- **Matched keywords** -- terms the job requires that your resume already contains.
|
| 30 |
+
- **Missing keywords** -- high-value terms you should consider adding (where truthfully applicable).
|
|
|
|
| 31 |
|
| 32 |
+
### Section-by-Section Analysis
|
| 33 |
+
Automatically detects standard resume sections (Summary, Experience, Education, Skills, Projects) and scores each one independently against the job description. This pinpoints exactly which parts of your resume need the most attention.
|
| 34 |
|
| 35 |
+
### Composite Match Score
|
| 36 |
+
A weighted composite score (60% semantic similarity, 40% keyword overlap) that gives a single 0-100% indicator of overall fit.
|
| 37 |
|
| 38 |
+
### Actionable Suggestions
|
| 39 |
+
Generates specific, prioritized recommendations for improving your resume's alignment with the target role.
|
| 40 |
|
| 41 |
+
### PDF Upload Support
|
| 42 |
+
Upload your resume as a PDF file instead of pasting text. The app extracts text from the PDF automatically using PyMuPDF.
|
|
|
|
| 43 |
|
| 44 |
+
## How to Use
|
| 45 |
|
| 46 |
+
1. **Paste your resume** into the left text area, or upload a PDF using the file upload widget.
|
| 47 |
+
2. **Paste the job description** into the right text area.
|
| 48 |
+
3. Click **Analyze**.
|
| 49 |
+
4. Browse the results across four tabs:
|
| 50 |
+
- **Overview** -- composite score, breakdown, and verdict.
|
| 51 |
+
- **Keywords** -- matched and missing keywords with TF-IDF rankings.
|
| 52 |
+
- **Sections** -- per-section scores and assessments.
|
| 53 |
+
- **Suggestions** -- numbered, actionable improvement recommendations.
|
| 54 |
+
|
| 55 |
+
Example data is pre-loaded so you can click Analyze immediately to see the tool in action.
|
| 56 |
+
|
| 57 |
+
## Technical Architecture
|
| 58 |
+
|
| 59 |
+
```
|
| 60 |
+
Resume Text / PDF ──┐
|
| 61 |
+
├──▶ Semantic Embedding (MiniLM-L6-v2) ──▶ Cosine Similarity
|
| 62 |
+
Job Description ────┘
|
| 63 |
+
├──▶ TF-IDF Vectorization ──▶ Keyword Extraction & Matching
|
| 64 |
+
└──▶ Section Detection (regex heuristics) ──▶ Per-section Scoring
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
| Component | Technology |
|
| 68 |
|---|---|
|
| 69 |
| Web framework | Gradio 4.44 |
|
| 70 |
+
| Semantic model | sentence-transformers/all-MiniLM-L6-v2 |
|
| 71 |
+
| Keyword extraction | scikit-learn TfidfVectorizer |
|
| 72 |
| PDF parsing | PyMuPDF (fitz) |
|
| 73 |
+
| Numerical compute | NumPy |
|
| 74 |
|
| 75 |
+
## Running Locally
|
| 76 |
|
| 77 |
```bash
|
| 78 |
# Clone the repository
|
| 79 |
+
git clone https://huggingface.co/spaces/gr8monk3ys/resume-analyzer-space
|
| 80 |
+
cd resume-analyzer-space
|
| 81 |
|
| 82 |
# Install dependencies
|
| 83 |
pip install -r requirements.txt
|
| 84 |
|
| 85 |
+
# Launch the app
|
| 86 |
python app.py
|
| 87 |
```
|
| 88 |
|
| 89 |
The app will be available at `http://localhost:7860`.
|
| 90 |
|
| 91 |
+
## Project Structure
|
| 92 |
|
| 93 |
+
```
|
| 94 |
+
resume-analyzer-space/
|
| 95 |
+
├── app.py # Application source (Gradio interface + analysis logic)
|
| 96 |
+
├── requirements.txt # Python dependencies
|
| 97 |
+
└── README.md # This file (includes HF Space metadata)
|
| 98 |
+
```
|
| 99 |
|
| 100 |
+
## License
|
| 101 |
|
| 102 |
+
MIT
|
app.py
CHANGED
|
@@ -1,535 +1,579 @@
|
|
| 1 |
"""
|
| 2 |
-
|
| 3 |
-
Version: 2.0.0 (Gradio 5.x compatible)
|
| 4 |
|
| 5 |
-
This application uses
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
key findings, methodology notes, and concise abstracts.
|
| 9 |
|
| 10 |
Author: Lorenzo Scaturchio (gr8monk3ys)
|
| 11 |
License: MIT
|
| 12 |
"""
|
| 13 |
|
| 14 |
-
import os
|
| 15 |
import re
|
| 16 |
import logging
|
| 17 |
from typing import Optional
|
| 18 |
|
| 19 |
import fitz # PyMuPDF
|
| 20 |
import gradio as gr
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
# ---------------------------------------------------------------------------
|
| 24 |
# Logging
|
| 25 |
# ---------------------------------------------------------------------------
|
| 26 |
-
logging.basicConfig(
|
| 27 |
-
level=logging.INFO,
|
| 28 |
-
format="%(asctime)s [%(levelname)s] %(message)s",
|
| 29 |
-
)
|
| 30 |
logger = logging.getLogger(__name__)
|
| 31 |
|
| 32 |
# ---------------------------------------------------------------------------
|
| 33 |
# Constants
|
| 34 |
# ---------------------------------------------------------------------------
|
| 35 |
-
MODEL_NAME = "
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
# ---------------------------------------------------------------------------
|
| 44 |
-
#
|
| 45 |
# ---------------------------------------------------------------------------
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
-
|
| 56 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
-
Args:
|
| 59 |
-
pdf_path: Path to the uploaded PDF file.
|
| 60 |
|
| 61 |
-
|
| 62 |
-
|
|
|
|
| 63 |
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
"""
|
| 67 |
try:
|
| 68 |
doc = fitz.open(pdf_path)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
except Exception as exc:
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
f"Details: {exc}"
|
| 73 |
-
) from exc
|
| 74 |
-
|
| 75 |
-
pages: list[str] = []
|
| 76 |
-
for page_num, page in enumerate(doc):
|
| 77 |
-
text = page.get_text("text")
|
| 78 |
-
if text.strip():
|
| 79 |
-
pages.append(text)
|
| 80 |
-
logger.debug("Page %d: extracted %d characters", page_num + 1, len(text))
|
| 81 |
-
|
| 82 |
-
doc.close()
|
| 83 |
-
|
| 84 |
-
if not pages:
|
| 85 |
-
raise ValueError(
|
| 86 |
-
"The PDF appears to contain no extractable text. "
|
| 87 |
-
"It may be a scanned document or consist only of images."
|
| 88 |
-
)
|
| 89 |
|
| 90 |
-
return "\n".join(pages)
|
| 91 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
-
def clean_text(text: str) -> str:
|
| 94 |
-
"""Normalize whitespace and remove common PDF artefacts.
|
| 95 |
|
| 96 |
-
|
| 97 |
-
characters that often appear in academic PDFs.
|
| 98 |
"""
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
# Re-join hyphenated line breaks (e.g. "summa-\nrization" -> "summarization")
|
| 102 |
-
text = re.sub(r"(\w)-\n(\w)", r"\1\2", text)
|
| 103 |
-
# Collapse multiple blank lines into one
|
| 104 |
-
text = re.sub(r"\n{3,}", "\n\n", text)
|
| 105 |
-
# Collapse multiple spaces
|
| 106 |
-
text = re.sub(r"[ \t]{2,}", " ", text)
|
| 107 |
-
return text.strip()
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
# ===========================================================================
|
| 111 |
-
# Title extraction heuristic
|
| 112 |
-
# ===========================================================================
|
| 113 |
-
|
| 114 |
-
def extract_title(text: str) -> str:
|
| 115 |
-
"""Attempt to extract the paper title from the first few lines.
|
| 116 |
-
|
| 117 |
-
Academic papers typically place the title in the first 1-5 lines before the
|
| 118 |
-
author block. We use a simple heuristic: the longest line among the first
|
| 119 |
-
few non-empty lines that is not all-caps (which would be a header like
|
| 120 |
-
"ABSTRACT") and does not look like an author list.
|
| 121 |
-
"""
|
| 122 |
-
lines = [ln.strip() for ln in text.split("\n") if ln.strip()][:12]
|
| 123 |
-
|
| 124 |
-
candidates: list[str] = []
|
| 125 |
-
for line in lines:
|
| 126 |
-
# Skip very short lines (page numbers, dates, etc.)
|
| 127 |
-
if len(line) < 10:
|
| 128 |
-
continue
|
| 129 |
-
# Skip lines that are likely author names / affiliations (contain '@')
|
| 130 |
-
if "@" in line:
|
| 131 |
-
continue
|
| 132 |
-
# Skip lines that are section headers (all uppercase, short)
|
| 133 |
-
if line.isupper() and len(line) < 60:
|
| 134 |
-
continue
|
| 135 |
-
# Skip lines that look like emails or URLs
|
| 136 |
-
if re.search(r"https?://|www\.", line):
|
| 137 |
-
continue
|
| 138 |
-
candidates.append(line)
|
| 139 |
-
|
| 140 |
-
if not candidates:
|
| 141 |
-
return "Untitled Paper"
|
| 142 |
-
|
| 143 |
-
# Return the first substantial candidate (titles usually come first)
|
| 144 |
-
return candidates[0]
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
# ===========================================================================
|
| 148 |
-
# Chunking and summarization
|
| 149 |
-
# ===========================================================================
|
| 150 |
-
|
| 151 |
-
def chunk_text(text: str, max_words: int = CHUNK_WORD_LIMIT) -> list[str]:
|
| 152 |
-
"""Split text into chunks of approximately *max_words* words.
|
| 153 |
-
|
| 154 |
-
Splitting is done on paragraph boundaries where possible so that chunks
|
| 155 |
-
remain coherent. If a single paragraph exceeds the limit it is split on
|
| 156 |
-
sentence boundaries instead.
|
| 157 |
"""
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
# If adding this paragraph would exceed the limit, finalize the chunk.
|
| 167 |
-
if current_word_count + para_words > max_words and current_chunk:
|
| 168 |
-
chunks.append("\n\n".join(current_chunk))
|
| 169 |
-
current_chunk = []
|
| 170 |
-
current_word_count = 0
|
| 171 |
-
|
| 172 |
-
# Handle paragraphs that are themselves larger than the limit.
|
| 173 |
-
if para_words > max_words:
|
| 174 |
-
sentences = re.split(r"(?<=[.!?])\s+", para)
|
| 175 |
-
for sentence in sentences:
|
| 176 |
-
s_words = len(sentence.split())
|
| 177 |
-
if current_word_count + s_words > max_words and current_chunk:
|
| 178 |
-
chunks.append("\n\n".join(current_chunk))
|
| 179 |
-
current_chunk = []
|
| 180 |
-
current_word_count = 0
|
| 181 |
-
current_chunk.append(sentence)
|
| 182 |
-
current_word_count += s_words
|
| 183 |
-
else:
|
| 184 |
-
current_chunk.append(para)
|
| 185 |
-
current_word_count += para_words
|
| 186 |
|
| 187 |
-
|
| 188 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 189 |
|
| 190 |
-
return chunks
|
| 191 |
|
|
|
|
|
|
|
|
|
|
| 192 |
|
| 193 |
-
def summarize_text(text: str) -> str:
|
| 194 |
-
"""Summarize a single chunk of text using the BART model via Inference API.
|
| 195 |
|
| 196 |
-
|
| 197 |
-
|
|
|
|
| 198 |
"""
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 203 |
|
| 204 |
-
max_len = min(SUMMARY_MAX_LENGTH, max(50, word_count // 2))
|
| 205 |
-
min_len = min(SUMMARY_MIN_LENGTH, max_len - 10)
|
| 206 |
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
)
|
| 216 |
-
return result.summary_text
|
| 217 |
-
except Exception as e:
|
| 218 |
-
logger.warning("Summarization failed: %s", e)
|
| 219 |
-
# Fallback: return truncated text
|
| 220 |
-
return " ".join(text.split()[:100]) + "..."
|
| 221 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 222 |
|
| 223 |
-
|
| 224 |
-
""
|
| 225 |
|
| 226 |
-
Strategy:
|
| 227 |
-
1. Split the document into manageable chunks.
|
| 228 |
-
2. Summarize each chunk individually.
|
| 229 |
-
3. If multiple chunk summaries exist, combine them and run a second-pass
|
| 230 |
-
summarization to produce a coherent final summary.
|
| 231 |
-
"""
|
| 232 |
-
chunks = chunk_text(text)
|
| 233 |
-
logger.info("Document split into %d chunk(s) for summarization.", len(chunks))
|
| 234 |
|
| 235 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 236 |
|
| 237 |
-
if
|
| 238 |
-
|
|
|
|
|
|
|
|
|
|
| 239 |
|
| 240 |
-
#
|
| 241 |
-
|
| 242 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 243 |
|
| 244 |
-
|
| 245 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 246 |
|
| 247 |
-
|
| 248 |
-
|
|
|
|
|
|
|
|
|
|
| 249 |
|
| 250 |
-
|
| 251 |
-
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
"max_length": max_len,
|
| 255 |
-
"min_length": min_len,
|
| 256 |
-
"do_sample": False,
|
| 257 |
-
},
|
| 258 |
)
|
| 259 |
-
return result.summary_text
|
| 260 |
-
except Exception as e:
|
| 261 |
-
logger.warning("Combined summarization failed: %s", e)
|
| 262 |
-
return combined
|
| 263 |
|
|
|
|
| 264 |
|
| 265 |
-
# ===========================================================================
|
| 266 |
-
# Section extraction helpers
|
| 267 |
-
# ===========================================================================
|
| 268 |
|
| 269 |
-
|
| 270 |
-
|
|
|
|
| 271 |
|
| 272 |
-
|
| 273 |
-
|
|
|
|
|
|
|
|
|
|
| 274 |
"""
|
| 275 |
-
|
| 276 |
-
rf"(?:^|\n)\s*(?:\d+[\.\)]?\s*)?{heading_pattern}\s*\n(.*?)(?=\n\s*(?:\d+[\.\)]?\s*)?[A-Z][A-Za-z ]+\s*\n|\Z)",
|
| 277 |
-
re.DOTALL | re.IGNORECASE,
|
| 278 |
-
)
|
| 279 |
-
match = pattern.search(text)
|
| 280 |
-
if match:
|
| 281 |
-
content = match.group(1).strip()
|
| 282 |
-
if len(content) > 30:
|
| 283 |
-
return content
|
| 284 |
-
return fallback
|
| 285 |
-
|
| 286 |
-
|
| 287 |
-
def extract_key_findings(text: str) -> str:
|
| 288 |
-
"""Try to extract key findings from Results / Conclusion sections, or
|
| 289 |
-
fall back to summarizing the last portion of the paper."""
|
| 290 |
-
for heading in [
|
| 291 |
-
r"(?:key\s+)?findings",
|
| 292 |
-
r"results?\s*(?:and\s+discussion)?",
|
| 293 |
-
r"conclusions?\s*(?:and\s+future\s+work)?",
|
| 294 |
-
r"discussion",
|
| 295 |
-
]:
|
| 296 |
-
content = extract_section(text, heading)
|
| 297 |
-
if content:
|
| 298 |
-
return summarize_text(content[:3000])
|
| 299 |
-
# Fallback: summarize the last quarter of the document.
|
| 300 |
-
words = text.split()
|
| 301 |
-
tail = " ".join(words[-(len(words) // 4):])
|
| 302 |
-
if len(tail.split()) > 50:
|
| 303 |
-
return summarize_text(tail[:3000])
|
| 304 |
-
return "Key findings could not be automatically extracted."
|
| 305 |
-
|
| 306 |
-
|
| 307 |
-
def extract_methodology(text: str) -> str:
|
| 308 |
-
"""Try to extract methodology information from the paper."""
|
| 309 |
-
for heading in [
|
| 310 |
-
r"method(?:ology|s)?",
|
| 311 |
-
r"approach",
|
| 312 |
-
r"experimental\s+setup",
|
| 313 |
-
r"materials?\s+and\s+methods",
|
| 314 |
-
r"(?:proposed\s+)?(?:framework|system|model|architecture)",
|
| 315 |
-
]:
|
| 316 |
-
content = extract_section(text, heading)
|
| 317 |
-
if content:
|
| 318 |
-
return summarize_text(content[:3000])
|
| 319 |
-
return "Methodology section could not be automatically extracted."
|
| 320 |
-
|
| 321 |
-
|
| 322 |
-
# ===========================================================================
|
| 323 |
-
# Main processing function
|
| 324 |
-
# ===========================================================================
|
| 325 |
-
|
| 326 |
-
def process_paper(
|
| 327 |
-
pdf_file: Optional[str],
|
| 328 |
-
pasted_text: Optional[str],
|
| 329 |
-
) -> str:
|
| 330 |
-
"""Process a research paper and return a structured summary.
|
| 331 |
|
| 332 |
-
|
| 333 |
-
|
| 334 |
"""
|
| 335 |
# ------------------------------------------------------------------
|
| 336 |
-
#
|
| 337 |
# ------------------------------------------------------------------
|
| 338 |
if pdf_file is not None:
|
| 339 |
-
|
| 340 |
-
try:
|
| 341 |
-
raw_text = extract_text_from_pdf(pdf_file)
|
| 342 |
-
except ValueError as exc:
|
| 343 |
-
return f"**Error:** {exc}"
|
| 344 |
-
elif pasted_text and pasted_text.strip():
|
| 345 |
-
raw_text = pasted_text.strip()
|
| 346 |
-
else:
|
| 347 |
-
return (
|
| 348 |
-
"**Error:** Please upload a PDF file or paste the paper text. "
|
| 349 |
-
"Both inputs are currently empty."
|
| 350 |
-
)
|
| 351 |
-
|
| 352 |
-
text = clean_text(raw_text)
|
| 353 |
-
original_word_count = len(text.split())
|
| 354 |
|
| 355 |
-
if
|
| 356 |
-
|
| 357 |
-
|
| 358 |
-
|
| 359 |
-
)
|
| 360 |
|
| 361 |
-
|
|
|
|
|
|
|
|
|
|
| 362 |
|
| 363 |
# ------------------------------------------------------------------
|
| 364 |
-
# 2.
|
| 365 |
# ------------------------------------------------------------------
|
| 366 |
-
|
| 367 |
-
|
| 368 |
-
|
| 369 |
-
methodology = extract_methodology(text)
|
| 370 |
|
| 371 |
-
|
| 372 |
|
| 373 |
# ------------------------------------------------------------------
|
| 374 |
-
# 3.
|
| 375 |
# ------------------------------------------------------------------
|
| 376 |
-
|
|
|
|
| 377 |
|
| 378 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 379 |
|
| 380 |
-
#
|
| 381 |
-
|
|
|
|
|
|
|
| 382 |
|
| 383 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 384 |
|
| 385 |
-
|
| 386 |
-
{key_findings}
|
| 387 |
|
| 388 |
-
---
|
| 389 |
|
| 390 |
-
#
|
| 391 |
-
|
|
|
|
| 392 |
|
| 393 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 394 |
|
| 395 |
-
### Statistics
|
| 396 |
-
| Metric | Value |
|
| 397 |
-
|---|---|
|
| 398 |
-
| Original length | {original_word_count:,} words |
|
| 399 |
-
| Summary length | {summary_word_count:,} words |
|
| 400 |
-
| Compression ratio | {original_word_count / max(summary_word_count, 1):.1f}x |
|
| 401 |
-
"""
|
| 402 |
-
return output
|
| 403 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 404 |
|
| 405 |
-
# ===========================================================================
|
| 406 |
-
# Example inputs for the Gradio demo
|
| 407 |
-
# ===========================================================================
|
| 408 |
|
| 409 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 410 |
|
| 411 |
-
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
|
| 412 |
|
| 413 |
-
|
| 414 |
-
|
|
|
|
|
|
|
| 415 |
|
| 416 |
-
|
| 417 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 418 |
|
| 419 |
-
|
| 420 |
-
The Transformer follows an encoder-decoder structure using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder. The encoder maps an input sequence of symbol representations to a sequence of continuous representations. Given z, the decoder then generates an output sequence of symbols one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next. The Transformer uses multi-head attention to allow the model to jointly attend to information from different representation subspaces at different positions.
|
| 421 |
|
| 422 |
-
Results
|
| 423 |
-
On the WMT 2014 English-to-German translation task, the big transformer model outperforms the best previously reported models including ensembles by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4. On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1/4 the training cost of the previous state-of-the-art model. The Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers.
|
| 424 |
|
| 425 |
-
|
| 426 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 427 |
|
| 428 |
|
| 429 |
-
# =========================================================================
|
| 430 |
# Gradio interface
|
| 431 |
-
# =========================================================================
|
| 432 |
|
| 433 |
def build_interface() -> gr.Blocks:
|
| 434 |
"""Construct and return the Gradio Blocks interface."""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 435 |
|
| 436 |
-
with gr.Blocks(
|
| 437 |
-
title="Paper Summarizer",
|
| 438 |
-
theme=gr.themes.Soft(
|
| 439 |
-
primary_hue="indigo",
|
| 440 |
-
secondary_hue="blue",
|
| 441 |
-
),
|
| 442 |
-
css="""
|
| 443 |
-
.header-text { text-align: center; margin-bottom: 0.5em; }
|
| 444 |
-
.subheader { text-align: center; color: #6b7280; margin-top: 0; }
|
| 445 |
-
footer { display: none !important; }
|
| 446 |
-
""",
|
| 447 |
-
) as demo:
|
| 448 |
-
# --- Header ---
|
| 449 |
gr.Markdown(
|
| 450 |
-
""
|
| 451 |
-
|
| 452 |
-
|
| 453 |
-
|
| 454 |
-
|
| 455 |
-
</p>
|
| 456 |
-
""",
|
| 457 |
)
|
| 458 |
|
| 459 |
-
with gr.Row():
|
| 460 |
-
# --- Input column ---
|
| 461 |
with gr.Column(scale=1):
|
| 462 |
-
gr.
|
| 463 |
-
|
| 464 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 465 |
file_types=[".pdf"],
|
| 466 |
type="filepath",
|
| 467 |
)
|
| 468 |
-
text_input = gr.Textbox(
|
| 469 |
-
label="Or paste paper text",
|
| 470 |
-
placeholder="Paste the full text of a research paper here...",
|
| 471 |
-
lines=12,
|
| 472 |
-
max_lines=30,
|
| 473 |
-
)
|
| 474 |
-
submit_btn = gr.Button("Summarize", variant="primary", size="lg")
|
| 475 |
-
|
| 476 |
-
# --- Output column ---
|
| 477 |
with gr.Column(scale=1):
|
| 478 |
-
gr.
|
| 479 |
-
|
| 480 |
-
|
| 481 |
-
|
|
|
|
| 482 |
)
|
| 483 |
|
| 484 |
-
|
| 485 |
-
|
| 486 |
-
gr.
|
| 487 |
-
|
| 488 |
-
|
| 489 |
-
|
| 490 |
-
|
| 491 |
-
|
| 492 |
-
|
| 493 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 494 |
)
|
| 495 |
|
| 496 |
-
|
| 497 |
-
|
| 498 |
-
|
| 499 |
-
|
| 500 |
-
|
| 501 |
-
[`facebook/bart-large-cnn`](https://huggingface.co/facebook/bart-large-cnn)
|
| 502 |
-
to generate abstractive summaries of academic papers.
|
| 503 |
-
|
| 504 |
-
**Features**
|
| 505 |
-
- PDF upload with automatic text extraction (PyMuPDF)
|
| 506 |
-
- Intelligent chunking for papers of any length
|
| 507 |
-
- Structured output: title, key findings, methodology, and concise summary
|
| 508 |
-
- Word-count statistics and compression ratio
|
| 509 |
-
|
| 510 |
-
**Limitations**
|
| 511 |
-
- Scanned PDFs (image-only) are not supported; the PDF must contain selectable text.
|
| 512 |
-
- Summarization quality depends on the input text quality and structure.
|
| 513 |
-
- Running on free CPU tier; very long papers may take a minute to process.
|
| 514 |
-
|
| 515 |
-
Built by [Lorenzo Scaturchio](https://huggingface.co/gr8monk3ys).
|
| 516 |
-
"""
|
| 517 |
-
)
|
| 518 |
-
|
| 519 |
-
# --- Event binding ---
|
| 520 |
-
submit_btn.click(
|
| 521 |
-
fn=process_paper,
|
| 522 |
-
inputs=[pdf_input, text_input],
|
| 523 |
-
outputs=output,
|
| 524 |
)
|
| 525 |
|
| 526 |
return demo
|
| 527 |
|
| 528 |
|
| 529 |
-
#
|
| 530 |
# Entry point
|
| 531 |
-
#
|
| 532 |
-
|
| 533 |
if __name__ == "__main__":
|
| 534 |
app = build_interface()
|
| 535 |
app.launch()
|
|
|
|
| 1 |
"""
|
| 2 |
+
Resume Analyzer - AI-Powered Resume Analysis Against Job Descriptions
|
|
|
|
| 3 |
|
| 4 |
+
This Gradio application uses NLP techniques to evaluate how well a resume
|
| 5 |
+
matches a given job description. It provides semantic similarity scoring,
|
| 6 |
+
keyword extraction, gap analysis, and actionable improvement suggestions.
|
|
|
|
| 7 |
|
| 8 |
Author: Lorenzo Scaturchio (gr8monk3ys)
|
| 9 |
License: MIT
|
| 10 |
"""
|
| 11 |
|
|
|
|
| 12 |
import re
|
| 13 |
import logging
|
| 14 |
from typing import Optional
|
| 15 |
|
| 16 |
import fitz # PyMuPDF
|
| 17 |
import gradio as gr
|
| 18 |
+
import numpy as np
|
| 19 |
+
from sentence_transformers import SentenceTransformer
|
| 20 |
+
from sklearn.feature_extraction.text import TfidfVectorizer
|
| 21 |
+
from sklearn.metrics.pairwise import cosine_similarity
|
| 22 |
|
| 23 |
# ---------------------------------------------------------------------------
|
| 24 |
# Logging
|
| 25 |
# ---------------------------------------------------------------------------
|
| 26 |
+
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
|
|
|
|
|
|
|
|
|
|
| 27 |
logger = logging.getLogger(__name__)
|
| 28 |
|
| 29 |
# ---------------------------------------------------------------------------
|
| 30 |
# Constants
|
| 31 |
# ---------------------------------------------------------------------------
|
| 32 |
+
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
|
| 33 |
+
|
| 34 |
+
RESUME_SECTIONS = {
|
| 35 |
+
"experience": [
|
| 36 |
+
"experience", "work experience", "professional experience",
|
| 37 |
+
"employment history", "work history", "career history",
|
| 38 |
+
],
|
| 39 |
+
"education": [
|
| 40 |
+
"education", "academic background", "academic history",
|
| 41 |
+
"qualifications", "certifications", "degrees",
|
| 42 |
+
],
|
| 43 |
+
"skills": [
|
| 44 |
+
"skills", "technical skills", "core competencies",
|
| 45 |
+
"competencies", "proficiencies", "technologies", "tools",
|
| 46 |
+
],
|
| 47 |
+
"projects": [
|
| 48 |
+
"projects", "personal projects", "portfolio",
|
| 49 |
+
"key projects", "selected projects",
|
| 50 |
+
],
|
| 51 |
+
"summary": [
|
| 52 |
+
"summary", "professional summary", "profile",
|
| 53 |
+
"objective", "career objective", "about me", "overview",
|
| 54 |
+
],
|
| 55 |
+
}
|
| 56 |
|
| 57 |
# ---------------------------------------------------------------------------
|
| 58 |
+
# Example data shipped with the Space
|
| 59 |
# ---------------------------------------------------------------------------
|
| 60 |
+
EXAMPLE_RESUME = """LORENZO SCATURCHIO
|
| 61 |
+
San Francisco, CA | lorenzo@email.com | linkedin.com/in/lorenzo | github.com/gr8monk3ys
|
| 62 |
+
|
| 63 |
+
PROFESSIONAL SUMMARY
|
| 64 |
+
Results-driven Machine Learning Engineer with 4+ years of experience designing,
|
| 65 |
+
building, and deploying production ML systems. Skilled in deep learning, NLP,
|
| 66 |
+
and scalable data pipelines. Passionate about applying AI to solve real-world
|
| 67 |
+
problems and delivering measurable business impact.
|
| 68 |
+
|
| 69 |
+
EXPERIENCE
|
| 70 |
+
|
| 71 |
+
Senior Machine Learning Engineer | Acme AI Corp | Jan 2022 - Present
|
| 72 |
+
- Designed and deployed a real-time recommendation engine serving 2M+ daily
|
| 73 |
+
active users, improving click-through rate by 18%.
|
| 74 |
+
- Built an end-to-end NLP pipeline for document classification using
|
| 75 |
+
Transformers (BERT, RoBERTa) with 94% F1 score.
|
| 76 |
+
- Led migration of model training infrastructure to Kubernetes, reducing
|
| 77 |
+
training time by 40% and cloud costs by 25%.
|
| 78 |
+
- Mentored a team of 3 junior engineers on MLOps best practices.
|
| 79 |
+
|
| 80 |
+
Machine Learning Engineer | DataWave Inc | Jun 2020 - Dec 2021
|
| 81 |
+
- Developed a customer churn prediction model (XGBoost, LightGBM) that
|
| 82 |
+
reduced churn by 12%, saving $1.2M annually.
|
| 83 |
+
- Implemented A/B testing framework for ML model evaluation in production.
|
| 84 |
+
- Created automated data quality checks and feature engineering pipelines
|
| 85 |
+
using Apache Spark and Airflow.
|
| 86 |
+
|
| 87 |
+
Data Science Intern | StartUp Labs | May 2019 - Aug 2019
|
| 88 |
+
- Conducted exploratory data analysis on 500K+ records to identify key
|
| 89 |
+
business drivers.
|
| 90 |
+
- Built sentiment analysis prototype using spaCy and scikit-learn.
|
| 91 |
+
|
| 92 |
+
EDUCATION
|
| 93 |
+
M.S. Computer Science (Machine Learning Specialization) | Stanford University | 2020
|
| 94 |
+
B.S. Computer Science | UC Berkeley | 2018
|
| 95 |
+
|
| 96 |
+
SKILLS
|
| 97 |
+
Languages: Python, SQL, Java, Scala, R
|
| 98 |
+
ML/DL Frameworks: PyTorch, TensorFlow, scikit-learn, Hugging Face Transformers
|
| 99 |
+
MLOps: Docker, Kubernetes, MLflow, Weights & Biases, Airflow
|
| 100 |
+
Cloud: AWS (SageMaker, EC2, S3), GCP (Vertex AI, BigQuery)
|
| 101 |
+
Data: Spark, Pandas, NumPy, PostgreSQL, Redis
|
| 102 |
+
Other: Git, CI/CD, REST APIs, Agile/Scrum
|
| 103 |
+
|
| 104 |
+
PROJECTS
|
| 105 |
+
- Open-source NLP toolkit for resume analysis (this project!)
|
| 106 |
+
- Real-time object detection system using YOLOv5 on edge devices
|
| 107 |
+
- Kaggle competition top-5% finish in Tabular Playground Series
|
| 108 |
+
"""
|
| 109 |
|
| 110 |
+
EXAMPLE_JOB_DESCRIPTION = """Senior Machine Learning Engineer
|
| 111 |
+
|
| 112 |
+
About the Role
|
| 113 |
+
We are looking for a Senior Machine Learning Engineer to join our AI Platform
|
| 114 |
+
team. You will design, build, and maintain production ML systems that power
|
| 115 |
+
our core product features. This is a high-impact role where you will work
|
| 116 |
+
closely with product, engineering, and data science teams.
|
| 117 |
+
|
| 118 |
+
Responsibilities
|
| 119 |
+
- Design and implement scalable ML pipelines for training and inference.
|
| 120 |
+
- Develop and deploy deep learning models for NLP and computer vision tasks.
|
| 121 |
+
- Build robust monitoring and alerting for model performance in production.
|
| 122 |
+
- Collaborate with cross-functional teams to translate business requirements
|
| 123 |
+
into ML solutions.
|
| 124 |
+
- Mentor junior engineers and contribute to engineering best practices.
|
| 125 |
+
- Drive adoption of MLOps tools and processes across the organization.
|
| 126 |
+
|
| 127 |
+
Requirements
|
| 128 |
+
- 3+ years of experience in machine learning engineering or a related role.
|
| 129 |
+
- Strong proficiency in Python and SQL.
|
| 130 |
+
- Hands-on experience with ML frameworks such as PyTorch or TensorFlow.
|
| 131 |
+
- Experience deploying models to production using Docker and Kubernetes.
|
| 132 |
+
- Familiarity with cloud platforms (AWS or GCP).
|
| 133 |
+
- Solid understanding of NLP techniques (transformers, embeddings, etc.).
|
| 134 |
+
- Experience with data processing frameworks like Spark or similar.
|
| 135 |
+
- Strong communication skills and ability to work in an Agile environment.
|
| 136 |
+
|
| 137 |
+
Nice to Have
|
| 138 |
+
- Experience with recommendation systems or search ranking.
|
| 139 |
+
- Contributions to open-source ML projects.
|
| 140 |
+
- M.S. or Ph.D. in Computer Science, Machine Learning, or related field.
|
| 141 |
+
- Experience with MLflow, Weights & Biases, or similar experiment tracking.
|
| 142 |
+
- Familiarity with CI/CD pipelines for ML.
|
| 143 |
+
"""
|
| 144 |
|
| 145 |
+
# ---------------------------------------------------------------------------
|
| 146 |
+
# Model loading (cached at module level for the HF Space)
|
| 147 |
+
# ---------------------------------------------------------------------------
|
| 148 |
+
logger.info("Loading sentence-transformer model: %s", MODEL_NAME)
|
| 149 |
+
_model = SentenceTransformer(MODEL_NAME)
|
| 150 |
+
logger.info("Model loaded successfully.")
|
| 151 |
|
|
|
|
|
|
|
| 152 |
|
| 153 |
+
# =========================================================================
|
| 154 |
+
# Core analysis utilities
|
| 155 |
+
# =========================================================================
|
| 156 |
|
| 157 |
+
def extract_text_from_pdf(pdf_path: str) -> str:
|
| 158 |
+
"""Extract plain text from a PDF file using PyMuPDF."""
|
|
|
|
| 159 |
try:
|
| 160 |
doc = fitz.open(pdf_path)
|
| 161 |
+
pages = [page.get_text() for page in doc]
|
| 162 |
+
doc.close()
|
| 163 |
+
text = "\n".join(pages).strip()
|
| 164 |
+
if not text:
|
| 165 |
+
raise ValueError("The PDF appears to contain no extractable text.")
|
| 166 |
+
return text
|
| 167 |
except Exception as exc:
|
| 168 |
+
logger.error("PDF extraction failed: %s", exc)
|
| 169 |
+
raise gr.Error(f"Could not read the PDF file: {exc}") from exc
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 170 |
|
|
|
|
| 171 |
|
| 172 |
+
def compute_semantic_similarity(text_a: str, text_b: str) -> float:
|
| 173 |
+
"""Return cosine similarity (0-1) between two texts using the sentence-transformer."""
|
| 174 |
+
embeddings = _model.encode([text_a, text_b], convert_to_numpy=True)
|
| 175 |
+
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
|
| 176 |
+
return float(np.clip(similarity, 0.0, 1.0))
|
| 177 |
|
|
|
|
|
|
|
| 178 |
|
| 179 |
+
def extract_keywords(texts: list[str], top_n: int = 30) -> list[list[str]]:
|
|
|
|
| 180 |
"""
|
| 181 |
+
Use TF-IDF to extract the most important keywords from each document
|
| 182 |
+
in *texts*. Returns a list of keyword-lists, one per input document.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 183 |
"""
|
| 184 |
+
vectorizer = TfidfVectorizer(
|
| 185 |
+
stop_words="english",
|
| 186 |
+
max_features=500,
|
| 187 |
+
ngram_range=(1, 2),
|
| 188 |
+
token_pattern=r"(?u)\b[a-zA-Z][a-zA-Z+#.\-]{1,}\b",
|
| 189 |
+
)
|
| 190 |
+
tfidf_matrix = vectorizer.fit_transform(texts)
|
| 191 |
+
feature_names = np.array(vectorizer.get_feature_names_out())
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 192 |
|
| 193 |
+
results: list[list[str]] = []
|
| 194 |
+
for row_idx in range(tfidf_matrix.shape[0]):
|
| 195 |
+
row = tfidf_matrix[row_idx].toarray().flatten()
|
| 196 |
+
top_indices = row.argsort()[-top_n:][::-1]
|
| 197 |
+
keywords = [feature_names[i] for i in top_indices if row[i] > 0]
|
| 198 |
+
results.append(keywords)
|
| 199 |
+
return results
|
| 200 |
|
|
|
|
| 201 |
|
| 202 |
+
def _normalize(text: str) -> str:
|
| 203 |
+
"""Lowercase and strip extra whitespace."""
|
| 204 |
+
return re.sub(r"\s+", " ", text.lower()).strip()
|
| 205 |
|
|
|
|
|
|
|
| 206 |
|
| 207 |
+
def find_matching_and_missing_keywords(
|
| 208 |
+
resume_text: str, job_keywords: list[str]
|
| 209 |
+
) -> tuple[list[str], list[str]]:
|
| 210 |
"""
|
| 211 |
+
Compare job-description keywords against the resume body.
|
| 212 |
+
Returns (matched, missing) keyword lists.
|
| 213 |
+
"""
|
| 214 |
+
resume_lower = _normalize(resume_text)
|
| 215 |
+
matched, missing = [], []
|
| 216 |
+
for kw in job_keywords:
|
| 217 |
+
if _normalize(kw) in resume_lower:
|
| 218 |
+
matched.append(kw)
|
| 219 |
+
else:
|
| 220 |
+
missing.append(kw)
|
| 221 |
+
return matched, missing
|
| 222 |
|
|
|
|
|
|
|
| 223 |
|
| 224 |
+
def detect_sections(resume_text: str) -> dict[str, str]:
|
| 225 |
+
"""
|
| 226 |
+
Heuristically split a resume into named sections.
|
| 227 |
+
Returns a dict mapping canonical section names to their content.
|
| 228 |
+
"""
|
| 229 |
+
lines = resume_text.split("\n")
|
| 230 |
+
current_section: Optional[str] = None
|
| 231 |
+
sections: dict[str, list[str]] = {}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 232 |
|
| 233 |
+
for line in lines:
|
| 234 |
+
stripped = line.strip()
|
| 235 |
+
matched_section = _match_section_header(stripped)
|
| 236 |
+
if matched_section:
|
| 237 |
+
current_section = matched_section
|
| 238 |
+
sections.setdefault(current_section, [])
|
| 239 |
+
elif current_section:
|
| 240 |
+
sections[current_section].append(stripped)
|
| 241 |
+
else:
|
| 242 |
+
sections.setdefault("other", []).append(stripped)
|
| 243 |
+
|
| 244 |
+
return {name: "\n".join(content).strip() for name, content in sections.items()}
|
| 245 |
+
|
| 246 |
+
|
| 247 |
+
def _match_section_header(line: str) -> Optional[str]:
|
| 248 |
+
"""Return a canonical section name if *line* looks like a header, else None."""
|
| 249 |
+
cleaned = re.sub(r"[^a-z\s]", "", line.lower()).strip()
|
| 250 |
+
if not cleaned or len(cleaned) > 40:
|
| 251 |
+
return None
|
| 252 |
+
for canonical, variants in RESUME_SECTIONS.items():
|
| 253 |
+
if cleaned in variants:
|
| 254 |
+
return canonical
|
| 255 |
+
return None
|
| 256 |
+
|
| 257 |
+
|
| 258 |
+
def analyze_section(
|
| 259 |
+
section_name: str, section_text: str, job_text: str
|
| 260 |
+
) -> dict[str, object]:
|
| 261 |
+
"""Compute a per-section relevance score and short commentary."""
|
| 262 |
+
if not section_text.strip():
|
| 263 |
+
return {
|
| 264 |
+
"score": 0.0,
|
| 265 |
+
"comment": f"No {section_name} section detected in the resume.",
|
| 266 |
+
}
|
| 267 |
+
score = compute_semantic_similarity(section_text, job_text)
|
| 268 |
+
pct = round(score * 100, 1)
|
| 269 |
+
|
| 270 |
+
if pct >= 70:
|
| 271 |
+
quality = "Strong"
|
| 272 |
+
elif pct >= 45:
|
| 273 |
+
quality = "Moderate"
|
| 274 |
+
else:
|
| 275 |
+
quality = "Weak"
|
| 276 |
|
| 277 |
+
comment = f"{quality} alignment ({pct}%) with the job description."
|
| 278 |
+
return {"score": score, "comment": comment}
|
| 279 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 280 |
|
| 281 |
+
def generate_suggestions(
|
| 282 |
+
missing_keywords: list[str],
|
| 283 |
+
section_scores: dict[str, dict],
|
| 284 |
+
overall_score: float,
|
| 285 |
+
) -> list[str]:
|
| 286 |
+
"""Return a list of actionable improvement suggestions."""
|
| 287 |
+
suggestions: list[str] = []
|
| 288 |
|
| 289 |
+
if overall_score < 0.45:
|
| 290 |
+
suggestions.append(
|
| 291 |
+
"Your resume has low overall alignment with this job description. "
|
| 292 |
+
"Consider tailoring it more directly to the role's requirements."
|
| 293 |
+
)
|
| 294 |
|
| 295 |
+
# Missing keywords
|
| 296 |
+
if missing_keywords:
|
| 297 |
+
top_missing = missing_keywords[:10]
|
| 298 |
+
suggestions.append(
|
| 299 |
+
"Add these high-value keywords or phrases where truthfully applicable: "
|
| 300 |
+
+ ", ".join(f'"{kw}"' for kw in top_missing)
|
| 301 |
+
+ "."
|
| 302 |
+
)
|
| 303 |
|
| 304 |
+
# Section-specific advice
|
| 305 |
+
for name, data in section_scores.items():
|
| 306 |
+
score = data["score"]
|
| 307 |
+
if name == "experience" and score < 0.5:
|
| 308 |
+
suggestions.append(
|
| 309 |
+
"Strengthen your Experience section by using action verbs and "
|
| 310 |
+
"quantifiable achievements that mirror the job requirements."
|
| 311 |
+
)
|
| 312 |
+
if name == "skills" and score < 0.5:
|
| 313 |
+
suggestions.append(
|
| 314 |
+
"Your Skills section could be improved. List specific tools, "
|
| 315 |
+
"frameworks, and technologies mentioned in the job posting."
|
| 316 |
+
)
|
| 317 |
+
if name == "education" and score < 0.3:
|
| 318 |
+
suggestions.append(
|
| 319 |
+
"Consider highlighting relevant coursework, certifications, or "
|
| 320 |
+
"academic projects in your Education section."
|
| 321 |
+
)
|
| 322 |
|
| 323 |
+
if "summary" not in section_scores or not section_scores.get("summary", {}).get("score", 0):
|
| 324 |
+
suggestions.append(
|
| 325 |
+
"Add a Professional Summary at the top of your resume that directly "
|
| 326 |
+
"addresses the key requirements of this role."
|
| 327 |
+
)
|
| 328 |
|
| 329 |
+
if not suggestions:
|
| 330 |
+
suggestions.append(
|
| 331 |
+
"Your resume is well-aligned with this job description. "
|
| 332 |
+
"Keep refining with specific metrics and results."
|
|
|
|
|
|
|
|
|
|
|
|
|
| 333 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 334 |
|
| 335 |
+
return suggestions
|
| 336 |
|
|
|
|
|
|
|
|
|
|
| 337 |
|
| 338 |
+
# =========================================================================
|
| 339 |
+
# Main analysis orchestrator
|
| 340 |
+
# =========================================================================
|
| 341 |
|
| 342 |
+
def run_analysis(
|
| 343 |
+
resume_text: str,
|
| 344 |
+
job_description: str,
|
| 345 |
+
pdf_file: Optional[str] = None,
|
| 346 |
+
) -> tuple[str, str, str, str]:
|
| 347 |
"""
|
| 348 |
+
Run the full resume analysis pipeline.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 349 |
|
| 350 |
+
Returns a 4-tuple of Markdown strings:
|
| 351 |
+
(overview, keywords_report, section_report, suggestions_report)
|
| 352 |
"""
|
| 353 |
# ------------------------------------------------------------------
|
| 354 |
+
# Input resolution
|
| 355 |
# ------------------------------------------------------------------
|
| 356 |
if pdf_file is not None:
|
| 357 |
+
resume_text = extract_text_from_pdf(pdf_file)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 358 |
|
| 359 |
+
if not resume_text or not resume_text.strip():
|
| 360 |
+
raise gr.Error("Please provide resume text or upload a PDF.")
|
| 361 |
+
if not job_description or not job_description.strip():
|
| 362 |
+
raise gr.Error("Please provide a job description.")
|
|
|
|
| 363 |
|
| 364 |
+
# ------------------------------------------------------------------
|
| 365 |
+
# 1. Semantic similarity
|
| 366 |
+
# ------------------------------------------------------------------
|
| 367 |
+
raw_similarity = compute_semantic_similarity(resume_text, job_description)
|
| 368 |
|
| 369 |
# ------------------------------------------------------------------
|
| 370 |
+
# 2. Keyword analysis
|
| 371 |
# ------------------------------------------------------------------
|
| 372 |
+
keyword_lists = extract_keywords([resume_text, job_description], top_n=30)
|
| 373 |
+
resume_keywords, job_keywords = keyword_lists[0], keyword_lists[1]
|
| 374 |
+
matched_kw, missing_kw = find_matching_and_missing_keywords(resume_text, job_keywords)
|
|
|
|
| 375 |
|
| 376 |
+
keyword_overlap = len(matched_kw) / max(len(job_keywords), 1)
|
| 377 |
|
| 378 |
# ------------------------------------------------------------------
|
| 379 |
+
# 3. Composite score (60% semantic + 40% keyword overlap)
|
| 380 |
# ------------------------------------------------------------------
|
| 381 |
+
composite = 0.6 * raw_similarity + 0.4 * keyword_overlap
|
| 382 |
+
overall_pct = round(composite * 100, 1)
|
| 383 |
|
| 384 |
+
# ------------------------------------------------------------------
|
| 385 |
+
# 4. Section-by-section analysis
|
| 386 |
+
# ------------------------------------------------------------------
|
| 387 |
+
sections = detect_sections(resume_text)
|
| 388 |
+
section_scores: dict[str, dict] = {}
|
| 389 |
+
for sec_name in ["summary", "experience", "education", "skills", "projects"]:
|
| 390 |
+
sec_text = sections.get(sec_name, "")
|
| 391 |
+
section_scores[sec_name] = analyze_section(sec_name, sec_text, job_description)
|
| 392 |
|
| 393 |
+
# ------------------------------------------------------------------
|
| 394 |
+
# 5. Suggestions
|
| 395 |
+
# ------------------------------------------------------------------
|
| 396 |
+
suggestions = generate_suggestions(missing_kw, section_scores, composite)
|
| 397 |
|
| 398 |
+
# ------------------------------------------------------------------
|
| 399 |
+
# Format outputs as Markdown
|
| 400 |
+
# ------------------------------------------------------------------
|
| 401 |
+
overview_md = _format_overview(overall_pct, raw_similarity, keyword_overlap, matched_kw, missing_kw)
|
| 402 |
+
keywords_md = _format_keywords(resume_keywords, job_keywords, matched_kw, missing_kw)
|
| 403 |
+
sections_md = _format_sections(section_scores)
|
| 404 |
+
suggest_md = _format_suggestions(suggestions)
|
| 405 |
|
| 406 |
+
return overview_md, keywords_md, sections_md, suggest_md
|
|
|
|
| 407 |
|
|
|
|
| 408 |
|
| 409 |
+
# =========================================================================
|
| 410 |
+
# Markdown formatters
|
| 411 |
+
# =========================================================================
|
| 412 |
|
| 413 |
+
def _score_bar(pct: float, width: int = 20) -> str:
|
| 414 |
+
"""Return a text-based progress bar for Markdown."""
|
| 415 |
+
filled = round(pct / 100 * width)
|
| 416 |
+
empty = width - filled
|
| 417 |
+
return f"`[{'=' * filled}{' ' * empty}]` **{pct}%**"
|
| 418 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 419 |
|
| 420 |
+
def _format_overview(
|
| 421 |
+
overall_pct: float,
|
| 422 |
+
semantic_sim: float,
|
| 423 |
+
keyword_overlap: float,
|
| 424 |
+
matched: list[str],
|
| 425 |
+
missing: list[str],
|
| 426 |
+
) -> str:
|
| 427 |
+
sem_pct = round(semantic_sim * 100, 1)
|
| 428 |
+
kw_pct = round(keyword_overlap * 100, 1)
|
| 429 |
+
|
| 430 |
+
if overall_pct >= 70:
|
| 431 |
+
verdict = "Excellent match - your resume aligns strongly with this role."
|
| 432 |
+
elif overall_pct >= 50:
|
| 433 |
+
verdict = "Good match - some targeted improvements could strengthen your application."
|
| 434 |
+
elif overall_pct >= 30:
|
| 435 |
+
verdict = "Partial match - significant tailoring is recommended."
|
| 436 |
+
else:
|
| 437 |
+
verdict = "Low match - consider whether this role fits your background or rewrite substantially."
|
| 438 |
+
|
| 439 |
+
return (
|
| 440 |
+
f"## Overall Match Score\n\n"
|
| 441 |
+
f"# {_score_bar(overall_pct)}\n\n"
|
| 442 |
+
f"**Verdict:** {verdict}\n\n"
|
| 443 |
+
f"---\n\n"
|
| 444 |
+
f"### Score Breakdown\n\n"
|
| 445 |
+
f"| Component | Score |\n"
|
| 446 |
+
f"|---|---|\n"
|
| 447 |
+
f"| Semantic Similarity | {sem_pct}% |\n"
|
| 448 |
+
f"| Keyword Overlap | {kw_pct}% |\n"
|
| 449 |
+
f"| **Composite (60/40)** | **{overall_pct}%** |\n\n"
|
| 450 |
+
f"---\n\n"
|
| 451 |
+
f"**Matched Keywords:** {len(matched)} | "
|
| 452 |
+
f"**Missing Keywords:** {len(missing)}\n"
|
| 453 |
+
)
|
| 454 |
|
|
|
|
|
|
|
|
|
|
| 455 |
|
| 456 |
+
def _format_keywords(
|
| 457 |
+
resume_kw: list[str],
|
| 458 |
+
job_kw: list[str],
|
| 459 |
+
matched: list[str],
|
| 460 |
+
missing: list[str],
|
| 461 |
+
) -> str:
|
| 462 |
+
matched_str = ", ".join(f"**{kw}**" for kw in matched) if matched else "_None detected_"
|
| 463 |
+
missing_str = ", ".join(f"~~{kw}~~" for kw in missing) if missing else "_None - great coverage!_"
|
| 464 |
+
resume_str = ", ".join(resume_kw[:20]) if resume_kw else "_None detected_"
|
| 465 |
+
job_str = ", ".join(job_kw[:20]) if job_kw else "_None detected_"
|
| 466 |
+
|
| 467 |
+
return (
|
| 468 |
+
f"## Keyword Analysis\n\n"
|
| 469 |
+
f"### Matched Keywords (found in your resume)\n{matched_str}\n\n"
|
| 470 |
+
f"---\n\n"
|
| 471 |
+
f"### Missing Keywords (consider adding)\n{missing_str}\n\n"
|
| 472 |
+
f"---\n\n"
|
| 473 |
+
f"### Top Resume Keywords (TF-IDF)\n{resume_str}\n\n"
|
| 474 |
+
f"### Top Job Description Keywords (TF-IDF)\n{job_str}\n"
|
| 475 |
+
)
|
| 476 |
|
|
|
|
| 477 |
|
| 478 |
+
def _format_sections(section_scores: dict[str, dict]) -> str:
|
| 479 |
+
header = "## Section-by-Section Analysis\n\n"
|
| 480 |
+
rows = "| Section | Score | Assessment |\n|---|---|---|\n"
|
| 481 |
+
details = ""
|
| 482 |
|
| 483 |
+
for name, data in section_scores.items():
|
| 484 |
+
pct = round(data["score"] * 100, 1)
|
| 485 |
+
comment = data["comment"]
|
| 486 |
+
display_name = name.replace("_", " ").title()
|
| 487 |
+
rows += f"| {display_name} | {pct}% | {comment} |\n"
|
| 488 |
+
details += f"### {display_name}\n{comment}\n\n"
|
| 489 |
|
| 490 |
+
return header + rows + "\n---\n\n" + details
|
|
|
|
| 491 |
|
|
|
|
|
|
|
| 492 |
|
| 493 |
+
def _format_suggestions(suggestions: list[str]) -> str:
|
| 494 |
+
items = "\n".join(f"{i}. {s}" for i, s in enumerate(suggestions, 1))
|
| 495 |
+
return (
|
| 496 |
+
f"## Improvement Suggestions\n\n"
|
| 497 |
+
f"{items}\n\n"
|
| 498 |
+
f"---\n\n"
|
| 499 |
+
f"*Tip: Tailor your resume for every application. Mirror the language "
|
| 500 |
+
f"used in the job posting while remaining truthful about your experience.*\n"
|
| 501 |
+
)
|
| 502 |
|
| 503 |
|
| 504 |
+
# =========================================================================
|
| 505 |
# Gradio interface
|
| 506 |
+
# =========================================================================
|
| 507 |
|
| 508 |
def build_interface() -> gr.Blocks:
|
| 509 |
"""Construct and return the Gradio Blocks interface."""
|
| 510 |
+
theme = gr.themes.Soft(
|
| 511 |
+
primary_hue="teal",
|
| 512 |
+
secondary_hue="green",
|
| 513 |
+
font=[gr.themes.GoogleFont("Inter"), "system-ui", "sans-serif"],
|
| 514 |
+
)
|
| 515 |
|
| 516 |
+
with gr.Blocks(theme=theme, title="Resume Analyzer") as demo:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 517 |
gr.Markdown(
|
| 518 |
+
"# Resume Analyzer\n"
|
| 519 |
+
"Evaluate how well your resume matches a job description using "
|
| 520 |
+
"**semantic similarity** and **keyword analysis**.\n\n"
|
| 521 |
+
"Paste your resume text (or upload a PDF) and the target job "
|
| 522 |
+
"description, then click **Analyze**."
|
|
|
|
|
|
|
| 523 |
)
|
| 524 |
|
| 525 |
+
with gr.Row(equal_height=True):
|
|
|
|
| 526 |
with gr.Column(scale=1):
|
| 527 |
+
resume_text = gr.Textbox(
|
| 528 |
+
label="Resume Text",
|
| 529 |
+
placeholder="Paste your resume here...",
|
| 530 |
+
lines=18,
|
| 531 |
+
value=EXAMPLE_RESUME,
|
| 532 |
+
)
|
| 533 |
+
pdf_upload = gr.File(
|
| 534 |
+
label="Or Upload Resume PDF",
|
| 535 |
file_types=[".pdf"],
|
| 536 |
type="filepath",
|
| 537 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 538 |
with gr.Column(scale=1):
|
| 539 |
+
job_desc = gr.Textbox(
|
| 540 |
+
label="Job Description",
|
| 541 |
+
placeholder="Paste the job description here...",
|
| 542 |
+
lines=22,
|
| 543 |
+
value=EXAMPLE_JOB_DESCRIPTION,
|
| 544 |
)
|
| 545 |
|
| 546 |
+
analyze_btn = gr.Button("Analyze", variant="primary", size="lg")
|
| 547 |
+
|
| 548 |
+
with gr.Tabs():
|
| 549 |
+
with gr.Tab("Overview"):
|
| 550 |
+
overview_output = gr.Markdown()
|
| 551 |
+
with gr.Tab("Keywords"):
|
| 552 |
+
keywords_output = gr.Markdown()
|
| 553 |
+
with gr.Tab("Sections"):
|
| 554 |
+
sections_output = gr.Markdown()
|
| 555 |
+
with gr.Tab("Suggestions"):
|
| 556 |
+
suggestions_output = gr.Markdown()
|
| 557 |
+
|
| 558 |
+
analyze_btn.click(
|
| 559 |
+
fn=run_analysis,
|
| 560 |
+
inputs=[resume_text, job_desc, pdf_upload],
|
| 561 |
+
outputs=[overview_output, keywords_output, sections_output, suggestions_output],
|
| 562 |
)
|
| 563 |
|
| 564 |
+
gr.Markdown(
|
| 565 |
+
"---\n"
|
| 566 |
+
"Built by [Lorenzo Scaturchio](https://huggingface.co/gr8monk3ys) | "
|
| 567 |
+
"Model: `sentence-transformers/all-MiniLM-L6-v2` | "
|
| 568 |
+
"[Source Code](https://huggingface.co/spaces/gr8monk3ys/resume-analyzer-space)"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 569 |
)
|
| 570 |
|
| 571 |
return demo
|
| 572 |
|
| 573 |
|
| 574 |
+
# ---------------------------------------------------------------------------
|
| 575 |
# Entry point
|
| 576 |
+
# ---------------------------------------------------------------------------
|
|
|
|
| 577 |
if __name__ == "__main__":
|
| 578 |
app = build_interface()
|
| 579 |
app.launch()
|
requirements.txt
CHANGED
|
@@ -1,3 +1,5 @@
|
|
| 1 |
-
gradio==
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio==5.9.1
|
| 2 |
+
sentence-transformers>=2.2.0
|
| 3 |
+
scikit-learn>=1.0.0
|
| 4 |
+
pymupdf>=1.24.0
|
| 5 |
+
numpy>=1.20.0
|