gr8monk3ys commited on
Commit
98a7e55
·
verified ·
1 Parent(s): 019a064

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +63 -42
  2. app.py +458 -414
  3. requirements.txt +5 -3
README.md CHANGED
@@ -1,81 +1,102 @@
1
  ---
2
- title: Paper Summarizer
3
- emoji: 📄
4
- colorFrom: blue
5
- colorTo: indigo
6
  sdk: gradio
7
- sdk_version: 4.44.0
8
  python_version: "3.10"
9
  app_file: app.py
10
  pinned: false
11
  license: mit
12
- short_description: Summarize academic research papers with AI
13
  ---
14
 
15
- # Paper Summarizer
16
 
17
- An AI-powered tool that transforms lengthy academic research papers into structured, digestible summaries. Built with Facebook's BART-Large-CNN model and deployed as a Gradio web application on HuggingFace Spaces.
18
 
19
  ## Features
20
 
21
- - **PDF Upload** -- Drop a research paper PDF and get an instant structured summary.
22
- - **Text Input** -- Paste raw paper text directly if you prefer.
23
- - **Structured Output** -- Every summary includes:
24
- - Extracted paper title
25
- - Concise abstract-length summary
26
- - Key findings from the results/conclusion sections
27
- - Methodology overview
28
- - Word-count statistics with compression ratio
29
- - **Long Document Support** -- Papers of any length are automatically chunked and summarized in multiple passes, then combined into a coherent final summary.
30
- - **Clean PDF Processing** -- Handles hyphenated line breaks, control characters, and other common PDF artifacts.
31
 
32
- ## How It Works
 
33
 
34
- 1. **Text Extraction** -- PDFs are parsed with PyMuPDF (fitz) to extract selectable text from every page.
35
- 2. **Cleaning** -- Raw text is normalized: stray control characters are removed, hyphenated line breaks are rejoined, and excessive whitespace is collapsed.
36
- 3. **Chunking** -- The cleaned text is split into chunks of approximately 700 words, respecting paragraph and sentence boundaries so context is preserved.
37
- 4. **Summarization** -- Each chunk is passed through `facebook/bart-large-cnn` for abstractive summarization. If there are multiple chunks, the individual summaries are combined and summarized again for coherence.
38
- 5. **Section Extraction** -- Regex heuristics identify Results, Methodology, and Conclusion sections for targeted summarization of key findings and methods.
39
 
40
- ## Model
 
41
 
42
- This Space uses [`facebook/bart-large-cnn`](https://huggingface.co/facebook/bart-large-cnn), a BART model fine-tuned on the CNN/DailyMail summarization dataset. It runs on the free CPU tier and can process most papers in under a minute.
 
43
 
44
- ## Limitations
 
45
 
46
- - **Scanned PDFs** are not supported -- the PDF must contain selectable text (not images of text).
47
- - **Summarization quality** depends on the structure and clarity of the input text.
48
- - **Processing time** may be longer for very large papers due to CPU-only inference.
49
 
50
- ## Tech Stack
51
 
52
- | Component | Library |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  |---|---|
54
  | Web framework | Gradio 4.44 |
55
- | Summarization model | HuggingFace Transformers (BART-Large-CNN) |
 
56
  | PDF parsing | PyMuPDF (fitz) |
57
- | Inference backend | PyTorch (CPU) |
58
 
59
- ## Local Development
60
 
61
  ```bash
62
  # Clone the repository
63
- git clone https://huggingface.co/spaces/gr8monk3ys/paper-summarizer
64
- cd paper-summarizer
65
 
66
  # Install dependencies
67
  pip install -r requirements.txt
68
 
69
- # Run the application
70
  python app.py
71
  ```
72
 
73
  The app will be available at `http://localhost:7860`.
74
 
75
- ## License
76
 
77
- MIT
 
 
 
 
 
78
 
79
- ## Author
80
 
81
- Built by [Lorenzo Scaturchio](https://huggingface.co/gr8monk3ys).
 
1
  ---
2
+ title: Resume Analyzer
3
+ emoji: 📋
4
+ colorFrom: green
5
+ colorTo: blue
6
  sdk: gradio
7
+ sdk_version: 5.9.1
8
  python_version: "3.10"
9
  app_file: app.py
10
  pinned: false
11
  license: mit
12
+ short_description: AI-powered resume analysis against job descriptions
13
  ---
14
 
15
+ # Resume Analyzer
16
 
17
+ An AI-powered tool that analyzes how well your resume matches a target job description. Built with Gradio, sentence-transformers, and scikit-learn.
18
 
19
  ## Features
20
 
21
+ ### Semantic Similarity Scoring
22
+ Uses the `sentence-transformers/all-MiniLM-L6-v2` model to compute deep semantic similarity between your resume and the job description. This goes beyond simple keyword matching to understand the meaning and context of your experience relative to what the role demands.
 
 
 
 
 
 
 
 
23
 
24
+ ### TF-IDF Keyword Extraction
25
+ Extracts the most important keywords and phrases from both documents using Term Frequency-Inverse Document Frequency (TF-IDF) with unigrams and bigrams. This surfaces the specific terms that carry the most weight in each document.
26
 
27
+ ### Keyword Gap Analysis
28
+ Compares the job description's top keywords against your resume content to identify:
29
+ - **Matched keywords** -- terms the job requires that your resume already contains.
30
+ - **Missing keywords** -- high-value terms you should consider adding (where truthfully applicable).
 
31
 
32
+ ### Section-by-Section Analysis
33
+ Automatically detects standard resume sections (Summary, Experience, Education, Skills, Projects) and scores each one independently against the job description. This pinpoints exactly which parts of your resume need the most attention.
34
 
35
+ ### Composite Match Score
36
+ A weighted composite score (60% semantic similarity, 40% keyword overlap) that gives a single 0-100% indicator of overall fit.
37
 
38
+ ### Actionable Suggestions
39
+ Generates specific, prioritized recommendations for improving your resume's alignment with the target role.
40
 
41
+ ### PDF Upload Support
42
+ Upload your resume as a PDF file instead of pasting text. The app extracts text from the PDF automatically using PyMuPDF.
 
43
 
44
+ ## How to Use
45
 
46
+ 1. **Paste your resume** into the left text area, or upload a PDF using the file upload widget.
47
+ 2. **Paste the job description** into the right text area.
48
+ 3. Click **Analyze**.
49
+ 4. Browse the results across four tabs:
50
+ - **Overview** -- composite score, breakdown, and verdict.
51
+ - **Keywords** -- matched and missing keywords with TF-IDF rankings.
52
+ - **Sections** -- per-section scores and assessments.
53
+ - **Suggestions** -- numbered, actionable improvement recommendations.
54
+
55
+ Example data is pre-loaded so you can click Analyze immediately to see the tool in action.
56
+
57
+ ## Technical Architecture
58
+
59
+ ```
60
+ Resume Text / PDF ──┐
61
+ ├──▶ Semantic Embedding (MiniLM-L6-v2) ──▶ Cosine Similarity
62
+ Job Description ────┘
63
+ ├──▶ TF-IDF Vectorization ──▶ Keyword Extraction & Matching
64
+ └──▶ Section Detection (regex heuristics) ──▶ Per-section Scoring
65
+ ```
66
+
67
+ | Component | Technology |
68
  |---|---|
69
  | Web framework | Gradio 4.44 |
70
+ | Semantic model | sentence-transformers/all-MiniLM-L6-v2 |
71
+ | Keyword extraction | scikit-learn TfidfVectorizer |
72
  | PDF parsing | PyMuPDF (fitz) |
73
+ | Numerical compute | NumPy |
74
 
75
+ ## Running Locally
76
 
77
  ```bash
78
  # Clone the repository
79
+ git clone https://huggingface.co/spaces/gr8monk3ys/resume-analyzer-space
80
+ cd resume-analyzer-space
81
 
82
  # Install dependencies
83
  pip install -r requirements.txt
84
 
85
+ # Launch the app
86
  python app.py
87
  ```
88
 
89
  The app will be available at `http://localhost:7860`.
90
 
91
+ ## Project Structure
92
 
93
+ ```
94
+ resume-analyzer-space/
95
+ ├── app.py # Application source (Gradio interface + analysis logic)
96
+ ├── requirements.txt # Python dependencies
97
+ └── README.md # This file (includes HF Space metadata)
98
+ ```
99
 
100
+ ## License
101
 
102
+ MIT
app.py CHANGED
@@ -1,535 +1,579 @@
1
  """
2
- Paper Summarizer - A Gradio-based web application for summarizing academic research papers.
3
- Version: 2.0.0 (Gradio 5.x compatible)
4
 
5
- This application uses Facebook's BART-Large-CNN model to generate structured summaries
6
- of academic papers. It supports both PDF uploads and pasted text input, handles long
7
- documents through intelligent chunking, and produces summaries with extracted titles,
8
- key findings, methodology notes, and concise abstracts.
9
 
10
  Author: Lorenzo Scaturchio (gr8monk3ys)
11
  License: MIT
12
  """
13
 
14
- import os
15
  import re
16
  import logging
17
  from typing import Optional
18
 
19
  import fitz # PyMuPDF
20
  import gradio as gr
21
- from huggingface_hub import InferenceClient
 
 
 
22
 
23
  # ---------------------------------------------------------------------------
24
  # Logging
25
  # ---------------------------------------------------------------------------
26
- logging.basicConfig(
27
- level=logging.INFO,
28
- format="%(asctime)s [%(levelname)s] %(message)s",
29
- )
30
  logger = logging.getLogger(__name__)
31
 
32
  # ---------------------------------------------------------------------------
33
  # Constants
34
  # ---------------------------------------------------------------------------
35
- MODEL_NAME = "facebook/bart-large-cnn"
36
- # BART-Large-CNN accepts up to 1024 tokens (~750 words). We chunk by words to
37
- # stay safely within that window while leaving room for special tokens.
38
- CHUNK_WORD_LIMIT = 700
39
- SUMMARY_MIN_LENGTH = 40
40
- SUMMARY_MAX_LENGTH = 180
41
- COMBINE_SUMMARY_MAX_LENGTH = 300
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  # ---------------------------------------------------------------------------
44
- # Use HuggingFace Inference API (no local model loading - saves memory)
45
  # ---------------------------------------------------------------------------
46
- logger.info("Initializing HuggingFace Inference Client for: %s", MODEL_NAME)
47
- client = InferenceClient(model=MODEL_NAME)
48
- logger.info("Inference client ready.")
49
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
- # ===========================================================================
52
- # Text extraction helpers
53
- # ===========================================================================
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
- def extract_text_from_pdf(pdf_path: str) -> str:
56
- """Extract all text content from a PDF file using PyMuPDF.
 
 
 
 
57
 
58
- Args:
59
- pdf_path: Path to the uploaded PDF file.
60
 
61
- Returns:
62
- The concatenated text of every page, separated by newlines.
 
63
 
64
- Raises:
65
- ValueError: If the PDF contains no extractable text.
66
- """
67
  try:
68
  doc = fitz.open(pdf_path)
 
 
 
 
 
 
69
  except Exception as exc:
70
- raise ValueError(
71
- f"Could not open the PDF file. It may be corrupted or password-protected. "
72
- f"Details: {exc}"
73
- ) from exc
74
-
75
- pages: list[str] = []
76
- for page_num, page in enumerate(doc):
77
- text = page.get_text("text")
78
- if text.strip():
79
- pages.append(text)
80
- logger.debug("Page %d: extracted %d characters", page_num + 1, len(text))
81
-
82
- doc.close()
83
-
84
- if not pages:
85
- raise ValueError(
86
- "The PDF appears to contain no extractable text. "
87
- "It may be a scanned document or consist only of images."
88
- )
89
 
90
- return "\n".join(pages)
91
 
 
 
 
 
 
92
 
93
- def clean_text(text: str) -> str:
94
- """Normalize whitespace and remove common PDF artefacts.
95
 
96
- Handles excessive newlines, hyphenated line-breaks, and stray control
97
- characters that often appear in academic PDFs.
98
  """
99
- # Remove form-feed and other control characters (keep newlines & tabs)
100
- text = re.sub(r"[\x00-\x08\x0b\x0c\x0e-\x1f]", "", text)
101
- # Re-join hyphenated line breaks (e.g. "summa-\nrization" -> "summarization")
102
- text = re.sub(r"(\w)-\n(\w)", r"\1\2", text)
103
- # Collapse multiple blank lines into one
104
- text = re.sub(r"\n{3,}", "\n\n", text)
105
- # Collapse multiple spaces
106
- text = re.sub(r"[ \t]{2,}", " ", text)
107
- return text.strip()
108
-
109
-
110
- # ===========================================================================
111
- # Title extraction heuristic
112
- # ===========================================================================
113
-
114
- def extract_title(text: str) -> str:
115
- """Attempt to extract the paper title from the first few lines.
116
-
117
- Academic papers typically place the title in the first 1-5 lines before the
118
- author block. We use a simple heuristic: the longest line among the first
119
- few non-empty lines that is not all-caps (which would be a header like
120
- "ABSTRACT") and does not look like an author list.
121
- """
122
- lines = [ln.strip() for ln in text.split("\n") if ln.strip()][:12]
123
-
124
- candidates: list[str] = []
125
- for line in lines:
126
- # Skip very short lines (page numbers, dates, etc.)
127
- if len(line) < 10:
128
- continue
129
- # Skip lines that are likely author names / affiliations (contain '@')
130
- if "@" in line:
131
- continue
132
- # Skip lines that are section headers (all uppercase, short)
133
- if line.isupper() and len(line) < 60:
134
- continue
135
- # Skip lines that look like emails or URLs
136
- if re.search(r"https?://|www\.", line):
137
- continue
138
- candidates.append(line)
139
-
140
- if not candidates:
141
- return "Untitled Paper"
142
-
143
- # Return the first substantial candidate (titles usually come first)
144
- return candidates[0]
145
-
146
-
147
- # ===========================================================================
148
- # Chunking and summarization
149
- # ===========================================================================
150
-
151
- def chunk_text(text: str, max_words: int = CHUNK_WORD_LIMIT) -> list[str]:
152
- """Split text into chunks of approximately *max_words* words.
153
-
154
- Splitting is done on paragraph boundaries where possible so that chunks
155
- remain coherent. If a single paragraph exceeds the limit it is split on
156
- sentence boundaries instead.
157
  """
158
- paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
159
- chunks: list[str] = []
160
- current_chunk: list[str] = []
161
- current_word_count = 0
162
-
163
- for para in paragraphs:
164
- para_words = len(para.split())
165
-
166
- # If adding this paragraph would exceed the limit, finalize the chunk.
167
- if current_word_count + para_words > max_words and current_chunk:
168
- chunks.append("\n\n".join(current_chunk))
169
- current_chunk = []
170
- current_word_count = 0
171
-
172
- # Handle paragraphs that are themselves larger than the limit.
173
- if para_words > max_words:
174
- sentences = re.split(r"(?<=[.!?])\s+", para)
175
- for sentence in sentences:
176
- s_words = len(sentence.split())
177
- if current_word_count + s_words > max_words and current_chunk:
178
- chunks.append("\n\n".join(current_chunk))
179
- current_chunk = []
180
- current_word_count = 0
181
- current_chunk.append(sentence)
182
- current_word_count += s_words
183
- else:
184
- current_chunk.append(para)
185
- current_word_count += para_words
186
 
187
- if current_chunk:
188
- chunks.append("\n\n".join(current_chunk))
 
 
 
 
 
189
 
190
- return chunks
191
 
 
 
 
192
 
193
- def summarize_text(text: str) -> str:
194
- """Summarize a single chunk of text using the BART model via Inference API.
195
 
196
- Dynamically adjusts min/max summary length based on input length to avoid
197
- the common transformers warning about min_length exceeding input length.
 
198
  """
199
- word_count = len(text.split())
200
- # For very short inputs, just return the text as-is.
201
- if word_count < 50:
202
- return text
 
 
 
 
 
 
 
203
 
204
- max_len = min(SUMMARY_MAX_LENGTH, max(50, word_count // 2))
205
- min_len = min(SUMMARY_MIN_LENGTH, max_len - 10)
206
 
207
- try:
208
- result = client.summarization(
209
- text,
210
- parameters={
211
- "max_length": max_len,
212
- "min_length": min_len,
213
- "do_sample": False,
214
- },
215
- )
216
- return result.summary_text
217
- except Exception as e:
218
- logger.warning("Summarization failed: %s", e)
219
- # Fallback: return truncated text
220
- return " ".join(text.split()[:100]) + "..."
221
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
222
 
223
- def generate_full_summary(text: str) -> str:
224
- """Produce a final summary for arbitrarily long documents.
225
 
226
- Strategy:
227
- 1. Split the document into manageable chunks.
228
- 2. Summarize each chunk individually.
229
- 3. If multiple chunk summaries exist, combine them and run a second-pass
230
- summarization to produce a coherent final summary.
231
- """
232
- chunks = chunk_text(text)
233
- logger.info("Document split into %d chunk(s) for summarization.", len(chunks))
234
 
235
- chunk_summaries = [summarize_text(chunk) for chunk in chunks]
 
 
 
 
 
 
236
 
237
- if len(chunk_summaries) == 1:
238
- return chunk_summaries[0]
 
 
 
239
 
240
- # Second pass: combine chunk summaries and re-summarize for coherence.
241
- combined = " ".join(chunk_summaries)
242
- combined_words = len(combined.split())
 
 
 
 
 
243
 
244
- if combined_words < 50:
245
- return combined
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
246
 
247
- max_len = min(COMBINE_SUMMARY_MAX_LENGTH, max(60, combined_words // 2))
248
- min_len = min(SUMMARY_MIN_LENGTH, max_len - 10)
 
 
 
249
 
250
- try:
251
- result = client.summarization(
252
- combined,
253
- parameters={
254
- "max_length": max_len,
255
- "min_length": min_len,
256
- "do_sample": False,
257
- },
258
  )
259
- return result.summary_text
260
- except Exception as e:
261
- logger.warning("Combined summarization failed: %s", e)
262
- return combined
263
 
 
264
 
265
- # ===========================================================================
266
- # Section extraction helpers
267
- # ===========================================================================
268
 
269
- def extract_section(text: str, heading_pattern: str, fallback: str = "") -> str:
270
- """Extract content under a section heading matched by *heading_pattern*.
 
271
 
272
- Uses a regex to find the heading and captures everything until the next
273
- heading of equal or higher level.
 
 
 
274
  """
275
- pattern = re.compile(
276
- rf"(?:^|\n)\s*(?:\d+[\.\)]?\s*)?{heading_pattern}\s*\n(.*?)(?=\n\s*(?:\d+[\.\)]?\s*)?[A-Z][A-Za-z ]+\s*\n|\Z)",
277
- re.DOTALL | re.IGNORECASE,
278
- )
279
- match = pattern.search(text)
280
- if match:
281
- content = match.group(1).strip()
282
- if len(content) > 30:
283
- return content
284
- return fallback
285
-
286
-
287
- def extract_key_findings(text: str) -> str:
288
- """Try to extract key findings from Results / Conclusion sections, or
289
- fall back to summarizing the last portion of the paper."""
290
- for heading in [
291
- r"(?:key\s+)?findings",
292
- r"results?\s*(?:and\s+discussion)?",
293
- r"conclusions?\s*(?:and\s+future\s+work)?",
294
- r"discussion",
295
- ]:
296
- content = extract_section(text, heading)
297
- if content:
298
- return summarize_text(content[:3000])
299
- # Fallback: summarize the last quarter of the document.
300
- words = text.split()
301
- tail = " ".join(words[-(len(words) // 4):])
302
- if len(tail.split()) > 50:
303
- return summarize_text(tail[:3000])
304
- return "Key findings could not be automatically extracted."
305
-
306
-
307
- def extract_methodology(text: str) -> str:
308
- """Try to extract methodology information from the paper."""
309
- for heading in [
310
- r"method(?:ology|s)?",
311
- r"approach",
312
- r"experimental\s+setup",
313
- r"materials?\s+and\s+methods",
314
- r"(?:proposed\s+)?(?:framework|system|model|architecture)",
315
- ]:
316
- content = extract_section(text, heading)
317
- if content:
318
- return summarize_text(content[:3000])
319
- return "Methodology section could not be automatically extracted."
320
-
321
-
322
- # ===========================================================================
323
- # Main processing function
324
- # ===========================================================================
325
-
326
- def process_paper(
327
- pdf_file: Optional[str],
328
- pasted_text: Optional[str],
329
- ) -> str:
330
- """Process a research paper and return a structured summary.
331
 
332
- Accepts either a PDF file path (from Gradio upload) or raw pasted text.
333
- Returns a Markdown-formatted structured summary.
334
  """
335
  # ------------------------------------------------------------------
336
- # 1. Obtain raw text
337
  # ------------------------------------------------------------------
338
  if pdf_file is not None:
339
- logger.info("Processing uploaded PDF: %s", pdf_file)
340
- try:
341
- raw_text = extract_text_from_pdf(pdf_file)
342
- except ValueError as exc:
343
- return f"**Error:** {exc}"
344
- elif pasted_text and pasted_text.strip():
345
- raw_text = pasted_text.strip()
346
- else:
347
- return (
348
- "**Error:** Please upload a PDF file or paste the paper text. "
349
- "Both inputs are currently empty."
350
- )
351
-
352
- text = clean_text(raw_text)
353
- original_word_count = len(text.split())
354
 
355
- if original_word_count < 30:
356
- return (
357
- "**Error:** The extracted text is too short to summarize. "
358
- "Please provide a longer document or check that the PDF contains selectable text."
359
- )
360
 
361
- logger.info("Cleaned text: %d words.", original_word_count)
 
 
 
362
 
363
  # ------------------------------------------------------------------
364
- # 2. Extract structured components
365
  # ------------------------------------------------------------------
366
- title = extract_title(text)
367
- concise_summary = generate_full_summary(text)
368
- key_findings = extract_key_findings(text)
369
- methodology = extract_methodology(text)
370
 
371
- summary_word_count = len(concise_summary.split())
372
 
373
  # ------------------------------------------------------------------
374
- # 3. Format the output
375
  # ------------------------------------------------------------------
376
- output = f"""## {title}
 
377
 
378
- ---
 
 
 
 
 
 
 
379
 
380
- ### Concise Summary
381
- {concise_summary}
 
 
382
 
383
- ---
 
 
 
 
 
 
384
 
385
- ### Key Findings
386
- {key_findings}
387
 
388
- ---
389
 
390
- ### Methodology
391
- {methodology}
 
392
 
393
- ---
 
 
 
 
394
 
395
- ### Statistics
396
- | Metric | Value |
397
- |---|---|
398
- | Original length | {original_word_count:,} words |
399
- | Summary length | {summary_word_count:,} words |
400
- | Compression ratio | {original_word_count / max(summary_word_count, 1):.1f}x |
401
- """
402
- return output
403
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
404
 
405
- # ===========================================================================
406
- # Example inputs for the Gradio demo
407
- # ===========================================================================
408
 
409
- EXAMPLE_TEXT = """Attention Is All You Need
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
410
 
411
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
412
 
413
- Abstract
414
- The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
 
 
415
 
416
- Introduction
417
- Recurrent neural networks, long short-term memory and gated recurrent neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures. Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht-1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.
 
 
 
 
418
 
419
- Methods
420
- The Transformer follows an encoder-decoder structure using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder. The encoder maps an input sequence of symbol representations to a sequence of continuous representations. Given z, the decoder then generates an output sequence of symbols one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next. The Transformer uses multi-head attention to allow the model to jointly attend to information from different representation subspaces at different positions.
421
 
422
- Results
423
- On the WMT 2014 English-to-German translation task, the big transformer model outperforms the best previously reported models including ensembles by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4. On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1/4 the training cost of the previous state-of-the-art model. The Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers.
424
 
425
- Conclusions
426
- In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. The Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. We achieved new state of the art on both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video."""
 
 
 
 
 
 
 
427
 
428
 
429
- # ===========================================================================
430
  # Gradio interface
431
- # ===========================================================================
432
 
433
  def build_interface() -> gr.Blocks:
434
  """Construct and return the Gradio Blocks interface."""
 
 
 
 
 
435
 
436
- with gr.Blocks(
437
- title="Paper Summarizer",
438
- theme=gr.themes.Soft(
439
- primary_hue="indigo",
440
- secondary_hue="blue",
441
- ),
442
- css="""
443
- .header-text { text-align: center; margin-bottom: 0.5em; }
444
- .subheader { text-align: center; color: #6b7280; margin-top: 0; }
445
- footer { display: none !important; }
446
- """,
447
- ) as demo:
448
- # --- Header ---
449
  gr.Markdown(
450
- """
451
- <h1 class="header-text">Paper Summarizer</h1>
452
- <p class="subheader">
453
- Summarize academic research papers into structured, digestible insights.<br>
454
- Upload a PDF or paste the full text below.
455
- </p>
456
- """,
457
  )
458
 
459
- with gr.Row():
460
- # --- Input column ---
461
  with gr.Column(scale=1):
462
- gr.Markdown("### Input")
463
- pdf_input = gr.File(
464
- label="Upload PDF",
 
 
 
 
 
465
  file_types=[".pdf"],
466
  type="filepath",
467
  )
468
- text_input = gr.Textbox(
469
- label="Or paste paper text",
470
- placeholder="Paste the full text of a research paper here...",
471
- lines=12,
472
- max_lines=30,
473
- )
474
- submit_btn = gr.Button("Summarize", variant="primary", size="lg")
475
-
476
- # --- Output column ---
477
  with gr.Column(scale=1):
478
- gr.Markdown("### Structured Summary")
479
- output = gr.Markdown(
480
- value="*Your summary will appear here after processing.*",
481
- label="Summary",
 
482
  )
483
 
484
- # --- Examples ---
485
- gr.Markdown("---")
486
- gr.Markdown("### Try an Example")
487
- gr.Examples(
488
- examples=[[None, EXAMPLE_TEXT]],
489
- inputs=[pdf_input, text_input],
490
- outputs=output,
491
- fn=process_paper,
492
- cache_examples=False,
493
- label="Click to load example paper text",
 
 
 
 
 
 
494
  )
495
 
496
- # --- About ---
497
- with gr.Accordion("About this Space", open=False):
498
- gr.Markdown(
499
- """
500
- **Paper Summarizer** uses
501
- [`facebook/bart-large-cnn`](https://huggingface.co/facebook/bart-large-cnn)
502
- to generate abstractive summaries of academic papers.
503
-
504
- **Features**
505
- - PDF upload with automatic text extraction (PyMuPDF)
506
- - Intelligent chunking for papers of any length
507
- - Structured output: title, key findings, methodology, and concise summary
508
- - Word-count statistics and compression ratio
509
-
510
- **Limitations**
511
- - Scanned PDFs (image-only) are not supported; the PDF must contain selectable text.
512
- - Summarization quality depends on the input text quality and structure.
513
- - Running on free CPU tier; very long papers may take a minute to process.
514
-
515
- Built by [Lorenzo Scaturchio](https://huggingface.co/gr8monk3ys).
516
- """
517
- )
518
-
519
- # --- Event binding ---
520
- submit_btn.click(
521
- fn=process_paper,
522
- inputs=[pdf_input, text_input],
523
- outputs=output,
524
  )
525
 
526
  return demo
527
 
528
 
529
- # ===========================================================================
530
  # Entry point
531
- # ===========================================================================
532
-
533
  if __name__ == "__main__":
534
  app = build_interface()
535
  app.launch()
 
1
  """
2
+ Resume Analyzer - AI-Powered Resume Analysis Against Job Descriptions
 
3
 
4
+ This Gradio application uses NLP techniques to evaluate how well a resume
5
+ matches a given job description. It provides semantic similarity scoring,
6
+ keyword extraction, gap analysis, and actionable improvement suggestions.
 
7
 
8
  Author: Lorenzo Scaturchio (gr8monk3ys)
9
  License: MIT
10
  """
11
 
 
12
  import re
13
  import logging
14
  from typing import Optional
15
 
16
  import fitz # PyMuPDF
17
  import gradio as gr
18
+ import numpy as np
19
+ from sentence_transformers import SentenceTransformer
20
+ from sklearn.feature_extraction.text import TfidfVectorizer
21
+ from sklearn.metrics.pairwise import cosine_similarity
22
 
23
  # ---------------------------------------------------------------------------
24
  # Logging
25
  # ---------------------------------------------------------------------------
26
+ logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
 
 
 
27
  logger = logging.getLogger(__name__)
28
 
29
  # ---------------------------------------------------------------------------
30
  # Constants
31
  # ---------------------------------------------------------------------------
32
+ MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
33
+
34
+ RESUME_SECTIONS = {
35
+ "experience": [
36
+ "experience", "work experience", "professional experience",
37
+ "employment history", "work history", "career history",
38
+ ],
39
+ "education": [
40
+ "education", "academic background", "academic history",
41
+ "qualifications", "certifications", "degrees",
42
+ ],
43
+ "skills": [
44
+ "skills", "technical skills", "core competencies",
45
+ "competencies", "proficiencies", "technologies", "tools",
46
+ ],
47
+ "projects": [
48
+ "projects", "personal projects", "portfolio",
49
+ "key projects", "selected projects",
50
+ ],
51
+ "summary": [
52
+ "summary", "professional summary", "profile",
53
+ "objective", "career objective", "about me", "overview",
54
+ ],
55
+ }
56
 
57
  # ---------------------------------------------------------------------------
58
+ # Example data shipped with the Space
59
  # ---------------------------------------------------------------------------
60
+ EXAMPLE_RESUME = """LORENZO SCATURCHIO
61
+ San Francisco, CA | lorenzo@email.com | linkedin.com/in/lorenzo | github.com/gr8monk3ys
62
+
63
+ PROFESSIONAL SUMMARY
64
+ Results-driven Machine Learning Engineer with 4+ years of experience designing,
65
+ building, and deploying production ML systems. Skilled in deep learning, NLP,
66
+ and scalable data pipelines. Passionate about applying AI to solve real-world
67
+ problems and delivering measurable business impact.
68
+
69
+ EXPERIENCE
70
+
71
+ Senior Machine Learning Engineer | Acme AI Corp | Jan 2022 - Present
72
+ - Designed and deployed a real-time recommendation engine serving 2M+ daily
73
+ active users, improving click-through rate by 18%.
74
+ - Built an end-to-end NLP pipeline for document classification using
75
+ Transformers (BERT, RoBERTa) with 94% F1 score.
76
+ - Led migration of model training infrastructure to Kubernetes, reducing
77
+ training time by 40% and cloud costs by 25%.
78
+ - Mentored a team of 3 junior engineers on MLOps best practices.
79
+
80
+ Machine Learning Engineer | DataWave Inc | Jun 2020 - Dec 2021
81
+ - Developed a customer churn prediction model (XGBoost, LightGBM) that
82
+ reduced churn by 12%, saving $1.2M annually.
83
+ - Implemented A/B testing framework for ML model evaluation in production.
84
+ - Created automated data quality checks and feature engineering pipelines
85
+ using Apache Spark and Airflow.
86
+
87
+ Data Science Intern | StartUp Labs | May 2019 - Aug 2019
88
+ - Conducted exploratory data analysis on 500K+ records to identify key
89
+ business drivers.
90
+ - Built sentiment analysis prototype using spaCy and scikit-learn.
91
+
92
+ EDUCATION
93
+ M.S. Computer Science (Machine Learning Specialization) | Stanford University | 2020
94
+ B.S. Computer Science | UC Berkeley | 2018
95
+
96
+ SKILLS
97
+ Languages: Python, SQL, Java, Scala, R
98
+ ML/DL Frameworks: PyTorch, TensorFlow, scikit-learn, Hugging Face Transformers
99
+ MLOps: Docker, Kubernetes, MLflow, Weights & Biases, Airflow
100
+ Cloud: AWS (SageMaker, EC2, S3), GCP (Vertex AI, BigQuery)
101
+ Data: Spark, Pandas, NumPy, PostgreSQL, Redis
102
+ Other: Git, CI/CD, REST APIs, Agile/Scrum
103
+
104
+ PROJECTS
105
+ - Open-source NLP toolkit for resume analysis (this project!)
106
+ - Real-time object detection system using YOLOv5 on edge devices
107
+ - Kaggle competition top-5% finish in Tabular Playground Series
108
+ """
109
 
110
+ EXAMPLE_JOB_DESCRIPTION = """Senior Machine Learning Engineer
111
+
112
+ About the Role
113
+ We are looking for a Senior Machine Learning Engineer to join our AI Platform
114
+ team. You will design, build, and maintain production ML systems that power
115
+ our core product features. This is a high-impact role where you will work
116
+ closely with product, engineering, and data science teams.
117
+
118
+ Responsibilities
119
+ - Design and implement scalable ML pipelines for training and inference.
120
+ - Develop and deploy deep learning models for NLP and computer vision tasks.
121
+ - Build robust monitoring and alerting for model performance in production.
122
+ - Collaborate with cross-functional teams to translate business requirements
123
+ into ML solutions.
124
+ - Mentor junior engineers and contribute to engineering best practices.
125
+ - Drive adoption of MLOps tools and processes across the organization.
126
+
127
+ Requirements
128
+ - 3+ years of experience in machine learning engineering or a related role.
129
+ - Strong proficiency in Python and SQL.
130
+ - Hands-on experience with ML frameworks such as PyTorch or TensorFlow.
131
+ - Experience deploying models to production using Docker and Kubernetes.
132
+ - Familiarity with cloud platforms (AWS or GCP).
133
+ - Solid understanding of NLP techniques (transformers, embeddings, etc.).
134
+ - Experience with data processing frameworks like Spark or similar.
135
+ - Strong communication skills and ability to work in an Agile environment.
136
+
137
+ Nice to Have
138
+ - Experience with recommendation systems or search ranking.
139
+ - Contributions to open-source ML projects.
140
+ - M.S. or Ph.D. in Computer Science, Machine Learning, or related field.
141
+ - Experience with MLflow, Weights & Biases, or similar experiment tracking.
142
+ - Familiarity with CI/CD pipelines for ML.
143
+ """
144
 
145
+ # ---------------------------------------------------------------------------
146
+ # Model loading (cached at module level for the HF Space)
147
+ # ---------------------------------------------------------------------------
148
+ logger.info("Loading sentence-transformer model: %s", MODEL_NAME)
149
+ _model = SentenceTransformer(MODEL_NAME)
150
+ logger.info("Model loaded successfully.")
151
 
 
 
152
 
153
+ # =========================================================================
154
+ # Core analysis utilities
155
+ # =========================================================================
156
 
157
+ def extract_text_from_pdf(pdf_path: str) -> str:
158
+ """Extract plain text from a PDF file using PyMuPDF."""
 
159
  try:
160
  doc = fitz.open(pdf_path)
161
+ pages = [page.get_text() for page in doc]
162
+ doc.close()
163
+ text = "\n".join(pages).strip()
164
+ if not text:
165
+ raise ValueError("The PDF appears to contain no extractable text.")
166
+ return text
167
  except Exception as exc:
168
+ logger.error("PDF extraction failed: %s", exc)
169
+ raise gr.Error(f"Could not read the PDF file: {exc}") from exc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
 
 
171
 
172
+ def compute_semantic_similarity(text_a: str, text_b: str) -> float:
173
+ """Return cosine similarity (0-1) between two texts using the sentence-transformer."""
174
+ embeddings = _model.encode([text_a, text_b], convert_to_numpy=True)
175
+ similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
176
+ return float(np.clip(similarity, 0.0, 1.0))
177
 
 
 
178
 
179
+ def extract_keywords(texts: list[str], top_n: int = 30) -> list[list[str]]:
 
180
  """
181
+ Use TF-IDF to extract the most important keywords from each document
182
+ in *texts*. Returns a list of keyword-lists, one per input document.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
183
  """
184
+ vectorizer = TfidfVectorizer(
185
+ stop_words="english",
186
+ max_features=500,
187
+ ngram_range=(1, 2),
188
+ token_pattern=r"(?u)\b[a-zA-Z][a-zA-Z+#.\-]{1,}\b",
189
+ )
190
+ tfidf_matrix = vectorizer.fit_transform(texts)
191
+ feature_names = np.array(vectorizer.get_feature_names_out())
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
192
 
193
+ results: list[list[str]] = []
194
+ for row_idx in range(tfidf_matrix.shape[0]):
195
+ row = tfidf_matrix[row_idx].toarray().flatten()
196
+ top_indices = row.argsort()[-top_n:][::-1]
197
+ keywords = [feature_names[i] for i in top_indices if row[i] > 0]
198
+ results.append(keywords)
199
+ return results
200
 
 
201
 
202
+ def _normalize(text: str) -> str:
203
+ """Lowercase and strip extra whitespace."""
204
+ return re.sub(r"\s+", " ", text.lower()).strip()
205
 
 
 
206
 
207
+ def find_matching_and_missing_keywords(
208
+ resume_text: str, job_keywords: list[str]
209
+ ) -> tuple[list[str], list[str]]:
210
  """
211
+ Compare job-description keywords against the resume body.
212
+ Returns (matched, missing) keyword lists.
213
+ """
214
+ resume_lower = _normalize(resume_text)
215
+ matched, missing = [], []
216
+ for kw in job_keywords:
217
+ if _normalize(kw) in resume_lower:
218
+ matched.append(kw)
219
+ else:
220
+ missing.append(kw)
221
+ return matched, missing
222
 
 
 
223
 
224
+ def detect_sections(resume_text: str) -> dict[str, str]:
225
+ """
226
+ Heuristically split a resume into named sections.
227
+ Returns a dict mapping canonical section names to their content.
228
+ """
229
+ lines = resume_text.split("\n")
230
+ current_section: Optional[str] = None
231
+ sections: dict[str, list[str]] = {}
 
 
 
 
 
 
232
 
233
+ for line in lines:
234
+ stripped = line.strip()
235
+ matched_section = _match_section_header(stripped)
236
+ if matched_section:
237
+ current_section = matched_section
238
+ sections.setdefault(current_section, [])
239
+ elif current_section:
240
+ sections[current_section].append(stripped)
241
+ else:
242
+ sections.setdefault("other", []).append(stripped)
243
+
244
+ return {name: "\n".join(content).strip() for name, content in sections.items()}
245
+
246
+
247
+ def _match_section_header(line: str) -> Optional[str]:
248
+ """Return a canonical section name if *line* looks like a header, else None."""
249
+ cleaned = re.sub(r"[^a-z\s]", "", line.lower()).strip()
250
+ if not cleaned or len(cleaned) > 40:
251
+ return None
252
+ for canonical, variants in RESUME_SECTIONS.items():
253
+ if cleaned in variants:
254
+ return canonical
255
+ return None
256
+
257
+
258
+ def analyze_section(
259
+ section_name: str, section_text: str, job_text: str
260
+ ) -> dict[str, object]:
261
+ """Compute a per-section relevance score and short commentary."""
262
+ if not section_text.strip():
263
+ return {
264
+ "score": 0.0,
265
+ "comment": f"No {section_name} section detected in the resume.",
266
+ }
267
+ score = compute_semantic_similarity(section_text, job_text)
268
+ pct = round(score * 100, 1)
269
+
270
+ if pct >= 70:
271
+ quality = "Strong"
272
+ elif pct >= 45:
273
+ quality = "Moderate"
274
+ else:
275
+ quality = "Weak"
276
 
277
+ comment = f"{quality} alignment ({pct}%) with the job description."
278
+ return {"score": score, "comment": comment}
279
 
 
 
 
 
 
 
 
 
280
 
281
+ def generate_suggestions(
282
+ missing_keywords: list[str],
283
+ section_scores: dict[str, dict],
284
+ overall_score: float,
285
+ ) -> list[str]:
286
+ """Return a list of actionable improvement suggestions."""
287
+ suggestions: list[str] = []
288
 
289
+ if overall_score < 0.45:
290
+ suggestions.append(
291
+ "Your resume has low overall alignment with this job description. "
292
+ "Consider tailoring it more directly to the role's requirements."
293
+ )
294
 
295
+ # Missing keywords
296
+ if missing_keywords:
297
+ top_missing = missing_keywords[:10]
298
+ suggestions.append(
299
+ "Add these high-value keywords or phrases where truthfully applicable: "
300
+ + ", ".join(f'"{kw}"' for kw in top_missing)
301
+ + "."
302
+ )
303
 
304
+ # Section-specific advice
305
+ for name, data in section_scores.items():
306
+ score = data["score"]
307
+ if name == "experience" and score < 0.5:
308
+ suggestions.append(
309
+ "Strengthen your Experience section by using action verbs and "
310
+ "quantifiable achievements that mirror the job requirements."
311
+ )
312
+ if name == "skills" and score < 0.5:
313
+ suggestions.append(
314
+ "Your Skills section could be improved. List specific tools, "
315
+ "frameworks, and technologies mentioned in the job posting."
316
+ )
317
+ if name == "education" and score < 0.3:
318
+ suggestions.append(
319
+ "Consider highlighting relevant coursework, certifications, or "
320
+ "academic projects in your Education section."
321
+ )
322
 
323
+ if "summary" not in section_scores or not section_scores.get("summary", {}).get("score", 0):
324
+ suggestions.append(
325
+ "Add a Professional Summary at the top of your resume that directly "
326
+ "addresses the key requirements of this role."
327
+ )
328
 
329
+ if not suggestions:
330
+ suggestions.append(
331
+ "Your resume is well-aligned with this job description. "
332
+ "Keep refining with specific metrics and results."
 
 
 
 
333
  )
 
 
 
 
334
 
335
+ return suggestions
336
 
 
 
 
337
 
338
+ # =========================================================================
339
+ # Main analysis orchestrator
340
+ # =========================================================================
341
 
342
+ def run_analysis(
343
+ resume_text: str,
344
+ job_description: str,
345
+ pdf_file: Optional[str] = None,
346
+ ) -> tuple[str, str, str, str]:
347
  """
348
+ Run the full resume analysis pipeline.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
349
 
350
+ Returns a 4-tuple of Markdown strings:
351
+ (overview, keywords_report, section_report, suggestions_report)
352
  """
353
  # ------------------------------------------------------------------
354
+ # Input resolution
355
  # ------------------------------------------------------------------
356
  if pdf_file is not None:
357
+ resume_text = extract_text_from_pdf(pdf_file)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
358
 
359
+ if not resume_text or not resume_text.strip():
360
+ raise gr.Error("Please provide resume text or upload a PDF.")
361
+ if not job_description or not job_description.strip():
362
+ raise gr.Error("Please provide a job description.")
 
363
 
364
+ # ------------------------------------------------------------------
365
+ # 1. Semantic similarity
366
+ # ------------------------------------------------------------------
367
+ raw_similarity = compute_semantic_similarity(resume_text, job_description)
368
 
369
  # ------------------------------------------------------------------
370
+ # 2. Keyword analysis
371
  # ------------------------------------------------------------------
372
+ keyword_lists = extract_keywords([resume_text, job_description], top_n=30)
373
+ resume_keywords, job_keywords = keyword_lists[0], keyword_lists[1]
374
+ matched_kw, missing_kw = find_matching_and_missing_keywords(resume_text, job_keywords)
 
375
 
376
+ keyword_overlap = len(matched_kw) / max(len(job_keywords), 1)
377
 
378
  # ------------------------------------------------------------------
379
+ # 3. Composite score (60% semantic + 40% keyword overlap)
380
  # ------------------------------------------------------------------
381
+ composite = 0.6 * raw_similarity + 0.4 * keyword_overlap
382
+ overall_pct = round(composite * 100, 1)
383
 
384
+ # ------------------------------------------------------------------
385
+ # 4. Section-by-section analysis
386
+ # ------------------------------------------------------------------
387
+ sections = detect_sections(resume_text)
388
+ section_scores: dict[str, dict] = {}
389
+ for sec_name in ["summary", "experience", "education", "skills", "projects"]:
390
+ sec_text = sections.get(sec_name, "")
391
+ section_scores[sec_name] = analyze_section(sec_name, sec_text, job_description)
392
 
393
+ # ------------------------------------------------------------------
394
+ # 5. Suggestions
395
+ # ------------------------------------------------------------------
396
+ suggestions = generate_suggestions(missing_kw, section_scores, composite)
397
 
398
+ # ------------------------------------------------------------------
399
+ # Format outputs as Markdown
400
+ # ------------------------------------------------------------------
401
+ overview_md = _format_overview(overall_pct, raw_similarity, keyword_overlap, matched_kw, missing_kw)
402
+ keywords_md = _format_keywords(resume_keywords, job_keywords, matched_kw, missing_kw)
403
+ sections_md = _format_sections(section_scores)
404
+ suggest_md = _format_suggestions(suggestions)
405
 
406
+ return overview_md, keywords_md, sections_md, suggest_md
 
407
 
 
408
 
409
+ # =========================================================================
410
+ # Markdown formatters
411
+ # =========================================================================
412
 
413
+ def _score_bar(pct: float, width: int = 20) -> str:
414
+ """Return a text-based progress bar for Markdown."""
415
+ filled = round(pct / 100 * width)
416
+ empty = width - filled
417
+ return f"`[{'=' * filled}{' ' * empty}]` **{pct}%**"
418
 
 
 
 
 
 
 
 
 
419
 
420
+ def _format_overview(
421
+ overall_pct: float,
422
+ semantic_sim: float,
423
+ keyword_overlap: float,
424
+ matched: list[str],
425
+ missing: list[str],
426
+ ) -> str:
427
+ sem_pct = round(semantic_sim * 100, 1)
428
+ kw_pct = round(keyword_overlap * 100, 1)
429
+
430
+ if overall_pct >= 70:
431
+ verdict = "Excellent match - your resume aligns strongly with this role."
432
+ elif overall_pct >= 50:
433
+ verdict = "Good match - some targeted improvements could strengthen your application."
434
+ elif overall_pct >= 30:
435
+ verdict = "Partial match - significant tailoring is recommended."
436
+ else:
437
+ verdict = "Low match - consider whether this role fits your background or rewrite substantially."
438
+
439
+ return (
440
+ f"## Overall Match Score\n\n"
441
+ f"# {_score_bar(overall_pct)}\n\n"
442
+ f"**Verdict:** {verdict}\n\n"
443
+ f"---\n\n"
444
+ f"### Score Breakdown\n\n"
445
+ f"| Component | Score |\n"
446
+ f"|---|---|\n"
447
+ f"| Semantic Similarity | {sem_pct}% |\n"
448
+ f"| Keyword Overlap | {kw_pct}% |\n"
449
+ f"| **Composite (60/40)** | **{overall_pct}%** |\n\n"
450
+ f"---\n\n"
451
+ f"**Matched Keywords:** {len(matched)} &nbsp;|&nbsp; "
452
+ f"**Missing Keywords:** {len(missing)}\n"
453
+ )
454
 
 
 
 
455
 
456
+ def _format_keywords(
457
+ resume_kw: list[str],
458
+ job_kw: list[str],
459
+ matched: list[str],
460
+ missing: list[str],
461
+ ) -> str:
462
+ matched_str = ", ".join(f"**{kw}**" for kw in matched) if matched else "_None detected_"
463
+ missing_str = ", ".join(f"~~{kw}~~" for kw in missing) if missing else "_None - great coverage!_"
464
+ resume_str = ", ".join(resume_kw[:20]) if resume_kw else "_None detected_"
465
+ job_str = ", ".join(job_kw[:20]) if job_kw else "_None detected_"
466
+
467
+ return (
468
+ f"## Keyword Analysis\n\n"
469
+ f"### Matched Keywords (found in your resume)\n{matched_str}\n\n"
470
+ f"---\n\n"
471
+ f"### Missing Keywords (consider adding)\n{missing_str}\n\n"
472
+ f"---\n\n"
473
+ f"### Top Resume Keywords (TF-IDF)\n{resume_str}\n\n"
474
+ f"### Top Job Description Keywords (TF-IDF)\n{job_str}\n"
475
+ )
476
 
 
477
 
478
+ def _format_sections(section_scores: dict[str, dict]) -> str:
479
+ header = "## Section-by-Section Analysis\n\n"
480
+ rows = "| Section | Score | Assessment |\n|---|---|---|\n"
481
+ details = ""
482
 
483
+ for name, data in section_scores.items():
484
+ pct = round(data["score"] * 100, 1)
485
+ comment = data["comment"]
486
+ display_name = name.replace("_", " ").title()
487
+ rows += f"| {display_name} | {pct}% | {comment} |\n"
488
+ details += f"### {display_name}\n{comment}\n\n"
489
 
490
+ return header + rows + "\n---\n\n" + details
 
491
 
 
 
492
 
493
+ def _format_suggestions(suggestions: list[str]) -> str:
494
+ items = "\n".join(f"{i}. {s}" for i, s in enumerate(suggestions, 1))
495
+ return (
496
+ f"## Improvement Suggestions\n\n"
497
+ f"{items}\n\n"
498
+ f"---\n\n"
499
+ f"*Tip: Tailor your resume for every application. Mirror the language "
500
+ f"used in the job posting while remaining truthful about your experience.*\n"
501
+ )
502
 
503
 
504
+ # =========================================================================
505
  # Gradio interface
506
+ # =========================================================================
507
 
508
  def build_interface() -> gr.Blocks:
509
  """Construct and return the Gradio Blocks interface."""
510
+ theme = gr.themes.Soft(
511
+ primary_hue="teal",
512
+ secondary_hue="green",
513
+ font=[gr.themes.GoogleFont("Inter"), "system-ui", "sans-serif"],
514
+ )
515
 
516
+ with gr.Blocks(theme=theme, title="Resume Analyzer") as demo:
 
 
 
 
 
 
 
 
 
 
 
 
517
  gr.Markdown(
518
+ "# Resume Analyzer\n"
519
+ "Evaluate how well your resume matches a job description using "
520
+ "**semantic similarity** and **keyword analysis**.\n\n"
521
+ "Paste your resume text (or upload a PDF) and the target job "
522
+ "description, then click **Analyze**."
 
 
523
  )
524
 
525
+ with gr.Row(equal_height=True):
 
526
  with gr.Column(scale=1):
527
+ resume_text = gr.Textbox(
528
+ label="Resume Text",
529
+ placeholder="Paste your resume here...",
530
+ lines=18,
531
+ value=EXAMPLE_RESUME,
532
+ )
533
+ pdf_upload = gr.File(
534
+ label="Or Upload Resume PDF",
535
  file_types=[".pdf"],
536
  type="filepath",
537
  )
 
 
 
 
 
 
 
 
 
538
  with gr.Column(scale=1):
539
+ job_desc = gr.Textbox(
540
+ label="Job Description",
541
+ placeholder="Paste the job description here...",
542
+ lines=22,
543
+ value=EXAMPLE_JOB_DESCRIPTION,
544
  )
545
 
546
+ analyze_btn = gr.Button("Analyze", variant="primary", size="lg")
547
+
548
+ with gr.Tabs():
549
+ with gr.Tab("Overview"):
550
+ overview_output = gr.Markdown()
551
+ with gr.Tab("Keywords"):
552
+ keywords_output = gr.Markdown()
553
+ with gr.Tab("Sections"):
554
+ sections_output = gr.Markdown()
555
+ with gr.Tab("Suggestions"):
556
+ suggestions_output = gr.Markdown()
557
+
558
+ analyze_btn.click(
559
+ fn=run_analysis,
560
+ inputs=[resume_text, job_desc, pdf_upload],
561
+ outputs=[overview_output, keywords_output, sections_output, suggestions_output],
562
  )
563
 
564
+ gr.Markdown(
565
+ "---\n"
566
+ "Built by [Lorenzo Scaturchio](https://huggingface.co/gr8monk3ys) | "
567
+ "Model: `sentence-transformers/all-MiniLM-L6-v2` | "
568
+ "[Source Code](https://huggingface.co/spaces/gr8monk3ys/resume-analyzer-space)"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
569
  )
570
 
571
  return demo
572
 
573
 
574
+ # ---------------------------------------------------------------------------
575
  # Entry point
576
+ # ---------------------------------------------------------------------------
 
577
  if __name__ == "__main__":
578
  app = build_interface()
579
  app.launch()
requirements.txt CHANGED
@@ -1,3 +1,5 @@
1
- gradio==4.44.0
2
- huggingface_hub==0.22.2
3
- PyMuPDF>=1.24.0
 
 
 
1
+ gradio==5.9.1
2
+ sentence-transformers>=2.2.0
3
+ scikit-learn>=1.0.0
4
+ pymupdf>=1.24.0
5
+ numpy>=1.20.0