Spaces:

gr8monk3ys
/

resume-analyzer

Sleeping

App Files Files Community

gr8monk3ys commited on Feb 1

Commit

98a7e55

verified ·

1 Parent(s): 019a064

Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

README.md +63 -42
app.py +458 -414
requirements.txt +5 -3

README.md CHANGED Viewed

@@ -1,81 +1,102 @@
 ---
-title: Paper Summarizer
-emoji: 📄
-colorFrom: blue
-colorTo: indigo
 sdk: gradio
-sdk_version: 4.44.0
 python_version: "3.10"
 app_file: app.py
 pinned: false
 license: mit
-short_description: Summarize academic research papers with AI
 ---
-# Paper Summarizer
-An AI-powered tool that transforms lengthy academic research papers into structured, digestible summaries. Built with Facebook's BART-Large-CNN model and deployed as a Gradio web application on HuggingFace Spaces.
 ## Features
-- **PDF Upload** -- Drop a research paper PDF and get an instant structured summary.
-- **Text Input** -- Paste raw paper text directly if you prefer.
-- **Structured Output** -- Every summary includes:
-  - Extracted paper title
-  - Concise abstract-length summary
-  - Key findings from the results/conclusion sections
-  - Methodology overview
-  - Word-count statistics with compression ratio
-- **Long Document Support** -- Papers of any length are automatically chunked and summarized in multiple passes, then combined into a coherent final summary.
-- **Clean PDF Processing** -- Handles hyphenated line breaks, control characters, and other common PDF artifacts.
-## How It Works
-1. **Text Extraction** -- PDFs are parsed with PyMuPDF (fitz) to extract selectable text from every page.
-2. **Cleaning** -- Raw text is normalized: stray control characters are removed, hyphenated line breaks are rejoined, and excessive whitespace is collapsed.
-3. **Chunking** -- The cleaned text is split into chunks of approximately 700 words, respecting paragraph and sentence boundaries so context is preserved.
-4. **Summarization** -- Each chunk is passed through `facebook/bart-large-cnn` for abstractive summarization. If there are multiple chunks, the individual summaries are combined and summarized again for coherence.
-5. **Section Extraction** -- Regex heuristics identify Results, Methodology, and Conclusion sections for targeted summarization of key findings and methods.
-## Model
-This Space uses [`facebook/bart-large-cnn`](https://huggingface.co/facebook/bart-large-cnn), a BART model fine-tuned on the CNN/DailyMail summarization dataset. It runs on the free CPU tier and can process most papers in under a minute.
-## Limitations
-- **Scanned PDFs** are not supported -- the PDF must contain selectable text (not images of text).
-- **Summarization quality** depends on the structure and clarity of the input text.
-- **Processing time** may be longer for very large papers due to CPU-only inference.
-## Tech Stack
-| Component | Library |
 |---|---|
 | Web framework | Gradio 4.44 |
-| Summarization model | HuggingFace Transformers (BART-Large-CNN) |
 | PDF parsing | PyMuPDF (fitz) |
-| Inference backend | PyTorch (CPU) |
-## Local Development
 ```bash
 # Clone the repository
-git clone https://huggingface.co/spaces/gr8monk3ys/paper-summarizer
-cd paper-summarizer
 # Install dependencies
 pip install -r requirements.txt
-# Run the application
 python app.py
 ```
 The app will be available at `http://localhost:7860`.
-## License
-MIT
-## Author
-Built by [Lorenzo Scaturchio](https://huggingface.co/gr8monk3ys).

 ---
+title: Resume Analyzer
+emoji: 📋
+colorFrom: green
+colorTo: blue
 sdk: gradio
+sdk_version: 5.9.1
 python_version: "3.10"
 app_file: app.py
 pinned: false
 license: mit
+short_description: AI-powered resume analysis against job descriptions
 ---
+# Resume Analyzer
+An AI-powered tool that analyzes how well your resume matches a target job description. Built with Gradio, sentence-transformers, and scikit-learn.
 ## Features
+### Semantic Similarity Scoring
+Uses the `sentence-transformers/all-MiniLM-L6-v2` model to compute deep semantic similarity between your resume and the job description. This goes beyond simple keyword matching to understand the meaning and context of your experience relative to what the role demands.
+### TF-IDF Keyword Extraction
+Extracts the most important keywords and phrases from both documents using Term Frequency-Inverse Document Frequency (TF-IDF) with unigrams and bigrams. This surfaces the specific terms that carry the most weight in each document.
+### Keyword Gap Analysis
+Compares the job description's top keywords against your resume content to identify:
+- **Matched keywords** -- terms the job requires that your resume already contains.
+- **Missing keywords** -- high-value terms you should consider adding (where truthfully applicable).
+### Section-by-Section Analysis
+Automatically detects standard resume sections (Summary, Experience, Education, Skills, Projects) and scores each one independently against the job description. This pinpoints exactly which parts of your resume need the most attention.
+### Composite Match Score
+A weighted composite score (60% semantic similarity, 40% keyword overlap) that gives a single 0-100% indicator of overall fit.
+### Actionable Suggestions
+Generates specific, prioritized recommendations for improving your resume's alignment with the target role.
+### PDF Upload Support
+Upload your resume as a PDF file instead of pasting text. The app extracts text from the PDF automatically using PyMuPDF.
+## How to Use
+1. **Paste your resume** into the left text area, or upload a PDF using the file upload widget.
+2. **Paste the job description** into the right text area.
+3. Click **Analyze**.
+4. Browse the results across four tabs:
+   - **Overview** -- composite score, breakdown, and verdict.
+   - **Keywords** -- matched and missing keywords with TF-IDF rankings.
+   - **Sections** -- per-section scores and assessments.
+   - **Suggestions** -- numbered, actionable improvement recommendations.
+Example data is pre-loaded so you can click Analyze immediately to see the tool in action.
+## Technical Architecture
+```
+Resume Text / PDF  ──┐
+                     ├──▶  Semantic Embedding (MiniLM-L6-v2)  ──▶  Cosine Similarity
+Job Description  ────┘
+                     ├──▶  TF-IDF Vectorization  ──▶  Keyword Extraction & Matching
+                     └──▶  Section Detection (regex heuristics)  ──▶  Per-section Scoring
+```
+| Component | Technology |
 |---|---|
 | Web framework | Gradio 4.44 |
+| Semantic model | sentence-transformers/all-MiniLM-L6-v2 |
+| Keyword extraction | scikit-learn TfidfVectorizer |
 | PDF parsing | PyMuPDF (fitz) |
+| Numerical compute | NumPy |
+## Running Locally
 ```bash
 # Clone the repository
+git clone https://huggingface.co/spaces/gr8monk3ys/resume-analyzer-space
+cd resume-analyzer-space
 # Install dependencies
 pip install -r requirements.txt
+# Launch the app
 python app.py
 ```
 The app will be available at `http://localhost:7860`.
+## Project Structure
+```
+resume-analyzer-space/
+├── app.py              # Application source (Gradio interface + analysis logic)
+├── requirements.txt    # Python dependencies
+└── README.md           # This file (includes HF Space metadata)
+```
+## License
+MIT

app.py CHANGED Viewed

@@ -1,535 +1,579 @@
 """
-Paper Summarizer - A Gradio-based web application for summarizing academic research papers.
-Version: 2.0.0 (Gradio 5.x compatible)
-This application uses Facebook's BART-Large-CNN model to generate structured summaries
-of academic papers. It supports both PDF uploads and pasted text input, handles long
-documents through intelligent chunking, and produces summaries with extracted titles,
-key findings, methodology notes, and concise abstracts.
 Author: Lorenzo Scaturchio (gr8monk3ys)
 License: MIT
 """
-import os
 import re
 import logging
 from typing import Optional
 import fitz  # PyMuPDF
 import gradio as gr
-from huggingface_hub import InferenceClient
 # ---------------------------------------------------------------------------
 # Logging
 # ---------------------------------------------------------------------------
-logging.basicConfig(
-    level=logging.INFO,
-    format="%(asctime)s [%(levelname)s] %(message)s",
-)
 logger = logging.getLogger(__name__)
 # ---------------------------------------------------------------------------
 # Constants
 # ---------------------------------------------------------------------------
-MODEL_NAME = "facebook/bart-large-cnn"
-# BART-Large-CNN accepts up to 1024 tokens (~750 words). We chunk by words to
-# stay safely within that window while leaving room for special tokens.
-CHUNK_WORD_LIMIT = 700
-SUMMARY_MIN_LENGTH = 40
-SUMMARY_MAX_LENGTH = 180
-COMBINE_SUMMARY_MAX_LENGTH = 300
 # ---------------------------------------------------------------------------
-# Use HuggingFace Inference API (no local model loading - saves memory)
 # ---------------------------------------------------------------------------
-logger.info("Initializing HuggingFace Inference Client for: %s", MODEL_NAME)
-client = InferenceClient(model=MODEL_NAME)
-logger.info("Inference client ready.")
-# ===========================================================================
-# Text extraction helpers
-# ===========================================================================
-def extract_text_from_pdf(pdf_path: str) -> str:
-    """Extract all text content from a PDF file using PyMuPDF.
-    Args:
-        pdf_path: Path to the uploaded PDF file.
-    Returns:
-        The concatenated text of every page, separated by newlines.
-    Raises:
-        ValueError: If the PDF contains no extractable text.
-    """
     try:
         doc = fitz.open(pdf_path)
     except Exception as exc:
-        raise ValueError(
-            f"Could not open the PDF file. It may be corrupted or password-protected. "
-            f"Details: {exc}"
-        ) from exc
-    pages: list[str] = []
-    for page_num, page in enumerate(doc):
-        text = page.get_text("text")
-        if text.strip():
-            pages.append(text)
-        logger.debug("Page %d: extracted %d characters", page_num + 1, len(text))
-    doc.close()
-    if not pages:
-        raise ValueError(
-            "The PDF appears to contain no extractable text. "
-            "It may be a scanned document or consist only of images."
-        )
-    return "\n".join(pages)
-def clean_text(text: str) -> str:
-    """Normalize whitespace and remove common PDF artefacts.
-    Handles excessive newlines, hyphenated line-breaks, and stray control
-    characters that often appear in academic PDFs.
     """
-    # Remove form-feed and other control characters (keep newlines & tabs)
-    text = re.sub(r"[\x00-\x08\x0b\x0c\x0e-\x1f]", "", text)
-    # Re-join hyphenated line breaks (e.g. "summa-\nrization" -> "summarization")
-    text = re.sub(r"(\w)-\n(\w)", r"\1\2", text)
-    # Collapse multiple blank lines into one
-    text = re.sub(r"\n{3,}", "\n\n", text)
-    # Collapse multiple spaces
-    text = re.sub(r"[ \t]{2,}", " ", text)
-    return text.strip()
-# ===========================================================================
-# Title extraction heuristic
-# ===========================================================================
-def extract_title(text: str) -> str:
-    """Attempt to extract the paper title from the first few lines.
-    Academic papers typically place the title in the first 1-5 lines before the
-    author block.  We use a simple heuristic: the longest line among the first
-    few non-empty lines that is not all-caps (which would be a header like
-    "ABSTRACT") and does not look like an author list.
-    """
-    lines = [ln.strip() for ln in text.split("\n") if ln.strip()][:12]
-    candidates: list[str] = []
-    for line in lines:
-        # Skip very short lines (page numbers, dates, etc.)
-        if len(line) < 10:
-            continue
-        # Skip lines that are likely author names / affiliations (contain '@')
-        if "@" in line:
-            continue
-        # Skip lines that are section headers (all uppercase, short)
-        if line.isupper() and len(line) < 60:
-            continue
-        # Skip lines that look like emails or URLs
-        if re.search(r"https?://|www\.", line):
-            continue
-        candidates.append(line)
-    if not candidates:
-        return "Untitled Paper"
-    # Return the first substantial candidate (titles usually come first)
-    return candidates[0]
-# ===========================================================================
-# Chunking and summarization
-# ===========================================================================
-def chunk_text(text: str, max_words: int = CHUNK_WORD_LIMIT) -> list[str]:
-    """Split text into chunks of approximately *max_words* words.
-    Splitting is done on paragraph boundaries where possible so that chunks
-    remain coherent.  If a single paragraph exceeds the limit it is split on
-    sentence boundaries instead.
     """
-    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
-    chunks: list[str] = []
-    current_chunk: list[str] = []
-    current_word_count = 0
-    for para in paragraphs:
-        para_words = len(para.split())
-        # If adding this paragraph would exceed the limit, finalize the chunk.
-        if current_word_count + para_words > max_words and current_chunk:
-            chunks.append("\n\n".join(current_chunk))
-            current_chunk = []
-            current_word_count = 0
-        # Handle paragraphs that are themselves larger than the limit.
-        if para_words > max_words:
-            sentences = re.split(r"(?<=[.!?])\s+", para)
-            for sentence in sentences:
-                s_words = len(sentence.split())
-                if current_word_count + s_words > max_words and current_chunk:
-                    chunks.append("\n\n".join(current_chunk))
-                    current_chunk = []
-                    current_word_count = 0
-                current_chunk.append(sentence)
-                current_word_count += s_words
-        else:
-            current_chunk.append(para)
-            current_word_count += para_words
-    if current_chunk:
-        chunks.append("\n\n".join(current_chunk))
-    return chunks
-def summarize_text(text: str) -> str:
-    """Summarize a single chunk of text using the BART model via Inference API.
-    Dynamically adjusts min/max summary length based on input length to avoid
-    the common transformers warning about min_length exceeding input length.
     """
-    word_count = len(text.split())
-    # For very short inputs, just return the text as-is.
-    if word_count < 50:
-        return text
-    max_len = min(SUMMARY_MAX_LENGTH, max(50, word_count // 2))
-    min_len = min(SUMMARY_MIN_LENGTH, max_len - 10)
-    try:
-        result = client.summarization(
-            text,
-            parameters={
-                "max_length": max_len,
-                "min_length": min_len,
-                "do_sample": False,
-            },
-        )
-        return result.summary_text
-    except Exception as e:
-        logger.warning("Summarization failed: %s", e)
-        # Fallback: return truncated text
-        return " ".join(text.split()[:100]) + "..."
-def generate_full_summary(text: str) -> str:
-    """Produce a final summary for arbitrarily long documents.
-    Strategy:
-    1. Split the document into manageable chunks.
-    2. Summarize each chunk individually.
-    3. If multiple chunk summaries exist, combine them and run a second-pass
-       summarization to produce a coherent final summary.
-    """
-    chunks = chunk_text(text)
-    logger.info("Document split into %d chunk(s) for summarization.", len(chunks))
-    chunk_summaries = [summarize_text(chunk) for chunk in chunks]
-    if len(chunk_summaries) == 1:
-        return chunk_summaries[0]
-    # Second pass: combine chunk summaries and re-summarize for coherence.
-    combined = " ".join(chunk_summaries)
-    combined_words = len(combined.split())
-    if combined_words < 50:
-        return combined
-    max_len = min(COMBINE_SUMMARY_MAX_LENGTH, max(60, combined_words // 2))
-    min_len = min(SUMMARY_MIN_LENGTH, max_len - 10)
-    try:
-        result = client.summarization(
-            combined,
-            parameters={
-                "max_length": max_len,
-                "min_length": min_len,
-                "do_sample": False,
-            },
         )
-        return result.summary_text
-    except Exception as e:
-        logger.warning("Combined summarization failed: %s", e)
-        return combined
-# ===========================================================================
-# Section extraction helpers
-# ===========================================================================
-def extract_section(text: str, heading_pattern: str, fallback: str = "") -> str:
-    """Extract content under a section heading matched by *heading_pattern*.
-    Uses a regex to find the heading and captures everything until the next
-    heading of equal or higher level.
     """
-    pattern = re.compile(
-        rf"(?:^|\n)\s*(?:\d+[\.\)]?\s*)?{heading_pattern}\s*\n(.*?)(?=\n\s*(?:\d+[\.\)]?\s*)?[A-Z][A-Za-z ]+\s*\n|\Z)",
-        re.DOTALL | re.IGNORECASE,
-    )
-    match = pattern.search(text)
-    if match:
-        content = match.group(1).strip()
-        if len(content) > 30:
-            return content
-    return fallback
-def extract_key_findings(text: str) -> str:
-    """Try to extract key findings from Results / Conclusion sections, or
-    fall back to summarizing the last portion of the paper."""
-    for heading in [
-        r"(?:key\s+)?findings",
-        r"results?\s*(?:and\s+discussion)?",
-        r"conclusions?\s*(?:and\s+future\s+work)?",
-        r"discussion",
-    ]:
-        content = extract_section(text, heading)
-        if content:
-            return summarize_text(content[:3000])
-    # Fallback: summarize the last quarter of the document.
-    words = text.split()
-    tail = " ".join(words[-(len(words) // 4):])
-    if len(tail.split()) > 50:
-        return summarize_text(tail[:3000])
-    return "Key findings could not be automatically extracted."
-def extract_methodology(text: str) -> str:
-    """Try to extract methodology information from the paper."""
-    for heading in [
-        r"method(?:ology|s)?",
-        r"approach",
-        r"experimental\s+setup",
-        r"materials?\s+and\s+methods",
-        r"(?:proposed\s+)?(?:framework|system|model|architecture)",
-    ]:
-        content = extract_section(text, heading)
-        if content:
-            return summarize_text(content[:3000])
-    return "Methodology section could not be automatically extracted."
-# ===========================================================================
-# Main processing function
-# ===========================================================================
-def process_paper(
-    pdf_file: Optional[str],
-    pasted_text: Optional[str],
-) -> str:
-    """Process a research paper and return a structured summary.
-    Accepts either a PDF file path (from Gradio upload) or raw pasted text.
-    Returns a Markdown-formatted structured summary.
     """
     # ------------------------------------------------------------------
-    # 1. Obtain raw text
     # ------------------------------------------------------------------
     if pdf_file is not None:
-        logger.info("Processing uploaded PDF: %s", pdf_file)
-        try:
-            raw_text = extract_text_from_pdf(pdf_file)
-        except ValueError as exc:
-            return f"**Error:** {exc}"
-    elif pasted_text and pasted_text.strip():
-        raw_text = pasted_text.strip()
-    else:
-        return (
-            "**Error:** Please upload a PDF file or paste the paper text. "
-            "Both inputs are currently empty."
-        )
-    text = clean_text(raw_text)
-    original_word_count = len(text.split())
-    if original_word_count < 30:
-        return (
-            "**Error:** The extracted text is too short to summarize. "
-            "Please provide a longer document or check that the PDF contains selectable text."
-        )
-    logger.info("Cleaned text: %d words.", original_word_count)
     # ------------------------------------------------------------------
-    # 2. Extract structured components
     # ------------------------------------------------------------------
-    title = extract_title(text)
-    concise_summary = generate_full_summary(text)
-    key_findings = extract_key_findings(text)
-    methodology = extract_methodology(text)
-    summary_word_count = len(concise_summary.split())
     # ------------------------------------------------------------------
-    # 3. Format the output
     # ------------------------------------------------------------------
-    output = f"""## {title}
----
-### Concise Summary
-{concise_summary}
----
-### Key Findings
-{key_findings}
----
-### Methodology
-{methodology}
----
-### Statistics
-| Metric | Value |
-|---|---|
-| Original length | {original_word_count:,} words |
-| Summary length | {summary_word_count:,} words |
-| Compression ratio | {original_word_count / max(summary_word_count, 1):.1f}x |
-"""
-    return output
-# ===========================================================================
-# Example inputs for the Gradio demo
-# ===========================================================================
-EXAMPLE_TEXT = """Attention Is All You Need
-Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
-Abstract
-The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
-Introduction
-Recurrent neural networks, long short-term memory and gated recurrent neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures. Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht-1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.
-Methods
-The Transformer follows an encoder-decoder structure using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder. The encoder maps an input sequence of symbol representations to a sequence of continuous representations. Given z, the decoder then generates an output sequence of symbols one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next. The Transformer uses multi-head attention to allow the model to jointly attend to information from different representation subspaces at different positions.
-Results
-On the WMT 2014 English-to-German translation task, the big transformer model outperforms the best previously reported models including ensembles by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4. On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1/4 the training cost of the previous state-of-the-art model. The Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers.
-Conclusions
-In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. The Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. We achieved new state of the art on both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video."""
-# ===========================================================================
 # Gradio interface
-# ===========================================================================
 def build_interface() -> gr.Blocks:
     """Construct and return the Gradio Blocks interface."""
-    with gr.Blocks(
-        title="Paper Summarizer",
-        theme=gr.themes.Soft(
-            primary_hue="indigo",
-            secondary_hue="blue",
-        ),
-        css="""
-            .header-text { text-align: center; margin-bottom: 0.5em; }
-            .subheader  { text-align: center; color: #6b7280; margin-top: 0; }
-            footer { display: none !important; }
-        """,
-    ) as demo:
-        # --- Header ---
         gr.Markdown(
-            """
-            <h1 class="header-text">Paper Summarizer</h1>
-            <p class="subheader">
-                Summarize academic research papers into structured, digestible insights.<br>
-                Upload a PDF or paste the full text below.
-            </p>
-            """,
         )
-        with gr.Row():
-            # --- Input column ---
             with gr.Column(scale=1):
-                gr.Markdown("### Input")
-                pdf_input = gr.File(
-                    label="Upload PDF",
                     file_types=[".pdf"],
                     type="filepath",
                 )
-                text_input = gr.Textbox(
-                    label="Or paste paper text",
-                    placeholder="Paste the full text of a research paper here...",
-                    lines=12,
-                    max_lines=30,
-                )
-                submit_btn = gr.Button("Summarize", variant="primary", size="lg")
-            # --- Output column ---
             with gr.Column(scale=1):
-                gr.Markdown("### Structured Summary")
-                output = gr.Markdown(
-                    value="*Your summary will appear here after processing.*",
-                    label="Summary",
                 )
-        # --- Examples ---
-        gr.Markdown("---")
-        gr.Markdown("### Try an Example")
-        gr.Examples(
-            examples=[[None, EXAMPLE_TEXT]],
-            inputs=[pdf_input, text_input],
-            outputs=output,
-            fn=process_paper,
-            cache_examples=False,
-            label="Click to load example paper text",
         )
-        # --- About ---
-        with gr.Accordion("About this Space", open=False):
-            gr.Markdown(
-                """
-                **Paper Summarizer** uses
-                [`facebook/bart-large-cnn`](https://huggingface.co/facebook/bart-large-cnn)
-                to generate abstractive summaries of academic papers.
-                **Features**
-                - PDF upload with automatic text extraction (PyMuPDF)
-                - Intelligent chunking for papers of any length
-                - Structured output: title, key findings, methodology, and concise summary
-                - Word-count statistics and compression ratio
-                **Limitations**
-                - Scanned PDFs (image-only) are not supported; the PDF must contain selectable text.
-                - Summarization quality depends on the input text quality and structure.
-                - Running on free CPU tier; very long papers may take a minute to process.
-                Built by [Lorenzo Scaturchio](https://huggingface.co/gr8monk3ys).
-                """
-            )
-        # --- Event binding ---
-        submit_btn.click(
-            fn=process_paper,
-            inputs=[pdf_input, text_input],
-            outputs=output,
         )
     return demo
-# ===========================================================================
 # Entry point
-# ===========================================================================
 if __name__ == "__main__":
     app = build_interface()
     app.launch()

 """
+Resume Analyzer - AI-Powered Resume Analysis Against Job Descriptions
+This Gradio application uses NLP techniques to evaluate how well a resume
+matches a given job description. It provides semantic similarity scoring,
+keyword extraction, gap analysis, and actionable improvement suggestions.
 Author: Lorenzo Scaturchio (gr8monk3ys)
 License: MIT
 """
 import re
 import logging
 from typing import Optional
 import fitz  # PyMuPDF
 import gradio as gr
+import numpy as np
+from sentence_transformers import SentenceTransformer
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.metrics.pairwise import cosine_similarity
 # ---------------------------------------------------------------------------
 # Logging
 # ---------------------------------------------------------------------------
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
 logger = logging.getLogger(__name__)
 # ---------------------------------------------------------------------------
 # Constants
 # ---------------------------------------------------------------------------
+MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
+RESUME_SECTIONS = {
+    "experience": [
+        "experience", "work experience", "professional experience",
+        "employment history", "work history", "career history",
+    ],
+    "education": [
+        "education", "academic background", "academic history",
+        "qualifications", "certifications", "degrees",
+    ],
+    "skills": [
+        "skills", "technical skills", "core competencies",
+        "competencies", "proficiencies", "technologies", "tools",
+    ],
+    "projects": [
+        "projects", "personal projects", "portfolio",
+        "key projects", "selected projects",
+    ],
+    "summary": [
+        "summary", "professional summary", "profile",
+        "objective", "career objective", "about me", "overview",
+    ],
+}
 # ---------------------------------------------------------------------------
+# Example data shipped with the Space
 # ---------------------------------------------------------------------------
+EXAMPLE_RESUME = """LORENZO SCATURCHIO
+San Francisco, CA | lorenzo@email.com | linkedin.com/in/lorenzo | github.com/gr8monk3ys
+PROFESSIONAL SUMMARY
+Results-driven Machine Learning Engineer with 4+ years of experience designing,
+building, and deploying production ML systems. Skilled in deep learning, NLP,
+and scalable data pipelines. Passionate about applying AI to solve real-world
+problems and delivering measurable business impact.
+EXPERIENCE
+Senior Machine Learning Engineer | Acme AI Corp | Jan 2022 - Present
+- Designed and deployed a real-time recommendation engine serving 2M+ daily
+  active users, improving click-through rate by 18%.
+- Built an end-to-end NLP pipeline for document classification using
+  Transformers (BERT, RoBERTa) with 94% F1 score.
+- Led migration of model training infrastructure to Kubernetes, reducing
+  training time by 40% and cloud costs by 25%.
+- Mentored a team of 3 junior engineers on MLOps best practices.
+Machine Learning Engineer | DataWave Inc | Jun 2020 - Dec 2021
+- Developed a customer churn prediction model (XGBoost, LightGBM) that
+  reduced churn by 12%, saving $1.2M annually.
+- Implemented A/B testing framework for ML model evaluation in production.
+- Created automated data quality checks and feature engineering pipelines
+  using Apache Spark and Airflow.
+Data Science Intern | StartUp Labs | May 2019 - Aug 2019
+- Conducted exploratory data analysis on 500K+ records to identify key
+  business drivers.
+- Built sentiment analysis prototype using spaCy and scikit-learn.
+EDUCATION
+M.S. Computer Science (Machine Learning Specialization) | Stanford University | 2020
+B.S. Computer Science | UC Berkeley | 2018
+SKILLS
+Languages: Python, SQL, Java, Scala, R
+ML/DL Frameworks: PyTorch, TensorFlow, scikit-learn, Hugging Face Transformers
+MLOps: Docker, Kubernetes, MLflow, Weights & Biases, Airflow
+Cloud: AWS (SageMaker, EC2, S3), GCP (Vertex AI, BigQuery)
+Data: Spark, Pandas, NumPy, PostgreSQL, Redis
+Other: Git, CI/CD, REST APIs, Agile/Scrum
+PROJECTS
+- Open-source NLP toolkit for resume analysis (this project!)
+- Real-time object detection system using YOLOv5 on edge devices
+- Kaggle competition top-5% finish in Tabular Playground Series
+"""
+EXAMPLE_JOB_DESCRIPTION = """Senior Machine Learning Engineer
+About the Role
+We are looking for a Senior Machine Learning Engineer to join our AI Platform
+team. You will design, build, and maintain production ML systems that power
+our core product features. This is a high-impact role where you will work
+closely with product, engineering, and data science teams.
+Responsibilities
+- Design and implement scalable ML pipelines for training and inference.
+- Develop and deploy deep learning models for NLP and computer vision tasks.
+- Build robust monitoring and alerting for model performance in production.
+- Collaborate with cross-functional teams to translate business requirements
+  into ML solutions.
+- Mentor junior engineers and contribute to engineering best practices.
+- Drive adoption of MLOps tools and processes across the organization.
+Requirements
+- 3+ years of experience in machine learning engineering or a related role.
+- Strong proficiency in Python and SQL.
+- Hands-on experience with ML frameworks such as PyTorch or TensorFlow.
+- Experience deploying models to production using Docker and Kubernetes.
+- Familiarity with cloud platforms (AWS or GCP).
+- Solid understanding of NLP techniques (transformers, embeddings, etc.).
+- Experience with data processing frameworks like Spark or similar.
+- Strong communication skills and ability to work in an Agile environment.
+Nice to Have
+- Experience with recommendation systems or search ranking.
+- Contributions to open-source ML projects.
+- M.S. or Ph.D. in Computer Science, Machine Learning, or related field.
+- Experience with MLflow, Weights & Biases, or similar experiment tracking.
+- Familiarity with CI/CD pipelines for ML.
+"""
+# ---------------------------------------------------------------------------
+# Model loading (cached at module level for the HF Space)
+# ---------------------------------------------------------------------------
+logger.info("Loading sentence-transformer model: %s", MODEL_NAME)
+_model = SentenceTransformer(MODEL_NAME)
+logger.info("Model loaded successfully.")
+# =========================================================================
+# Core analysis utilities
+# =========================================================================
+def extract_text_from_pdf(pdf_path: str) -> str:
+    """Extract plain text from a PDF file using PyMuPDF."""
     try:
         doc = fitz.open(pdf_path)
+        pages = [page.get_text() for page in doc]
+        doc.close()
+        text = "\n".join(pages).strip()
+        if not text:
+            raise ValueError("The PDF appears to contain no extractable text.")
+        return text
     except Exception as exc:
+        logger.error("PDF extraction failed: %s", exc)
+        raise gr.Error(f"Could not read the PDF file: {exc}") from exc
+def compute_semantic_similarity(text_a: str, text_b: str) -> float:
+    """Return cosine similarity (0-1) between two texts using the sentence-transformer."""
+    embeddings = _model.encode([text_a, text_b], convert_to_numpy=True)
+    similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
+    return float(np.clip(similarity, 0.0, 1.0))
+def extract_keywords(texts: list[str], top_n: int = 30) -> list[list[str]]:
     """
+    Use TF-IDF to extract the most important keywords from each document
+    in *texts*. Returns a list of keyword-lists, one per input document.
     """
+    vectorizer = TfidfVectorizer(
+        stop_words="english",
+        max_features=500,
+        ngram_range=(1, 2),
+        token_pattern=r"(?u)\b[a-zA-Z][a-zA-Z+#.\-]{1,}\b",
+    )
+    tfidf_matrix = vectorizer.fit_transform(texts)
+    feature_names = np.array(vectorizer.get_feature_names_out())
+    results: list[list[str]] = []
+    for row_idx in range(tfidf_matrix.shape[0]):
+        row = tfidf_matrix[row_idx].toarray().flatten()
+        top_indices = row.argsort()[-top_n:][::-1]
+        keywords = [feature_names[i] for i in top_indices if row[i] > 0]
+        results.append(keywords)
+    return results
+def _normalize(text: str) -> str:
+    """Lowercase and strip extra whitespace."""
+    return re.sub(r"\s+", " ", text.lower()).strip()
+def find_matching_and_missing_keywords(
+    resume_text: str, job_keywords: list[str]
+) -> tuple[list[str], list[str]]:
     """
+    Compare job-description keywords against the resume body.
+    Returns (matched, missing) keyword lists.
+    """
+    resume_lower = _normalize(resume_text)
+    matched, missing = [], []
+    for kw in job_keywords:
+        if _normalize(kw) in resume_lower:
+            matched.append(kw)
+        else:
+            missing.append(kw)
+    return matched, missing
+def detect_sections(resume_text: str) -> dict[str, str]:
+    """
+    Heuristically split a resume into named sections.
+    Returns a dict mapping canonical section names to their content.
+    """
+    lines = resume_text.split("\n")
+    current_section: Optional[str] = None
+    sections: dict[str, list[str]] = {}
+    for line in lines:
+        stripped = line.strip()
+        matched_section = _match_section_header(stripped)
+        if matched_section:
+            current_section = matched_section
+            sections.setdefault(current_section, [])
+        elif current_section:
+            sections[current_section].append(stripped)
+        else:
+            sections.setdefault("other", []).append(stripped)
+    return {name: "\n".join(content).strip() for name, content in sections.items()}
+def _match_section_header(line: str) -> Optional[str]:
+    """Return a canonical section name if *line* looks like a header, else None."""
+    cleaned = re.sub(r"[^a-z\s]", "", line.lower()).strip()
+    if not cleaned or len(cleaned) > 40:
+        return None
+    for canonical, variants in RESUME_SECTIONS.items():
+        if cleaned in variants:
+            return canonical
+    return None
+def analyze_section(
+    section_name: str, section_text: str, job_text: str
+) -> dict[str, object]:
+    """Compute a per-section relevance score and short commentary."""
+    if not section_text.strip():
+        return {
+            "score": 0.0,
+            "comment": f"No {section_name} section detected in the resume.",
+        }
+    score = compute_semantic_similarity(section_text, job_text)
+    pct = round(score * 100, 1)
+    if pct >= 70:
+        quality = "Strong"
+    elif pct >= 45:
+        quality = "Moderate"
+    else:
+        quality = "Weak"
+    comment = f"{quality} alignment ({pct}%) with the job description."
+    return {"score": score, "comment": comment}
+def generate_suggestions(
+    missing_keywords: list[str],
+    section_scores: dict[str, dict],
+    overall_score: float,
+) -> list[str]:
+    """Return a list of actionable improvement suggestions."""
+    suggestions: list[str] = []
+    if overall_score < 0.45:
+        suggestions.append(
+            "Your resume has low overall alignment with this job description. "
+            "Consider tailoring it more directly to the role's requirements."
+        )
+    # Missing keywords
+    if missing_keywords:
+        top_missing = missing_keywords[:10]
+        suggestions.append(
+            "Add these high-value keywords or phrases where truthfully applicable: "
+            + ", ".join(f'"{kw}"' for kw in top_missing)
+            + "."
+        )
+    # Section-specific advice
+    for name, data in section_scores.items():
+        score = data["score"]
+        if name == "experience" and score < 0.5:
+            suggestions.append(
+                "Strengthen your Experience section by using action verbs and "
+                "quantifiable achievements that mirror the job requirements."
+            )
+        if name == "skills" and score < 0.5:
+            suggestions.append(
+                "Your Skills section could be improved. List specific tools, "
+                "frameworks, and technologies mentioned in the job posting."
+            )
+        if name == "education" and score < 0.3:
+            suggestions.append(
+                "Consider highlighting relevant coursework, certifications, or "
+                "academic projects in your Education section."
+            )
+    if "summary" not in section_scores or not section_scores.get("summary", {}).get("score", 0):
+        suggestions.append(
+            "Add a Professional Summary at the top of your resume that directly "
+            "addresses the key requirements of this role."
+        )
+    if not suggestions:
+        suggestions.append(
+            "Your resume is well-aligned with this job description. "
+            "Keep refining with specific metrics and results."
         )
+    return suggestions
+# =========================================================================
+# Main analysis orchestrator
+# =========================================================================
+def run_analysis(
+    resume_text: str,
+    job_description: str,
+    pdf_file: Optional[str] = None,
+) -> tuple[str, str, str, str]:
     """
+    Run the full resume analysis pipeline.
+    Returns a 4-tuple of Markdown strings:
+        (overview, keywords_report, section_report, suggestions_report)
     """
     # ------------------------------------------------------------------
+    # Input resolution
     # ------------------------------------------------------------------
     if pdf_file is not None:
+        resume_text = extract_text_from_pdf(pdf_file)
+    if not resume_text or not resume_text.strip():
+        raise gr.Error("Please provide resume text or upload a PDF.")
+    if not job_description or not job_description.strip():
+        raise gr.Error("Please provide a job description.")
+    # ------------------------------------------------------------------
+    # 1. Semantic similarity
+    # ------------------------------------------------------------------
+    raw_similarity = compute_semantic_similarity(resume_text, job_description)
     # ------------------------------------------------------------------
+    # 2. Keyword analysis
     # ------------------------------------------------------------------
+    keyword_lists = extract_keywords([resume_text, job_description], top_n=30)
+    resume_keywords, job_keywords = keyword_lists[0], keyword_lists[1]
+    matched_kw, missing_kw = find_matching_and_missing_keywords(resume_text, job_keywords)
+    keyword_overlap = len(matched_kw) / max(len(job_keywords), 1)
     # ------------------------------------------------------------------
+    # 3. Composite score  (60% semantic + 40% keyword overlap)
     # ------------------------------------------------------------------
+    composite = 0.6 * raw_similarity + 0.4 * keyword_overlap
+    overall_pct = round(composite * 100, 1)
+    # ------------------------------------------------------------------
+    # 4. Section-by-section analysis
+    # ------------------------------------------------------------------
+    sections = detect_sections(resume_text)
+    section_scores: dict[str, dict] = {}
+    for sec_name in ["summary", "experience", "education", "skills", "projects"]:
+        sec_text = sections.get(sec_name, "")
+        section_scores[sec_name] = analyze_section(sec_name, sec_text, job_description)
+    # ------------------------------------------------------------------
+    # 5. Suggestions
+    # ------------------------------------------------------------------
+    suggestions = generate_suggestions(missing_kw, section_scores, composite)
+    # ------------------------------------------------------------------
+    # Format outputs as Markdown
+    # ------------------------------------------------------------------
+    overview_md = _format_overview(overall_pct, raw_similarity, keyword_overlap, matched_kw, missing_kw)
+    keywords_md = _format_keywords(resume_keywords, job_keywords, matched_kw, missing_kw)
+    sections_md = _format_sections(section_scores)
+    suggest_md = _format_suggestions(suggestions)
+    return overview_md, keywords_md, sections_md, suggest_md
+# =========================================================================
+# Markdown formatters
+# =========================================================================
+def _score_bar(pct: float, width: int = 20) -> str:
+    """Return a text-based progress bar for Markdown."""
+    filled = round(pct / 100 * width)
+    empty = width - filled
+    return f"`[{'=' * filled}{' ' * empty}]` **{pct}%**"
+def _format_overview(
+    overall_pct: float,
+    semantic_sim: float,
+    keyword_overlap: float,
+    matched: list[str],
+    missing: list[str],
+) -> str:
+    sem_pct = round(semantic_sim * 100, 1)
+    kw_pct = round(keyword_overlap * 100, 1)
+    if overall_pct >= 70:
+        verdict = "Excellent match - your resume aligns strongly with this role."
+    elif overall_pct >= 50:
+        verdict = "Good match - some targeted improvements could strengthen your application."
+    elif overall_pct >= 30:
+        verdict = "Partial match - significant tailoring is recommended."
+    else:
+        verdict = "Low match - consider whether this role fits your background or rewrite substantially."
+    return (
+        f"## Overall Match Score\n\n"
+        f"# {_score_bar(overall_pct)}\n\n"
+        f"**Verdict:** {verdict}\n\n"
+        f"---\n\n"
+        f"### Score Breakdown\n\n"
+        f"| Component | Score |\n"
+        f"|---|---|\n"
+        f"| Semantic Similarity | {sem_pct}% |\n"
+        f"| Keyword Overlap | {kw_pct}% |\n"
+        f"| **Composite (60/40)** | **{overall_pct}%** |\n\n"
+        f"---\n\n"
+        f"**Matched Keywords:** {len(matched)} &nbsp;|&nbsp; "
+        f"**Missing Keywords:** {len(missing)}\n"
+    )
+def _format_keywords(
+    resume_kw: list[str],
+    job_kw: list[str],
+    matched: list[str],
+    missing: list[str],
+) -> str:
+    matched_str = ", ".join(f"**{kw}**" for kw in matched) if matched else "_None detected_"
+    missing_str = ", ".join(f"~~{kw}~~" for kw in missing) if missing else "_None - great coverage!_"
+    resume_str = ", ".join(resume_kw[:20]) if resume_kw else "_None detected_"
+    job_str = ", ".join(job_kw[:20]) if job_kw else "_None detected_"
+    return (
+        f"## Keyword Analysis\n\n"
+        f"### Matched Keywords (found in your resume)\n{matched_str}\n\n"
+        f"---\n\n"
+        f"### Missing Keywords (consider adding)\n{missing_str}\n\n"
+        f"---\n\n"
+        f"### Top Resume Keywords (TF-IDF)\n{resume_str}\n\n"
+        f"### Top Job Description Keywords (TF-IDF)\n{job_str}\n"
+    )
+def _format_sections(section_scores: dict[str, dict]) -> str:
+    header = "## Section-by-Section Analysis\n\n"
+    rows = "| Section | Score | Assessment |\n|---|---|---|\n"
+    details = ""
+    for name, data in section_scores.items():
+        pct = round(data["score"] * 100, 1)
+        comment = data["comment"]
+        display_name = name.replace("_", " ").title()
+        rows += f"| {display_name} | {pct}% | {comment} |\n"
+        details += f"### {display_name}\n{comment}\n\n"
+    return header + rows + "\n---\n\n" + details
+def _format_suggestions(suggestions: list[str]) -> str:
+    items = "\n".join(f"{i}. {s}" for i, s in enumerate(suggestions, 1))
+    return (
+        f"## Improvement Suggestions\n\n"
+        f"{items}\n\n"
+        f"---\n\n"
+        f"*Tip: Tailor your resume for every application. Mirror the language "
+        f"used in the job posting while remaining truthful about your experience.*\n"
+    )
+# =========================================================================
 # Gradio interface
+# =========================================================================
 def build_interface() -> gr.Blocks:
     """Construct and return the Gradio Blocks interface."""
+    theme = gr.themes.Soft(
+        primary_hue="teal",
+        secondary_hue="green",
+        font=[gr.themes.GoogleFont("Inter"), "system-ui", "sans-serif"],
+    )
+    with gr.Blocks(theme=theme, title="Resume Analyzer") as demo:
         gr.Markdown(
+            "# Resume Analyzer\n"
+            "Evaluate how well your resume matches a job description using "
+            "**semantic similarity** and **keyword analysis**.\n\n"
+            "Paste your resume text (or upload a PDF) and the target job "
+            "description, then click **Analyze**."
         )
+        with gr.Row(equal_height=True):
             with gr.Column(scale=1):
+                resume_text = gr.Textbox(
+                    label="Resume Text",
+                    placeholder="Paste your resume here...",
+                    lines=18,
+                    value=EXAMPLE_RESUME,
+                )
+                pdf_upload = gr.File(
+                    label="Or Upload Resume PDF",
                     file_types=[".pdf"],
                     type="filepath",
                 )
             with gr.Column(scale=1):
+                job_desc = gr.Textbox(
+                    label="Job Description",
+                    placeholder="Paste the job description here...",
+                    lines=22,
+                    value=EXAMPLE_JOB_DESCRIPTION,
                 )
+        analyze_btn = gr.Button("Analyze", variant="primary", size="lg")
+        with gr.Tabs():
+            with gr.Tab("Overview"):
+                overview_output = gr.Markdown()
+            with gr.Tab("Keywords"):
+                keywords_output = gr.Markdown()
+            with gr.Tab("Sections"):
+                sections_output = gr.Markdown()
+            with gr.Tab("Suggestions"):
+                suggestions_output = gr.Markdown()
+        analyze_btn.click(
+            fn=run_analysis,
+            inputs=[resume_text, job_desc, pdf_upload],
+            outputs=[overview_output, keywords_output, sections_output, suggestions_output],
         )
+        gr.Markdown(
+            "---\n"
+            "Built by [Lorenzo Scaturchio](https://huggingface.co/gr8monk3ys) | "
+            "Model: `sentence-transformers/all-MiniLM-L6-v2` | "
+            "[Source Code](https://huggingface.co/spaces/gr8monk3ys/resume-analyzer-space)"
         )
     return demo
+# ---------------------------------------------------------------------------
 # Entry point
+# ---------------------------------------------------------------------------
 if __name__ == "__main__":
     app = build_interface()
     app.launch()

requirements.txt CHANGED Viewed

@@ -1,3 +1,5 @@
-gradio==4.44.0
-huggingface_hub==0.22.2
-PyMuPDF>=1.24.0

+gradio==5.9.1
+sentence-transformers>=2.2.0
+scikit-learn>=1.0.0
+pymupdf>=1.24.0
+numpy>=1.20.0