Spaces:

Sher1988
/

Resume-Intelligence-Chat

Sleeping

App Files Files Community

Sher1988 commited on Apr 21

Commit

626428e

1 Parent(s): 0ceb825

initial commit

Browse files

Files changed (13) hide show

.gitattributes +5 -0
.gitignore +34 -0
LICENSE +21 -0
README.md +157 -9
app.py +102 -0
core/chains.py +87 -0
core/ingestion/docling_loader.py +32 -0
core/llm_model.py +88 -0
core/parsing/extractor.py +30 -0
core/parsing/schema.py +65 -0
core/processing/database.py +46 -0
core/processing/dataframe.py +46 -0
requirements.txt +9 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+flickr8k/images/101654506_8eb26cfb60.jpg filter=lfs diff=lfs merge=lfs -text
+flickr8k/images/109202756_b97fcdc62c.jpg filter=lfs diff=lfs merge=lfs -text
+flickr8k/images/136644343_0e2b423829.jpg filter=lfs diff=lfs merge=lfs -text
+flickr8k/images/47870024_73a4481f7d.jpg filter=lfs diff=lfs merge=lfs -text
+flickr8k/images/47871819_db55ac4699.jpg filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,34 @@

+# Environment variables
+.env
+# Python cache
+__pycache__/
+*.pyc
+# Virtual environment
+.venv/
+# OS files
+.DS_Store
+Thumbs.db
+# Logs
+*.log
+# Data (optional)
+data/
+# Jupyter
+.ipynb_checkpoints/
+# Build
+dist/
+build/
+# IDE
+.vscode/
+.idea/
+# ignore dockerfile for github upload.
+Dockerfile

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2026 Sher Alam
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,12 +1,160 @@
 ---
-title: Resume Intelligence Chat
-emoji: 🏆
-colorFrom: blue
-colorTo: indigo
-sdk: docker
-pinned: false
-license: mit
-short_description: LLM Powered Resume Data Extraction and Intelligence Chat app
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Resume-Intelligence-Chat
+emoji: 📸
+sdk: streamlit
+sdk_version: 1.37.1
+app_file: app.py
 ---
+## Resume Intelligence Chat
+A Streamlit app that ingests resumes, structures them into a schema, stores them in a SQLite database, and enables natural language querying using LLM-generated SQL.
+---
+## Features
+* **Resume Parsing**: Uses Docling to extract raw text from PDF resumes
+* **Structured Extraction**: Pydantic AI agent converts text into a typed `Resume` schema
+* **Database Storage**: Extracted data is stored in SQLite
+* **Natural Language Queries**:
+  * User asks a question
+  * LLM generates SQL query
+  * Query runs on database
+  * Result is fed back to LLM for final answer
+* **Chat Interface**: Streamlit-based conversational UI
+* **Database Control**: Option to delete/reset database
+---
+## Architecture
+```
+PDF Resume
+   ↓
+Docling Parser
+   ↓
+Pydantic AI Agent → Resume Schema
+   ↓
+SQLite Database
+   ↓
+User Query (NL)
+   ↓
+LLM → SQL Query
+   ↓
+Execute on DB
+   ↓
+LLM → Final Answer
+```
+---
+## Project Structure
+```
+.
+├── app.py
+├── core/
+│   ├── ingestion/
+│   │   └── docling_loader.py
+│   ├── parsing/
+│   │   ├── extractor.py
+│   │   └── schema.py
+│   ├── processing/
+│   │   └── database.py
+│   └── chains/
+│       ├── generate_sql_query
+│       └── generate_nl_answer
+├── data/db/
+│   └── resumes.db
+```
+---
+## Setup
+### 1. Install dependencies
+```bash
+pip install -r requirements.txt
+```
+### 2. Run app
+```bash
+streamlit run app.py
+```
+---
+## Usage
+### Upload & Index
+1. Upload one or more PDF resumes
+2. Click **"Process & Index Resumes"**
+3. Data is parsed, structured, and stored
+### Query
+* Ask questions like:
+  * `List all candidate names`
+  * `List complex projects with candidate name`
+  * `Show candidates with 5+ years experience`
+### Database Reset
+* Check **Confirm delete database**
+* Click **Delete Database**
+---
+## Core Components
+### 1. Docling Loader
+Extracts clean text from PDF resumes.
+### 2. Resume Extractor
+Uses Pydantic AI agent to map text → structured `Resume` object.
+### 3. SQLite Storage
+Stores structured resume data for querying.
+### 4. SQL Generator
+LLM converts user query → SQL statement.
+### 5. Answer Generator
+LLM converts DB results → natural language response.
+---
+## Safety Checks
+* Only allows `SELECT` / `WITH` queries
+* Rejects irrelevant or unsafe queries
+* Handles empty results (`NO_DATA`)
+---
+## Notes
+* Minimizes LLM calls by separating SQL generation and response generation
+* Works with multiple resumes
+* Designed for local/offline LLM setups (e.g., Ollama)
+---
+## Future Improvements
+* Human-in-the-loop SQL approval
+* Multi-table schema support
+* Better handling of missing/empty fields
+* Query caching for performance
+---

app.py ADDED Viewed

	@@ -0,0 +1,102 @@

+import ast
+import streamlit as st
+import tempfile
+import os
+from core.chains import generate_sql_query, generate_nl_answer
+from core.ingestion.docling_loader import load_and_convert_cv
+from core.parsing.extractor import extract_resume
+from core.parsing.schema import Resume
+from core.processing.database import resume_to_sqlite
+from langchain_community.utilities import SQLDatabase
+@st.cache_resource
+def get_db():
+    return SQLDatabase.from_uri("sqlite:///data/db/resumes.db")
+st.set_page_config(page_title="Resume AI Assistant", layout="wide")
+st.title("🤖 Resume Intelligence Chat")
+# Initialize chat history
+if "messages" not in st.session_state:
+    st.session_state.messages = []
+# Sidebar for File Uploads
+with st.sidebar:
+    st.header("Upload Center")
+    uploaded_files = st.file_uploader(
+        "Upload PDF Resumes",
+        type=["pdf"],
+        accept_multiple_files=True
+    )
+    if uploaded_files and st.button("Process & Index Resumes"):
+        os.makedirs("data/db", exist_ok=True)
+        with st.spinner(f"Indexing {len(uploaded_files)} resumes..."):
+            for uploaded_file in uploaded_files:
+                with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
+                    tmp.write(uploaded_file.getbuffer())
+                    pdf_path = tmp.name
+                try:
+                    text = load_and_convert_cv(pdf_path)
+                    data: Resume = extract_resume(text)
+                    resume_to_sqlite(data, "data/db/resumes.db")
+                finally:
+                    os.remove(pdf_path)
+        st.success("Indexing complete!")
+    st.divider()
+    confirm = st.checkbox("Confirm delete database")
+    # DELETE DB BUTTON
+    if st.button("🗑️ Delete Database"):
+        if confirm:
+            tables = get_db().run("select name from sqlite_master where type='table';")
+            # tables = "[('resume_base',), ('contact',), ('certifications',), ('education',), ('experience',), ('projects',)]"
+            tables = ast.literal_eval(tables)
+            for table in tables:
+                get_db().run(f"drop table if exists {table[0]};")
+            st.success("All tables dropped successfully.")
+        else:
+            st.warning("Please confirm deletion first.")
+# Display chat history
+for message in st.session_state.messages:
+    with st.chat_message(message["role"]):
+        st.markdown(message["content"])
+# Chat Input
+if prompt := st.chat_input("Ask about your resumes (e.g., 'List all candidates names')"):
+    # Add user message to history
+    st.session_state.messages.append({"role": "user", "content": prompt})
+    with st.chat_message("user"):
+        st.markdown(prompt)
+    # Generate response
+    with st.chat_message("assistant"):
+        with st.spinner("Analyzing database..."):
+            try:
+                # create sql query for the user query
+                sql_query = generate_sql_query(prompt).strip()
+                # sql_query = "select name from resume_base;" # fake query
+                print('sql_query: ', sql_query)
+                if sql_query == "IRRELEVANT QUERY":
+                    response = "This question is outside the resume database scope."
+                elif not sql_query.upper().startswith(("SELECT", "WITH")):
+                    raise ValueError("Invalid SQL generated")
+                else:
+                    db_result = get_db().run(sql_query)
+                    if not db_result:
+                        db_result = "NO_DATA"
+                    # create a natural language response based on db results.
+                    response = generate_nl_answer(prompt, db_result)
+                    # response = db_result # fake response
+                    print('response generated')
+                st.markdown(response)
+                st.code(sql_query, language="sql", width="content")  # show query
+                st.session_state.messages.append({"role": "assistant", "content": response})
+            except Exception as e:
+                error_msg = f"Sorry, I ran into an error: {str(e)}"
+                st.error(error_msg)

core/chains.py ADDED Viewed

	@@ -0,0 +1,87 @@

+from langchain_core.prompts import ChatPromptTemplate
+from langchain_core.runnables import RunnableLambda
+from core.llm_model import model2 as llm
+system_prompt_sql = '''
+You are a SQLite SQL generator for a resume database.
+SCHEMA:
+resume_base(resume_id, name, summary)
+contact(resume_id, email, phone, linkedin, github, hugging_face, kaggle)
+certifications(resume_id, certification_name)
+education(resume_id, institution, degree, start_date, end_date)
+experience(resume_id, title, company, start_date, end_date)
+projects(resume_id, name, description, technologies, url, difficulty_score)
+RELATIONS:
+resume_base.resume_id = contact.resume_id
+resume_base.resume_id = education.resume_id
+resume_base.resume_id = experience.resume_id
+resume_base.resume_id = projects.resume_id
+TASK:
+Convert user query → valid SQLite SQL.
+OUTPUT RULES:
+- One line only
+- Only SQL
+- No markdown, no backticks, no text
+RELEVANCE:
+- If not answerable from schema → IRRELEVANT QUERY
+QUERY RULES:
+- Use COLLATE NOCASE on searches
+- Use only listed tables/columns
+- Use explicit JOIN when needed
+- Always join on resume_id
+- No invented schema
+'''
+primary_template = ChatPromptTemplate.from_messages([
+    ("system", system_prompt_sql),
+    ("human", "Query: {user_query}")
+])
+primary_chain = primary_template | llm | RunnableLambda(lambda response: response.content)
+def generate_sql_query(user_query):
+    return primary_chain.invoke({
+        "user_query": user_query
+    })
+system_prompt_analyst = '''
+You are a data analyst.
+INPUT:
+- User question
+- SQL query result (from database)
+TASK:
+- Generate a clear natural language answer based ONLY on the SQL result
+- If result is empty, say no matching data found
+- Generate short and concise answer in markdown format
+- Do NOT generate SQL
+- Do NOT hallucinate missing data
+'''
+# Secondary Chain
+secondary_template = ChatPromptTemplate.from_messages([
+    ("system", system_prompt_analyst),
+    ("human", "Query: {user_query}\nResults: {db_results}")
+])
+secondary_chain = secondary_template | llm | RunnableLambda(lambda response: response.content)
+def generate_nl_answer(user_query, db_results):
+    return secondary_chain.invoke({
+        "user_query": user_query,
+        "db_results": db_results
+    })

core/ingestion/docling_loader.py ADDED Viewed

	@@ -0,0 +1,32 @@

+import streamlit as st
+from pathlib import Path
+from docling.document_converter import DocumentConverter, PdfFormatOption
+@st.cache_resource
+def get_converter():
+    """
+    Initializes and caches the Docling DocumentConverter.
+    This ensures models are only loaded once across app reruns.
+    """
+    return DocumentConverter(
+        format_options={
+            "pdf": PdfFormatOption(enable_ocr=False)
+        }
+    )
+def load_and_convert_cv(file_path: str) -> str:
+    """
+    Converts a PDF/DOCX file to Markdown format using Docling.
+    Args:
+        file_path (str): The local path to the uploaded CV file.
+    Returns:
+        str: The converted markdown text.
+    """
+    converter = get_converter()
+    result = converter.convert(file_path)
+    text_content = result.document.export_to_text()
+    return text_content

core/llm_model.py ADDED Viewed

	@@ -0,0 +1,88 @@

+from langchain_openai import ChatOpenAI
+from pydantic_ai.models.openai import OpenAIChatModel
+from pydantic_ai.providers.openai import OpenAIProvider
+import os
+from dotenv import load_dotenv
+load_dotenv() # unnecessary if deployed on huggingface space as HF has secret key.
+# api_key = os.environ["HF_TOKEN"]   # raises error if missing
+api_key = os.environ["GITHUB_TOKEN"]   # raises error if missing
+# 1. Define the GitHub-compatible provider
+github_provider = OpenAIProvider(
+    base_url="https://models.inference.ai.azure.com", # GitHub Models endpoint
+    api_key=api_key                   # Your GitHub PAT
+)
+# 2. Initialize the model using the GitHub provider
+# Use model IDs like 'gpt-4o', 'meta-llama-3.1-70b-instruct', or 'DeepSeek-R1'
+model1 = OpenAIChatModel(
+    "gpt-4o",
+    provider=github_provider
+)
+model2 = ChatOpenAI(
+    model="gpt-4o", # Or other models like "meta-llama-3.1-70b-instruct"
+    openai_api_key=api_key,
+    base_url="https://models.github.ai/inference"
+)
+# from pydantic_ai import Agent
+# from pydantic_ai.models.openai import OpenAIChatModel
+# from pydantic_ai.providers.openai import OpenAIProvider
+# model1 = OpenAIChatModel(
+#     # model="qwen2.5:7b-instruct",
+#     "qwen2.5-7b-instruct-q4_k_m",
+#     provider=OpenAIProvider(
+#         base_url="http://localhost:11434/v1",  # 👈 Ollama
+#         api_key="ollama"  # dummy
+#     )
+# )
+# from langchain_ollama import ChatOllama
+# model2 = ChatOllama(
+#     model="Dolphin_SQL"
+# )
+# from pydantic_ai.models.huggingface import HuggingFaceModel
+# from pydantic_ai.providers.openai import OpenAIProvider
+# from dotenv import load_dotenv
+# import os
+# from langchain_openai import ChatOpenAI
+# load_dotenv() # unnecessary if deployed on huggingface space.
+# api_key = os.environ["HF_TOKEN"]   # raises error if missing
+# model1 = HuggingFaceModel(
+#     'Qwen/Qwen2.5-7B-Instruct',
+#     provider=OpenAIProvider(
+#         base_url="https://router.huggingface.co/v1",
+#         api_key=api_key
+#         )
+#     )
+# # Initialize using the OpenAI-compatible router
+# model2 = ChatOpenAI(
+#     model='Qwen/Qwen2.5-7B-Instruct',
+#     openai_api_key=api_key,
+#     openai_api_base="https://router.huggingface.co/v1"
+# )

core/parsing/extractor.py ADDED Viewed

	@@ -0,0 +1,30 @@

+from pydantic_ai import Agent
+from core.parsing.schema import Resume
+from core.llm_model import model1
+agent = Agent(
+    model=model1,
+    system_prompt=(
+            'You are an expert resume extractor.'
+            'If the context is not a Resume return null and DO NOT infer or hallucinate.'
+            'Do NOT infer or hallucinate missing sections.'
+            'If a section is not explicitly present, return null or empty list.'
+        ),
+    output_type=Resume
+)
+def extract_resume(text: str) -> Resume:
+    '''
+    Extract data from text using pydantic ai agent.
+    Args:
+        text (str): Text extracted from resume (using parser eg. Docling)
+    Returns:
+        Resume: Structured schema for resume
+    '''
+    result = agent.run_sync(text)
+    return result.output

core/parsing/schema.py ADDED Viewed

	@@ -0,0 +1,65 @@

+from pydantic import BaseModel, Field
+from typing import List, Optional
+# Nested models for detailed resume sections
+class ContactInformation(BaseModel):
+    email: str = Field(None, description="Email address.")
+    phone: Optional[str] = Field(None, description='mobile number eg. +92 03011234567')
+    linkedin: Optional[str] = None
+    github: Optional[str] = None
+    hugging_face: Optional[str] = None
+    kaggle: Optional[str] = None
+class Education(BaseModel):
+    institution: str
+    degree: str
+    start_date: Optional[str] = None
+    end_date: Optional[str] = None
+class Experience(BaseModel):
+    title: str = Field(description="Job role/title.")
+    company: str = Field(description="Name of the company or organization.")
+    start_date: Optional[str] = None
+    end_date: Optional[str] = None
+class Project(BaseModel):
+    name: str = Field(description="Name of a project.")
+    description: str = Field(description="Project Description")
+    technologies: List[str] = None
+    url: Optional[str] = None
+    difficulty_score: int = Field(
+      ...,
+      ge=1,
+      le=10,
+      description=(
+        "Strictly evaluate AI engineering complexity. "
+        "1-3: Simple 'wrapper' apps, basic prompting, or out-of-the-box RAG with a single data source. "
+        "4-6: Production-grade apps with persistent memory, multi-step tool use (agents), "
+        "complex data parsing (PDFs/Tables), or basic fine-tuning for style. "
+        "7-8: Advanced architectures featuring multi-agent orchestration, self-healing loops, "
+        "complex hybrid search (vector + keyword), or custom evaluation frameworks (LLM-as-a-judge). "
+        "9-10: Highly complex, mission-critical systems with real-time streaming, "
+        "multi-modal integration, or heavy optimization for cost and latency at scale. "
+        "If the project only uses a single API call without complex logic, it must not exceed 3."
+        )
+    )
+# Main AI Developer Resume Schema
+class Resume(BaseModel):
+    full_name: str = Field(..., description="Full name of the applicant.")
+    contact: ContactInformation
+    summary: str = Field(..., description="Professional summary focusing on AI/ML.")
+    education: Optional[List[Education]] = Field(
+            ..., description="List of educational degrees. Return null if not explicitly present."
+        )
+    experience: Optional[List[Experience]] = Field(
+            ..., description="List of experiences. Return null if not explicitly present."
+        )
+    ai_ml_skills: List[str] = Field(..., description="Specific AI/ML skills (e.g., LLMs, Computer Vision).")
+    technical_skills: List[str] = Field(..., description="Programming languages and tools.")
+    projects: Optional[List[Project]] = None
+    certifications: Optional[List[str]] = None

core/processing/database.py ADDED Viewed

	@@ -0,0 +1,46 @@

+import pandas as pd
+import sqlite3
+import uuid
+from core.parsing.schema import Resume
+def resume_to_sqlite(resume: Resume, db_path: str = "resumes.db"):
+    r = resume.model_dump()
+    resume_id = str(uuid.uuid4())[:8]
+    # 1. Prepare Data
+    base_data = {
+        "resume_id": resume_id,
+        "name": r.get("full_name"),
+        # **{f"contact_{k}": v for k, v in (r.get("contact") or {}).items()},
+        "summary": r.get("summary"),
+    }
+    # Helper to create DF with resume_id
+    def create_df(key):
+        data = r.get(key) or []
+        # Special handling for project technologies list
+        if key == "projects":
+            for p in data:
+                if isinstance(p.get("technologies"), list):
+                    p["technologies"] = ", ".join(p["technologies"])
+        if key == "contact":
+            df = pd.DataFrame([data])
+        else:
+            df = pd.DataFrame(data)
+        if not df.empty:
+            df.insert(0, 'resume_id', resume_id)
+        return df
+    # 2. Write to SQLite
+    with sqlite3.connect(db_path) as conn:
+        # Save base info
+        pd.DataFrame([base_data]).to_sql("resume_base", conn, if_exists="append", index=False)
+        # pd.DataFrame([r.get("contact" or [])]).to_sql('contact', conn, if_exists='append', index=False)
+        # Save nested lists
+        tables = ["contact", "skills", "certifications", "education", "experience", "projects"]
+        for table in tables:
+            df = create_df(table)
+            if not df.empty:
+                df.to_sql(table, conn, if_exists="append", index=False)
+    return resume_id

core/processing/dataframe.py ADDED Viewed

	@@ -0,0 +1,46 @@

+import pandas as pd
+from core.parsing.schema import Resume
+import uuid
+def resume_to_dfs(resume: Resume):
+    r = resume.model_dump()
+    # Flattens the top-level fields and the nested 'contact' dict
+    base_data = {
+        "name": r.get("full_name"),
+        **{f"contact_{k}": v for k, v in (r.get("contact") or {}).items()},
+        "summary": r.get("summary"),
+    }
+    df_base = pd.DataFrame([base_data])
+    df_skl = pd.DataFrame(r.get("skills") or [])
+    df_cert = pd.DataFrame(r.get("certifications") or [])
+    df_edu = pd.DataFrame(r.get("education") or [])
+    df_exp = pd.DataFrame(r.get("experience") or [])
+    # We handle the 'technologies' list by joining it into a string for the CSV/Table view
+    projects = r.get("projects") or []
+    for p in projects:
+        if isinstance(p.get("technologies"), list):
+            p["technologies"] = ", ".join(p["technologies"])
+    df_proj = pd.DataFrame(projects)
+    dfs = {
+        "base": df_base,
+        "skills": df_skl,
+        "certifications": df_cert,
+        "education": df_edu,
+        "experience": df_exp,
+        "projects": df_proj
+    }
+    resume_id = str(uuid.uuid4())[:8]  # Short unique ID
+    for df in dfs:
+        df.insert(0, 'resume_id', resume_id)
+    return dfs

requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+docling==2.82.0
+langchain_community==0.4.1
+langchain_core==1.2.31
+langchain_openai==1.1.14
+pandas==3.0.2
+pydantic==2.13.1
+pydantic_ai==1.83.0
+python-dotenv==1.2.2
+streamlit==1.55.0