{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 🧠 HRHUB v2.1 - Enhanced with LLM (FREE VERSION)\n", "\n", "## πŸ“˜ Project Overview\n", "\n", "**Bilateral HR Matching System with LLM-Powered Intelligence**\n", "\n", "### What's New in v2.1:\n", "- βœ… **FREE LLM**: Using Hugging Face Inference API (no cost)\n", "- βœ… **Job Level Classification**: Zero-shot & few-shot learning\n", "- βœ… **Structured Skills Extraction**: Pydantic schemas\n", "- βœ… **Match Explainability**: LLM-generated reasoning\n", "- βœ… **Flexible Data Loading**: Upload OR Google Drive\n", "\n", "### Tech Stack:\n", "```\n", "Embeddings: sentence-transformers (local, free)\n", "LLM: Hugging Face Inference API (free tier)\n", "Schemas: Pydantic\n", "Platform: Google Colab β†’ VS Code\n", "```\n", "\n", "---\n", "\n", "**Master's Thesis - Aalborg University** \n", "*Business Data Science Program* \n", "*December 2025*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## πŸ“Š Step 1: Install Dependencies" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "βœ… All packages installed!\n" ] } ], "source": [ "# Install required packages\n", "#!pip install -q sentence-transformers huggingface-hub pydantic plotly pyvis nbformat scikit-learn pandas numpy\n", "\n", "print(\"βœ… All packages installed!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## πŸ“Š Step 2: Import Libraries" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "βœ… Environment variables loaded from .env\n", "βœ… All libraries imported!\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import json\n", "import os\n", "from typing import List, Dict, Optional, Literal\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "# ML & NLP\n", "from sentence_transformers import SentenceTransformer\n", "from sklearn.metrics.pairwise import cosine_similarity\n", "\n", "# LLM Integration (FREE)\n", "from huggingface_hub import InferenceClient\n", "from pydantic import BaseModel, Field\n", "\n", "# Visualization\n", "import plotly.graph_objects as go\n", "from IPython.display import HTML, display\n", "\n", "# Configuration Settings\n", "from dotenv import load_dotenv\n", "\n", "# Carrega variΓ‘veis do .env\n", "load_dotenv()\n", "print(\"βœ… Environment variables loaded from .env\")\n", "# ============== ATΓ‰ AQUI ⬆️ ==============\n", "\n", "print(\"βœ… All libraries imported!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## πŸ“Š Step 3: Configuration" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "βœ… Configuration loaded!\n", "🧠 Embedding model: all-MiniLM-L6-v2\n", "πŸ€– LLM model: meta-llama/Llama-3.2-3B-Instruct\n", "πŸ”‘ HF Token configured: Yes βœ…\n", "πŸ“‚ Data path: ../csv_files/\n" ] } ], "source": [ "class Config:\n", " \"\"\"Centralized configuration for VS Code\"\"\"\n", " \n", " # Paths - VS Code structure\n", " CSV_PATH = '../csv_files/'\n", " PROCESSED_PATH = '../processed/'\n", " RESULTS_PATH = '../results/'\n", " \n", " # Embedding Model\n", " EMBEDDING_MODEL = 'all-MiniLM-L6-v2'\n", " \n", " # LLM Settings (FREE - Hugging Face)\n", " HF_TOKEN = os.getenv('HF_TOKEN', '') # βœ… Pega do .env\n", " LLM_MODEL = 'meta-llama/Llama-3.2-3B-Instruct'\n", " \n", " LLM_MAX_TOKENS = 1000\n", " \n", " # Matching Parameters\n", " TOP_K_MATCHES = 10\n", " SIMILARITY_THRESHOLD = 0.5\n", " RANDOM_SEED = 42\n", "\n", "np.random.seed(Config.RANDOM_SEED)\n", "\n", "print(\"βœ… Configuration loaded!\")\n", "print(f\"🧠 Embedding model: {Config.EMBEDDING_MODEL}\")\n", "print(f\"πŸ€– LLM model: {Config.LLM_MODEL}\")\n", "print(f\"πŸ”‘ HF Token configured: {'Yes βœ…' if Config.HF_TOKEN else 'No ⚠️'}\")\n", "print(f\"πŸ“‚ Data path: {Config.CSV_PATH}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## πŸ“Š Step 4: Load All Datasets" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "πŸ“‚ Loading all datasets...\n", "\n", "======================================================================\n", "βœ… Candidates: 9,544 rows Γ— 35 columns\n", "βœ… Companies (base): 24,473 rows\n", "βœ… Company industries: 24,375 rows\n", "βœ… Company specialties: 169,387 rows\n", "βœ… Employee counts: 35,787 rows\n", "βœ… Postings: 123,849 rows Γ— 31 columns\n", "βœ… Job skills: 213,768 rows\n", "βœ… Job industries: 164,808 rows\n", "\n", "======================================================================\n", "βœ… All datasets loaded successfully!\n", "\n" ] } ], "source": [ "print(\"πŸ“‚ Loading all datasets...\\n\")\n", "print(\"=\" * 70)\n", "\n", "# Load main datasets\n", "candidates = pd.read_csv(f'{Config.CSV_PATH}resume_data.csv')\n", "print(f\"βœ… Candidates: {len(candidates):,} rows Γ— {len(candidates.columns)} columns\")\n", "\n", "companies_base = pd.read_csv(f'{Config.CSV_PATH}companies.csv')\n", "print(f\"βœ… Companies (base): {len(companies_base):,} rows\")\n", "\n", "company_industries = pd.read_csv(f'{Config.CSV_PATH}company_industries.csv')\n", "print(f\"βœ… Company industries: {len(company_industries):,} rows\")\n", "\n", "company_specialties = pd.read_csv(f'{Config.CSV_PATH}company_specialities.csv')\n", "print(f\"βœ… Company specialties: {len(company_specialties):,} rows\")\n", "\n", "employee_counts = pd.read_csv(f'{Config.CSV_PATH}employee_counts.csv')\n", "print(f\"βœ… Employee counts: {len(employee_counts):,} rows\")\n", "\n", "postings = pd.read_csv(f'{Config.CSV_PATH}postings.csv', on_bad_lines='skip', engine='python')\n", "print(f\"βœ… Postings: {len(postings):,} rows Γ— {len(postings.columns)} columns\")\n", "\n", "# Optional datasets\n", "try:\n", " job_skills = pd.read_csv(f'{Config.CSV_PATH}job_skills.csv')\n", " print(f\"βœ… Job skills: {len(job_skills):,} rows\")\n", "except:\n", " job_skills = None\n", " print(\"⚠️ Job skills not found (optional)\")\n", "\n", "try:\n", " job_industries = pd.read_csv(f'{Config.CSV_PATH}job_industries.csv')\n", " print(f\"βœ… Job industries: {len(job_industries):,} rows\")\n", "except:\n", " job_industries = None\n", " print(\"⚠️ Job industries not found (optional)\")\n", "\n", "print(\"\\n\" + \"=\" * 70)\n", "print(\"βœ… All datasets loaded successfully!\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## πŸ“Š Step 5: Merge & Enrich Company Data" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "πŸ”— Merging company data...\n", "\n", "βœ… Aggregated industries for 24,365 companies\n", "βœ… Aggregated specialties for 17,780 companies\n", "\n", "βœ… Base company merge complete: 35,787 companies\n", "\n" ] } ], "source": [ "print(\"πŸ”— Merging company data...\\n\")\n", "\n", "# Aggregate industries\n", "company_industries_agg = company_industries.groupby('company_id')['industry'].apply(\n", " lambda x: ', '.join(map(str, x.tolist()))\n", ").reset_index()\n", "company_industries_agg.columns = ['company_id', 'industries_list']\n", "print(f\"βœ… Aggregated industries for {len(company_industries_agg):,} companies\")\n", "\n", "# Aggregate specialties\n", "company_specialties_agg = company_specialties.groupby('company_id')['speciality'].apply(\n", " lambda x: ' | '.join(x.astype(str).tolist())\n", ").reset_index()\n", "company_specialties_agg.columns = ['company_id', 'specialties_list']\n", "print(f\"βœ… Aggregated specialties for {len(company_specialties_agg):,} companies\")\n", "\n", "# Merge all company data\n", "companies_merged = companies_base.copy()\n", "companies_merged = companies_merged.merge(company_industries_agg, on='company_id', how='left')\n", "companies_merged = companies_merged.merge(company_specialties_agg, on='company_id', how='left')\n", "companies_merged = companies_merged.merge(employee_counts, on='company_id', how='left')\n", "\n", "print(f\"\\nβœ… Base company merge complete: {len(companies_merged):,} companies\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## πŸ“Š Step 6: Enrich with Job Postings" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "πŸŒ‰ Enriching companies with job posting data...\n", "\n", "======================================================================\n", "KEY INSIGHT: Postings = 'Requirements Language Bridge'\n", "======================================================================\n", "\n", "βœ… Enriched 35,787 companies with posting data\n", "\n" ] } ], "source": [ "print(\"πŸŒ‰ Enriching companies with job posting data...\\n\")\n", "print(\"=\" * 70)\n", "print(\"KEY INSIGHT: Postings = 'Requirements Language Bridge'\")\n", "print(\"=\" * 70 + \"\\n\")\n", "\n", "postings = postings.fillna('')\n", "postings['company_id'] = postings['company_id'].astype(str)\n", "\n", "# Aggregate postings per company\n", "postings_agg = postings.groupby('company_id').agg({\n", " 'title': lambda x: ' | '.join(x.astype(str).tolist()[:10]),\n", " 'description': lambda x: ' '.join(x.astype(str).tolist()[:5]),\n", " 'skills_desc': lambda x: ' | '.join(x.dropna().astype(str).tolist()),\n", " 'formatted_experience_level': lambda x: ' | '.join(x.dropna().unique().astype(str)),\n", "}).reset_index()\n", "\n", "postings_agg.columns = ['company_id', 'posted_job_titles', 'posted_descriptions', 'required_skills', 'experience_levels']\n", "\n", "companies_merged['company_id'] = companies_merged['company_id'].astype(str)\n", "companies_full = companies_merged.merge(postings_agg, on='company_id', how='left').fillna('')\n", "\n", "print(f\"βœ… Enriched {len(companies_full):,} companies with posting data\\n\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
company_idnamedescriptioncompany_sizestatecountrycityzip_codeaddressurlindustries_listspecialties_listemployee_countfollower_counttime_recordedposted_job_titlesposted_descriptionsrequired_skillsexperience_levels
01009IBMAt IBM, we do more than work. We create. We cr...7.0NYUSArmonk, New York10504International Business Machines Corp.https://www.linkedin.com/company/ibmIT Services and IT ConsultingCloud | Mobile | Cognitive | Security | Resear...314102162536251712378162
11009IBMAt IBM, we do more than work. We create. We cr...7.0NYUSArmonk, New York10504International Business Machines Corp.https://www.linkedin.com/company/ibmIT Services and IT ConsultingCloud | Mobile | Cognitive | Security | Resear...313142163094641713392385
21009IBMAt IBM, we do more than work. We create. We cr...7.0NYUSArmonk, New York10504International Business Machines Corp.https://www.linkedin.com/company/ibmIT Services and IT ConsultingCloud | Mobile | Cognitive | Security | Resear...313147163099851713402495
31009IBMAt IBM, we do more than work. We create. We cr...7.0NYUSArmonk, New York10504International Business Machines Corp.https://www.linkedin.com/company/ibmIT Services and IT ConsultingCloud | Mobile | Cognitive | Security | Resear...311223163148461713501255
41016GE HealthCareEvery day millions of people feel the impact o...7.00USChicago0-https://www.linkedin.com/company/gehealthcareHospitals and Health CareHealthcare | Biotechnology5687321853681712382540
\n", "
" ], "text/plain": [ " company_id name \\\n", "0 1009 IBM \n", "1 1009 IBM \n", "2 1009 IBM \n", "3 1009 IBM \n", "4 1016 GE HealthCare \n", "\n", " description company_size state \\\n", "0 At IBM, we do more than work. We create. We cr... 7.0 NY \n", "1 At IBM, we do more than work. We create. We cr... 7.0 NY \n", "2 At IBM, we do more than work. We create. We cr... 7.0 NY \n", "3 At IBM, we do more than work. We create. We cr... 7.0 NY \n", "4 Every day millions of people feel the impact o... 7.0 0 \n", "\n", " country city zip_code address \\\n", "0 US Armonk, New York 10504 International Business Machines Corp. \n", "1 US Armonk, New York 10504 International Business Machines Corp. \n", "2 US Armonk, New York 10504 International Business Machines Corp. \n", "3 US Armonk, New York 10504 International Business Machines Corp. \n", "4 US Chicago 0 - \n", "\n", " url \\\n", "0 https://www.linkedin.com/company/ibm \n", "1 https://www.linkedin.com/company/ibm \n", "2 https://www.linkedin.com/company/ibm \n", "3 https://www.linkedin.com/company/ibm \n", "4 https://www.linkedin.com/company/gehealthcare \n", "\n", " industries_list \\\n", "0 IT Services and IT Consulting \n", "1 IT Services and IT Consulting \n", "2 IT Services and IT Consulting \n", "3 IT Services and IT Consulting \n", "4 Hospitals and Health Care \n", "\n", " specialties_list employee_count \\\n", "0 Cloud | Mobile | Cognitive | Security | Resear... 314102 \n", "1 Cloud | Mobile | Cognitive | Security | Resear... 313142 \n", "2 Cloud | Mobile | Cognitive | Security | Resear... 313147 \n", "3 Cloud | Mobile | Cognitive | Security | Resear... 311223 \n", "4 Healthcare | Biotechnology 56873 \n", "\n", " follower_count time_recorded posted_job_titles posted_descriptions \\\n", "0 16253625 1712378162 \n", "1 16309464 1713392385 \n", "2 16309985 1713402495 \n", "3 16314846 1713501255 \n", "4 2185368 1712382540 \n", "\n", " required_skills experience_levels \n", "0 \n", "1 \n", "2 \n", "3 \n", "4 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "companies_full.head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "================================================================================\n", "πŸ” DUPLICATE DETECTION REPORT\n", "================================================================================\n", "\n", "β”Œβ”€ πŸ“Š resume_data.csv (Candidates)\n", "β”‚ Primary Key: Resume_ID\n", "β”‚ Total rows: 9,544\n", "β”‚ Unique rows: 9,544\n", "β”‚ Duplicates: 0\n", "β”‚ Status: βœ… CLEAN\n", "└─\n", "\n", "β”Œβ”€ πŸ“Š companies.csv (Companies Base)\n", "β”‚ Primary Key: company_id\n", "β”‚ Total rows: 24,473\n", "β”‚ Unique rows: 24,473\n", "β”‚ Duplicates: 0\n", "β”‚ Status: βœ… CLEAN\n", "└─\n", "\n", "β”Œβ”€ πŸ“Š company_industries.csv\n", "β”‚ Primary Key: company_id + industry\n", "β”‚ Total rows: 24,375\n", "β”‚ Unique rows: 24,375\n", "β”‚ Duplicates: 0\n", "β”‚ Status: βœ… CLEAN\n", "└─\n", "\n", "β”Œβ”€ πŸ“Š company_specialities.csv\n", "β”‚ Primary Key: company_id + speciality\n", "β”‚ Total rows: 169,387\n", "β”‚ Unique rows: 169,387\n", "β”‚ Duplicates: 0\n", "β”‚ Status: βœ… CLEAN\n", "└─\n", "\n", "β”Œβ”€ πŸ“Š employee_counts.csv\n", "β”‚ Primary Key: company_id\n", "β”‚ Total rows: 35,787\n", "β”‚ Unique rows: 24,473\n", "β”‚ Duplicates: 11,314\n", "β”‚ Status: πŸ”΄ HAS DUPLICATES\n", "└─\n", "\n", "β”Œβ”€ πŸ“Š postings.csv (Job Postings)\n", "β”‚ Primary Key: job_id\n", "β”‚ Total rows: 123,849\n", "β”‚ Unique rows: 123,849\n", "β”‚ Duplicates: 0\n", "β”‚ Status: βœ… CLEAN\n", "└─\n", "\n", "β”Œβ”€ πŸ“Š companies_full (After Enrichment)\n", "β”‚ Primary Key: company_id\n", "β”‚ Total rows: 35,787\n", "β”‚ Unique rows: 24,473\n", "β”‚ Duplicates: 11,314\n", "β”‚ Status: πŸ”΄ HAS DUPLICATES\n", "β”‚\n", "β”‚ Top duplicate company_ids:\n", "β”‚ - 33242739 (Confidential): 13 times\n", "β”‚ - 5235 (LHH): 13 times\n", "β”‚ - 79383535 (Akkodis): 12 times\n", "β”‚ - 1681 (Robert Half): 12 times\n", "β”‚ - 220336 (Hyatt Hotels Corporation): 11 times\n", "└─\n", "\n", "================================================================================\n", "πŸ“Š SUMMARY\n", "================================================================================\n", "\n", "βœ… Clean datasets: 5/7\n", "πŸ”΄ Datasets with duplicates: 2/7\n", "πŸ—‘οΈ Total duplicates found: 22,628 rows\n", "\n", "⚠️ DUPLICATES DETECTED!\n", "================================================================================\n" ] } ], "source": [ "## πŸ” Data Quality Check - Duplicate Detection\n", "\n", "\"\"\"\n", "Checking for duplicates in all datasets based on primary keys.\n", "This cell only REPORTS duplicates, does not modify data.\n", "\"\"\"\n", "\n", "print(\"=\" * 80)\n", "print(\"πŸ” DUPLICATE DETECTION REPORT\")\n", "print(\"=\" * 80)\n", "print()\n", "\n", "# Define primary keys for each dataset\n", "duplicate_report = []\n", "\n", "# 1. Candidates\n", "print(\"β”Œβ”€ πŸ“Š resume_data.csv (Candidates)\")\n", "print(f\"β”‚ Primary Key: Resume_ID\")\n", "cand_total = len(candidates)\n", "cand_unique = candidates['Resume_ID'].nunique() if 'Resume_ID' in candidates.columns else len(candidates)\n", "cand_dups = cand_total - cand_unique\n", "print(f\"β”‚ Total rows: {cand_total:,}\")\n", "print(f\"β”‚ Unique rows: {cand_unique:,}\")\n", "print(f\"β”‚ Duplicates: {cand_dups:,}\")\n", "print(f\"β”‚ Status: {'βœ… CLEAN' if cand_dups == 0 else 'πŸ”΄ HAS DUPLICATES'}\")\n", "print(\"└─\\n\")\n", "duplicate_report.append(('Candidates', cand_total, cand_unique, cand_dups))\n", "\n", "# 2. Companies Base\n", "print(\"β”Œβ”€ πŸ“Š companies.csv (Companies Base)\")\n", "print(f\"β”‚ Primary Key: company_id\")\n", "comp_total = len(companies_base)\n", "comp_unique = companies_base['company_id'].nunique()\n", "comp_dups = comp_total - comp_unique\n", "print(f\"β”‚ Total rows: {comp_total:,}\")\n", "print(f\"β”‚ Unique rows: {comp_unique:,}\")\n", "print(f\"β”‚ Duplicates: {comp_dups:,}\")\n", "print(f\"β”‚ Status: {'βœ… CLEAN' if comp_dups == 0 else 'πŸ”΄ HAS DUPLICATES'}\")\n", "if comp_dups > 0:\n", " dup_ids = companies_base[companies_base.duplicated('company_id', keep=False)]['company_id'].value_counts().head(3)\n", " print(f\"β”‚ Top duplicates:\")\n", " for cid, count in dup_ids.items():\n", " print(f\"β”‚ - company_id={cid}: {count} times\")\n", "print(\"└─\\n\")\n", "duplicate_report.append(('Companies Base', comp_total, comp_unique, comp_dups))\n", "\n", "# 3. Company Industries\n", "print(\"β”Œβ”€ πŸ“Š company_industries.csv\")\n", "print(f\"β”‚ Primary Key: company_id + industry\")\n", "ci_total = len(company_industries)\n", "ci_unique = len(company_industries.drop_duplicates(subset=['company_id', 'industry']))\n", "ci_dups = ci_total - ci_unique\n", "print(f\"β”‚ Total rows: {ci_total:,}\")\n", "print(f\"β”‚ Unique rows: {ci_unique:,}\")\n", "print(f\"β”‚ Duplicates: {ci_dups:,}\")\n", "print(f\"β”‚ Status: {'βœ… CLEAN' if ci_dups == 0 else 'πŸ”΄ HAS DUPLICATES'}\")\n", "print(\"└─\\n\")\n", "duplicate_report.append(('Company Industries', ci_total, ci_unique, ci_dups))\n", "\n", "# 4. Company Specialties\n", "print(\"β”Œβ”€ πŸ“Š company_specialities.csv\")\n", "print(f\"β”‚ Primary Key: company_id + speciality\")\n", "cs_total = len(company_specialties)\n", "cs_unique = len(company_specialties.drop_duplicates(subset=['company_id', 'speciality']))\n", "cs_dups = cs_total - cs_unique\n", "print(f\"β”‚ Total rows: {cs_total:,}\")\n", "print(f\"β”‚ Unique rows: {cs_unique:,}\")\n", "print(f\"β”‚ Duplicates: {cs_dups:,}\")\n", "print(f\"β”‚ Status: {'βœ… CLEAN' if cs_dups == 0 else 'πŸ”΄ HAS DUPLICATES'}\")\n", "print(\"└─\\n\")\n", "duplicate_report.append(('Company Specialties', cs_total, cs_unique, cs_dups))\n", "\n", "# 5. Employee Counts\n", "print(\"β”Œβ”€ πŸ“Š employee_counts.csv\")\n", "print(f\"β”‚ Primary Key: company_id\")\n", "ec_total = len(employee_counts)\n", "ec_unique = employee_counts['company_id'].nunique()\n", "ec_dups = ec_total - ec_unique\n", "print(f\"β”‚ Total rows: {ec_total:,}\")\n", "print(f\"β”‚ Unique rows: {ec_unique:,}\")\n", "print(f\"β”‚ Duplicates: {ec_dups:,}\")\n", "print(f\"β”‚ Status: {'βœ… CLEAN' if ec_dups == 0 else 'πŸ”΄ HAS DUPLICATES'}\")\n", "print(\"└─\\n\")\n", "duplicate_report.append(('Employee Counts', ec_total, ec_unique, ec_dups))\n", "\n", "# 6. Postings\n", "print(\"β”Œβ”€ πŸ“Š postings.csv (Job Postings)\")\n", "print(f\"β”‚ Primary Key: job_id\")\n", "if 'job_id' in postings.columns:\n", " post_total = len(postings)\n", " post_unique = postings['job_id'].nunique()\n", " post_dups = post_total - post_unique\n", "else:\n", " post_total = len(postings)\n", " post_unique = len(postings.drop_duplicates())\n", " post_dups = post_total - post_unique\n", "print(f\"β”‚ Total rows: {post_total:,}\")\n", "print(f\"β”‚ Unique rows: {post_unique:,}\")\n", "print(f\"β”‚ Duplicates: {post_dups:,}\")\n", "print(f\"β”‚ Status: {'βœ… CLEAN' if post_dups == 0 else 'πŸ”΄ HAS DUPLICATES'}\")\n", "print(\"└─\\n\")\n", "duplicate_report.append(('Postings', post_total, post_unique, post_dups))\n", "\n", "# 7. Companies Full (After Merge)\n", "print(\"β”Œβ”€ πŸ“Š companies_full (After Enrichment)\")\n", "print(f\"β”‚ Primary Key: company_id\")\n", "cf_total = len(companies_full)\n", "cf_unique = companies_full['company_id'].nunique()\n", "cf_dups = cf_total - cf_unique\n", "print(f\"β”‚ Total rows: {cf_total:,}\")\n", "print(f\"β”‚ Unique rows: {cf_unique:,}\")\n", "print(f\"β”‚ Duplicates: {cf_dups:,}\")\n", "print(f\"β”‚ Status: {'βœ… CLEAN' if cf_dups == 0 else 'πŸ”΄ HAS DUPLICATES'}\")\n", "if cf_dups > 0:\n", " dup_ids = companies_full[companies_full.duplicated('company_id', keep=False)]['company_id'].value_counts().head(5)\n", " print(f\"β”‚\")\n", " print(f\"β”‚ Top duplicate company_ids:\")\n", " for cid, count in dup_ids.items():\n", " comp_name = companies_full[companies_full['company_id'] == cid]['name'].iloc[0]\n", " print(f\"β”‚ - {cid} ({comp_name}): {count} times\")\n", "print(\"└─\\n\")\n", "duplicate_report.append(('Companies Full', cf_total, cf_unique, cf_dups))\n", "\n", "# Summary\n", "print(\"=\" * 80)\n", "print(\"πŸ“Š SUMMARY\")\n", "print(\"=\" * 80)\n", "print()\n", "\n", "total_dups = sum(r[3] for r in duplicate_report)\n", "clean_datasets = sum(1 for r in duplicate_report if r[3] == 0)\n", "dirty_datasets = len(duplicate_report) - clean_datasets\n", "\n", "print(f\"βœ… Clean datasets: {clean_datasets}/{len(duplicate_report)}\")\n", "print(f\"πŸ”΄ Datasets with duplicates: {dirty_datasets}/{len(duplicate_report)}\")\n", "print(f\"πŸ—‘οΈ Total duplicates found: {total_dups:,} rows\")\n", "print()\n", "\n", "if dirty_datasets > 0:\n", " print(\"⚠️ DUPLICATES DETECTED!\")\n", "else:\n", " print(\"βœ… All datasets are clean! No duplicates found.\")\n", "\n", "print(\"=\" * 80)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "🧹 CLEANING DUPLICATES...\n", "\n", "================================================================================\n", "βœ… companies_base: Already clean\n", "\n", "βœ… company_industries: Already clean\n", "\n", "βœ… company_specialties: Already clean\n", "\n", "βœ… employee_counts:\n", " Removed 11,314 duplicates\n", " 35,787 β†’ 24,473 rows\n", "\n", "βœ… postings: Already clean\n", "\n", "βœ… companies_full:\n", " Removed 11,314 duplicates\n", " 35,787 β†’ 24,473 rows\n", "\n", "================================================================================\n", "βœ… DATA CLEANING COMPLETE!\n", "================================================================================\n", "\n", "πŸ“Š Total duplicates removed: 22,628 rows\n", "\n", "Cleaned datasets:\n", " - employee_counts: 35,787 β†’ 24,473\n", " - companies_full: 35,787 β†’ 24,473\n" ] } ], "source": [ "\"\"\"\n", "## 🧹 Data Cleaning - Remove Duplicates\n", "\n", "Based on the report above, removing duplicates from datasets.\n", "\"\"\"\n", "\n", "print(\"🧹 CLEANING DUPLICATES...\\n\")\n", "print(\"=\" * 80)\n", "\n", "# Store original counts\n", "original_counts = {}\n", "\n", "# 1. Clean Companies Base (if needed)\n", "if len(companies_base) != companies_base['company_id'].nunique():\n", " original_counts['companies_base'] = len(companies_base)\n", " companies_base = companies_base.drop_duplicates(subset=['company_id'], keep='first')\n", " removed = original_counts['companies_base'] - len(companies_base)\n", " print(f\"βœ… companies_base:\")\n", " print(f\" Removed {removed:,} duplicates\")\n", " print(f\" {original_counts['companies_base']:,} β†’ {len(companies_base):,} rows\\n\")\n", "else:\n", " print(f\"βœ… companies_base: Already clean\\n\")\n", "\n", "# 2. Clean Company Industries (if needed)\n", "if len(company_industries) != len(company_industries.drop_duplicates(subset=['company_id', 'industry'])):\n", " original_counts['company_industries'] = len(company_industries)\n", " company_industries = company_industries.drop_duplicates(subset=['company_id', 'industry'], keep='first')\n", " removed = original_counts['company_industries'] - len(company_industries)\n", " print(f\"βœ… company_industries:\")\n", " print(f\" Removed {removed:,} duplicates\")\n", " print(f\" {original_counts['company_industries']:,} β†’ {len(company_industries):,} rows\\n\")\n", "else:\n", " print(f\"βœ… company_industries: Already clean\\n\")\n", "\n", "# 3. Clean Company Specialties (if needed)\n", "if len(company_specialties) != len(company_specialties.drop_duplicates(subset=['company_id', 'speciality'])):\n", " original_counts['company_specialties'] = len(company_specialties)\n", " company_specialties = company_specialties.drop_duplicates(subset=['company_id', 'speciality'], keep='first')\n", " removed = original_counts['company_specialties'] - len(company_specialties)\n", " print(f\"βœ… company_specialties:\")\n", " print(f\" Removed {removed:,} duplicates\")\n", " print(f\" {original_counts['company_specialties']:,} β†’ {len(company_specialties):,} rows\\n\")\n", "else:\n", " print(f\"βœ… company_specialties: Already clean\\n\")\n", "\n", "# 4. Clean Employee Counts (if needed)\n", "if len(employee_counts) != employee_counts['company_id'].nunique():\n", " original_counts['employee_counts'] = len(employee_counts)\n", " employee_counts = employee_counts.drop_duplicates(subset=['company_id'], keep='first')\n", " removed = original_counts['employee_counts'] - len(employee_counts)\n", " print(f\"βœ… employee_counts:\")\n", " print(f\" Removed {removed:,} duplicates\")\n", " print(f\" {original_counts['employee_counts']:,} β†’ {len(employee_counts):,} rows\\n\")\n", "else:\n", " print(f\"βœ… employee_counts: Already clean\\n\")\n", "\n", "# 5. Clean Postings (if needed)\n", "if 'job_id' in postings.columns:\n", " if len(postings) != postings['job_id'].nunique():\n", " original_counts['postings'] = len(postings)\n", " postings = postings.drop_duplicates(subset=['job_id'], keep='first')\n", " removed = original_counts['postings'] - len(postings)\n", " print(f\"βœ… postings:\")\n", " print(f\" Removed {removed:,} duplicates\")\n", " print(f\" {original_counts['postings']:,} β†’ {len(postings):,} rows\\n\")\n", " else:\n", " print(f\"βœ… postings: Already clean\\n\")\n", "\n", "# 6. Clean Companies Full (if needed)\n", "if len(companies_full) != companies_full['company_id'].nunique():\n", " original_counts['companies_full'] = len(companies_full)\n", " companies_full = companies_full.drop_duplicates(subset=['company_id'], keep='first')\n", " removed = original_counts['companies_full'] - len(companies_full)\n", " print(f\"βœ… companies_full:\")\n", " print(f\" Removed {removed:,} duplicates\")\n", " print(f\" {original_counts['companies_full']:,} β†’ {len(companies_full):,} rows\\n\")\n", "else:\n", " print(f\"βœ… companies_full: Already clean\\n\")\n", "\n", "print(\"=\" * 80)\n", "print(\"βœ… DATA CLEANING COMPLETE!\")\n", "print(\"=\" * 80)\n", "print()\n", "\n", "# Summary\n", "if original_counts:\n", " total_removed = sum(original_counts[k] - globals()[k].shape[0] if k in globals() else 0 \n", " for k in original_counts.keys())\n", " print(f\"πŸ“Š Total duplicates removed: {total_removed:,} rows\")\n", " print()\n", " print(\"Cleaned datasets:\")\n", " for dataset, original in original_counts.items():\n", " current = len(globals()[dataset]) if dataset in globals() else 0\n", " print(f\" - {dataset}: {original:,} β†’ {current:,}\")\n", "else:\n", " print(\"βœ… No duplicates found - all datasets were already clean!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## πŸ“Š Step 7: Load Embedding Model & Pre-computed Vectors" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "🧠 Loading embedding model...\n", "\n", "βœ… Model loaded: all-MiniLM-L6-v2\n", "πŸ“ Embedding dimension: ℝ^384\n", "\n", "πŸ“‚ Loading pre-computed embeddings...\n", "βœ… Loaded from ../processed/\n", "πŸ“Š Candidate vectors: (9544, 384)\n", "πŸ“Š Company vectors: (35787, 384)\n", "\n" ] } ], "source": [ "print(\"🧠 Loading embedding model...\\n\")\n", "model = SentenceTransformer(Config.EMBEDDING_MODEL)\n", "embedding_dim = model.get_sentence_embedding_dimension()\n", "print(f\"βœ… Model loaded: {Config.EMBEDDING_MODEL}\")\n", "print(f\"πŸ“ Embedding dimension: ℝ^{embedding_dim}\\n\")\n", "\n", "print(\"πŸ“‚ Loading pre-computed embeddings...\")\n", "\n", "try:\n", " # Try to load from processed folder\n", " cand_vectors = np.load(f'{Config.PROCESSED_PATH}candidate_embeddings.npy')\n", " comp_vectors = np.load(f'{Config.PROCESSED_PATH}company_embeddings.npy')\n", " \n", " print(f\"βœ… Loaded from {Config.PROCESSED_PATH}\")\n", " print(f\"πŸ“Š Candidate vectors: {cand_vectors.shape}\")\n", " print(f\"πŸ“Š Company vectors: {comp_vectors.shape}\\n\")\n", " \n", "except FileNotFoundError:\n", " print(\"⚠️ Pre-computed embeddings not found!\")\n", " print(\" Embeddings will need to be generated (takes ~5-10 minutes)\")\n", " print(\" This is normal if running for the first time.\\n\")\n", " \n", " # You can add embedding generation code here if needed\n", " # For now, we'll skip to keep notebook clean\n", " cand_vectors = None\n", " comp_vectors = None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## πŸ“Š Step 8: Core Matching Function" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "βœ… Matching function ready\n" ] } ], "source": [ "def find_top_matches(candidate_idx: int, top_k: int = 10) -> List[tuple]:\n", " \"\"\"\n", " Find top K company matches for a candidate using cosine similarity.\n", " \n", " Args:\n", " candidate_idx: Index of candidate\n", " top_k: Number of top matches to return\n", " \n", " Returns:\n", " List of (company_index, similarity_score) tuples\n", " \"\"\"\n", " if cand_vectors is None or comp_vectors is None:\n", " raise ValueError(\"Embeddings not loaded! Please run Step 8 first.\")\n", " \n", " cand_vec = cand_vectors[candidate_idx].reshape(1, -1)\n", " similarities = cosine_similarity(cand_vec, comp_vectors)[0]\n", " top_indices = np.argsort(similarities)[::-1][:top_k]\n", " \n", " return [(int(idx), float(similarities[idx])) for idx in top_indices]\n", "\n", "print(\"βœ… Matching function ready\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## πŸ“Š Step 9: Initialize FREE LLM (Hugging Face)\n", "\n", "### Get your FREE token: https://huggingface.co/settings/tokens" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "βœ… Hugging Face client initialized (FREE)\n", "πŸ€– Model: meta-llama/Llama-3.2-3B-Instruct\n", "πŸ’° Cost: $0.00 (completely free!)\n", "\n", "βœ… LLM helper functions ready\n" ] } ], "source": [ "# Initialize Hugging Face Inference Client (FREE)\n", "if Config.HF_TOKEN:\n", " try:\n", " hf_client = InferenceClient(token=Config.HF_TOKEN)\n", " print(\"βœ… Hugging Face client initialized (FREE)\")\n", " print(f\"πŸ€– Model: {Config.LLM_MODEL}\")\n", " print(\"πŸ’° Cost: $0.00 (completely free!)\\n\")\n", " LLM_AVAILABLE = True\n", " except Exception as e:\n", " print(f\"⚠️ Failed to initialize HF client: {e}\")\n", " LLM_AVAILABLE = False\n", "else:\n", " print(\"⚠️ No Hugging Face token configured\")\n", " print(\" LLM features will be disabled\")\n", " print(\"\\nπŸ“ To enable:\")\n", " print(\" 1. Go to: https://huggingface.co/settings/tokens\")\n", " print(\" 2. Create a token (free)\")\n", " print(\" 3. Set: Config.HF_TOKEN = 'your-token-here'\\n\")\n", " LLM_AVAILABLE = False\n", " hf_client = None\n", "\n", "def call_llm(prompt: str, max_tokens: int = 1000) -> str:\n", " \"\"\"\n", " Generic LLM call using Hugging Face Inference API (FREE).\n", " \"\"\"\n", " if not LLM_AVAILABLE:\n", " return \"[LLM not available - check .env file for HF_TOKEN]\"\n", " \n", " try:\n", " response = hf_client.chat_completion( # βœ… chat_completion\n", " messages=[{\"role\": \"user\", \"content\": prompt}],\n", " model=Config.LLM_MODEL,\n", " max_tokens=max_tokens,\n", " temperature=0.7\n", " )\n", " return response.choices[0].message.content # βœ… Extrai conteΓΊdo\n", " except Exception as e:\n", " return f\"[Error: {str(e)}]\"\n", "\n", "print(\"βœ… LLM helper functions ready\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## πŸ“Š Step 10: Pydantic Schemas for Structured Output" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "βœ… Pydantic schemas defined\n" ] } ], "source": [ "class JobLevelClassification(BaseModel):\n", " \"\"\"Job level classification result\"\"\"\n", " level: Literal['Entry', 'Mid', 'Senior', 'Executive']\n", " confidence: float = Field(ge=0.0, le=1.0)\n", " reasoning: str\n", "\n", "class SkillsTaxonomy(BaseModel):\n", " \"\"\"Structured skills extraction\"\"\"\n", " technical_skills: List[str] = Field(default_factory=list)\n", " soft_skills: List[str] = Field(default_factory=list)\n", " certifications: List[str] = Field(default_factory=list)\n", " languages: List[str] = Field(default_factory=list)\n", "\n", "class MatchExplanation(BaseModel):\n", " \"\"\"Match reasoning\"\"\"\n", " overall_score: float = Field(ge=0.0, le=1.0)\n", " match_strengths: List[str]\n", " skill_gaps: List[str]\n", " recommendation: str\n", " fit_summary: str = Field(max_length=200)\n", "\n", "print(\"βœ… Pydantic schemas defined\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## πŸ“Š Step 11: Job Level Classification (Zero-Shot)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "πŸ§ͺ Testing zero-shot classification...\n", "\n", "πŸ“Š Classification Result:\n", "{\n", " \"level\": \"Entry\",\n", " \"confidence\": 0.75,\n", " \"reasoning\": \"The job posting mentions 'some experience in graphic design' and 'fun, kind, ambitious members of the sales team' indicating a junior role.\"\n", "}\n" ] } ], "source": [ "def classify_job_level_zero_shot(job_description: str) -> Dict:\n", " \"\"\"\n", " Zero-shot job level classification.\n", " \n", " Returns classification as: Entry, Mid, Senior, or Executive\n", " \"\"\"\n", " \n", " prompt = f\"\"\"Classify this job posting into ONE seniority level.\n", "\n", "Levels:\n", "- Entry: 0-2 years experience, junior roles\n", "- Mid: 3-5 years experience, independent work\n", "- Senior: 6-10 years experience, technical leadership\n", "- Executive: 10+ years, strategic leadership, C-level\n", "\n", "Job Posting:\n", "{job_description[:500]}\n", "\n", "Return ONLY valid JSON:\n", "{{\n", " \"level\": \"Entry|Mid|Senior|Executive\",\n", " \"confidence\": 0.85,\n", " \"reasoning\": \"Brief explanation\"\n", "}}\n", "\"\"\"\n", " \n", " response = call_llm(prompt)\n", " \n", " try:\n", " # Extract JSON\n", " json_str = response.strip()\n", " if '```json' in json_str:\n", " json_str = json_str.split('```json')[1].split('```')[0].strip()\n", " elif '```' in json_str:\n", " json_str = json_str.split('```')[1].split('```')[0].strip()\n", " \n", " # Find JSON in response\n", " if '{' in json_str and '}' in json_str:\n", " start = json_str.index('{')\n", " end = json_str.rindex('}') + 1\n", " json_str = json_str[start:end]\n", " \n", " result = json.loads(json_str)\n", " return result\n", " except:\n", " return {\n", " \"level\": \"Unknown\",\n", " \"confidence\": 0.0,\n", " \"reasoning\": \"Failed to parse response\"\n", " }\n", "\n", "# Test if LLM available and data loaded\n", "if LLM_AVAILABLE and len(postings) > 0:\n", " print(\"πŸ§ͺ Testing zero-shot classification...\\n\")\n", " sample = postings.iloc[0]['description']\n", " result = classify_job_level_zero_shot(sample)\n", " \n", " print(\"πŸ“Š Classification Result:\")\n", " print(json.dumps(result, indent=2))\n", "else:\n", " print(\"⚠️ Skipped - LLM not available or no data\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## πŸ“Š Step 12: Few-Shot Learning" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "πŸ§ͺ Comparing Zero-Shot vs Few-Shot...\n", "\n", "πŸ“Š Comparison:\n", "Zero-shot: Unknown (confidence: 0.00)\n", "Few-shot: Entry (confidence: 0.90)\n" ] } ], "source": [ "def classify_job_level_few_shot(job_description: str) -> Dict:\n", " \"\"\"\n", " Few-shot classification with examples.\n", " \"\"\"\n", " \n", " prompt = f\"\"\"Classify this job posting using examples.\n", "\n", "EXAMPLES:\n", "\n", "Example 1 (Entry):\n", "\"Recent graduate wanted. Python basics. Mentorship provided.\"\n", "β†’ Entry level (learning focus, 0-2 years)\n", "\n", "Example 2 (Senior):\n", "\"5+ years backend. Lead team of 3. System architecture.\"\n", "β†’ Senior level (technical leadership, 6-10 years)\n", "\n", "Example 3 (Executive):\n", "\"CTO position. 15+ years. Define technical strategy.\"\n", "β†’ Executive level (C-level, strategic)\n", "\n", "NOW CLASSIFY:\n", "{job_description[:500]}\n", "\n", "Return JSON:\n", "{{\n", " \"level\": \"Entry|Mid|Senior|Executive\",\n", " \"confidence\": 0.0-1.0,\n", " \"reasoning\": \"Explain\"\n", "}}\n", "\"\"\"\n", " \n", " response = call_llm(prompt)\n", " \n", " try:\n", " json_str = response.strip()\n", " if '```json' in json_str:\n", " json_str = json_str.split('```json')[1].split('```')[0].strip()\n", " \n", " if '{' in json_str and '}' in json_str:\n", " start = json_str.index('{')\n", " end = json_str.rindex('}') + 1\n", " json_str = json_str[start:end]\n", " \n", " result = json.loads(json_str)\n", " return result\n", " except:\n", " return {\"level\": \"Unknown\", \"confidence\": 0.0, \"reasoning\": \"Parse error\"}\n", "\n", "# Compare zero-shot vs few-shot\n", "if LLM_AVAILABLE and len(postings) > 0:\n", " print(\"πŸ§ͺ Comparing Zero-Shot vs Few-Shot...\\n\")\n", " sample = postings.iloc[0]['description']\n", " \n", " zero = classify_job_level_zero_shot(sample)\n", " few = classify_job_level_few_shot(sample)\n", " \n", " print(\"πŸ“Š Comparison:\")\n", " print(f\"Zero-shot: {zero['level']} (confidence: {zero['confidence']:.2f})\")\n", " print(f\"Few-shot: {few['level']} (confidence: {few['confidence']:.2f})\")\n", "else:\n", " print(\"⚠️ Skipped\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## πŸ“Š Step 13: Structured Skills Extraction" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "πŸ” Testing skills extraction...\n", "\n", "πŸ“Š Extracted Skills:\n", "{\n", " \"technical_skills\": [\n", " \"Adobe Creative Cloud\",\n", " \"Microsoft Office Suite\"\n", " ],\n", " \"soft_skills\": [\n", " \"Communication\",\n", " \"Leadership\",\n", " \"Organization\",\n", " \"Responsibility\",\n", " \"Respect\",\n", " \"Positive attitude\",\n", " \"Proactivity\",\n", " \"Creativity\",\n", " \"Time management\",\n", " \"Cool-under-pressure\"\n", " ],\n", " \"certifications\": [\n", " \"Adobe Creative Cloud skills\"\n", " ],\n", " \"languages\": [\n", " \"English\"\n", " ]\n", "}\n" ] } ], "source": [ "def extract_skills_taxonomy(job_description: str) -> Dict:\n", " \"\"\"\n", " Extract structured skills using LLM + Pydantic validation.\n", " \"\"\"\n", " \n", " prompt = f\"\"\"Extract skills from this job posting.\n", "\n", "Job Posting:\n", "{job_description[:800]}\n", "\n", "Return ONLY valid JSON:\n", "{{\n", " \"technical_skills\": [\"Python\", \"Docker\", \"AWS\"],\n", " \"soft_skills\": [\"Communication\", \"Leadership\"],\n", " \"certifications\": [\"AWS Certified\"],\n", " \"languages\": [\"English\", \"Danish\"]\n", "}}\n", "\"\"\"\n", " \n", " response = call_llm(prompt, max_tokens=800)\n", " \n", " try:\n", " json_str = response.strip()\n", " if '```json' in json_str:\n", " json_str = json_str.split('```json')[1].split('```')[0].strip()\n", " \n", " if '{' in json_str and '}' in json_str:\n", " start = json_str.index('{')\n", " end = json_str.rindex('}') + 1\n", " json_str = json_str[start:end]\n", " \n", " data = json.loads(json_str)\n", " # Validate with Pydantic\n", " validated = SkillsTaxonomy(**data)\n", " return validated.model_dump()\n", " except:\n", " return {\n", " \"technical_skills\": [],\n", " \"soft_skills\": [],\n", " \"certifications\": [],\n", " \"languages\": []\n", " }\n", "\n", "# Test extraction\n", "if LLM_AVAILABLE and len(postings) > 0:\n", " print(\"πŸ” Testing skills extraction...\\n\")\n", " sample = postings.iloc[0]['description']\n", " skills = extract_skills_taxonomy(sample)\n", " \n", " print(\"πŸ“Š Extracted Skills:\")\n", " print(json.dumps(skills, indent=2))\n", "else:\n", " print(\"⚠️ Skipped\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## πŸ“Š Step 14: Match Explainability" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "πŸ’‘ Testing match explainability...\n", "\n", "πŸ“Š Match Explanation:\n", "{\n", " \"overall_score\": 0.7028058171272278,\n", " \"match_strengths\": [\n", " \"Unable to generate\"\n", " ],\n", " \"skill_gaps\": [],\n", " \"recommendation\": \"Review manually\",\n", " \"fit_summary\": \"Match score: 0.70\"\n", "}\n" ] } ], "source": [ "def explain_match(candidate_idx: int, company_idx: int, similarity_score: float) -> Dict:\n", " \"\"\"\n", " Generate LLM explanation for why candidate matches company.\n", " \"\"\"\n", " \n", " cand = candidates.iloc[candidate_idx]\n", " comp = companies_full.iloc[company_idx]\n", " \n", " cand_skills = str(cand.get('skills', 'N/A'))[:300]\n", " cand_exp = str(cand.get('positions', 'N/A'))[:300]\n", " comp_req = str(comp.get('required_skills', 'N/A'))[:300]\n", " comp_name = comp.get('name', 'Unknown')\n", " \n", " prompt = f\"\"\"Explain why this candidate matches this company.\n", "\n", "Candidate:\n", "Skills: {cand_skills}\n", "Experience: {cand_exp}\n", "\n", "Company: {comp_name}\n", "Requirements: {comp_req}\n", "\n", "Similarity Score: {similarity_score:.2f}\n", "\n", "Return JSON:\n", "{{\n", " \"overall_score\": {similarity_score},\n", " \"match_strengths\": [\"Top 3-5 matching factors\"],\n", " \"skill_gaps\": [\"Missing skills\"],\n", " \"recommendation\": \"What candidate should do\",\n", " \"fit_summary\": \"One sentence summary\"\n", "}}\n", "\"\"\"\n", " \n", " response = call_llm(prompt, max_tokens=1000)\n", " \n", " try:\n", " json_str = response.strip()\n", " if '```json' in json_str:\n", " json_str = json_str.split('```json')[1].split('```')[0].strip()\n", " \n", " if '{' in json_str and '}' in json_str:\n", " start = json_str.index('{')\n", " end = json_str.rindex('}') + 1\n", " json_str = json_str[start:end]\n", " \n", " data = json.loads(json_str)\n", " return data\n", " except:\n", " return {\n", " \"overall_score\": similarity_score,\n", " \"match_strengths\": [\"Unable to generate\"],\n", " \"skill_gaps\": [],\n", " \"recommendation\": \"Review manually\",\n", " \"fit_summary\": f\"Match score: {similarity_score:.2f}\"\n", " }\n", "\n", "# Test explainability\n", "if LLM_AVAILABLE and cand_vectors is not None and len(candidates) > 0:\n", " print(\"πŸ’‘ Testing match explainability...\\n\")\n", " matches = find_top_matches(0, top_k=1)\n", " if matches:\n", " comp_idx, score = matches[0]\n", " explanation = explain_match(0, comp_idx, score)\n", " \n", " print(\"πŸ“Š Match Explanation:\")\n", " print(json.dumps(explanation, indent=2))\n", "else:\n", " print(\"⚠️ Skipped - requirements not met\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## πŸ“Š Step 16: Detailed Match Visualization" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "πŸ” DETAILED MATCH ANALYSIS\n", "====================================================================================================\n", "\n", "🎯 CANDIDATE #0\n", "Resume ID: N/A\n", "Category: N/A\n", "Skills: ['Big Data', 'Hadoop', 'Hive', 'Python', 'Mapreduce', 'Spark', 'Java', 'Machine Learning', 'Cloud', 'Hdfs', 'YARN', 'Core Java', 'Data Science', 'C++'...\n", "\n", "πŸ”— TOP 5 MATCHES:\n", "\n", "#1. TeachTown (Score: 0.7028)\n", " Industries: E-Learning Providers...\n", "#3. Wolverine Power Systems (Score: 0.7026)\n", " Industries: Renewable Energy Semiconductor Manufacturing...\n", "#5. Mariner (Score: 0.7010)\n", " Industries: Financial Services...\n", "\n", "====================================================================================================\n" ] }, { "data": { "text/plain": [ "[(9418, 0.7028058171272278),\n", " (30989, 0.7026211023330688),\n", " (9417, 0.7025721669197083),\n", " (30990, 0.7019376754760742),\n", " (9416, 0.7010321021080017)]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# ============================================================================\n", "# πŸ” DETAILED MATCH EXAMPLE\n", "# ============================================================================\n", "\n", "def show_detailed_match_example(candidate_idx=0, top_k=5):\n", " print(\"πŸ” DETAILED MATCH ANALYSIS\")\n", " print(\"=\" * 100)\n", " \n", " if candidate_idx >= len(candidates):\n", " print(f\"❌ ERROR: Candidate {candidate_idx} out of range\")\n", " return None\n", " \n", " cand = candidates.iloc[candidate_idx]\n", " \n", " print(f\"\\n🎯 CANDIDATE #{candidate_idx}\")\n", " print(f\"Resume ID: {cand.get('Resume_ID', 'N/A')}\")\n", " print(f\"Category: {cand.get('Category', 'N/A')}\")\n", " print(f\"Skills: {str(cand.get('skills', 'N/A'))[:150]}...\\n\")\n", " \n", " matches = find_top_matches(candidate_idx, top_k=top_k)\n", " \n", " print(f\"πŸ”— TOP {len(matches)} MATCHES:\\n\")\n", " \n", " for rank, (comp_idx, score) in enumerate(matches, 1):\n", " if comp_idx >= len(companies_full):\n", " continue\n", " \n", " company = companies_full.iloc[comp_idx]\n", " print(f\"#{rank}. {company.get('name', 'N/A')} (Score: {score:.4f})\")\n", " print(f\" Industries: {str(company.get('industries_list', 'N/A'))[:60]}...\")\n", " \n", " print(\"\\n\" + \"=\" * 100)\n", " return matches\n", "\n", "# Test\n", "show_detailed_match_example(candidate_idx=0, top_k=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## πŸ“Š Step 17: Bridging Concept Analysis" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "πŸŒ‰ THE BRIDGING CONCEPT\n", "==========================================================================================\n", "\n", "πŸ“Š DATA REALITY:\n", " Total companies: 24,473\n", " WITH postings: 0 (0.0%)\n", " WITHOUT postings: 24,473\n", "\n", "🎯 THE PROBLEM:\n", " Companies: 'We are in TECH INDUSTRY'\n", " Candidates: 'I know PYTHON, AWS'\n", " β†’ Different languages! 🚫\n", "\n", "πŸŒ‰ THE SOLUTION (BRIDGING):\n", " 1. Extract from postings: 'Need PYTHON developers'\n", " 2. Enrich company profile with skills\n", " 3. Now both speak SKILLS LANGUAGE! βœ…\n", "\n", "==========================================================================================\n" ] }, { "data": { "text/plain": [ "(Empty DataFrame\n", " Columns: [company_id, name, description, company_size, state, country, city, zip_code, address, url, industries_list, specialties_list, employee_count, follower_count, time_recorded, posted_job_titles, posted_descriptions, required_skills, experience_levels]\n", " Index: [],\n", " company_id name \\\n", " 0 1009 IBM \n", " 4 1016 GE HealthCare \n", " 14 1025 Hewlett Packard Enterprise \n", " 18 1028 Oracle \n", " 23 1033 Accenture \n", " ... ... ... \n", " 35782 103463217 JRC Services \n", " 35783 103466352 Centent Consulting LLC \n", " 35784 103467540 Kings and Queens Productions, LLC \n", " 35785 103468936 WebUnite \n", " 35786 103472979 BlackVe \n", " \n", " description company_size \\\n", " 0 At IBM, we do more than work. We create. We cr... 7.0 \n", " 4 Every day millions of people feel the impact o... 7.0 \n", " 14 Official LinkedIn of Hewlett Packard Enterpris... 7.0 \n", " 18 We’re a cloud technology company that provides... 7.0 \n", " 23 Accenture is a leading global professional ser... 7.0 \n", " ... ... ... \n", " 35782 2.0 \n", " 35783 Centent Consulting LLC is a reputable human re... \n", " 35784 We are a small but mighty collection of thinke... \n", " 35785 Our mission at WebUnite is to offer experience... \n", " 35786 1.0 \n", " \n", " state country city zip_code \\\n", " 0 NY US Armonk, New York 10504 \n", " 4 0 US Chicago 0 \n", " 14 Texas US Houston 77389 \n", " 18 Texas US Austin 78741 \n", " 23 0 IE Dublin 2 0 \n", " ... ... ... ... ... \n", " 35782 0 0 0 0 \n", " 35783 0 0 0 0 \n", " 35784 0 0 0 0 \n", " 35785 Pennsylvania US Southampton 18966 \n", " 35786 0 0 0 0 \n", " \n", " address \\\n", " 0 International Business Machines Corp. \n", " 4 - \n", " 14 1701 E Mossy Oaks Rd Spring \n", " 18 2300 Oracle Way \n", " 23 Grand Canal Harbour \n", " ... ... \n", " 35782 0 \n", " 35783 0 \n", " 35784 0 \n", " 35785 720 2nd Street Pike \n", " 35786 0 \n", " \n", " url \\\n", " 0 https://www.linkedin.com/company/ibm \n", " 4 https://www.linkedin.com/company/gehealthcare \n", " 14 https://www.linkedin.com/company/hewlett-packa... \n", " 18 https://www.linkedin.com/company/oracle \n", " 23 https://www.linkedin.com/company/accenture \n", " ... ... \n", " 35782 https://www.linkedin.com/company/jrcservices \n", " 35783 https://www.linkedin.com/company/centent-consu... \n", " 35784 https://www.linkedin.com/company/kings-and-que... \n", " 35785 https://www.linkedin.com/company/webunite \n", " 35786 https://www.linkedin.com/company/blackve \n", " \n", " industries_list \\\n", " 0 IT Services and IT Consulting \n", " 4 Hospitals and Health Care \n", " 14 IT Services and IT Consulting \n", " 18 IT Services and IT Consulting \n", " 23 Business Consulting and Services \n", " ... ... \n", " 35782 Facilities Services \n", " 35783 Business Consulting and Services \n", " 35784 Broadcast Media Production and Distribution \n", " 35785 Business Consulting and Services \n", " 35786 Defense and Space Manufacturing \n", " \n", " specialties_list employee_count \\\n", " 0 Cloud | Mobile | Cognitive | Security | Resear... 314102 \n", " 4 Healthcare | Biotechnology 56873 \n", " 14 79528 \n", " 18 enterprise | software | applications | databas... 192099 \n", " 23 Management Consulting | Systems Integration an... 574664 \n", " ... ... ... \n", " 35782 0 \n", " 35783 0 \n", " 35784 0 \n", " 35785 0 \n", " 35786 0 \n", " \n", " follower_count time_recorded posted_job_titles posted_descriptions \\\n", " 0 16253625 1712378162 \n", " 4 2185368 1712382540 \n", " 14 3586194 1712870106 \n", " 18 9465968 1712642952 \n", " 23 11864908 1712641699 \n", " ... ... ... ... ... \n", " 35782 21 1713552037 \n", " 35783 0 1713550651 \n", " 35784 12 1713554225 \n", " 35785 1 1713535939 \n", " 35786 0 1713539379 \n", " \n", " required_skills experience_levels \n", " 0 \n", " 4 \n", " 14 \n", " 18 \n", " 23 \n", " ... ... ... \n", " 35782 \n", " 35783 \n", " 35784 \n", " 35785 \n", " 35786 \n", " \n", " [24473 rows x 19 columns])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# ============================================================================\n", "# πŸŒ‰ BRIDGING CONCEPT ANALYSIS\n", "# ============================================================================\n", "\n", "def show_bridging_concept_analysis():\n", " print(\"πŸŒ‰ THE BRIDGING CONCEPT\")\n", " print(\"=\" * 90)\n", " \n", " companies_with = companies_full[companies_full['required_skills'] != '']\n", " companies_without = companies_full[companies_full['required_skills'] == '']\n", " \n", " print(f\"\\nπŸ“Š DATA REALITY:\")\n", " print(f\" Total companies: {len(companies_full):,}\")\n", " print(f\" WITH postings: {len(companies_with):,} ({len(companies_with)/len(companies_full)*100:.1f}%)\")\n", " print(f\" WITHOUT postings: {len(companies_without):,}\\n\")\n", " \n", " print(\"🎯 THE PROBLEM:\")\n", " print(\" Companies: 'We are in TECH INDUSTRY'\")\n", " print(\" Candidates: 'I know PYTHON, AWS'\")\n", " print(\" β†’ Different languages! 🚫\\n\")\n", " \n", " print(\"πŸŒ‰ THE SOLUTION (BRIDGING):\")\n", " print(\" 1. Extract from postings: 'Need PYTHON developers'\")\n", " print(\" 2. Enrich company profile with skills\")\n", " print(\" 3. Now both speak SKILLS LANGUAGE! βœ…\\n\")\n", " \n", " print(\"=\" * 90)\n", " return companies_with, companies_without\n", "\n", "# Test\n", "show_bridging_concept_analysis()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## πŸ“Š Step 18: Export Results to CSV" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "πŸ’Ύ Exporting 50 candidates (top 5 each)...\n", "\n", " Processing 1/50...\n", "\n", "βœ… Exported 129 matches\n", "πŸ“„ File: ../results/hrhub_matches.csv\n", "\n" ] } ], "source": [ "# ============================================================================\n", "# πŸ’Ύ EXPORT MATCHES TO CSV\n", "# ============================================================================\n", "\n", "def export_matches_to_csv(num_candidates=100, top_k=10):\n", " print(f\"πŸ’Ύ Exporting {num_candidates} candidates (top {top_k} each)...\\n\")\n", " \n", " results = []\n", " \n", " for i in range(min(num_candidates, len(candidates))):\n", " if i % 50 == 0:\n", " print(f\" Processing {i+1}/{num_candidates}...\")\n", " \n", " matches = find_top_matches(i, top_k=top_k)\n", " cand = candidates.iloc[i]\n", " \n", " for rank, (comp_idx, score) in enumerate(matches, 1):\n", " if comp_idx >= len(companies_full):\n", " continue\n", " \n", " company = companies_full.iloc[comp_idx]\n", " \n", " results.append({\n", " 'candidate_id': i,\n", " 'candidate_category': cand.get('Category', 'N/A'),\n", " 'company_id': company.get('company_id', 'N/A'),\n", " 'company_name': company.get('name', 'N/A'),\n", " 'match_rank': rank,\n", " 'similarity_score': round(float(score), 4)\n", " })\n", " \n", " results_df = pd.DataFrame(results)\n", " output_file = f'{Config.RESULTS_PATH}hrhub_matches.csv'\n", " results_df.to_csv(output_file, index=False)\n", " \n", " print(f\"\\nβœ… Exported {len(results_df):,} matches\")\n", " print(f\"πŸ“„ File: {output_file}\\n\")\n", " \n", " return results_df\n", "\n", "# Export sample\n", "matches_df = export_matches_to_csv(num_candidates=50, top_k=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## πŸ“Š Step 19: Summary\n", "\n", "### What We Built" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "======================================================================\n", "🎯 HRHUB v2.1 - SUMMARY\n", "======================================================================\n", "\n", "βœ… IMPLEMENTED:\n", " 1. Zero-Shot Job Classification (Entry/Mid/Senior/Executive)\n", " 2. Few-Shot Learning with Examples\n", " 3. Structured Skills Extraction (Pydantic schemas)\n", " 4. Match Explainability (LLM-generated reasoning)\n", " 5. FREE LLM Integration (Hugging Face)\n", " 6. Flexible Data Loading (Upload OR Google Drive)\n", "\n", "πŸ’° COST: $0.00 (completely free!)\n", "\n", "πŸ“ˆ COURSE ALIGNMENT:\n", " βœ… LLMs for structured output\n", " βœ… Pydantic schemas\n", " βœ… Classification pipelines\n", " βœ… Zero-shot & few-shot learning\n", " βœ… JSON extraction\n", " βœ… Transformer architecture (embeddings)\n", " βœ… API deployment strategies\n", "\n", "======================================================================\n", "πŸš€ READY TO MOVE TO VS CODE!\n", "======================================================================\n" ] } ], "source": [ "print(\"=\"*70)\n", "print(\"🎯 HRHUB v2.1 - SUMMARY\")\n", "print(\"=\"*70)\n", "print(\"\")\n", "print(\"βœ… IMPLEMENTED:\")\n", "print(\" 1. Zero-Shot Job Classification (Entry/Mid/Senior/Executive)\")\n", "print(\" 2. Few-Shot Learning with Examples\")\n", "print(\" 3. Structured Skills Extraction (Pydantic schemas)\")\n", "print(\" 4. Match Explainability (LLM-generated reasoning)\")\n", "print(\" 5. FREE LLM Integration (Hugging Face)\")\n", "print(\" 6. Flexible Data Loading (Upload OR Google Drive)\")\n", "print(\"\")\n", "print(\"πŸ’° COST: $0.00 (completely free!)\")\n", "print(\"\")\n", "print(\"πŸ“ˆ COURSE ALIGNMENT:\")\n", "print(\" βœ… LLMs for structured output\")\n", "print(\" βœ… Pydantic schemas\")\n", "print(\" βœ… Classification pipelines\")\n", "print(\" βœ… Zero-shot & few-shot learning\")\n", "print(\" βœ… JSON extraction\")\n", "print(\" βœ… Transformer architecture (embeddings)\")\n", "print(\" βœ… API deployment strategies\")\n", "print(\"\")\n", "print(\"=\"*70)\n", "print(\"πŸš€ READY TO MOVE TO VS CODE!\")\n", "print(\"=\"*70)" ] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 2 }