Spaces:

Rogersurf
/

hrhub

Sleeping

File size: 78,211 Bytes

33185cb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 🧠 HRHUB v2.1 - Enhanced with LLM (FREE VERSION)\n",
    "\n",
    "## 📘 Project Overview\n",
    "\n",
    "**Bilateral HR Matching System with LLM-Powered Intelligence**\n",
    "\n",
    "### What's New in v2.1:\n",
    "- ✅ **FREE LLM**: Using Hugging Face Inference API (no cost)\n",
    "- ✅ **Job Level Classification**: Zero-shot & few-shot learning\n",
    "- ✅ **Structured Skills Extraction**: Pydantic schemas\n",
    "- ✅ **Match Explainability**: LLM-generated reasoning\n",
    "- ✅ **Flexible Data Loading**: Upload OR Google Drive\n",
    "\n",
    "### Tech Stack:\n",
    "```\n",
    "Embeddings: sentence-transformers (local, free)\n",
    "LLM: Hugging Face Inference API (free tier)\n",
    "Schemas: Pydantic\n",
    "Platform: Google Colab → VS Code\n",
    "```\n",
    "\n",
    "---\n",
    "\n",
    "**Master's Thesis - Aalborg University**  \n",
    "*Business Data Science Program*  \n",
    "*December 2025*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 📊 Step 1: Install Dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✅ All packages installed!\n"
     ]
    }
   ],
   "source": [
    "# Install required packages\n",
    "#!pip install -q sentence-transformers huggingface-hub pydantic plotly pyvis nbformat scikit-learn pandas numpy\n",
    "\n",
    "print(\"✅ All packages installed!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 📊 Step 2: Import Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✅ Environment variables loaded from .env\n",
      "✅ All libraries imported!\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import json\n",
    "import os\n",
    "from typing import List, Dict, Optional, Literal\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# ML & NLP\n",
    "from sentence_transformers import SentenceTransformer\n",
    "from sklearn.metrics.pairwise import cosine_similarity\n",
    "\n",
    "# LLM Integration (FREE)\n",
    "from huggingface_hub import InferenceClient\n",
    "from pydantic import BaseModel, Field\n",
    "\n",
    "# Visualization\n",
    "import plotly.graph_objects as go\n",
    "from IPython.display import HTML, display\n",
    "\n",
    "# Configuration Settings\n",
    "from dotenv import load_dotenv\n",
    "\n",
    "# Carrega variáveis do .env\n",
    "load_dotenv()\n",
    "print(\"✅ Environment variables loaded from .env\")\n",
    "# ============== ATÉ AQUI ⬆️ ==============\n",
    "\n",
    "print(\"✅ All libraries imported!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 📊 Step 3: Configuration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✅ Configuration loaded!\n",
      "🧠 Embedding model: all-MiniLM-L6-v2\n",
      "🤖 LLM model: meta-llama/Llama-3.2-3B-Instruct\n",
      "🔑 HF Token configured: Yes ✅\n",
      "📂 Data path: ../csv_files/\n"
     ]
    }
   ],
   "source": [
    "class Config:\n",
    "    \"\"\"Centralized configuration for VS Code\"\"\"\n",
    "    \n",
    "    # Paths - VS Code structure\n",
    "    CSV_PATH = '../csv_files/'\n",
    "    PROCESSED_PATH = '../processed/'\n",
    "    RESULTS_PATH = '../results/'\n",
    "    \n",
    "    # Embedding Model\n",
    "    EMBEDDING_MODEL = 'all-MiniLM-L6-v2'\n",
    "    \n",
    "    # LLM Settings (FREE - Hugging Face)\n",
    "    HF_TOKEN = os.getenv('HF_TOKEN', '')  # ✅ Pega do .env\n",
    "    LLM_MODEL = 'meta-llama/Llama-3.2-3B-Instruct'\n",
    "    \n",
    "    LLM_MAX_TOKENS = 1000\n",
    "    \n",
    "    # Matching Parameters\n",
    "    TOP_K_MATCHES = 10\n",
    "    SIMILARITY_THRESHOLD = 0.5\n",
    "    RANDOM_SEED = 42\n",
    "\n",
    "np.random.seed(Config.RANDOM_SEED)\n",
    "\n",
    "print(\"✅ Configuration loaded!\")\n",
    "print(f\"🧠 Embedding model: {Config.EMBEDDING_MODEL}\")\n",
    "print(f\"🤖 LLM model: {Config.LLM_MODEL}\")\n",
    "print(f\"🔑 HF Token configured: {'Yes ✅' if Config.HF_TOKEN else 'No ⚠️'}\")\n",
    "print(f\"📂 Data path: {Config.CSV_PATH}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 📊 Step 4: Load All Datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "📂 Loading all datasets...\n",
      "\n",
      "======================================================================\n",
      "✅ Candidates: 9,544 rows × 35 columns\n",
      "✅ Companies (base): 24,473 rows\n",
      "✅ Company industries: 24,375 rows\n",
      "✅ Company specialties: 169,387 rows\n",
      "✅ Employee counts: 35,787 rows\n",
      "✅ Postings: 123,849 rows × 31 columns\n",
      "✅ Job skills: 213,768 rows\n",
      "✅ Job industries: 164,808 rows\n",
      "\n",
      "======================================================================\n",
      "✅ All datasets loaded successfully!\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(\"📂 Loading all datasets...\\n\")\n",
    "print(\"=\" * 70)\n",
    "\n",
    "# Load main datasets\n",
    "candidates = pd.read_csv(f'{Config.CSV_PATH}resume_data.csv')\n",
    "print(f\"✅ Candidates: {len(candidates):,} rows × {len(candidates.columns)} columns\")\n",
    "\n",
    "companies_base = pd.read_csv(f'{Config.CSV_PATH}companies.csv')\n",
    "print(f\"✅ Companies (base): {len(companies_base):,} rows\")\n",
    "\n",
    "company_industries = pd.read_csv(f'{Config.CSV_PATH}company_industries.csv')\n",
    "print(f\"✅ Company industries: {len(company_industries):,} rows\")\n",
    "\n",
    "company_specialties = pd.read_csv(f'{Config.CSV_PATH}company_specialities.csv')\n",
    "print(f\"✅ Company specialties: {len(company_specialties):,} rows\")\n",
    "\n",
    "employee_counts = pd.read_csv(f'{Config.CSV_PATH}employee_counts.csv')\n",
    "print(f\"✅ Employee counts: {len(employee_counts):,} rows\")\n",
    "\n",
    "postings = pd.read_csv(f'{Config.CSV_PATH}postings.csv', on_bad_lines='skip', engine='python')\n",
    "print(f\"✅ Postings: {len(postings):,} rows × {len(postings.columns)} columns\")\n",
    "\n",
    "# Optional datasets\n",
    "try:\n",
    "    job_skills = pd.read_csv(f'{Config.CSV_PATH}job_skills.csv')\n",
    "    print(f\"✅ Job skills: {len(job_skills):,} rows\")\n",
    "except:\n",
    "    job_skills = None\n",
    "    print(\"⚠️  Job skills not found (optional)\")\n",
    "\n",
    "try:\n",
    "    job_industries = pd.read_csv(f'{Config.CSV_PATH}job_industries.csv')\n",
    "    print(f\"✅ Job industries: {len(job_industries):,} rows\")\n",
    "except:\n",
    "    job_industries = None\n",
    "    print(\"⚠️  Job industries not found (optional)\")\n",
    "\n",
    "print(\"\\n\" + \"=\" * 70)\n",
    "print(\"✅ All datasets loaded successfully!\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 📊 Step 5: Merge & Enrich Company Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🔗 Merging company data...\n",
      "\n",
      "✅ Aggregated industries for 24,365 companies\n",
      "✅ Aggregated specialties for 17,780 companies\n",
      "\n",
      "✅ Base company merge complete: 35,787 companies\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(\"🔗 Merging company data...\\n\")\n",
    "\n",
    "# Aggregate industries\n",
    "company_industries_agg = company_industries.groupby('company_id')['industry'].apply(\n",
    "    lambda x: ', '.join(map(str, x.tolist()))\n",
    ").reset_index()\n",
    "company_industries_agg.columns = ['company_id', 'industries_list']\n",
    "print(f\"✅ Aggregated industries for {len(company_industries_agg):,} companies\")\n",
    "\n",
    "# Aggregate specialties\n",
    "company_specialties_agg = company_specialties.groupby('company_id')['speciality'].apply(\n",
    "    lambda x: ' | '.join(x.astype(str).tolist())\n",
    ").reset_index()\n",
    "company_specialties_agg.columns = ['company_id', 'specialties_list']\n",
    "print(f\"✅ Aggregated specialties for {len(company_specialties_agg):,} companies\")\n",
    "\n",
    "# Merge all company data\n",
    "companies_merged = companies_base.copy()\n",
    "companies_merged = companies_merged.merge(company_industries_agg, on='company_id', how='left')\n",
    "companies_merged = companies_merged.merge(company_specialties_agg, on='company_id', how='left')\n",
    "companies_merged = companies_merged.merge(employee_counts, on='company_id', how='left')\n",
    "\n",
    "print(f\"\\n✅ Base company merge complete: {len(companies_merged):,} companies\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 📊 Step 6: Enrich with Job Postings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🌉 Enriching companies with job posting data...\n",
      "\n",
      "======================================================================\n",
      "KEY INSIGHT: Postings = 'Requirements Language Bridge'\n",
      "======================================================================\n",
      "\n",
      "✅ Enriched 35,787 companies with posting data\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(\"🌉 Enriching companies with job posting data...\\n\")\n",
    "print(\"=\" * 70)\n",
    "print(\"KEY INSIGHT: Postings = 'Requirements Language Bridge'\")\n",
    "print(\"=\" * 70 + \"\\n\")\n",
    "\n",
    "postings = postings.fillna('')\n",
    "postings['company_id'] = postings['company_id'].astype(str)\n",
    "\n",
    "# Aggregate postings per company\n",
    "postings_agg = postings.groupby('company_id').agg({\n",
    "    'title': lambda x: ' | '.join(x.astype(str).tolist()[:10]),\n",
    "    'description': lambda x: ' '.join(x.astype(str).tolist()[:5]),\n",
    "    'skills_desc': lambda x: ' | '.join(x.dropna().astype(str).tolist()),\n",
    "    'formatted_experience_level': lambda x: ' | '.join(x.dropna().unique().astype(str)),\n",
    "}).reset_index()\n",
    "\n",
    "postings_agg.columns = ['company_id', 'posted_job_titles', 'posted_descriptions', 'required_skills', 'experience_levels']\n",
    "\n",
    "companies_merged['company_id'] = companies_merged['company_id'].astype(str)\n",
    "companies_full = companies_merged.merge(postings_agg, on='company_id', how='left').fillna('')\n",
    "\n",
    "print(f\"✅ Enriched {len(companies_full):,} companies with posting data\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>company_id</th>\n",
       "      <th>name</th>\n",
       "      <th>description</th>\n",
       "      <th>company_size</th>\n",
       "      <th>state</th>\n",
       "      <th>country</th>\n",
       "      <th>city</th>\n",
       "      <th>zip_code</th>\n",
       "      <th>address</th>\n",
       "      <th>url</th>\n",
       "      <th>industries_list</th>\n",
       "      <th>specialties_list</th>\n",
       "      <th>employee_count</th>\n",
       "      <th>follower_count</th>\n",
       "      <th>time_recorded</th>\n",
       "      <th>posted_job_titles</th>\n",
       "      <th>posted_descriptions</th>\n",
       "      <th>required_skills</th>\n",
       "      <th>experience_levels</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1009</td>\n",
       "      <td>IBM</td>\n",
       "      <td>At IBM, we do more than work. We create. We cr...</td>\n",
       "      <td>7.0</td>\n",
       "      <td>NY</td>\n",
       "      <td>US</td>\n",
       "      <td>Armonk, New York</td>\n",
       "      <td>10504</td>\n",
       "      <td>International Business Machines Corp.</td>\n",
       "      <td>https://www.linkedin.com/company/ibm</td>\n",
       "      <td>IT Services and IT Consulting</td>\n",
       "      <td>Cloud | Mobile | Cognitive | Security | Resear...</td>\n",
       "      <td>314102</td>\n",
       "      <td>16253625</td>\n",
       "      <td>1712378162</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1009</td>\n",
       "      <td>IBM</td>\n",
       "      <td>At IBM, we do more than work. We create. We cr...</td>\n",
       "      <td>7.0</td>\n",
       "      <td>NY</td>\n",
       "      <td>US</td>\n",
       "      <td>Armonk, New York</td>\n",
       "      <td>10504</td>\n",
       "      <td>International Business Machines Corp.</td>\n",
       "      <td>https://www.linkedin.com/company/ibm</td>\n",
       "      <td>IT Services and IT Consulting</td>\n",
       "      <td>Cloud | Mobile | Cognitive | Security | Resear...</td>\n",
       "      <td>313142</td>\n",
       "      <td>16309464</td>\n",
       "      <td>1713392385</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1009</td>\n",
       "      <td>IBM</td>\n",
       "      <td>At IBM, we do more than work. We create. We cr...</td>\n",
       "      <td>7.0</td>\n",
       "      <td>NY</td>\n",
       "      <td>US</td>\n",
       "      <td>Armonk, New York</td>\n",
       "      <td>10504</td>\n",
       "      <td>International Business Machines Corp.</td>\n",
       "      <td>https://www.linkedin.com/company/ibm</td>\n",
       "      <td>IT Services and IT Consulting</td>\n",
       "      <td>Cloud | Mobile | Cognitive | Security | Resear...</td>\n",
       "      <td>313147</td>\n",
       "      <td>16309985</td>\n",
       "      <td>1713402495</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1009</td>\n",
       "      <td>IBM</td>\n",
       "      <td>At IBM, we do more than work. We create. We cr...</td>\n",
       "      <td>7.0</td>\n",
       "      <td>NY</td>\n",
       "      <td>US</td>\n",
       "      <td>Armonk, New York</td>\n",
       "      <td>10504</td>\n",
       "      <td>International Business Machines Corp.</td>\n",
       "      <td>https://www.linkedin.com/company/ibm</td>\n",
       "      <td>IT Services and IT Consulting</td>\n",
       "      <td>Cloud | Mobile | Cognitive | Security | Resear...</td>\n",
       "      <td>311223</td>\n",
       "      <td>16314846</td>\n",
       "      <td>1713501255</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1016</td>\n",
       "      <td>GE HealthCare</td>\n",
       "      <td>Every day millions of people feel the impact o...</td>\n",
       "      <td>7.0</td>\n",
       "      <td>0</td>\n",
       "      <td>US</td>\n",
       "      <td>Chicago</td>\n",
       "      <td>0</td>\n",
       "      <td>-</td>\n",
       "      <td>https://www.linkedin.com/company/gehealthcare</td>\n",
       "      <td>Hospitals and Health Care</td>\n",
       "      <td>Healthcare | Biotechnology</td>\n",
       "      <td>56873</td>\n",
       "      <td>2185368</td>\n",
       "      <td>1712382540</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  company_id           name  \\\n",
       "0       1009            IBM   \n",
       "1       1009            IBM   \n",
       "2       1009            IBM   \n",
       "3       1009            IBM   \n",
       "4       1016  GE HealthCare   \n",
       "\n",
       "                                         description company_size state  \\\n",
       "0  At IBM, we do more than work. We create. We cr...          7.0    NY   \n",
       "1  At IBM, we do more than work. We create. We cr...          7.0    NY   \n",
       "2  At IBM, we do more than work. We create. We cr...          7.0    NY   \n",
       "3  At IBM, we do more than work. We create. We cr...          7.0    NY   \n",
       "4  Every day millions of people feel the impact o...          7.0     0   \n",
       "\n",
       "  country              city zip_code                                address  \\\n",
       "0      US  Armonk, New York    10504  International Business Machines Corp.   \n",
       "1      US  Armonk, New York    10504  International Business Machines Corp.   \n",
       "2      US  Armonk, New York    10504  International Business Machines Corp.   \n",
       "3      US  Armonk, New York    10504  International Business Machines Corp.   \n",
       "4      US           Chicago        0                                      -   \n",
       "\n",
       "                                             url  \\\n",
       "0           https://www.linkedin.com/company/ibm   \n",
       "1           https://www.linkedin.com/company/ibm   \n",
       "2           https://www.linkedin.com/company/ibm   \n",
       "3           https://www.linkedin.com/company/ibm   \n",
       "4  https://www.linkedin.com/company/gehealthcare   \n",
       "\n",
       "                 industries_list  \\\n",
       "0  IT Services and IT Consulting   \n",
       "1  IT Services and IT Consulting   \n",
       "2  IT Services and IT Consulting   \n",
       "3  IT Services and IT Consulting   \n",
       "4      Hospitals and Health Care   \n",
       "\n",
       "                                    specialties_list  employee_count  \\\n",
       "0  Cloud | Mobile | Cognitive | Security | Resear...          314102   \n",
       "1  Cloud | Mobile | Cognitive | Security | Resear...          313142   \n",
       "2  Cloud | Mobile | Cognitive | Security | Resear...          313147   \n",
       "3  Cloud | Mobile | Cognitive | Security | Resear...          311223   \n",
       "4                         Healthcare | Biotechnology           56873   \n",
       "\n",
       "   follower_count  time_recorded posted_job_titles posted_descriptions  \\\n",
       "0        16253625     1712378162                                         \n",
       "1        16309464     1713392385                                         \n",
       "2        16309985     1713402495                                         \n",
       "3        16314846     1713501255                                         \n",
       "4         2185368     1712382540                                         \n",
       "\n",
       "  required_skills experience_levels  \n",
       "0                                    \n",
       "1                                    \n",
       "2                                    \n",
       "3                                    \n",
       "4                                    "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "companies_full.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "================================================================================\n",
      "🔍 DUPLICATE DETECTION REPORT\n",
      "================================================================================\n",
      "\n",
      "┌─ 📊 resume_data.csv (Candidates)\n",
      "│  Primary Key: Resume_ID\n",
      "│  Total rows:     9,544\n",
      "│  Unique rows:    9,544\n",
      "│  Duplicates:     0\n",
      "│  Status:         ✅ CLEAN\n",
      "└─\n",
      "\n",
      "┌─ 📊 companies.csv (Companies Base)\n",
      "│  Primary Key: company_id\n",
      "│  Total rows:     24,473\n",
      "│  Unique rows:    24,473\n",
      "│  Duplicates:     0\n",
      "│  Status:         ✅ CLEAN\n",
      "└─\n",
      "\n",
      "┌─ 📊 company_industries.csv\n",
      "│  Primary Key: company_id + industry\n",
      "│  Total rows:     24,375\n",
      "│  Unique rows:    24,375\n",
      "│  Duplicates:     0\n",
      "│  Status:         ✅ CLEAN\n",
      "└─\n",
      "\n",
      "┌─ 📊 company_specialities.csv\n",
      "│  Primary Key: company_id + speciality\n",
      "│  Total rows:     169,387\n",
      "│  Unique rows:    169,387\n",
      "│  Duplicates:     0\n",
      "│  Status:         ✅ CLEAN\n",
      "└─\n",
      "\n",
      "┌─ 📊 employee_counts.csv\n",
      "│  Primary Key: company_id\n",
      "│  Total rows:     35,787\n",
      "│  Unique rows:    24,473\n",
      "│  Duplicates:     11,314\n",
      "│  Status:         🔴 HAS DUPLICATES\n",
      "└─\n",
      "\n",
      "┌─ 📊 postings.csv (Job Postings)\n",
      "│  Primary Key: job_id\n",
      "│  Total rows:     123,849\n",
      "│  Unique rows:    123,849\n",
      "│  Duplicates:     0\n",
      "│  Status:         ✅ CLEAN\n",
      "└─\n",
      "\n",
      "┌─ 📊 companies_full (After Enrichment)\n",
      "│  Primary Key: company_id\n",
      "│  Total rows:     35,787\n",
      "│  Unique rows:    24,473\n",
      "│  Duplicates:     11,314\n",
      "│  Status:         🔴 HAS DUPLICATES\n",
      "│\n",
      "│  Top duplicate company_ids:\n",
      "│    - 33242739 (Confidential): 13 times\n",
      "│    - 5235 (LHH): 13 times\n",
      "│    - 79383535 (Akkodis): 12 times\n",
      "│    - 1681 (Robert Half): 12 times\n",
      "│    - 220336 (Hyatt Hotels Corporation): 11 times\n",
      "└─\n",
      "\n",
      "================================================================================\n",
      "📊 SUMMARY\n",
      "================================================================================\n",
      "\n",
      "✅ Clean datasets:          5/7\n",
      "🔴 Datasets with duplicates: 2/7\n",
      "🗑️  Total duplicates found:  22,628 rows\n",
      "\n",
      "⚠️  DUPLICATES DETECTED!\n",
      "================================================================================\n"
     ]
    }
   ],
   "source": [
    "## 🔍 Data Quality Check - Duplicate Detection\n",
    "\n",
    "\"\"\"\n",
    "Checking for duplicates in all datasets based on primary keys.\n",
    "This cell only REPORTS duplicates, does not modify data.\n",
    "\"\"\"\n",
    "\n",
    "print(\"=\" * 80)\n",
    "print(\"🔍 DUPLICATE DETECTION REPORT\")\n",
    "print(\"=\" * 80)\n",
    "print()\n",
    "\n",
    "# Define primary keys for each dataset\n",
    "duplicate_report = []\n",
    "\n",
    "# 1. Candidates\n",
    "print(\"┌─ 📊 resume_data.csv (Candidates)\")\n",
    "print(f\"│  Primary Key: Resume_ID\")\n",
    "cand_total = len(candidates)\n",
    "cand_unique = candidates['Resume_ID'].nunique() if 'Resume_ID' in candidates.columns else len(candidates)\n",
    "cand_dups = cand_total - cand_unique\n",
    "print(f\"│  Total rows:     {cand_total:,}\")\n",
    "print(f\"│  Unique rows:    {cand_unique:,}\")\n",
    "print(f\"│  Duplicates:     {cand_dups:,}\")\n",
    "print(f\"│  Status:         {'✅ CLEAN' if cand_dups == 0 else '🔴 HAS DUPLICATES'}\")\n",
    "print(\"└─\\n\")\n",
    "duplicate_report.append(('Candidates', cand_total, cand_unique, cand_dups))\n",
    "\n",
    "# 2. Companies Base\n",
    "print(\"┌─ 📊 companies.csv (Companies Base)\")\n",
    "print(f\"│  Primary Key: company_id\")\n",
    "comp_total = len(companies_base)\n",
    "comp_unique = companies_base['company_id'].nunique()\n",
    "comp_dups = comp_total - comp_unique\n",
    "print(f\"│  Total rows:     {comp_total:,}\")\n",
    "print(f\"│  Unique rows:    {comp_unique:,}\")\n",
    "print(f\"│  Duplicates:     {comp_dups:,}\")\n",
    "print(f\"│  Status:         {'✅ CLEAN' if comp_dups == 0 else '🔴 HAS DUPLICATES'}\")\n",
    "if comp_dups > 0:\n",
    "    dup_ids = companies_base[companies_base.duplicated('company_id', keep=False)]['company_id'].value_counts().head(3)\n",
    "    print(f\"│  Top duplicates:\")\n",
    "    for cid, count in dup_ids.items():\n",
    "        print(f\"│    - company_id={cid}: {count} times\")\n",
    "print(\"└─\\n\")\n",
    "duplicate_report.append(('Companies Base', comp_total, comp_unique, comp_dups))\n",
    "\n",
    "# 3. Company Industries\n",
    "print(\"┌─ 📊 company_industries.csv\")\n",
    "print(f\"│  Primary Key: company_id + industry\")\n",
    "ci_total = len(company_industries)\n",
    "ci_unique = len(company_industries.drop_duplicates(subset=['company_id', 'industry']))\n",
    "ci_dups = ci_total - ci_unique\n",
    "print(f\"│  Total rows:     {ci_total:,}\")\n",
    "print(f\"│  Unique rows:    {ci_unique:,}\")\n",
    "print(f\"│  Duplicates:     {ci_dups:,}\")\n",
    "print(f\"│  Status:         {'✅ CLEAN' if ci_dups == 0 else '🔴 HAS DUPLICATES'}\")\n",
    "print(\"└─\\n\")\n",
    "duplicate_report.append(('Company Industries', ci_total, ci_unique, ci_dups))\n",
    "\n",
    "# 4. Company Specialties\n",
    "print(\"┌─ 📊 company_specialities.csv\")\n",
    "print(f\"│  Primary Key: company_id + speciality\")\n",
    "cs_total = len(company_specialties)\n",
    "cs_unique = len(company_specialties.drop_duplicates(subset=['company_id', 'speciality']))\n",
    "cs_dups = cs_total - cs_unique\n",
    "print(f\"│  Total rows:     {cs_total:,}\")\n",
    "print(f\"│  Unique rows:    {cs_unique:,}\")\n",
    "print(f\"│  Duplicates:     {cs_dups:,}\")\n",
    "print(f\"│  Status:         {'✅ CLEAN' if cs_dups == 0 else '🔴 HAS DUPLICATES'}\")\n",
    "print(\"└─\\n\")\n",
    "duplicate_report.append(('Company Specialties', cs_total, cs_unique, cs_dups))\n",
    "\n",
    "# 5. Employee Counts\n",
    "print(\"┌─ 📊 employee_counts.csv\")\n",
    "print(f\"│  Primary Key: company_id\")\n",
    "ec_total = len(employee_counts)\n",
    "ec_unique = employee_counts['company_id'].nunique()\n",
    "ec_dups = ec_total - ec_unique\n",
    "print(f\"│  Total rows:     {ec_total:,}\")\n",
    "print(f\"│  Unique rows:    {ec_unique:,}\")\n",
    "print(f\"│  Duplicates:     {ec_dups:,}\")\n",
    "print(f\"│  Status:         {'✅ CLEAN' if ec_dups == 0 else '🔴 HAS DUPLICATES'}\")\n",
    "print(\"└─\\n\")\n",
    "duplicate_report.append(('Employee Counts', ec_total, ec_unique, ec_dups))\n",
    "\n",
    "# 6. Postings\n",
    "print(\"┌─ 📊 postings.csv (Job Postings)\")\n",
    "print(f\"│  Primary Key: job_id\")\n",
    "if 'job_id' in postings.columns:\n",
    "    post_total = len(postings)\n",
    "    post_unique = postings['job_id'].nunique()\n",
    "    post_dups = post_total - post_unique\n",
    "else:\n",
    "    post_total = len(postings)\n",
    "    post_unique = len(postings.drop_duplicates())\n",
    "    post_dups = post_total - post_unique\n",
    "print(f\"│  Total rows:     {post_total:,}\")\n",
    "print(f\"│  Unique rows:    {post_unique:,}\")\n",
    "print(f\"│  Duplicates:     {post_dups:,}\")\n",
    "print(f\"│  Status:         {'✅ CLEAN' if post_dups == 0 else '🔴 HAS DUPLICATES'}\")\n",
    "print(\"└─\\n\")\n",
    "duplicate_report.append(('Postings', post_total, post_unique, post_dups))\n",
    "\n",
    "# 7. Companies Full (After Merge)\n",
    "print(\"┌─ 📊 companies_full (After Enrichment)\")\n",
    "print(f\"│  Primary Key: company_id\")\n",
    "cf_total = len(companies_full)\n",
    "cf_unique = companies_full['company_id'].nunique()\n",
    "cf_dups = cf_total - cf_unique\n",
    "print(f\"│  Total rows:     {cf_total:,}\")\n",
    "print(f\"│  Unique rows:    {cf_unique:,}\")\n",
    "print(f\"│  Duplicates:     {cf_dups:,}\")\n",
    "print(f\"│  Status:         {'✅ CLEAN' if cf_dups == 0 else '🔴 HAS DUPLICATES'}\")\n",
    "if cf_dups > 0:\n",
    "    dup_ids = companies_full[companies_full.duplicated('company_id', keep=False)]['company_id'].value_counts().head(5)\n",
    "    print(f\"│\")\n",
    "    print(f\"│  Top duplicate company_ids:\")\n",
    "    for cid, count in dup_ids.items():\n",
    "        comp_name = companies_full[companies_full['company_id'] == cid]['name'].iloc[0]\n",
    "        print(f\"│    - {cid} ({comp_name}): {count} times\")\n",
    "print(\"└─\\n\")\n",
    "duplicate_report.append(('Companies Full', cf_total, cf_unique, cf_dups))\n",
    "\n",
    "# Summary\n",
    "print(\"=\" * 80)\n",
    "print(\"📊 SUMMARY\")\n",
    "print(\"=\" * 80)\n",
    "print()\n",
    "\n",
    "total_dups = sum(r[3] for r in duplicate_report)\n",
    "clean_datasets = sum(1 for r in duplicate_report if r[3] == 0)\n",
    "dirty_datasets = len(duplicate_report) - clean_datasets\n",
    "\n",
    "print(f\"✅ Clean datasets:          {clean_datasets}/{len(duplicate_report)}\")\n",
    "print(f\"🔴 Datasets with duplicates: {dirty_datasets}/{len(duplicate_report)}\")\n",
    "print(f\"🗑️  Total duplicates found:  {total_dups:,} rows\")\n",
    "print()\n",
    "\n",
    "if dirty_datasets > 0:\n",
    "    print(\"⚠️  DUPLICATES DETECTED!\")\n",
    "else:\n",
    "    print(\"✅ All datasets are clean! No duplicates found.\")\n",
    "\n",
    "print(\"=\" * 80)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🧹 CLEANING DUPLICATES...\n",
      "\n",
      "================================================================================\n",
      "✅ companies_base: Already clean\n",
      "\n",
      "✅ company_industries: Already clean\n",
      "\n",
      "✅ company_specialties: Already clean\n",
      "\n",
      "✅ employee_counts:\n",
      "   Removed 11,314 duplicates\n",
      "   35,787 → 24,473 rows\n",
      "\n",
      "✅ postings: Already clean\n",
      "\n",
      "✅ companies_full:\n",
      "   Removed 11,314 duplicates\n",
      "   35,787 → 24,473 rows\n",
      "\n",
      "================================================================================\n",
      "✅ DATA CLEANING COMPLETE!\n",
      "================================================================================\n",
      "\n",
      "📊 Total duplicates removed: 22,628 rows\n",
      "\n",
      "Cleaned datasets:\n",
      "  - employee_counts: 35,787 → 24,473\n",
      "  - companies_full: 35,787 → 24,473\n"
     ]
    }
   ],
   "source": [
    "\"\"\"\n",
    "## 🧹 Data Cleaning - Remove Duplicates\n",
    "\n",
    "Based on the report above, removing duplicates from datasets.\n",
    "\"\"\"\n",
    "\n",
    "print(\"🧹 CLEANING DUPLICATES...\\n\")\n",
    "print(\"=\" * 80)\n",
    "\n",
    "# Store original counts\n",
    "original_counts = {}\n",
    "\n",
    "# 1. Clean Companies Base (if needed)\n",
    "if len(companies_base) != companies_base['company_id'].nunique():\n",
    "    original_counts['companies_base'] = len(companies_base)\n",
    "    companies_base = companies_base.drop_duplicates(subset=['company_id'], keep='first')\n",
    "    removed = original_counts['companies_base'] - len(companies_base)\n",
    "    print(f\"✅ companies_base:\")\n",
    "    print(f\"   Removed {removed:,} duplicates\")\n",
    "    print(f\"   {original_counts['companies_base']:,} → {len(companies_base):,} rows\\n\")\n",
    "else:\n",
    "    print(f\"✅ companies_base: Already clean\\n\")\n",
    "\n",
    "# 2. Clean Company Industries (if needed)\n",
    "if len(company_industries) != len(company_industries.drop_duplicates(subset=['company_id', 'industry'])):\n",
    "    original_counts['company_industries'] = len(company_industries)\n",
    "    company_industries = company_industries.drop_duplicates(subset=['company_id', 'industry'], keep='first')\n",
    "    removed = original_counts['company_industries'] - len(company_industries)\n",
    "    print(f\"✅ company_industries:\")\n",
    "    print(f\"   Removed {removed:,} duplicates\")\n",
    "    print(f\"   {original_counts['company_industries']:,} → {len(company_industries):,} rows\\n\")\n",
    "else:\n",
    "    print(f\"✅ company_industries: Already clean\\n\")\n",
    "\n",
    "# 3. Clean Company Specialties (if needed)\n",
    "if len(company_specialties) != len(company_specialties.drop_duplicates(subset=['company_id', 'speciality'])):\n",
    "    original_counts['company_specialties'] = len(company_specialties)\n",
    "    company_specialties = company_specialties.drop_duplicates(subset=['company_id', 'speciality'], keep='first')\n",
    "    removed = original_counts['company_specialties'] - len(company_specialties)\n",
    "    print(f\"✅ company_specialties:\")\n",
    "    print(f\"   Removed {removed:,} duplicates\")\n",
    "    print(f\"   {original_counts['company_specialties']:,} → {len(company_specialties):,} rows\\n\")\n",
    "else:\n",
    "    print(f\"✅ company_specialties: Already clean\\n\")\n",
    "\n",
    "# 4. Clean Employee Counts (if needed)\n",
    "if len(employee_counts) != employee_counts['company_id'].nunique():\n",
    "    original_counts['employee_counts'] = len(employee_counts)\n",
    "    employee_counts = employee_counts.drop_duplicates(subset=['company_id'], keep='first')\n",
    "    removed = original_counts['employee_counts'] - len(employee_counts)\n",
    "    print(f\"✅ employee_counts:\")\n",
    "    print(f\"   Removed {removed:,} duplicates\")\n",
    "    print(f\"   {original_counts['employee_counts']:,} → {len(employee_counts):,} rows\\n\")\n",
    "else:\n",
    "    print(f\"✅ employee_counts: Already clean\\n\")\n",
    "\n",
    "# 5. Clean Postings (if needed)\n",
    "if 'job_id' in postings.columns:\n",
    "    if len(postings) != postings['job_id'].nunique():\n",
    "        original_counts['postings'] = len(postings)\n",
    "        postings = postings.drop_duplicates(subset=['job_id'], keep='first')\n",
    "        removed = original_counts['postings'] - len(postings)\n",
    "        print(f\"✅ postings:\")\n",
    "        print(f\"   Removed {removed:,} duplicates\")\n",
    "        print(f\"   {original_counts['postings']:,} → {len(postings):,} rows\\n\")\n",
    "    else:\n",
    "        print(f\"✅ postings: Already clean\\n\")\n",
    "\n",
    "# 6. Clean Companies Full (if needed)\n",
    "if len(companies_full) != companies_full['company_id'].nunique():\n",
    "    original_counts['companies_full'] = len(companies_full)\n",
    "    companies_full = companies_full.drop_duplicates(subset=['company_id'], keep='first')\n",
    "    removed = original_counts['companies_full'] - len(companies_full)\n",
    "    print(f\"✅ companies_full:\")\n",
    "    print(f\"   Removed {removed:,} duplicates\")\n",
    "    print(f\"   {original_counts['companies_full']:,} → {len(companies_full):,} rows\\n\")\n",
    "else:\n",
    "    print(f\"✅ companies_full: Already clean\\n\")\n",
    "\n",
    "print(\"=\" * 80)\n",
    "print(\"✅ DATA CLEANING COMPLETE!\")\n",
    "print(\"=\" * 80)\n",
    "print()\n",
    "\n",
    "# Summary\n",
    "if original_counts:\n",
    "    total_removed = sum(original_counts[k] - globals()[k].shape[0] if k in globals() else 0 \n",
    "                       for k in original_counts.keys())\n",
    "    print(f\"📊 Total duplicates removed: {total_removed:,} rows\")\n",
    "    print()\n",
    "    print(\"Cleaned datasets:\")\n",
    "    for dataset, original in original_counts.items():\n",
    "        current = len(globals()[dataset]) if dataset in globals() else 0\n",
    "        print(f\"  - {dataset}: {original:,} → {current:,}\")\n",
    "else:\n",
    "    print(\"✅ No duplicates found - all datasets were already clean!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 📊 Step 7: Load Embedding Model & Pre-computed Vectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🧠 Loading embedding model...\n",
      "\n",
      "✅ Model loaded: all-MiniLM-L6-v2\n",
      "📐 Embedding dimension: ℝ^384\n",
      "\n",
      "📂 Loading pre-computed embeddings...\n",
      "✅ Loaded from ../processed/\n",
      "📊 Candidate vectors: (9544, 384)\n",
      "📊 Company vectors: (35787, 384)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(\"🧠 Loading embedding model...\\n\")\n",
    "model = SentenceTransformer(Config.EMBEDDING_MODEL)\n",
    "embedding_dim = model.get_sentence_embedding_dimension()\n",
    "print(f\"✅ Model loaded: {Config.EMBEDDING_MODEL}\")\n",
    "print(f\"📐 Embedding dimension: ℝ^{embedding_dim}\\n\")\n",
    "\n",
    "print(\"📂 Loading pre-computed embeddings...\")\n",
    "\n",
    "try:\n",
    "    # Try to load from processed folder\n",
    "    cand_vectors = np.load(f'{Config.PROCESSED_PATH}candidate_embeddings.npy')\n",
    "    comp_vectors = np.load(f'{Config.PROCESSED_PATH}company_embeddings.npy')\n",
    "    \n",
    "    print(f\"✅ Loaded from {Config.PROCESSED_PATH}\")\n",
    "    print(f\"📊 Candidate vectors: {cand_vectors.shape}\")\n",
    "    print(f\"📊 Company vectors: {comp_vectors.shape}\\n\")\n",
    "    \n",
    "except FileNotFoundError:\n",
    "    print(\"⚠️  Pre-computed embeddings not found!\")\n",
    "    print(\"   Embeddings will need to be generated (takes ~5-10 minutes)\")\n",
    "    print(\"   This is normal if running for the first time.\\n\")\n",
    "    \n",
    "    # You can add embedding generation code here if needed\n",
    "    # For now, we'll skip to keep notebook clean\n",
    "    cand_vectors = None\n",
    "    comp_vectors = None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 📊 Step 8: Core Matching Function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✅ Matching function ready\n"
     ]
    }
   ],
   "source": [
    "def find_top_matches(candidate_idx: int, top_k: int = 10) -> List[tuple]:\n",
    "    \"\"\"\n",
    "    Find top K company matches for a candidate using cosine similarity.\n",
    "    \n",
    "    Args:\n",
    "        candidate_idx: Index of candidate\n",
    "        top_k: Number of top matches to return\n",
    "    \n",
    "    Returns:\n",
    "        List of (company_index, similarity_score) tuples\n",
    "    \"\"\"\n",
    "    if cand_vectors is None or comp_vectors is None:\n",
    "        raise ValueError(\"Embeddings not loaded! Please run Step 8 first.\")\n",
    "    \n",
    "    cand_vec = cand_vectors[candidate_idx].reshape(1, -1)\n",
    "    similarities = cosine_similarity(cand_vec, comp_vectors)[0]\n",
    "    top_indices = np.argsort(similarities)[::-1][:top_k]\n",
    "    \n",
    "    return [(int(idx), float(similarities[idx])) for idx in top_indices]\n",
    "\n",
    "print(\"✅ Matching function ready\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 📊 Step 9: Initialize FREE LLM (Hugging Face)\n",
    "\n",
    "### Get your FREE token: https://huggingface.co/settings/tokens"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✅ Hugging Face client initialized (FREE)\n",
      "🤖 Model: meta-llama/Llama-3.2-3B-Instruct\n",
      "💰 Cost: $0.00 (completely free!)\n",
      "\n",
      "✅ LLM helper functions ready\n"
     ]
    }
   ],
   "source": [
    "# Initialize Hugging Face Inference Client (FREE)\n",
    "if Config.HF_TOKEN:\n",
    "    try:\n",
    "        hf_client = InferenceClient(token=Config.HF_TOKEN)\n",
    "        print(\"✅ Hugging Face client initialized (FREE)\")\n",
    "        print(f\"🤖 Model: {Config.LLM_MODEL}\")\n",
    "        print(\"💰 Cost: $0.00 (completely free!)\\n\")\n",
    "        LLM_AVAILABLE = True\n",
    "    except Exception as e:\n",
    "        print(f\"⚠️  Failed to initialize HF client: {e}\")\n",
    "        LLM_AVAILABLE = False\n",
    "else:\n",
    "    print(\"⚠️  No Hugging Face token configured\")\n",
    "    print(\"   LLM features will be disabled\")\n",
    "    print(\"\\n📝 To enable:\")\n",
    "    print(\"   1. Go to: https://huggingface.co/settings/tokens\")\n",
    "    print(\"   2. Create a token (free)\")\n",
    "    print(\"   3. Set: Config.HF_TOKEN = 'your-token-here'\\n\")\n",
    "    LLM_AVAILABLE = False\n",
    "    hf_client = None\n",
    "\n",
    "def call_llm(prompt: str, max_tokens: int = 1000) -> str:\n",
    "    \"\"\"\n",
    "    Generic LLM call using Hugging Face Inference API (FREE).\n",
    "    \"\"\"\n",
    "    if not LLM_AVAILABLE:\n",
    "        return \"[LLM not available - check .env file for HF_TOKEN]\"\n",
    "    \n",
    "    try:\n",
    "        response = hf_client.chat_completion(  # ✅ chat_completion\n",
    "            messages=[{\"role\": \"user\", \"content\": prompt}],\n",
    "            model=Config.LLM_MODEL,\n",
    "            max_tokens=max_tokens,\n",
    "            temperature=0.7\n",
    "        )\n",
    "        return response.choices[0].message.content  # ✅ Extrai conteúdo\n",
    "    except Exception as e:\n",
    "        return f\"[Error: {str(e)}]\"\n",
    "\n",
    "print(\"✅ LLM helper functions ready\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 📊 Step 10: Pydantic Schemas for Structured Output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✅ Pydantic schemas defined\n"
     ]
    }
   ],
   "source": [
    "class JobLevelClassification(BaseModel):\n",
    "    \"\"\"Job level classification result\"\"\"\n",
    "    level: Literal['Entry', 'Mid', 'Senior', 'Executive']\n",
    "    confidence: float = Field(ge=0.0, le=1.0)\n",
    "    reasoning: str\n",
    "\n",
    "class SkillsTaxonomy(BaseModel):\n",
    "    \"\"\"Structured skills extraction\"\"\"\n",
    "    technical_skills: List[str] = Field(default_factory=list)\n",
    "    soft_skills: List[str] = Field(default_factory=list)\n",
    "    certifications: List[str] = Field(default_factory=list)\n",
    "    languages: List[str] = Field(default_factory=list)\n",
    "\n",
    "class MatchExplanation(BaseModel):\n",
    "    \"\"\"Match reasoning\"\"\"\n",
    "    overall_score: float = Field(ge=0.0, le=1.0)\n",
    "    match_strengths: List[str]\n",
    "    skill_gaps: List[str]\n",
    "    recommendation: str\n",
    "    fit_summary: str = Field(max_length=200)\n",
    "\n",
    "print(\"✅ Pydantic schemas defined\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 📊 Step 11: Job Level Classification (Zero-Shot)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🧪 Testing zero-shot classification...\n",
      "\n",
      "📊 Classification Result:\n",
      "{\n",
      "  \"level\": \"Entry\",\n",
      "  \"confidence\": 0.75,\n",
      "  \"reasoning\": \"The job posting mentions 'some experience in graphic design' and 'fun, kind, ambitious members of the sales team' indicating a junior role.\"\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "def classify_job_level_zero_shot(job_description: str) -> Dict:\n",
    "    \"\"\"\n",
    "    Zero-shot job level classification.\n",
    "    \n",
    "    Returns classification as: Entry, Mid, Senior, or Executive\n",
    "    \"\"\"\n",
    "    \n",
    "    prompt = f\"\"\"Classify this job posting into ONE seniority level.\n",
    "\n",
    "Levels:\n",
    "- Entry: 0-2 years experience, junior roles\n",
    "- Mid: 3-5 years experience, independent work\n",
    "- Senior: 6-10 years experience, technical leadership\n",
    "- Executive: 10+ years, strategic leadership, C-level\n",
    "\n",
    "Job Posting:\n",
    "{job_description[:500]}\n",
    "\n",
    "Return ONLY valid JSON:\n",
    "{{\n",
    "    \"level\": \"Entry|Mid|Senior|Executive\",\n",
    "    \"confidence\": 0.85,\n",
    "    \"reasoning\": \"Brief explanation\"\n",
    "}}\n",
    "\"\"\"\n",
    "    \n",
    "    response = call_llm(prompt)\n",
    "    \n",
    "    try:\n",
    "        # Extract JSON\n",
    "        json_str = response.strip()\n",
    "        if '```json' in json_str:\n",
    "            json_str = json_str.split('```json')[1].split('```')[0].strip()\n",
    "        elif '```' in json_str:\n",
    "            json_str = json_str.split('```')[1].split('```')[0].strip()\n",
    "        \n",
    "        # Find JSON in response\n",
    "        if '{' in json_str and '}' in json_str:\n",
    "            start = json_str.index('{')\n",
    "            end = json_str.rindex('}') + 1\n",
    "            json_str = json_str[start:end]\n",
    "        \n",
    "        result = json.loads(json_str)\n",
    "        return result\n",
    "    except:\n",
    "        return {\n",
    "            \"level\": \"Unknown\",\n",
    "            \"confidence\": 0.0,\n",
    "            \"reasoning\": \"Failed to parse response\"\n",
    "        }\n",
    "\n",
    "# Test if LLM available and data loaded\n",
    "if LLM_AVAILABLE and len(postings) > 0:\n",
    "    print(\"🧪 Testing zero-shot classification...\\n\")\n",
    "    sample = postings.iloc[0]['description']\n",
    "    result = classify_job_level_zero_shot(sample)\n",
    "    \n",
    "    print(\"📊 Classification Result:\")\n",
    "    print(json.dumps(result, indent=2))\n",
    "else:\n",
    "    print(\"⚠️  Skipped - LLM not available or no data\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 📊 Step 12: Few-Shot Learning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🧪 Comparing Zero-Shot vs Few-Shot...\n",
      "\n",
      "📊 Comparison:\n",
      "Zero-shot: Unknown (confidence: 0.00)\n",
      "Few-shot:  Entry (confidence: 0.90)\n"
     ]
    }
   ],
   "source": [
    "def classify_job_level_few_shot(job_description: str) -> Dict:\n",
    "    \"\"\"\n",
    "    Few-shot classification with examples.\n",
    "    \"\"\"\n",
    "    \n",
    "    prompt = f\"\"\"Classify this job posting using examples.\n",
    "\n",
    "EXAMPLES:\n",
    "\n",
    "Example 1 (Entry):\n",
    "\"Recent graduate wanted. Python basics. Mentorship provided.\"\n",
    "→ Entry level (learning focus, 0-2 years)\n",
    "\n",
    "Example 2 (Senior):\n",
    "\"5+ years backend. Lead team of 3. System architecture.\"\n",
    "→ Senior level (technical leadership, 6-10 years)\n",
    "\n",
    "Example 3 (Executive):\n",
    "\"CTO position. 15+ years. Define technical strategy.\"\n",
    "→ Executive level (C-level, strategic)\n",
    "\n",
    "NOW CLASSIFY:\n",
    "{job_description[:500]}\n",
    "\n",
    "Return JSON:\n",
    "{{\n",
    "    \"level\": \"Entry|Mid|Senior|Executive\",\n",
    "    \"confidence\": 0.0-1.0,\n",
    "    \"reasoning\": \"Explain\"\n",
    "}}\n",
    "\"\"\"\n",
    "    \n",
    "    response = call_llm(prompt)\n",
    "    \n",
    "    try:\n",
    "        json_str = response.strip()\n",
    "        if '```json' in json_str:\n",
    "            json_str = json_str.split('```json')[1].split('```')[0].strip()\n",
    "        \n",
    "        if '{' in json_str and '}' in json_str:\n",
    "            start = json_str.index('{')\n",
    "            end = json_str.rindex('}') + 1\n",
    "            json_str = json_str[start:end]\n",
    "        \n",
    "        result = json.loads(json_str)\n",
    "        return result\n",
    "    except:\n",
    "        return {\"level\": \"Unknown\", \"confidence\": 0.0, \"reasoning\": \"Parse error\"}\n",
    "\n",
    "# Compare zero-shot vs few-shot\n",
    "if LLM_AVAILABLE and len(postings) > 0:\n",
    "    print(\"🧪 Comparing Zero-Shot vs Few-Shot...\\n\")\n",
    "    sample = postings.iloc[0]['description']\n",
    "    \n",
    "    zero = classify_job_level_zero_shot(sample)\n",
    "    few = classify_job_level_few_shot(sample)\n",
    "    \n",
    "    print(\"📊 Comparison:\")\n",
    "    print(f\"Zero-shot: {zero['level']} (confidence: {zero['confidence']:.2f})\")\n",
    "    print(f\"Few-shot:  {few['level']} (confidence: {few['confidence']:.2f})\")\n",
    "else:\n",
    "    print(\"⚠️  Skipped\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 📊 Step 13: Structured Skills Extraction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🔍 Testing skills extraction...\n",
      "\n",
      "📊 Extracted Skills:\n",
      "{\n",
      "  \"technical_skills\": [\n",
      "    \"Adobe Creative Cloud\",\n",
      "    \"Microsoft Office Suite\"\n",
      "  ],\n",
      "  \"soft_skills\": [\n",
      "    \"Communication\",\n",
      "    \"Leadership\",\n",
      "    \"Organization\",\n",
      "    \"Responsibility\",\n",
      "    \"Respect\",\n",
      "    \"Positive attitude\",\n",
      "    \"Proactivity\",\n",
      "    \"Creativity\",\n",
      "    \"Time management\",\n",
      "    \"Cool-under-pressure\"\n",
      "  ],\n",
      "  \"certifications\": [\n",
      "    \"Adobe Creative Cloud skills\"\n",
      "  ],\n",
      "  \"languages\": [\n",
      "    \"English\"\n",
      "  ]\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "def extract_skills_taxonomy(job_description: str) -> Dict:\n",
    "    \"\"\"\n",
    "    Extract structured skills using LLM + Pydantic validation.\n",
    "    \"\"\"\n",
    "    \n",
    "    prompt = f\"\"\"Extract skills from this job posting.\n",
    "\n",
    "Job Posting:\n",
    "{job_description[:800]}\n",
    "\n",
    "Return ONLY valid JSON:\n",
    "{{\n",
    "    \"technical_skills\": [\"Python\", \"Docker\", \"AWS\"],\n",
    "    \"soft_skills\": [\"Communication\", \"Leadership\"],\n",
    "    \"certifications\": [\"AWS Certified\"],\n",
    "    \"languages\": [\"English\", \"Danish\"]\n",
    "}}\n",
    "\"\"\"\n",
    "    \n",
    "    response = call_llm(prompt, max_tokens=800)\n",
    "    \n",
    "    try:\n",
    "        json_str = response.strip()\n",
    "        if '```json' in json_str:\n",
    "            json_str = json_str.split('```json')[1].split('```')[0].strip()\n",
    "        \n",
    "        if '{' in json_str and '}' in json_str:\n",
    "            start = json_str.index('{')\n",
    "            end = json_str.rindex('}') + 1\n",
    "            json_str = json_str[start:end]\n",
    "        \n",
    "        data = json.loads(json_str)\n",
    "        # Validate with Pydantic\n",
    "        validated = SkillsTaxonomy(**data)\n",
    "        return validated.model_dump()\n",
    "    except:\n",
    "        return {\n",
    "            \"technical_skills\": [],\n",
    "            \"soft_skills\": [],\n",
    "            \"certifications\": [],\n",
    "            \"languages\": []\n",
    "        }\n",
    "\n",
    "# Test extraction\n",
    "if LLM_AVAILABLE and len(postings) > 0:\n",
    "    print(\"🔍 Testing skills extraction...\\n\")\n",
    "    sample = postings.iloc[0]['description']\n",
    "    skills = extract_skills_taxonomy(sample)\n",
    "    \n",
    "    print(\"📊 Extracted Skills:\")\n",
    "    print(json.dumps(skills, indent=2))\n",
    "else:\n",
    "    print(\"⚠️  Skipped\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 📊 Step 14: Match Explainability"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "💡 Testing match explainability...\n",
      "\n",
      "📊 Match Explanation:\n",
      "{\n",
      "  \"overall_score\": 0.7028058171272278,\n",
      "  \"match_strengths\": [\n",
      "    \"Unable to generate\"\n",
      "  ],\n",
      "  \"skill_gaps\": [],\n",
      "  \"recommendation\": \"Review manually\",\n",
      "  \"fit_summary\": \"Match score: 0.70\"\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "def explain_match(candidate_idx: int, company_idx: int, similarity_score: float) -> Dict:\n",
    "    \"\"\"\n",
    "    Generate LLM explanation for why candidate matches company.\n",
    "    \"\"\"\n",
    "    \n",
    "    cand = candidates.iloc[candidate_idx]\n",
    "    comp = companies_full.iloc[company_idx]\n",
    "    \n",
    "    cand_skills = str(cand.get('skills', 'N/A'))[:300]\n",
    "    cand_exp = str(cand.get('positions', 'N/A'))[:300]\n",
    "    comp_req = str(comp.get('required_skills', 'N/A'))[:300]\n",
    "    comp_name = comp.get('name', 'Unknown')\n",
    "    \n",
    "    prompt = f\"\"\"Explain why this candidate matches this company.\n",
    "\n",
    "Candidate:\n",
    "Skills: {cand_skills}\n",
    "Experience: {cand_exp}\n",
    "\n",
    "Company: {comp_name}\n",
    "Requirements: {comp_req}\n",
    "\n",
    "Similarity Score: {similarity_score:.2f}\n",
    "\n",
    "Return JSON:\n",
    "{{\n",
    "    \"overall_score\": {similarity_score},\n",
    "    \"match_strengths\": [\"Top 3-5 matching factors\"],\n",
    "    \"skill_gaps\": [\"Missing skills\"],\n",
    "    \"recommendation\": \"What candidate should do\",\n",
    "    \"fit_summary\": \"One sentence summary\"\n",
    "}}\n",
    "\"\"\"\n",
    "    \n",
    "    response = call_llm(prompt, max_tokens=1000)\n",
    "    \n",
    "    try:\n",
    "        json_str = response.strip()\n",
    "        if '```json' in json_str:\n",
    "            json_str = json_str.split('```json')[1].split('```')[0].strip()\n",
    "        \n",
    "        if '{' in json_str and '}' in json_str:\n",
    "            start = json_str.index('{')\n",
    "            end = json_str.rindex('}') + 1\n",
    "            json_str = json_str[start:end]\n",
    "        \n",
    "        data = json.loads(json_str)\n",
    "        return data\n",
    "    except:\n",
    "        return {\n",
    "            \"overall_score\": similarity_score,\n",
    "            \"match_strengths\": [\"Unable to generate\"],\n",
    "            \"skill_gaps\": [],\n",
    "            \"recommendation\": \"Review manually\",\n",
    "            \"fit_summary\": f\"Match score: {similarity_score:.2f}\"\n",
    "        }\n",
    "\n",
    "# Test explainability\n",
    "if LLM_AVAILABLE and cand_vectors is not None and len(candidates) > 0:\n",
    "    print(\"💡 Testing match explainability...\\n\")\n",
    "    matches = find_top_matches(0, top_k=1)\n",
    "    if matches:\n",
    "        comp_idx, score = matches[0]\n",
    "        explanation = explain_match(0, comp_idx, score)\n",
    "        \n",
    "        print(\"📊 Match Explanation:\")\n",
    "        print(json.dumps(explanation, indent=2))\n",
    "else:\n",
    "    print(\"⚠️  Skipped - requirements not met\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 📊 Step 16: Detailed Match Visualization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🔍 DETAILED MATCH ANALYSIS\n",
      "====================================================================================================\n",
      "\n",
      "🎯 CANDIDATE #0\n",
      "Resume ID: N/A\n",
      "Category: N/A\n",
      "Skills: ['Big Data', 'Hadoop', 'Hive', 'Python', 'Mapreduce', 'Spark', 'Java', 'Machine Learning', 'Cloud', 'Hdfs', 'YARN', 'Core Java', 'Data Science', 'C++'...\n",
      "\n",
      "🔗 TOP 5 MATCHES:\n",
      "\n",
      "#1. TeachTown (Score: 0.7028)\n",
      "    Industries: E-Learning Providers...\n",
      "#3. Wolverine Power Systems (Score: 0.7026)\n",
      "    Industries: Renewable Energy Semiconductor Manufacturing...\n",
      "#5. Mariner (Score: 0.7010)\n",
      "    Industries: Financial Services...\n",
      "\n",
      "====================================================================================================\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[(9418, 0.7028058171272278),\n",
       " (30989, 0.7026211023330688),\n",
       " (9417, 0.7025721669197083),\n",
       " (30990, 0.7019376754760742),\n",
       " (9416, 0.7010321021080017)]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# ============================================================================\n",
    "# 🔍 DETAILED MATCH EXAMPLE\n",
    "# ============================================================================\n",
    "\n",
    "def show_detailed_match_example(candidate_idx=0, top_k=5):\n",
    "    print(\"🔍 DETAILED MATCH ANALYSIS\")\n",
    "    print(\"=\" * 100)\n",
    "    \n",
    "    if candidate_idx >= len(candidates):\n",
    "        print(f\"❌ ERROR: Candidate {candidate_idx} out of range\")\n",
    "        return None\n",
    "    \n",
    "    cand = candidates.iloc[candidate_idx]\n",
    "    \n",
    "    print(f\"\\n🎯 CANDIDATE #{candidate_idx}\")\n",
    "    print(f\"Resume ID: {cand.get('Resume_ID', 'N/A')}\")\n",
    "    print(f\"Category: {cand.get('Category', 'N/A')}\")\n",
    "    print(f\"Skills: {str(cand.get('skills', 'N/A'))[:150]}...\\n\")\n",
    "    \n",
    "    matches = find_top_matches(candidate_idx, top_k=top_k)\n",
    "    \n",
    "    print(f\"🔗 TOP {len(matches)} MATCHES:\\n\")\n",
    "    \n",
    "    for rank, (comp_idx, score) in enumerate(matches, 1):\n",
    "        if comp_idx >= len(companies_full):\n",
    "            continue\n",
    "        \n",
    "        company = companies_full.iloc[comp_idx]\n",
    "        print(f\"#{rank}. {company.get('name', 'N/A')} (Score: {score:.4f})\")\n",
    "        print(f\"    Industries: {str(company.get('industries_list', 'N/A'))[:60]}...\")\n",
    "    \n",
    "    print(\"\\n\" + \"=\" * 100)\n",
    "    return matches\n",
    "\n",
    "# Test\n",
    "show_detailed_match_example(candidate_idx=0, top_k=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 📊 Step 17: Bridging Concept Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🌉 THE BRIDGING CONCEPT\n",
      "==========================================================================================\n",
      "\n",
      "📊 DATA REALITY:\n",
      "   Total companies: 24,473\n",
      "   WITH postings: 0 (0.0%)\n",
      "   WITHOUT postings: 24,473\n",
      "\n",
      "🎯 THE PROBLEM:\n",
      "   Companies: 'We are in TECH INDUSTRY'\n",
      "   Candidates: 'I know PYTHON, AWS'\n",
      "   → Different languages! 🚫\n",
      "\n",
      "🌉 THE SOLUTION (BRIDGING):\n",
      "   1. Extract from postings: 'Need PYTHON developers'\n",
      "   2. Enrich company profile with skills\n",
      "   3. Now both speak SKILLS LANGUAGE! ✅\n",
      "\n",
      "==========================================================================================\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(Empty DataFrame\n",
       " Columns: [company_id, name, description, company_size, state, country, city, zip_code, address, url, industries_list, specialties_list, employee_count, follower_count, time_recorded, posted_job_titles, posted_descriptions, required_skills, experience_levels]\n",
       " Index: [],\n",
       "       company_id                               name  \\\n",
       " 0           1009                                IBM   \n",
       " 4           1016                      GE HealthCare   \n",
       " 14          1025         Hewlett Packard Enterprise   \n",
       " 18          1028                             Oracle   \n",
       " 23          1033                          Accenture   \n",
       " ...          ...                                ...   \n",
       " 35782  103463217                       JRC Services   \n",
       " 35783  103466352             Centent Consulting LLC   \n",
       " 35784  103467540  Kings and Queens Productions, LLC   \n",
       " 35785  103468936                           WebUnite   \n",
       " 35786  103472979                            BlackVe   \n",
       " \n",
       "                                              description company_size  \\\n",
       " 0      At IBM, we do more than work. We create. We cr...          7.0   \n",
       " 4      Every day millions of people feel the impact o...          7.0   \n",
       " 14     Official LinkedIn of Hewlett Packard Enterpris...          7.0   \n",
       " 18     We’re a cloud technology company that provides...          7.0   \n",
       " 23     Accenture is a leading global professional ser...          7.0   \n",
       " ...                                                  ...          ...   \n",
       " 35782                                                             2.0   \n",
       " 35783  Centent Consulting LLC is a reputable human re...                \n",
       " 35784  We are a small but mighty collection of thinke...                \n",
       " 35785  Our mission at WebUnite is to offer experience...                \n",
       " 35786                                                             1.0   \n",
       " \n",
       "               state country              city zip_code  \\\n",
       " 0                NY      US  Armonk, New York    10504   \n",
       " 4                 0      US           Chicago        0   \n",
       " 14            Texas      US           Houston    77389   \n",
       " 18            Texas      US            Austin    78741   \n",
       " 23                0      IE          Dublin 2        0   \n",
       " ...             ...     ...               ...      ...   \n",
       " 35782             0       0                 0        0   \n",
       " 35783             0       0                 0        0   \n",
       " 35784             0       0                 0        0   \n",
       " 35785  Pennsylvania      US       Southampton    18966   \n",
       " 35786             0       0                 0        0   \n",
       " \n",
       "                                      address  \\\n",
       " 0      International Business Machines Corp.   \n",
       " 4                                          -   \n",
       " 14               1701 E Mossy Oaks Rd Spring   \n",
       " 18                           2300 Oracle Way   \n",
       " 23                       Grand Canal Harbour   \n",
       " ...                                      ...   \n",
       " 35782                                      0   \n",
       " 35783                                      0   \n",
       " 35784                                      0   \n",
       " 35785                    720 2nd Street Pike   \n",
       " 35786                                      0   \n",
       " \n",
       "                                                      url  \\\n",
       " 0                   https://www.linkedin.com/company/ibm   \n",
       " 4          https://www.linkedin.com/company/gehealthcare   \n",
       " 14     https://www.linkedin.com/company/hewlett-packa...   \n",
       " 18               https://www.linkedin.com/company/oracle   \n",
       " 23            https://www.linkedin.com/company/accenture   \n",
       " ...                                                  ...   \n",
       " 35782       https://www.linkedin.com/company/jrcservices   \n",
       " 35783  https://www.linkedin.com/company/centent-consu...   \n",
       " 35784  https://www.linkedin.com/company/kings-and-que...   \n",
       " 35785          https://www.linkedin.com/company/webunite   \n",
       " 35786           https://www.linkedin.com/company/blackve   \n",
       " \n",
       "                                    industries_list  \\\n",
       " 0                    IT Services and IT Consulting   \n",
       " 4                        Hospitals and Health Care   \n",
       " 14                   IT Services and IT Consulting   \n",
       " 18                   IT Services and IT Consulting   \n",
       " 23                Business Consulting and Services   \n",
       " ...                                            ...   \n",
       " 35782                          Facilities Services   \n",
       " 35783             Business Consulting and Services   \n",
       " 35784  Broadcast Media Production and Distribution   \n",
       " 35785             Business Consulting and Services   \n",
       " 35786              Defense and Space Manufacturing   \n",
       " \n",
       "                                         specialties_list  employee_count  \\\n",
       " 0      Cloud | Mobile | Cognitive | Security | Resear...          314102   \n",
       " 4                             Healthcare | Biotechnology           56873   \n",
       " 14                                                                 79528   \n",
       " 18     enterprise | software | applications | databas...          192099   \n",
       " 23     Management Consulting | Systems Integration an...          574664   \n",
       " ...                                                  ...             ...   \n",
       " 35782                                                                  0   \n",
       " 35783                                                                  0   \n",
       " 35784                                                                  0   \n",
       " 35785                                                                  0   \n",
       " 35786                                                                  0   \n",
       " \n",
       "        follower_count  time_recorded posted_job_titles posted_descriptions  \\\n",
       " 0            16253625     1712378162                                         \n",
       " 4             2185368     1712382540                                         \n",
       " 14            3586194     1712870106                                         \n",
       " 18            9465968     1712642952                                         \n",
       " 23           11864908     1712641699                                         \n",
       " ...               ...            ...               ...                 ...   \n",
       " 35782              21     1713552037                                         \n",
       " 35783               0     1713550651                                         \n",
       " 35784              12     1713554225                                         \n",
       " 35785               1     1713535939                                         \n",
       " 35786               0     1713539379                                         \n",
       " \n",
       "       required_skills experience_levels  \n",
       " 0                                        \n",
       " 4                                        \n",
       " 14                                       \n",
       " 18                                       \n",
       " 23                                       \n",
       " ...               ...               ...  \n",
       " 35782                                    \n",
       " 35783                                    \n",
       " 35784                                    \n",
       " 35785                                    \n",
       " 35786                                    \n",
       " \n",
       " [24473 rows x 19 columns])"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# ============================================================================\n",
    "# 🌉 BRIDGING CONCEPT ANALYSIS\n",
    "# ============================================================================\n",
    "\n",
    "def show_bridging_concept_analysis():\n",
    "    print(\"🌉 THE BRIDGING CONCEPT\")\n",
    "    print(\"=\" * 90)\n",
    "    \n",
    "    companies_with = companies_full[companies_full['required_skills'] != '']\n",
    "    companies_without = companies_full[companies_full['required_skills'] == '']\n",
    "    \n",
    "    print(f\"\\n📊 DATA REALITY:\")\n",
    "    print(f\"   Total companies: {len(companies_full):,}\")\n",
    "    print(f\"   WITH postings: {len(companies_with):,} ({len(companies_with)/len(companies_full)*100:.1f}%)\")\n",
    "    print(f\"   WITHOUT postings: {len(companies_without):,}\\n\")\n",
    "    \n",
    "    print(\"🎯 THE PROBLEM:\")\n",
    "    print(\"   Companies: 'We are in TECH INDUSTRY'\")\n",
    "    print(\"   Candidates: 'I know PYTHON, AWS'\")\n",
    "    print(\"   → Different languages! 🚫\\n\")\n",
    "    \n",
    "    print(\"🌉 THE SOLUTION (BRIDGING):\")\n",
    "    print(\"   1. Extract from postings: 'Need PYTHON developers'\")\n",
    "    print(\"   2. Enrich company profile with skills\")\n",
    "    print(\"   3. Now both speak SKILLS LANGUAGE! ✅\\n\")\n",
    "    \n",
    "    print(\"=\" * 90)\n",
    "    return companies_with, companies_without\n",
    "\n",
    "# Test\n",
    "show_bridging_concept_analysis()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 📊 Step 18: Export Results to CSV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "💾 Exporting 50 candidates (top 5 each)...\n",
      "\n",
      "   Processing 1/50...\n",
      "\n",
      "✅ Exported 129 matches\n",
      "📄 File: ../results/hrhub_matches.csv\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# ============================================================================\n",
    "# 💾 EXPORT MATCHES TO CSV\n",
    "# ============================================================================\n",
    "\n",
    "def export_matches_to_csv(num_candidates=100, top_k=10):\n",
    "    print(f\"💾 Exporting {num_candidates} candidates (top {top_k} each)...\\n\")\n",
    "    \n",
    "    results = []\n",
    "    \n",
    "    for i in range(min(num_candidates, len(candidates))):\n",
    "        if i % 50 == 0:\n",
    "            print(f\"   Processing {i+1}/{num_candidates}...\")\n",
    "        \n",
    "        matches = find_top_matches(i, top_k=top_k)\n",
    "        cand = candidates.iloc[i]\n",
    "        \n",
    "        for rank, (comp_idx, score) in enumerate(matches, 1):\n",
    "            if comp_idx >= len(companies_full):\n",
    "                continue\n",
    "            \n",
    "            company = companies_full.iloc[comp_idx]\n",
    "            \n",
    "            results.append({\n",
    "                'candidate_id': i,\n",
    "                'candidate_category': cand.get('Category', 'N/A'),\n",
    "                'company_id': company.get('company_id', 'N/A'),\n",
    "                'company_name': company.get('name', 'N/A'),\n",
    "                'match_rank': rank,\n",
    "                'similarity_score': round(float(score), 4)\n",
    "            })\n",
    "    \n",
    "    results_df = pd.DataFrame(results)\n",
    "    output_file = f'{Config.RESULTS_PATH}hrhub_matches.csv'\n",
    "    results_df.to_csv(output_file, index=False)\n",
    "    \n",
    "    print(f\"\\n✅ Exported {len(results_df):,} matches\")\n",
    "    print(f\"📄 File: {output_file}\\n\")\n",
    "    \n",
    "    return results_df\n",
    "\n",
    "# Export sample\n",
    "matches_df = export_matches_to_csv(num_candidates=50, top_k=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 📊 Step 19: Summary\n",
    "\n",
    "### What We Built"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "======================================================================\n",
      "🎯 HRHUB v2.1 - SUMMARY\n",
      "======================================================================\n",
      "\n",
      "✅ IMPLEMENTED:\n",
      "  1. Zero-Shot Job Classification (Entry/Mid/Senior/Executive)\n",
      "  2. Few-Shot Learning with Examples\n",
      "  3. Structured Skills Extraction (Pydantic schemas)\n",
      "  4. Match Explainability (LLM-generated reasoning)\n",
      "  5. FREE LLM Integration (Hugging Face)\n",
      "  6. Flexible Data Loading (Upload OR Google Drive)\n",
      "\n",
      "💰 COST: $0.00 (completely free!)\n",
      "\n",
      "📈 COURSE ALIGNMENT:\n",
      "  ✅ LLMs for structured output\n",
      "  ✅ Pydantic schemas\n",
      "  ✅ Classification pipelines\n",
      "  ✅ Zero-shot & few-shot learning\n",
      "  ✅ JSON extraction\n",
      "  ✅ Transformer architecture (embeddings)\n",
      "  ✅ API deployment strategies\n",
      "\n",
      "======================================================================\n",
      "🚀 READY TO MOVE TO VS CODE!\n",
      "======================================================================\n"
     ]
    }
   ],
   "source": [
    "print(\"=\"*70)\n",
    "print(\"🎯 HRHUB v2.1 - SUMMARY\")\n",
    "print(\"=\"*70)\n",
    "print(\"\")\n",
    "print(\"✅ IMPLEMENTED:\")\n",
    "print(\"  1. Zero-Shot Job Classification (Entry/Mid/Senior/Executive)\")\n",
    "print(\"  2. Few-Shot Learning with Examples\")\n",
    "print(\"  3. Structured Skills Extraction (Pydantic schemas)\")\n",
    "print(\"  4. Match Explainability (LLM-generated reasoning)\")\n",
    "print(\"  5. FREE LLM Integration (Hugging Face)\")\n",
    "print(\"  6. Flexible Data Loading (Upload OR Google Drive)\")\n",
    "print(\"\")\n",
    "print(\"💰 COST: $0.00 (completely free!)\")\n",
    "print(\"\")\n",
    "print(\"📈 COURSE ALIGNMENT:\")\n",
    "print(\"  ✅ LLMs for structured output\")\n",
    "print(\"  ✅ Pydantic schemas\")\n",
    "print(\"  ✅ Classification pipelines\")\n",
    "print(\"  ✅ Zero-shot & few-shot learning\")\n",
    "print(\"  ✅ JSON extraction\")\n",
    "print(\"  ✅ Transformer architecture (embeddings)\")\n",
    "print(\"  ✅ API deployment strategies\")\n",
    "print(\"\")\n",
    "print(\"=\"*70)\n",
    "print(\"🚀 READY TO MOVE TO VS CODE!\")\n",
    "print(\"=\"*70)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}