Spaces:

Rogersurf
/

hrhub

Running

App Files Files Community

Roger Surf commited on 8 days ago

Commit

def3477

1 Parent(s): 782c177

new notebook hrhub_v2.1_enhanced

Browse files

Files changed (2) hide show

.gitignore +2 -1
data/notebooks/HRHUB_v2.1_Enhanced_FREE.ipynb +1694 -0

.gitignore CHANGED Viewed

@@ -5,4 +5,5 @@ __pycache__/
 .DS_Store
 *.log
 .streamlit/
-*.csv

 .DS_Store
 *.log
 .streamlit/
+*.csv
+.env

data/notebooks/HRHUB_v2.1_Enhanced_FREE.ipynb ADDED Viewed

	@@ -0,0 +1,1694 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 🧠 HRHUB v2.1 - Enhanced with LLM (FREE VERSION)\n",
+    "\n",
+    "## 📘 Project Overview\n",
+    "\n",
+    "**Bilateral HR Matching System with LLM-Powered Intelligence**\n",
+    "\n",
+    "### What's New in v2.1:\n",
+    "- ✅ **FREE LLM**: Using Hugging Face Inference API (no cost)\n",
+    "- ✅ **Job Level Classification**: Zero-shot & few-shot learning\n",
+    "- ✅ **Structured Skills Extraction**: Pydantic schemas\n",
+    "- ✅ **Match Explainability**: LLM-generated reasoning\n",
+    "- ✅ **Flexible Data Loading**: Upload OR Google Drive\n",
+    "\n",
+    "### Tech Stack:\n",
+    "```\n",
+    "Embeddings: sentence-transformers (local, free)\n",
+    "LLM: Hugging Face Inference API (free tier)\n",
+    "Schemas: Pydantic\n",
+    "Platform: Google Colab → VS Code\n",
+    "```\n",
+    "\n",
+    "---\n",
+    "\n",
+    "**Master's Thesis - Aalborg University**  \n",
+    "*Business Data Science Program*  \n",
+    "*December 2025*"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 📦 Step 1: Install Dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "✅ All packages installed!\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Install required packages\n",
+    "#!pip install -q sentence-transformers huggingface-hub pydantic plotly pyvis nbformat scikit-learn pandas numpy\n",
+    "\n",
+    "print(\"✅ All packages installed!\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 📚 Step 2: Import Libraries"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "✅ Environment variables loaded from .env\n",
+      "✅ All libraries imported!\n"
+     ]
+    }
+   ],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import json\n",
+    "import os\n",
+    "from typing import List, Dict, Optional, Literal\n",
+    "import warnings\n",
+    "warnings.filterwarnings('ignore')\n",
+    "\n",
+    "# ML & NLP\n",
+    "from sentence_transformers import SentenceTransformer\n",
+    "from sklearn.metrics.pairwise import cosine_similarity\n",
+    "\n",
+    "# LLM Integration (FREE)\n",
+    "from huggingface_hub import InferenceClient\n",
+    "from pydantic import BaseModel, Field\n",
+    "\n",
+    "# Visualization\n",
+    "import plotly.graph_objects as go\n",
+    "from IPython.display import HTML, display\n",
+    "\n",
+    "# Configuration Settings\n",
+    "from dotenv import load_dotenv\n",
+    "\n",
+    "# Carrega variáveis do .env\n",
+    "load_dotenv()\n",
+    "print(\"✅ Environment variables loaded from .env\")\n",
+    "# ============== ATÉ AQUI ⬆️ ==============\n",
+    "\n",
+    "print(\"✅ All libraries imported!\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 🔧 Step 3: Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "✅ Configuration loaded!\n",
+      "🧠 Embedding model: all-MiniLM-L6-v2\n",
+      "🤖 LLM model: meta-llama/Llama-3.2-3B-Instruct\n",
+      "🔑 HF Token configured: Yes ✅\n",
+      "📂 Data path: ../csv_files/\n"
+     ]
+    }
+   ],
+   "source": [
+    "class Config:\n",
+    "    \"\"\"Centralized configuration for VS Code\"\"\"\n",
+    "    \n",
+    "    # Paths - VS Code structure\n",
+    "    CSV_PATH = '../csv_files/'\n",
+    "    PROCESSED_PATH = '../processed/'\n",
+    "    RESULTS_PATH = '../results/'\n",
+    "    \n",
+    "    # Embedding Model\n",
+    "    EMBEDDING_MODEL = 'all-MiniLM-L6-v2'\n",
+    "    \n",
+    "    # LLM Settings (FREE - Hugging Face)\n",
+    "    HF_TOKEN = os.getenv('HF_TOKEN', '')  # ✅ Pega do .env\n",
+    "    LLM_MODEL = 'meta-llama/Llama-3.2-3B-Instruct'\n",
+    "    \n",
+    "    LLM_MAX_TOKENS = 1000\n",
+    "    \n",
+    "    # Matching Parameters\n",
+    "    TOP_K_MATCHES = 10\n",
+    "    SIMILARITY_THRESHOLD = 0.5\n",
+    "    RANDOM_SEED = 42\n",
+    "\n",
+    "np.random.seed(Config.RANDOM_SEED)\n",
+    "\n",
+    "print(\"✅ Configuration loaded!\")\n",
+    "print(f\"🧠 Embedding model: {Config.EMBEDDING_MODEL}\")\n",
+    "print(f\"🤖 LLM model: {Config.LLM_MODEL}\")\n",
+    "print(f\"🔑 HF Token configured: {'Yes ✅' if Config.HF_TOKEN else 'No ⚠️'}\")\n",
+    "print(f\"📂 Data path: {Config.CSV_PATH}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 📊 Step 5: Load All Datasets"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "📂 Loading all datasets...\n",
+      "\n",
+      "======================================================================\n",
+      "✅ Candidates: 9,544 rows × 35 columns\n",
+      "✅ Companies (base): 24,473 rows\n",
+      "✅ Company industries: 24,375 rows\n",
+      "✅ Company specialties: 169,387 rows\n",
+      "✅ Employee counts: 35,787 rows\n",
+      "✅ Postings: 123,849 rows × 31 columns\n",
+      "✅ Job skills: 213,768 rows\n",
+      "✅ Job industries: 164,808 rows\n",
+      "\n",
+      "======================================================================\n",
+      "✅ All datasets loaded successfully!\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"📂 Loading all datasets...\\n\")\n",
+    "print(\"=\" * 70)\n",
+    "\n",
+    "# Load main datasets\n",
+    "candidates = pd.read_csv(f'{Config.CSV_PATH}resume_data.csv')\n",
+    "print(f\"✅ Candidates: {len(candidates):,} rows × {len(candidates.columns)} columns\")\n",
+    "\n",
+    "companies_base = pd.read_csv(f'{Config.CSV_PATH}companies.csv')\n",
+    "print(f\"✅ Companies (base): {len(companies_base):,} rows\")\n",
+    "\n",
+    "company_industries = pd.read_csv(f'{Config.CSV_PATH}company_industries.csv')\n",
+    "print(f\"✅ Company industries: {len(company_industries):,} rows\")\n",
+    "\n",
+    "company_specialties = pd.read_csv(f'{Config.CSV_PATH}company_specialities.csv')\n",
+    "print(f\"✅ Company specialties: {len(company_specialties):,} rows\")\n",
+    "\n",
+    "employee_counts = pd.read_csv(f'{Config.CSV_PATH}employee_counts.csv')\n",
+    "print(f\"✅ Employee counts: {len(employee_counts):,} rows\")\n",
+    "\n",
+    "postings = pd.read_csv(f'{Config.CSV_PATH}postings.csv', on_bad_lines='skip', engine='python')\n",
+    "print(f\"✅ Postings: {len(postings):,} rows × {len(postings.columns)} columns\")\n",
+    "\n",
+    "# Optional datasets\n",
+    "try:\n",
+    "    job_skills = pd.read_csv(f'{Config.CSV_PATH}job_skills.csv')\n",
+    "    print(f\"✅ Job skills: {len(job_skills):,} rows\")\n",
+    "except:\n",
+    "    job_skills = None\n",
+    "    print(\"⚠️  Job skills not found (optional)\")\n",
+    "\n",
+    "try:\n",
+    "    job_industries = pd.read_csv(f'{Config.CSV_PATH}job_industries.csv')\n",
+    "    print(f\"✅ Job industries: {len(job_industries):,} rows\")\n",
+    "except:\n",
+    "    job_industries = None\n",
+    "    print(\"⚠️  Job industries not found (optional)\")\n",
+    "\n",
+    "print(\"\\n\" + \"=\" * 70)\n",
+    "print(\"✅ All datasets loaded successfully!\\n\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 🔗 Step 6: Merge & Enrich Company Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "🔗 Merging company data...\n",
+      "\n",
+      "✅ Aggregated industries for 24,365 companies\n",
+      "✅ Aggregated specialties for 17,780 companies\n",
+      "\n",
+      "✅ Base company merge complete: 35,787 companies\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"🔗 Merging company data...\\n\")\n",
+    "\n",
+    "# Aggregate industries\n",
+    "company_industries_agg = company_industries.groupby('company_id')['industry'].apply(\n",
+    "    lambda x: ', '.join(map(str, x.tolist()))\n",
+    ").reset_index()\n",
+    "company_industries_agg.columns = ['company_id', 'industries_list']\n",
+    "print(f\"✅ Aggregated industries for {len(company_industries_agg):,} companies\")\n",
+    "\n",
+    "# Aggregate specialties\n",
+    "company_specialties_agg = company_specialties.groupby('company_id')['speciality'].apply(\n",
+    "    lambda x: ' | '.join(x.astype(str).tolist())\n",
+    ").reset_index()\n",
+    "company_specialties_agg.columns = ['company_id', 'specialties_list']\n",
+    "print(f\"✅ Aggregated specialties for {len(company_specialties_agg):,} companies\")\n",
+    "\n",
+    "# Merge all company data\n",
+    "companies_merged = companies_base.copy()\n",
+    "companies_merged = companies_merged.merge(company_industries_agg, on='company_id', how='left')\n",
+    "companies_merged = companies_merged.merge(company_specialties_agg, on='company_id', how='left')\n",
+    "companies_merged = companies_merged.merge(employee_counts, on='company_id', how='left')\n",
+    "\n",
+    "print(f\"\\n✅ Base company merge complete: {len(companies_merged):,} companies\\n\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 🌉 Step 7: Enrich with Job Postings"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "🌉 Enriching companies with job posting data...\n",
+      "\n",
+      "======================================================================\n",
+      "KEY INSIGHT: Postings = 'Requirements Language Bridge'\n",
+      "======================================================================\n",
+      "\n",
+      "✅ Enriched 35,787 companies with posting data\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"🌉 Enriching companies with job posting data...\\n\")\n",
+    "print(\"=\" * 70)\n",
+    "print(\"KEY INSIGHT: Postings = 'Requirements Language Bridge'\")\n",
+    "print(\"=\" * 70 + \"\\n\")\n",
+    "\n",
+    "postings = postings.fillna('')\n",
+    "postings['company_id'] = postings['company_id'].astype(str)\n",
+    "\n",
+    "# Aggregate postings per company\n",
+    "postings_agg = postings.groupby('company_id').agg({\n",
+    "    'title': lambda x: ' | '.join(x.astype(str).tolist()[:10]),\n",
+    "    'description': lambda x: ' '.join(x.astype(str).tolist()[:5]),\n",
+    "    'skills_desc': lambda x: ' | '.join(x.dropna().astype(str).tolist()),\n",
+    "    'formatted_experience_level': lambda x: ' | '.join(x.dropna().unique().astype(str)),\n",
+    "}).reset_index()\n",
+    "\n",
+    "postings_agg.columns = ['company_id', 'posted_job_titles', 'posted_descriptions', 'required_skills', 'experience_levels']\n",
+    "\n",
+    "companies_merged['company_id'] = companies_merged['company_id'].astype(str)\n",
+    "companies_full = companies_merged.merge(postings_agg, on='company_id', how='left').fillna('')\n",
+    "\n",
+    "print(f\"✅ Enriched {len(companies_full):,} companies with posting data\\n\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>company_id</th>\n",
+       "      <th>name</th>\n",
+       "      <th>description</th>\n",
+       "      <th>company_size</th>\n",
+       "      <th>state</th>\n",
+       "      <th>country</th>\n",
+       "      <th>city</th>\n",
+       "      <th>zip_code</th>\n",
+       "      <th>address</th>\n",
+       "      <th>url</th>\n",
+       "      <th>industries_list</th>\n",
+       "      <th>specialties_list</th>\n",
+       "      <th>employee_count</th>\n",
+       "      <th>follower_count</th>\n",
+       "      <th>time_recorded</th>\n",
+       "      <th>posted_job_titles</th>\n",
+       "      <th>posted_descriptions</th>\n",
+       "      <th>required_skills</th>\n",
+       "      <th>experience_levels</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1009</td>\n",
+       "      <td>IBM</td>\n",
+       "      <td>At IBM, we do more than work. We create. We cr...</td>\n",
+       "      <td>7.0</td>\n",
+       "      <td>NY</td>\n",
+       "      <td>US</td>\n",
+       "      <td>Armonk, New York</td>\n",
+       "      <td>10504</td>\n",
+       "      <td>International Business Machines Corp.</td>\n",
+       "      <td>https://www.linkedin.com/company/ibm</td>\n",
+       "      <td>IT Services and IT Consulting</td>\n",
+       "      <td>Cloud | Mobile | Cognitive | Security | Resear...</td>\n",
+       "      <td>314102</td>\n",
+       "      <td>16253625</td>\n",
+       "      <td>1712378162</td>\n",
+       "      <td></td>\n",
+       "      <td></td>\n",
+       "      <td></td>\n",
+       "      <td></td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>1009</td>\n",
+       "      <td>IBM</td>\n",
+       "      <td>At IBM, we do more than work. We create. We cr...</td>\n",
+       "      <td>7.0</td>\n",
+       "      <td>NY</td>\n",
+       "      <td>US</td>\n",
+       "      <td>Armonk, New York</td>\n",
+       "      <td>10504</td>\n",
+       "      <td>International Business Machines Corp.</td>\n",
+       "      <td>https://www.linkedin.com/company/ibm</td>\n",
+       "      <td>IT Services and IT Consulting</td>\n",
+       "      <td>Cloud | Mobile | Cognitive | Security | Resear...</td>\n",
+       "      <td>313142</td>\n",
+       "      <td>16309464</td>\n",
+       "      <td>1713392385</td>\n",
+       "      <td></td>\n",
+       "      <td></td>\n",
+       "      <td></td>\n",
+       "      <td></td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>1009</td>\n",
+       "      <td>IBM</td>\n",
+       "      <td>At IBM, we do more than work. We create. We cr...</td>\n",
+       "      <td>7.0</td>\n",
+       "      <td>NY</td>\n",
+       "      <td>US</td>\n",
+       "      <td>Armonk, New York</td>\n",
+       "      <td>10504</td>\n",
+       "      <td>International Business Machines Corp.</td>\n",
+       "      <td>https://www.linkedin.com/company/ibm</td>\n",
+       "      <td>IT Services and IT Consulting</td>\n",
+       "      <td>Cloud | Mobile | Cognitive | Security | Resear...</td>\n",
+       "      <td>313147</td>\n",
+       "      <td>16309985</td>\n",
+       "      <td>1713402495</td>\n",
+       "      <td></td>\n",
+       "      <td></td>\n",
+       "      <td></td>\n",
+       "      <td></td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>1009</td>\n",
+       "      <td>IBM</td>\n",
+       "      <td>At IBM, we do more than work. We create. We cr...</td>\n",
+       "      <td>7.0</td>\n",
+       "      <td>NY</td>\n",
+       "      <td>US</td>\n",
+       "      <td>Armonk, New York</td>\n",
+       "      <td>10504</td>\n",
+       "      <td>International Business Machines Corp.</td>\n",
+       "      <td>https://www.linkedin.com/company/ibm</td>\n",
+       "      <td>IT Services and IT Consulting</td>\n",
+       "      <td>Cloud | Mobile | Cognitive | Security | Resear...</td>\n",
+       "      <td>311223</td>\n",
+       "      <td>16314846</td>\n",
+       "      <td>1713501255</td>\n",
+       "      <td></td>\n",
+       "      <td></td>\n",
+       "      <td></td>\n",
+       "      <td></td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>1016</td>\n",
+       "      <td>GE HealthCare</td>\n",
+       "      <td>Every day millions of people feel the impact o...</td>\n",
+       "      <td>7.0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>US</td>\n",
+       "      <td>Chicago</td>\n",
+       "      <td>0</td>\n",
+       "      <td>-</td>\n",
+       "      <td>https://www.linkedin.com/company/gehealthcare</td>\n",
+       "      <td>Hospitals and Health Care</td>\n",
+       "      <td>Healthcare | Biotechnology</td>\n",
+       "      <td>56873</td>\n",
+       "      <td>2185368</td>\n",
+       "      <td>1712382540</td>\n",
+       "      <td></td>\n",
+       "      <td></td>\n",
+       "      <td></td>\n",
+       "      <td></td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "  company_id           name  \\\n",
+       "0       1009            IBM   \n",
+       "1       1009            IBM   \n",
+       "2       1009            IBM   \n",
+       "3       1009            IBM   \n",
+       "4       1016  GE HealthCare   \n",
+       "\n",
+       "                                         description company_size state  \\\n",
+       "0  At IBM, we do more than work. We create. We cr...          7.0    NY   \n",
+       "1  At IBM, we do more than work. We create. We cr...          7.0    NY   \n",
+       "2  At IBM, we do more than work. We create. We cr...          7.0    NY   \n",
+       "3  At IBM, we do more than work. We create. We cr...          7.0    NY   \n",
+       "4  Every day millions of people feel the impact o...          7.0     0   \n",
+       "\n",
+       "  country              city zip_code                                address  \\\n",
+       "0      US  Armonk, New York    10504  International Business Machines Corp.   \n",
+       "1      US  Armonk, New York    10504  International Business Machines Corp.   \n",
+       "2      US  Armonk, New York    10504  International Business Machines Corp.   \n",
+       "3      US  Armonk, New York    10504  International Business Machines Corp.   \n",
+       "4      US           Chicago        0                                      -   \n",
+       "\n",
+       "                                             url  \\\n",
+       "0           https://www.linkedin.com/company/ibm   \n",
+       "1           https://www.linkedin.com/company/ibm   \n",
+       "2           https://www.linkedin.com/company/ibm   \n",
+       "3           https://www.linkedin.com/company/ibm   \n",
+       "4  https://www.linkedin.com/company/gehealthcare   \n",
+       "\n",
+       "                 industries_list  \\\n",
+       "0  IT Services and IT Consulting   \n",
+       "1  IT Services and IT Consulting   \n",
+       "2  IT Services and IT Consulting   \n",
+       "3  IT Services and IT Consulting   \n",
+       "4      Hospitals and Health Care   \n",
+       "\n",
+       "                                    specialties_list  employee_count  \\\n",
+       "0  Cloud | Mobile | Cognitive | Security | Resear...          314102   \n",
+       "1  Cloud | Mobile | Cognitive | Security | Resear...          313142   \n",
+       "2  Cloud | Mobile | Cognitive | Security | Resear...          313147   \n",
+       "3  Cloud | Mobile | Cognitive | Security | Resear...          311223   \n",
+       "4                         Healthcare | Biotechnology           56873   \n",
+       "\n",
+       "   follower_count  time_recorded posted_job_titles posted_descriptions  \\\n",
+       "0        16253625     1712378162                                         \n",
+       "1        16309464     1713392385                                         \n",
+       "2        16309985     1713402495                                         \n",
+       "3        16314846     1713501255                                         \n",
+       "4         2185368     1712382540                                         \n",
+       "\n",
+       "  required_skills experience_levels  \n",
+       "0                                    \n",
+       "1                                    \n",
+       "2                                    \n",
+       "3                                    \n",
+       "4                                    "
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "companies_full.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "================================================================================\n",
+      "🔍 DUPLICATE DETECTION REPORT\n",
+      "================================================================================\n",
+      "\n",
+      "┌─ 📊 resume_data.csv (Candidates)\n",
+      "│  Primary Key: Resume_ID\n",
+      "│  Total rows:     9,544\n",
+      "│  Unique rows:    9,544\n",
+      "│  Duplicates:     0\n",
+      "│  Status:         ✅ CLEAN\n",
+      "└─\n",
+      "\n",
+      "┌─ 📊 companies.csv (Companies Base)\n",
+      "│  Primary Key: company_id\n",
+      "│  Total rows:     24,473\n",
+      "│  Unique rows:    24,473\n",
+      "│  Duplicates:     0\n",
+      "│  Status:         ✅ CLEAN\n",
+      "└─\n",
+      "\n",
+      "┌─ 📊 company_industries.csv\n",
+      "│  Primary Key: company_id + industry\n",
+      "│  Total rows:     24,375\n",
+      "│  Unique rows:    24,375\n",
+      "│  Duplicates:     0\n",
+      "│  Status:         ✅ CLEAN\n",
+      "└─\n",
+      "\n",
+      "┌─ 📊 company_specialities.csv\n",
+      "│  Primary Key: company_id + speciality\n",
+      "│  Total rows:     169,387\n",
+      "│  Unique rows:    169,387\n",
+      "│  Duplicates:     0\n",
+      "│  Status:         ✅ CLEAN\n",
+      "└─\n",
+      "\n",
+      "┌─ 📊 employee_counts.csv\n",
+      "│  Primary Key: company_id\n",
+      "│  Total rows:     35,787\n",
+      "│  Unique rows:    24,473\n",
+      "│  Duplicates:     11,314\n",
+      "│  Status:         🔴 HAS DUPLICATES\n",
+      "└─\n",
+      "\n",
+      "┌─ 📊 postings.csv (Job Postings)\n",
+      "│  Primary Key: job_id\n",
+      "│  Total rows:     123,849\n",
+      "│  Unique rows:    123,849\n",
+      "│  Duplicates:     0\n",
+      "│  Status:         ✅ CLEAN\n",
+      "└─\n",
+      "\n",
+      "┌─ 📊 companies_full (After Enrichment)\n",
+      "│  Primary Key: company_id\n",
+      "│  Total rows:     35,787\n",
+      "│  Unique rows:    24,473\n",
+      "│  Duplicates:     11,314\n",
+      "│  Status:         🔴 HAS DUPLICATES\n",
+      "│\n",
+      "│  Top duplicate company_ids:\n",
+      "│    - 33242739 (Confidential): 13 times\n",
+      "│    - 5235 (LHH): 13 times\n",
+      "│    - 79383535 (Akkodis): 12 times\n",
+      "│    - 1681 (Robert Half): 12 times\n",
+      "│    - 220336 (Hyatt Hotels Corporation): 11 times\n",
+      "└─\n",
+      "\n",
+      "================================================================================\n",
+      "📊 SUMMARY\n",
+      "================================================================================\n",
+      "\n",
+      "✅ Clean datasets:          5/7\n",
+      "🔴 Datasets with duplicates: 2/7\n",
+      "🗑️  Total duplicates found:  22,628 rows\n",
+      "\n",
+      "⚠️  DUPLICATES DETECTED!\n",
+      "================================================================================\n"
+     ]
+    }
+   ],
+   "source": [
+    "## 🔍 Data Quality Check - Duplicate Detection\n",
+    "\n",
+    "\"\"\"\n",
+    "Checking for duplicates in all datasets based on primary keys.\n",
+    "This cell only REPORTS duplicates, does not modify data.\n",
+    "\"\"\"\n",
+    "\n",
+    "print(\"=\" * 80)\n",
+    "print(\"🔍 DUPLICATE DETECTION REPORT\")\n",
+    "print(\"=\" * 80)\n",
+    "print()\n",
+    "\n",
+    "# Define primary keys for each dataset\n",
+    "duplicate_report = []\n",
+    "\n",
+    "# 1. Candidates\n",
+    "print(\"┌─ 📊 resume_data.csv (Candidates)\")\n",
+    "print(f\"│  Primary Key: Resume_ID\")\n",
+    "cand_total = len(candidates)\n",
+    "cand_unique = candidates['Resume_ID'].nunique() if 'Resume_ID' in candidates.columns else len(candidates)\n",
+    "cand_dups = cand_total - cand_unique\n",
+    "print(f\"│  Total rows:     {cand_total:,}\")\n",
+    "print(f\"│  Unique rows:    {cand_unique:,}\")\n",
+    "print(f\"│  Duplicates:     {cand_dups:,}\")\n",
+    "print(f\"│  Status:         {'✅ CLEAN' if cand_dups == 0 else '🔴 HAS DUPLICATES'}\")\n",
+    "print(\"└─\\n\")\n",
+    "duplicate_report.append(('Candidates', cand_total, cand_unique, cand_dups))\n",
+    "\n",
+    "# 2. Companies Base\n",
+    "print(\"┌─ 📊 companies.csv (Companies Base)\")\n",
+    "print(f\"│  Primary Key: company_id\")\n",
+    "comp_total = len(companies_base)\n",
+    "comp_unique = companies_base['company_id'].nunique()\n",
+    "comp_dups = comp_total - comp_unique\n",
+    "print(f\"│  Total rows:     {comp_total:,}\")\n",
+    "print(f\"│  Unique rows:    {comp_unique:,}\")\n",
+    "print(f\"│  Duplicates:     {comp_dups:,}\")\n",
+    "print(f\"│  Status:         {'✅ CLEAN' if comp_dups == 0 else '🔴 HAS DUPLICATES'}\")\n",
+    "if comp_dups > 0:\n",
+    "    dup_ids = companies_base[companies_base.duplicated('company_id', keep=False)]['company_id'].value_counts().head(3)\n",
+    "    print(f\"│  Top duplicates:\")\n",
+    "    for cid, count in dup_ids.items():\n",
+    "        print(f\"│    - company_id={cid}: {count} times\")\n",
+    "print(\"└─\\n\")\n",
+    "duplicate_report.append(('Companies Base', comp_total, comp_unique, comp_dups))\n",
+    "\n",
+    "# 3. Company Industries\n",
+    "print(\"┌─ 📊 company_industries.csv\")\n",
+    "print(f\"│  Primary Key: company_id + industry\")\n",
+    "ci_total = len(company_industries)\n",
+    "ci_unique = len(company_industries.drop_duplicates(subset=['company_id', 'industry']))\n",
+    "ci_dups = ci_total - ci_unique\n",
+    "print(f\"│  Total rows:     {ci_total:,}\")\n",
+    "print(f\"│  Unique rows:    {ci_unique:,}\")\n",
+    "print(f\"│  Duplicates:     {ci_dups:,}\")\n",
+    "print(f\"│  Status:         {'✅ CLEAN' if ci_dups == 0 else '🔴 HAS DUPLICATES'}\")\n",
+    "print(\"└─\\n\")\n",
+    "duplicate_report.append(('Company Industries', ci_total, ci_unique, ci_dups))\n",
+    "\n",
+    "# 4. Company Specialties\n",
+    "print(\"┌─ 📊 company_specialities.csv\")\n",
+    "print(f\"│  Primary Key: company_id + speciality\")\n",
+    "cs_total = len(company_specialties)\n",
+    "cs_unique = len(company_specialties.drop_duplicates(subset=['company_id', 'speciality']))\n",
+    "cs_dups = cs_total - cs_unique\n",
+    "print(f\"│  Total rows:     {cs_total:,}\")\n",
+    "print(f\"│  Unique rows:    {cs_unique:,}\")\n",
+    "print(f\"│  Duplicates:     {cs_dups:,}\")\n",
+    "print(f\"│  Status:         {'✅ CLEAN' if cs_dups == 0 else '🔴 HAS DUPLICATES'}\")\n",
+    "print(\"└─\\n\")\n",
+    "duplicate_report.append(('Company Specialties', cs_total, cs_unique, cs_dups))\n",
+    "\n",
+    "# 5. Employee Counts\n",
+    "print(\"┌─ 📊 employee_counts.csv\")\n",
+    "print(f\"│  Primary Key: company_id\")\n",
+    "ec_total = len(employee_counts)\n",
+    "ec_unique = employee_counts['company_id'].nunique()\n",
+    "ec_dups = ec_total - ec_unique\n",
+    "print(f\"│  Total rows:     {ec_total:,}\")\n",
+    "print(f\"│  Unique rows:    {ec_unique:,}\")\n",
+    "print(f\"│  Duplicates:     {ec_dups:,}\")\n",
+    "print(f\"│  Status:         {'✅ CLEAN' if ec_dups == 0 else '🔴 HAS DUPLICATES'}\")\n",
+    "print(\"└─\\n\")\n",
+    "duplicate_report.append(('Employee Counts', ec_total, ec_unique, ec_dups))\n",
+    "\n",
+    "# 6. Postings\n",
+    "print(\"┌─ 📊 postings.csv (Job Postings)\")\n",
+    "print(f\"│  Primary Key: job_id\")\n",
+    "if 'job_id' in postings.columns:\n",
+    "    post_total = len(postings)\n",
+    "    post_unique = postings['job_id'].nunique()\n",
+    "    post_dups = post_total - post_unique\n",
+    "else:\n",
+    "    post_total = len(postings)\n",
+    "    post_unique = len(postings.drop_duplicates())\n",
+    "    post_dups = post_total - post_unique\n",
+    "print(f\"│  Total rows:     {post_total:,}\")\n",
+    "print(f\"│  Unique rows:    {post_unique:,}\")\n",
+    "print(f\"│  Duplicates:     {post_dups:,}\")\n",
+    "print(f\"│  Status:         {'✅ CLEAN' if post_dups == 0 else '🔴 HAS DUPLICATES'}\")\n",
+    "print(\"└─\\n\")\n",
+    "duplicate_report.append(('Postings', post_total, post_unique, post_dups))\n",
+    "\n",
+    "# 7. Companies Full (After Merge)\n",
+    "print(\"┌─ 📊 companies_full (After Enrichment)\")\n",
+    "print(f\"│  Primary Key: company_id\")\n",
+    "cf_total = len(companies_full)\n",
+    "cf_unique = companies_full['company_id'].nunique()\n",
+    "cf_dups = cf_total - cf_unique\n",
+    "print(f\"│  Total rows:     {cf_total:,}\")\n",
+    "print(f\"│  Unique rows:    {cf_unique:,}\")\n",
+    "print(f\"│  Duplicates:     {cf_dups:,}\")\n",
+    "print(f\"│  Status:         {'✅ CLEAN' if cf_dups == 0 else '🔴 HAS DUPLICATES'}\")\n",
+    "if cf_dups > 0:\n",
+    "    dup_ids = companies_full[companies_full.duplicated('company_id', keep=False)]['company_id'].value_counts().head(5)\n",
+    "    print(f\"│\")\n",
+    "    print(f\"│  Top duplicate company_ids:\")\n",
+    "    for cid, count in dup_ids.items():\n",
+    "        comp_name = companies_full[companies_full['company_id'] == cid]['name'].iloc[0]\n",
+    "        print(f\"│    - {cid} ({comp_name}): {count} times\")\n",
+    "print(\"└─\\n\")\n",
+    "duplicate_report.append(('Companies Full', cf_total, cf_unique, cf_dups))\n",
+    "\n",
+    "# Summary\n",
+    "print(\"=\" * 80)\n",
+    "print(\"📊 SUMMARY\")\n",
+    "print(\"=\" * 80)\n",
+    "print()\n",
+    "\n",
+    "total_dups = sum(r[3] for r in duplicate_report)\n",
+    "clean_datasets = sum(1 for r in duplicate_report if r[3] == 0)\n",
+    "dirty_datasets = len(duplicate_report) - clean_datasets\n",
+    "\n",
+    "print(f\"✅ Clean datasets:          {clean_datasets}/{len(duplicate_report)}\")\n",
+    "print(f\"🔴 Datasets with duplicates: {dirty_datasets}/{len(duplicate_report)}\")\n",
+    "print(f\"🗑️  Total duplicates found:  {total_dups:,} rows\")\n",
+    "print()\n",
+    "\n",
+    "if dirty_datasets > 0:\n",
+    "    print(\"⚠️  DUPLICATES DETECTED!\")\n",
+    "else:\n",
+    "    print(\"✅ All datasets are clean! No duplicates found.\")\n",
+    "\n",
+    "print(\"=\" * 80)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "🧹 CLEANING DUPLICATES...\n",
+      "\n",
+      "================================================================================\n",
+      "✅ companies_base: Already clean\n",
+      "\n",
+      "✅ company_industries: Already clean\n",
+      "\n",
+      "✅ company_specialties: Already clean\n",
+      "\n",
+      "✅ employee_counts:\n",
+      "   Removed 11,314 duplicates\n",
+      "   35,787 → 24,473 rows\n",
+      "\n",
+      "✅ postings: Already clean\n",
+      "\n",
+      "✅ companies_full:\n",
+      "   Removed 11,314 duplicates\n",
+      "   35,787 → 24,473 rows\n",
+      "\n",
+      "================================================================================\n",
+      "✅ DATA CLEANING COMPLETE!\n",
+      "================================================================================\n",
+      "\n",
+      "📊 Total duplicates removed: 22,628 rows\n",
+      "\n",
+      "Cleaned datasets:\n",
+      "  - employee_counts: 35,787 → 24,473\n",
+      "  - companies_full: 35,787 → 24,473\n"
+     ]
+    }
+   ],
+   "source": [
+    "\"\"\"\n",
+    "## 🧹 Data Cleaning - Remove Duplicates\n",
+    "\n",
+    "Based on the report above, removing duplicates from datasets.\n",
+    "\"\"\"\n",
+    "\n",
+    "print(\"🧹 CLEANING DUPLICATES...\\n\")\n",
+    "print(\"=\" * 80)\n",
+    "\n",
+    "# Store original counts\n",
+    "original_counts = {}\n",
+    "\n",
+    "# 1. Clean Companies Base (if needed)\n",
+    "if len(companies_base) != companies_base['company_id'].nunique():\n",
+    "    original_counts['companies_base'] = len(companies_base)\n",
+    "    companies_base = companies_base.drop_duplicates(subset=['company_id'], keep='first')\n",
+    "    removed = original_counts['companies_base'] - len(companies_base)\n",
+    "    print(f\"✅ companies_base:\")\n",
+    "    print(f\"   Removed {removed:,} duplicates\")\n",
+    "    print(f\"   {original_counts['companies_base']:,} → {len(companies_base):,} rows\\n\")\n",
+    "else:\n",
+    "    print(f\"✅ companies_base: Already clean\\n\")\n",
+    "\n",
+    "# 2. Clean Company Industries (if needed)\n",
+    "if len(company_industries) != len(company_industries.drop_duplicates(subset=['company_id', 'industry'])):\n",
+    "    original_counts['company_industries'] = len(company_industries)\n",
+    "    company_industries = company_industries.drop_duplicates(subset=['company_id', 'industry'], keep='first')\n",
+    "    removed = original_counts['company_industries'] - len(company_industries)\n",
+    "    print(f\"✅ company_industries:\")\n",
+    "    print(f\"   Removed {removed:,} duplicates\")\n",
+    "    print(f\"   {original_counts['company_industries']:,} → {len(company_industries):,} rows\\n\")\n",
+    "else:\n",
+    "    print(f\"✅ company_industries: Already clean\\n\")\n",
+    "\n",
+    "# 3. Clean Company Specialties (if needed)\n",
+    "if len(company_specialties) != len(company_specialties.drop_duplicates(subset=['company_id', 'speciality'])):\n",
+    "    original_counts['company_specialties'] = len(company_specialties)\n",
+    "    company_specialties = company_specialties.drop_duplicates(subset=['company_id', 'speciality'], keep='first')\n",
+    "    removed = original_counts['company_specialties'] - len(company_specialties)\n",
+    "    print(f\"✅ company_specialties:\")\n",
+    "    print(f\"   Removed {removed:,} duplicates\")\n",
+    "    print(f\"   {original_counts['company_specialties']:,} → {len(company_specialties):,} rows\\n\")\n",
+    "else:\n",
+    "    print(f\"✅ company_specialties: Already clean\\n\")\n",
+    "\n",
+    "# 4. Clean Employee Counts (if needed)\n",
+    "if len(employee_counts) != employee_counts['company_id'].nunique():\n",
+    "    original_counts['employee_counts'] = len(employee_counts)\n",
+    "    employee_counts = employee_counts.drop_duplicates(subset=['company_id'], keep='first')\n",
+    "    removed = original_counts['employee_counts'] - len(employee_counts)\n",
+    "    print(f\"✅ employee_counts:\")\n",
+    "    print(f\"   Removed {removed:,} duplicates\")\n",
+    "    print(f\"   {original_counts['employee_counts']:,} → {len(employee_counts):,} rows\\n\")\n",
+    "else:\n",
+    "    print(f\"✅ employee_counts: Already clean\\n\")\n",
+    "\n",
+    "# 5. Clean Postings (if needed)\n",
+    "if 'job_id' in postings.columns:\n",
+    "    if len(postings) != postings['job_id'].nunique():\n",
+    "        original_counts['postings'] = len(postings)\n",
+    "        postings = postings.drop_duplicates(subset=['job_id'], keep='first')\n",
+    "        removed = original_counts['postings'] - len(postings)\n",
+    "        print(f\"✅ postings:\")\n",
+    "        print(f\"   Removed {removed:,} duplicates\")\n",
+    "        print(f\"   {original_counts['postings']:,} → {len(postings):,} rows\\n\")\n",
+    "    else:\n",
+    "        print(f\"✅ postings: Already clean\\n\")\n",
+    "\n",
+    "# 6. Clean Companies Full (if needed)\n",
+    "if len(companies_full) != companies_full['company_id'].nunique():\n",
+    "    original_counts['companies_full'] = len(companies_full)\n",
+    "    companies_full = companies_full.drop_duplicates(subset=['company_id'], keep='first')\n",
+    "    removed = original_counts['companies_full'] - len(companies_full)\n",
+    "    print(f\"✅ companies_full:\")\n",
+    "    print(f\"   Removed {removed:,} duplicates\")\n",
+    "    print(f\"   {original_counts['companies_full']:,} → {len(companies_full):,} rows\\n\")\n",
+    "else:\n",
+    "    print(f\"✅ companies_full: Already clean\\n\")\n",
+    "\n",
+    "print(\"=\" * 80)\n",
+    "print(\"✅ DATA CLEANING COMPLETE!\")\n",
+    "print(\"=\" * 80)\n",
+    "print()\n",
+    "\n",
+    "# Summary\n",
+    "if original_counts:\n",
+    "    total_removed = sum(original_counts[k] - globals()[k].shape[0] if k in globals() else 0 \n",
+    "                       for k in original_counts.keys())\n",
+    "    print(f\"📊 Total duplicates removed: {total_removed:,} rows\")\n",
+    "    print()\n",
+    "    print(\"Cleaned datasets:\")\n",
+    "    for dataset, original in original_counts.items():\n",
+    "        current = len(globals()[dataset]) if dataset in globals() else 0\n",
+    "        print(f\"  - {dataset}: {original:,} → {current:,}\")\n",
+    "else:\n",
+    "    print(\"✅ No duplicates found - all datasets were already clean!\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 🧠 Step 8: Load Embedding Model & Pre-computed Vectors"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "🧠 Loading embedding model...\n",
+      "\n",
+      "✅ Model loaded: all-MiniLM-L6-v2\n",
+      "📐 Embedding dimension: ℝ^384\n",
+      "\n",
+      "📂 Loading pre-computed embeddings...\n",
+      "✅ Loaded from ../processed/\n",
+      "📊 Candidate vectors: (9544, 384)\n",
+      "📊 Company vectors: (35787, 384)\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"🧠 Loading embedding model...\\n\")\n",
+    "model = SentenceTransformer(Config.EMBEDDING_MODEL)\n",
+    "embedding_dim = model.get_sentence_embedding_dimension()\n",
+    "print(f\"✅ Model loaded: {Config.EMBEDDING_MODEL}\")\n",
+    "print(f\"📐 Embedding dimension: ℝ^{embedding_dim}\\n\")\n",
+    "\n",
+    "print(\"📂 Loading pre-computed embeddings...\")\n",
+    "\n",
+    "try:\n",
+    "    # Try to load from processed folder\n",
+    "    cand_vectors = np.load(f'{Config.PROCESSED_PATH}candidate_embeddings.npy')\n",
+    "    comp_vectors = np.load(f'{Config.PROCESSED_PATH}company_embeddings.npy')\n",
+    "    \n",
+    "    print(f\"✅ Loaded from {Config.PROCESSED_PATH}\")\n",
+    "    print(f\"📊 Candidate vectors: {cand_vectors.shape}\")\n",
+    "    print(f\"📊 Company vectors: {comp_vectors.shape}\\n\")\n",
+    "    \n",
+    "except FileNotFoundError:\n",
+    "    print(\"⚠️  Pre-computed embeddings not found!\")\n",
+    "    print(\"   Embeddings will need to be generated (takes ~5-10 minutes)\")\n",
+    "    print(\"   This is normal if running for the first time.\\n\")\n",
+    "    \n",
+    "    # You can add embedding generation code here if needed\n",
+    "    # For now, we'll skip to keep notebook clean\n",
+    "    cand_vectors = None\n",
+    "    comp_vectors = None"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 🎯 Step 9: Core Matching Function"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "✅ Matching function ready\n"
+     ]
+    }
+   ],
+   "source": [
+    "def find_top_matches(candidate_idx: int, top_k: int = 10) -> List[tuple]:\n",
+    "    \"\"\"\n",
+    "    Find top K company matches for a candidate using cosine similarity.\n",
+    "    \n",
+    "    Args:\n",
+    "        candidate_idx: Index of candidate\n",
+    "        top_k: Number of top matches to return\n",
+    "    \n",
+    "    Returns:\n",
+    "        List of (company_index, similarity_score) tuples\n",
+    "    \"\"\"\n",
+    "    if cand_vectors is None or comp_vectors is None:\n",
+    "        raise ValueError(\"Embeddings not loaded! Please run Step 8 first.\")\n",
+    "    \n",
+    "    cand_vec = cand_vectors[candidate_idx].reshape(1, -1)\n",
+    "    similarities = cosine_similarity(cand_vec, comp_vectors)[0]\n",
+    "    top_indices = np.argsort(similarities)[::-1][:top_k]\n",
+    "    \n",
+    "    return [(int(idx), float(similarities[idx])) for idx in top_indices]\n",
+    "\n",
+    "print(\"✅ Matching function ready\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 🤖 Step 10: Initialize FREE LLM (Hugging Face)\n",
+    "\n",
+    "### Get your FREE token: https://huggingface.co/settings/tokens"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "✅ Hugging Face client initialized (FREE)\n",
+      "🤖 Model: meta-llama/Llama-3.2-3B-Instruct\n",
+      "💰 Cost: $0.00 (completely free!)\n",
+      "\n",
+      "✅ LLM helper functions ready\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Initialize Hugging Face Inference Client (FREE)\n",
+    "if Config.HF_TOKEN:\n",
+    "    try:\n",
+    "        hf_client = InferenceClient(token=Config.HF_TOKEN)\n",
+    "        print(\"✅ Hugging Face client initialized (FREE)\")\n",
+    "        print(f\"🤖 Model: {Config.LLM_MODEL}\")\n",
+    "        print(\"💰 Cost: $0.00 (completely free!)\\n\")\n",
+    "        LLM_AVAILABLE = True\n",
+    "    except Exception as e:\n",
+    "        print(f\"⚠️  Failed to initialize HF client: {e}\")\n",
+    "        LLM_AVAILABLE = False\n",
+    "else:\n",
+    "    print(\"⚠️  No Hugging Face token configured\")\n",
+    "    print(\"   LLM features will be disabled\")\n",
+    "    print(\"\\n📝 To enable:\")\n",
+    "    print(\"   1. Go to: https://huggingface.co/settings/tokens\")\n",
+    "    print(\"   2. Create a token (free)\")\n",
+    "    print(\"   3. Set: Config.HF_TOKEN = 'your-token-here'\\n\")\n",
+    "    LLM_AVAILABLE = False\n",
+    "    hf_client = None\n",
+    "\n",
+    "def call_llm(prompt: str, max_tokens: int = 1000) -> str:\n",
+    "    \"\"\"\n",
+    "    Generic LLM call using Hugging Face Inference API (FREE).\n",
+    "    \"\"\"\n",
+    "    if not LLM_AVAILABLE:\n",
+    "        return \"[LLM not available - check .env file for HF_TOKEN]\"\n",
+    "    \n",
+    "    try:\n",
+    "        response = hf_client.chat_completion(  # ✅ chat_completion\n",
+    "            messages=[{\"role\": \"user\", \"content\": prompt}],\n",
+    "            model=Config.LLM_MODEL,\n",
+    "            max_tokens=max_tokens,\n",
+    "            temperature=0.7\n",
+    "        )\n",
+    "        return response.choices[0].message.content  # ✅ Extrai conteúdo\n",
+    "    except Exception as e:\n",
+    "        return f\"[Error: {str(e)}]\"\n",
+    "\n",
+    "print(\"✅ LLM helper functions ready\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 🤖 Step 11: Pydantic Schemas for Structured Output"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "✅ Pydantic schemas defined\n"
+     ]
+    }
+   ],
+   "source": [
+    "class JobLevelClassification(BaseModel):\n",
+    "    \"\"\"Job level classification result\"\"\"\n",
+    "    level: Literal['Entry', 'Mid', 'Senior', 'Executive']\n",
+    "    confidence: float = Field(ge=0.0, le=1.0)\n",
+    "    reasoning: str\n",
+    "\n",
+    "class SkillsTaxonomy(BaseModel):\n",
+    "    \"\"\"Structured skills extraction\"\"\"\n",
+    "    technical_skills: List[str] = Field(default_factory=list)\n",
+    "    soft_skills: List[str] = Field(default_factory=list)\n",
+    "    certifications: List[str] = Field(default_factory=list)\n",
+    "    languages: List[str] = Field(default_factory=list)\n",
+    "\n",
+    "class MatchExplanation(BaseModel):\n",
+    "    \"\"\"Match reasoning\"\"\"\n",
+    "    overall_score: float = Field(ge=0.0, le=1.0)\n",
+    "    match_strengths: List[str]\n",
+    "    skill_gaps: List[str]\n",
+    "    recommendation: str\n",
+    "    fit_summary: str = Field(max_length=200)\n",
+    "\n",
+    "print(\"✅ Pydantic schemas defined\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 🏷️ Step 12: Job Level Classification (Zero-Shot)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "🧪 Testing zero-shot classification...\n",
+      "\n",
+      "📊 Classification Result:\n",
+      "{\n",
+      "  \"level\": \"Unknown\",\n",
+      "  \"confidence\": 0.0,\n",
+      "  \"reasoning\": \"Failed to parse response\"\n",
+      "}\n"
+     ]
+    }
+   ],
+   "source": [
+    "def classify_job_level_zero_shot(job_description: str) -> Dict:\n",
+    "    \"\"\"\n",
+    "    Zero-shot job level classification.\n",
+    "    \n",
+    "    Returns classification as: Entry, Mid, Senior, or Executive\n",
+    "    \"\"\"\n",
+    "    \n",
+    "    prompt = f\"\"\"Classify this job posting into ONE seniority level.\n",
+    "\n",
+    "Levels:\n",
+    "- Entry: 0-2 years experience, junior roles\n",
+    "- Mid: 3-5 years experience, independent work\n",
+    "- Senior: 6-10 years experience, technical leadership\n",
+    "- Executive: 10+ years, strategic leadership, C-level\n",
+    "\n",
+    "Job Posting:\n",
+    "{job_description[:500]}\n",
+    "\n",
+    "Return ONLY valid JSON:\n",
+    "{{\n",
+    "    \"level\": \"Entry|Mid|Senior|Executive\",\n",
+    "    \"confidence\": 0.85,\n",
+    "    \"reasoning\": \"Brief explanation\"\n",
+    "}}\n",
+    "\"\"\"\n",
+    "    \n",
+    "    response = call_llm(prompt)\n",
+    "    \n",
+    "    try:\n",
+    "        # Extract JSON\n",
+    "        json_str = response.strip()\n",
+    "        if '```json' in json_str:\n",
+    "            json_str = json_str.split('```json')[1].split('```')[0].strip()\n",
+    "        elif '```' in json_str:\n",
+    "            json_str = json_str.split('```')[1].split('```')[0].strip()\n",
+    "        \n",
+    "        # Find JSON in response\n",
+    "        if '{' in json_str and '}' in json_str:\n",
+    "            start = json_str.index('{')\n",
+    "            end = json_str.rindex('}') + 1\n",
+    "            json_str = json_str[start:end]\n",
+    "        \n",
+    "        result = json.loads(json_str)\n",
+    "        return result\n",
+    "    except:\n",
+    "        return {\n",
+    "            \"level\": \"Unknown\",\n",
+    "            \"confidence\": 0.0,\n",
+    "            \"reasoning\": \"Failed to parse response\"\n",
+    "        }\n",
+    "\n",
+    "# Test if LLM available and data loaded\n",
+    "if LLM_AVAILABLE and len(postings) > 0:\n",
+    "    print(\"🧪 Testing zero-shot classification...\\n\")\n",
+    "    sample = postings.iloc[0]['description']\n",
+    "    result = classify_job_level_zero_shot(sample)\n",
+    "    \n",
+    "    print(\"📊 Classification Result:\")\n",
+    "    print(json.dumps(result, indent=2))\n",
+    "else:\n",
+    "    print(\"⚠️  Skipped - LLM not available or no data\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 🎓 Step 13: Few-Shot Learning"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "🧪 Comparing Zero-Shot vs Few-Shot...\n",
+      "\n",
+      "📊 Comparison:\n",
+      "Zero-shot: Unknown (confidence: 0.00)\n",
+      "Few-shot:  Unknown (confidence: 0.00)\n"
+     ]
+    }
+   ],
+   "source": [
+    "def classify_job_level_few_shot(job_description: str) -> Dict:\n",
+    "    \"\"\"\n",
+    "    Few-shot classification with examples.\n",
+    "    \"\"\"\n",
+    "    \n",
+    "    prompt = f\"\"\"Classify this job posting using examples.\n",
+    "\n",
+    "EXAMPLES:\n",
+    "\n",
+    "Example 1 (Entry):\n",
+    "\"Recent graduate wanted. Python basics. Mentorship provided.\"\n",
+    "→ Entry level (learning focus, 0-2 years)\n",
+    "\n",
+    "Example 2 (Senior):\n",
+    "\"5+ years backend. Lead team of 3. System architecture.\"\n",
+    "→ Senior level (technical leadership, 6-10 years)\n",
+    "\n",
+    "Example 3 (Executive):\n",
+    "\"CTO position. 15+ years. Define technical strategy.\"\n",
+    "→ Executive level (C-level, strategic)\n",
+    "\n",
+    "NOW CLASSIFY:\n",
+    "{job_description[:500]}\n",
+    "\n",
+    "Return JSON:\n",
+    "{{\n",
+    "    \"level\": \"Entry|Mid|Senior|Executive\",\n",
+    "    \"confidence\": 0.0-1.0,\n",
+    "    \"reasoning\": \"Explain\"\n",
+    "}}\n",
+    "\"\"\"\n",
+    "    \n",
+    "    response = call_llm(prompt)\n",
+    "    \n",
+    "    try:\n",
+    "        json_str = response.strip()\n",
+    "        if '```json' in json_str:\n",
+    "            json_str = json_str.split('```json')[1].split('```')[0].strip()\n",
+    "        \n",
+    "        if '{' in json_str and '}' in json_str:\n",
+    "            start = json_str.index('{')\n",
+    "            end = json_str.rindex('}') + 1\n",
+    "            json_str = json_str[start:end]\n",
+    "        \n",
+    "        result = json.loads(json_str)\n",
+    "        return result\n",
+    "    except:\n",
+    "        return {\"level\": \"Unknown\", \"confidence\": 0.0, \"reasoning\": \"Parse error\"}\n",
+    "\n",
+    "# Compare zero-shot vs few-shot\n",
+    "if LLM_AVAILABLE and len(postings) > 0:\n",
+    "    print(\"🧪 Comparing Zero-Shot vs Few-Shot...\\n\")\n",
+    "    sample = postings.iloc[0]['description']\n",
+    "    \n",
+    "    zero = classify_job_level_zero_shot(sample)\n",
+    "    few = classify_job_level_few_shot(sample)\n",
+    "    \n",
+    "    print(\"📊 Comparison:\")\n",
+    "    print(f\"Zero-shot: {zero['level']} (confidence: {zero['confidence']:.2f})\")\n",
+    "    print(f\"Few-shot:  {few['level']} (confidence: {few['confidence']:.2f})\")\n",
+    "else:\n",
+    "    print(\"⚠️  Skipped\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 🔍 Step 14: Structured Skills Extraction"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "🔍 Testing skills extraction...\n",
+      "\n",
+      "📊 Extracted Skills:\n",
+      "{\n",
+      "  \"technical_skills\": [\n",
+      "    \"Adobe Creative Cloud (Indesign, Illustrator, Photoshop)\",\n",
+      "    \"Microsoft Office Suite\"\n",
+      "  ],\n",
+      "  \"soft_skills\": [\n",
+      "    \"Communication\",\n",
+      "    \"Leadership\"\n",
+      "  ],\n",
+      "  \"certifications\": [],\n",
+      "  \"languages\": [\n",
+      "    \"English\",\n",
+      "    \"Danish\"\n",
+      "  ]\n",
+      "}\n"
+     ]
+    }
+   ],
+   "source": [
+    "def extract_skills_taxonomy(job_description: str) -> Dict:\n",
+    "    \"\"\"\n",
+    "    Extract structured skills using LLM + Pydantic validation.\n",
+    "    \"\"\"\n",
+    "    \n",
+    "    prompt = f\"\"\"Extract skills from this job posting.\n",
+    "\n",
+    "Job Posting:\n",
+    "{job_description[:800]}\n",
+    "\n",
+    "Return ONLY valid JSON:\n",
+    "{{\n",
+    "    \"technical_skills\": [\"Python\", \"Docker\", \"AWS\"],\n",
+    "    \"soft_skills\": [\"Communication\", \"Leadership\"],\n",
+    "    \"certifications\": [\"AWS Certified\"],\n",
+    "    \"languages\": [\"English\", \"Danish\"]\n",
+    "}}\n",
+    "\"\"\"\n",
+    "    \n",
+    "    response = call_llm(prompt, max_tokens=800)\n",
+    "    \n",
+    "    try:\n",
+    "        json_str = response.strip()\n",
+    "        if '```json' in json_str:\n",
+    "            json_str = json_str.split('```json')[1].split('```')[0].strip()\n",
+    "        \n",
+    "        if '{' in json_str and '}' in json_str:\n",
+    "            start = json_str.index('{')\n",
+    "            end = json_str.rindex('}') + 1\n",
+    "            json_str = json_str[start:end]\n",
+    "        \n",
+    "        data = json.loads(json_str)\n",
+    "        # Validate with Pydantic\n",
+    "        validated = SkillsTaxonomy(**data)\n",
+    "        return validated.model_dump()\n",
+    "    except:\n",
+    "        return {\n",
+    "            \"technical_skills\": [],\n",
+    "            \"soft_skills\": [],\n",
+    "            \"certifications\": [],\n",
+    "            \"languages\": []\n",
+    "        }\n",
+    "\n",
+    "# Test extraction\n",
+    "if LLM_AVAILABLE and len(postings) > 0:\n",
+    "    print(\"🔍 Testing skills extraction...\\n\")\n",
+    "    sample = postings.iloc[0]['description']\n",
+    "    skills = extract_skills_taxonomy(sample)\n",
+    "    \n",
+    "    print(\"📊 Extracted Skills:\")\n",
+    "    print(json.dumps(skills, indent=2))\n",
+    "else:\n",
+    "    print(\"⚠️  Skipped\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 💡 Step 15: Match Explainability"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "💡 Testing match explainability...\n",
+      "\n",
+      "📊 Match Explanation:\n",
+      "{\n",
+      "  \"overall_score\": 0.7028058171272278,\n",
+      "  \"match_strengths\": [\n",
+      "    \"Big Data\",\n",
+      "    \"Machine Learning\",\n",
+      "    \"Cloud\",\n",
+      "    \"Data Science\",\n",
+      "    \"Data Structures\"\n",
+      "  ],\n",
+      "  \"skill_gaps\": [\n",
+      "    \"TeachTown-specific skills\"\n",
+      "  ],\n",
+      "  \"recommendation\": \"Encourage the candidate to learn TeachTown-specific skills\",\n",
+      "  \"fit_summary\": \"The candidate has a strong background in big data, machine learning, and cloud technologies, but may need to learn TeachTown-specific skills to fully align with the company's needs.\"\n",
+      "}\n"
+     ]
+    }
+   ],
+   "source": [
+    "def explain_match(candidate_idx: int, company_idx: int, similarity_score: float) -> Dict:\n",
+    "    \"\"\"\n",
+    "    Generate LLM explanation for why candidate matches company.\n",
+    "    \"\"\"\n",
+    "    \n",
+    "    cand = candidates.iloc[candidate_idx]\n",
+    "    comp = companies_full.iloc[company_idx]\n",
+    "    \n",
+    "    cand_skills = str(cand.get('skills', 'N/A'))[:300]\n",
+    "    cand_exp = str(cand.get('positions', 'N/A'))[:300]\n",
+    "    comp_req = str(comp.get('required_skills', 'N/A'))[:300]\n",
+    "    comp_name = comp.get('name', 'Unknown')\n",
+    "    \n",
+    "    prompt = f\"\"\"Explain why this candidate matches this company.\n",
+    "\n",
+    "Candidate:\n",
+    "Skills: {cand_skills}\n",
+    "Experience: {cand_exp}\n",
+    "\n",
+    "Company: {comp_name}\n",
+    "Requirements: {comp_req}\n",
+    "\n",
+    "Similarity Score: {similarity_score:.2f}\n",
+    "\n",
+    "Return JSON:\n",
+    "{{\n",
+    "    \"overall_score\": {similarity_score},\n",
+    "    \"match_strengths\": [\"Top 3-5 matching factors\"],\n",
+    "    \"skill_gaps\": [\"Missing skills\"],\n",
+    "    \"recommendation\": \"What candidate should do\",\n",
+    "    \"fit_summary\": \"One sentence summary\"\n",
+    "}}\n",
+    "\"\"\"\n",
+    "    \n",
+    "    response = call_llm(prompt, max_tokens=1000)\n",
+    "    \n",
+    "    try:\n",
+    "        json_str = response.strip()\n",
+    "        if '```json' in json_str:\n",
+    "            json_str = json_str.split('```json')[1].split('```')[0].strip()\n",
+    "        \n",
+    "        if '{' in json_str and '}' in json_str:\n",
+    "            start = json_str.index('{')\n",
+    "            end = json_str.rindex('}') + 1\n",
+    "            json_str = json_str[start:end]\n",
+    "        \n",
+    "        data = json.loads(json_str)\n",
+    "        return data\n",
+    "    except:\n",
+    "        return {\n",
+    "            \"overall_score\": similarity_score,\n",
+    "            \"match_strengths\": [\"Unable to generate\"],\n",
+    "            \"skill_gaps\": [],\n",
+    "            \"recommendation\": \"Review manually\",\n",
+    "            \"fit_summary\": f\"Match score: {similarity_score:.2f}\"\n",
+    "        }\n",
+    "\n",
+    "# Test explainability\n",
+    "if LLM_AVAILABLE and cand_vectors is not None and len(candidates) > 0:\n",
+    "    print(\"💡 Testing match explainability...\\n\")\n",
+    "    matches = find_top_matches(0, top_k=1)\n",
+    "    if matches:\n",
+    "        comp_idx, score = matches[0]\n",
+    "        explanation = explain_match(0, comp_idx, score)\n",
+    "        \n",
+    "        print(\"📊 Match Explanation:\")\n",
+    "        print(json.dumps(explanation, indent=2))\n",
+    "else:\n",
+    "    print(\"⚠️  Skipped - requirements not met\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 📊 Step 16: Summary\n",
+    "\n",
+    "### What We Built"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "======================================================================\n",
+      "🎯 HRHUB v2.1 - SUMMARY\n",
+      "======================================================================\n",
+      "\n",
+      "✅ IMPLEMENTED:\n",
+      "  1. Zero-Shot Job Classification (Entry/Mid/Senior/Executive)\n",
+      "  2. Few-Shot Learning with Examples\n",
+      "  3. Structured Skills Extraction (Pydantic schemas)\n",
+      "  4. Match Explainability (LLM-generated reasoning)\n",
+      "  5. FREE LLM Integration (Hugging Face)\n",
+      "  6. Flexible Data Loading (Upload OR Google Drive)\n",
+      "\n",
+      "💰 COST: $0.00 (completely free!)\n",
+      "\n",
+      "📈 COURSE ALIGNMENT:\n",
+      "  ✅ LLMs for structured output\n",
+      "  ✅ Pydantic schemas\n",
+      "  ✅ Classification pipelines\n",
+      "  ✅ Zero-shot & few-shot learning\n",
+      "  ✅ JSON extraction\n",
+      "  ✅ Transformer architecture (embeddings)\n",
+      "  ✅ API deployment strategies\n",
+      "\n",
+      "======================================================================\n",
+      "🚀 READY TO MOVE TO VS CODE!\n",
+      "======================================================================\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"=\"*70)\n",
+    "print(\"🎯 HRHUB v2.1 - SUMMARY\")\n",
+    "print(\"=\"*70)\n",
+    "print(\"\")\n",
+    "print(\"✅ IMPLEMENTED:\")\n",
+    "print(\"  1. Zero-Shot Job Classification (Entry/Mid/Senior/Executive)\")\n",
+    "print(\"  2. Few-Shot Learning with Examples\")\n",
+    "print(\"  3. Structured Skills Extraction (Pydantic schemas)\")\n",
+    "print(\"  4. Match Explainability (LLM-generated reasoning)\")\n",
+    "print(\"  5. FREE LLM Integration (Hugging Face)\")\n",
+    "print(\"  6. Flexible Data Loading (Upload OR Google Drive)\")\n",
+    "print(\"\")\n",
+    "print(\"💰 COST: $0.00 (completely free!)\")\n",
+    "print(\"\")\n",
+    "print(\"📈 COURSE ALIGNMENT:\")\n",
+    "print(\"  ✅ LLMs for structured output\")\n",
+    "print(\"  ✅ Pydantic schemas\")\n",
+    "print(\"  ✅ Classification pipelines\")\n",
+    "print(\"  ✅ Zero-shot & few-shot learning\")\n",
+    "print(\"  ✅ JSON extraction\")\n",
+    "print(\"  ✅ Transformer architecture (embeddings)\")\n",
+    "print(\"  ✅ API deployment strategies\")\n",
+    "print(\"\")\n",
+    "print(\"=\"*70)\n",
+    "print(\"🚀 READY TO MOVE TO VS CODE!\")\n",
+    "print(\"=\"*70)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}