Roger Surf commited on
Commit
def3477
Β·
1 Parent(s): 782c177

new notebook hrhub_v2.1_enhanced

Browse files
.gitignore CHANGED
@@ -5,4 +5,5 @@ __pycache__/
5
  .DS_Store
6
  *.log
7
  .streamlit/
8
- *.csv
 
 
5
  .DS_Store
6
  *.log
7
  .streamlit/
8
+ *.csv
9
+ .env
data/notebooks/HRHUB_v2.1_Enhanced_FREE.ipynb ADDED
@@ -0,0 +1,1694 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# 🧠 HRHUB v2.1 - Enhanced with LLM (FREE VERSION)\n",
8
+ "\n",
9
+ "## πŸ“˜ Project Overview\n",
10
+ "\n",
11
+ "**Bilateral HR Matching System with LLM-Powered Intelligence**\n",
12
+ "\n",
13
+ "### What's New in v2.1:\n",
14
+ "- βœ… **FREE LLM**: Using Hugging Face Inference API (no cost)\n",
15
+ "- βœ… **Job Level Classification**: Zero-shot & few-shot learning\n",
16
+ "- βœ… **Structured Skills Extraction**: Pydantic schemas\n",
17
+ "- βœ… **Match Explainability**: LLM-generated reasoning\n",
18
+ "- βœ… **Flexible Data Loading**: Upload OR Google Drive\n",
19
+ "\n",
20
+ "### Tech Stack:\n",
21
+ "```\n",
22
+ "Embeddings: sentence-transformers (local, free)\n",
23
+ "LLM: Hugging Face Inference API (free tier)\n",
24
+ "Schemas: Pydantic\n",
25
+ "Platform: Google Colab β†’ VS Code\n",
26
+ "```\n",
27
+ "\n",
28
+ "---\n",
29
+ "\n",
30
+ "**Master's Thesis - Aalborg University** \n",
31
+ "*Business Data Science Program* \n",
32
+ "*December 2025*"
33
+ ]
34
+ },
35
+ {
36
+ "cell_type": "markdown",
37
+ "metadata": {},
38
+ "source": [
39
+ "---\n",
40
+ "## πŸ“¦ Step 1: Install Dependencies"
41
+ ]
42
+ },
43
+ {
44
+ "cell_type": "code",
45
+ "execution_count": 1,
46
+ "metadata": {},
47
+ "outputs": [
48
+ {
49
+ "name": "stdout",
50
+ "output_type": "stream",
51
+ "text": [
52
+ "βœ… All packages installed!\n"
53
+ ]
54
+ }
55
+ ],
56
+ "source": [
57
+ "# Install required packages\n",
58
+ "#!pip install -q sentence-transformers huggingface-hub pydantic plotly pyvis nbformat scikit-learn pandas numpy\n",
59
+ "\n",
60
+ "print(\"βœ… All packages installed!\")"
61
+ ]
62
+ },
63
+ {
64
+ "cell_type": "markdown",
65
+ "metadata": {},
66
+ "source": [
67
+ "---\n",
68
+ "## πŸ“š Step 2: Import Libraries"
69
+ ]
70
+ },
71
+ {
72
+ "cell_type": "code",
73
+ "execution_count": 2,
74
+ "metadata": {},
75
+ "outputs": [
76
+ {
77
+ "name": "stdout",
78
+ "output_type": "stream",
79
+ "text": [
80
+ "βœ… Environment variables loaded from .env\n",
81
+ "βœ… All libraries imported!\n"
82
+ ]
83
+ }
84
+ ],
85
+ "source": [
86
+ "import pandas as pd\n",
87
+ "import numpy as np\n",
88
+ "import json\n",
89
+ "import os\n",
90
+ "from typing import List, Dict, Optional, Literal\n",
91
+ "import warnings\n",
92
+ "warnings.filterwarnings('ignore')\n",
93
+ "\n",
94
+ "# ML & NLP\n",
95
+ "from sentence_transformers import SentenceTransformer\n",
96
+ "from sklearn.metrics.pairwise import cosine_similarity\n",
97
+ "\n",
98
+ "# LLM Integration (FREE)\n",
99
+ "from huggingface_hub import InferenceClient\n",
100
+ "from pydantic import BaseModel, Field\n",
101
+ "\n",
102
+ "# Visualization\n",
103
+ "import plotly.graph_objects as go\n",
104
+ "from IPython.display import HTML, display\n",
105
+ "\n",
106
+ "# Configuration Settings\n",
107
+ "from dotenv import load_dotenv\n",
108
+ "\n",
109
+ "# Carrega variΓ‘veis do .env\n",
110
+ "load_dotenv()\n",
111
+ "print(\"βœ… Environment variables loaded from .env\")\n",
112
+ "# ============== ATΓ‰ AQUI ⬆️ ==============\n",
113
+ "\n",
114
+ "print(\"βœ… All libraries imported!\")"
115
+ ]
116
+ },
117
+ {
118
+ "cell_type": "markdown",
119
+ "metadata": {},
120
+ "source": [
121
+ "---\n",
122
+ "## πŸ”§ Step 3: Configuration"
123
+ ]
124
+ },
125
+ {
126
+ "cell_type": "code",
127
+ "execution_count": 3,
128
+ "metadata": {},
129
+ "outputs": [
130
+ {
131
+ "name": "stdout",
132
+ "output_type": "stream",
133
+ "text": [
134
+ "βœ… Configuration loaded!\n",
135
+ "🧠 Embedding model: all-MiniLM-L6-v2\n",
136
+ "πŸ€– LLM model: meta-llama/Llama-3.2-3B-Instruct\n",
137
+ "πŸ”‘ HF Token configured: Yes βœ…\n",
138
+ "πŸ“‚ Data path: ../csv_files/\n"
139
+ ]
140
+ }
141
+ ],
142
+ "source": [
143
+ "class Config:\n",
144
+ " \"\"\"Centralized configuration for VS Code\"\"\"\n",
145
+ " \n",
146
+ " # Paths - VS Code structure\n",
147
+ " CSV_PATH = '../csv_files/'\n",
148
+ " PROCESSED_PATH = '../processed/'\n",
149
+ " RESULTS_PATH = '../results/'\n",
150
+ " \n",
151
+ " # Embedding Model\n",
152
+ " EMBEDDING_MODEL = 'all-MiniLM-L6-v2'\n",
153
+ " \n",
154
+ " # LLM Settings (FREE - Hugging Face)\n",
155
+ " HF_TOKEN = os.getenv('HF_TOKEN', '') # βœ… Pega do .env\n",
156
+ " LLM_MODEL = 'meta-llama/Llama-3.2-3B-Instruct'\n",
157
+ " \n",
158
+ " LLM_MAX_TOKENS = 1000\n",
159
+ " \n",
160
+ " # Matching Parameters\n",
161
+ " TOP_K_MATCHES = 10\n",
162
+ " SIMILARITY_THRESHOLD = 0.5\n",
163
+ " RANDOM_SEED = 42\n",
164
+ "\n",
165
+ "np.random.seed(Config.RANDOM_SEED)\n",
166
+ "\n",
167
+ "print(\"βœ… Configuration loaded!\")\n",
168
+ "print(f\"🧠 Embedding model: {Config.EMBEDDING_MODEL}\")\n",
169
+ "print(f\"πŸ€– LLM model: {Config.LLM_MODEL}\")\n",
170
+ "print(f\"πŸ”‘ HF Token configured: {'Yes βœ…' if Config.HF_TOKEN else 'No ⚠️'}\")\n",
171
+ "print(f\"πŸ“‚ Data path: {Config.CSV_PATH}\")"
172
+ ]
173
+ },
174
+ {
175
+ "cell_type": "markdown",
176
+ "metadata": {},
177
+ "source": [
178
+ "---\n",
179
+ "## πŸ“Š Step 5: Load All Datasets"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "code",
184
+ "execution_count": 4,
185
+ "metadata": {},
186
+ "outputs": [
187
+ {
188
+ "name": "stdout",
189
+ "output_type": "stream",
190
+ "text": [
191
+ "πŸ“‚ Loading all datasets...\n",
192
+ "\n",
193
+ "======================================================================\n",
194
+ "βœ… Candidates: 9,544 rows Γ— 35 columns\n",
195
+ "βœ… Companies (base): 24,473 rows\n",
196
+ "βœ… Company industries: 24,375 rows\n",
197
+ "βœ… Company specialties: 169,387 rows\n",
198
+ "βœ… Employee counts: 35,787 rows\n",
199
+ "βœ… Postings: 123,849 rows Γ— 31 columns\n",
200
+ "βœ… Job skills: 213,768 rows\n",
201
+ "βœ… Job industries: 164,808 rows\n",
202
+ "\n",
203
+ "======================================================================\n",
204
+ "βœ… All datasets loaded successfully!\n",
205
+ "\n"
206
+ ]
207
+ }
208
+ ],
209
+ "source": [
210
+ "print(\"πŸ“‚ Loading all datasets...\\n\")\n",
211
+ "print(\"=\" * 70)\n",
212
+ "\n",
213
+ "# Load main datasets\n",
214
+ "candidates = pd.read_csv(f'{Config.CSV_PATH}resume_data.csv')\n",
215
+ "print(f\"βœ… Candidates: {len(candidates):,} rows Γ— {len(candidates.columns)} columns\")\n",
216
+ "\n",
217
+ "companies_base = pd.read_csv(f'{Config.CSV_PATH}companies.csv')\n",
218
+ "print(f\"βœ… Companies (base): {len(companies_base):,} rows\")\n",
219
+ "\n",
220
+ "company_industries = pd.read_csv(f'{Config.CSV_PATH}company_industries.csv')\n",
221
+ "print(f\"βœ… Company industries: {len(company_industries):,} rows\")\n",
222
+ "\n",
223
+ "company_specialties = pd.read_csv(f'{Config.CSV_PATH}company_specialities.csv')\n",
224
+ "print(f\"βœ… Company specialties: {len(company_specialties):,} rows\")\n",
225
+ "\n",
226
+ "employee_counts = pd.read_csv(f'{Config.CSV_PATH}employee_counts.csv')\n",
227
+ "print(f\"βœ… Employee counts: {len(employee_counts):,} rows\")\n",
228
+ "\n",
229
+ "postings = pd.read_csv(f'{Config.CSV_PATH}postings.csv', on_bad_lines='skip', engine='python')\n",
230
+ "print(f\"βœ… Postings: {len(postings):,} rows Γ— {len(postings.columns)} columns\")\n",
231
+ "\n",
232
+ "# Optional datasets\n",
233
+ "try:\n",
234
+ " job_skills = pd.read_csv(f'{Config.CSV_PATH}job_skills.csv')\n",
235
+ " print(f\"βœ… Job skills: {len(job_skills):,} rows\")\n",
236
+ "except:\n",
237
+ " job_skills = None\n",
238
+ " print(\"⚠️ Job skills not found (optional)\")\n",
239
+ "\n",
240
+ "try:\n",
241
+ " job_industries = pd.read_csv(f'{Config.CSV_PATH}job_industries.csv')\n",
242
+ " print(f\"βœ… Job industries: {len(job_industries):,} rows\")\n",
243
+ "except:\n",
244
+ " job_industries = None\n",
245
+ " print(\"⚠️ Job industries not found (optional)\")\n",
246
+ "\n",
247
+ "print(\"\\n\" + \"=\" * 70)\n",
248
+ "print(\"βœ… All datasets loaded successfully!\\n\")"
249
+ ]
250
+ },
251
+ {
252
+ "cell_type": "markdown",
253
+ "metadata": {},
254
+ "source": [
255
+ "---\n",
256
+ "## πŸ”— Step 6: Merge & Enrich Company Data"
257
+ ]
258
+ },
259
+ {
260
+ "cell_type": "code",
261
+ "execution_count": 5,
262
+ "metadata": {},
263
+ "outputs": [
264
+ {
265
+ "name": "stdout",
266
+ "output_type": "stream",
267
+ "text": [
268
+ "πŸ”— Merging company data...\n",
269
+ "\n",
270
+ "βœ… Aggregated industries for 24,365 companies\n",
271
+ "βœ… Aggregated specialties for 17,780 companies\n",
272
+ "\n",
273
+ "βœ… Base company merge complete: 35,787 companies\n",
274
+ "\n"
275
+ ]
276
+ }
277
+ ],
278
+ "source": [
279
+ "print(\"πŸ”— Merging company data...\\n\")\n",
280
+ "\n",
281
+ "# Aggregate industries\n",
282
+ "company_industries_agg = company_industries.groupby('company_id')['industry'].apply(\n",
283
+ " lambda x: ', '.join(map(str, x.tolist()))\n",
284
+ ").reset_index()\n",
285
+ "company_industries_agg.columns = ['company_id', 'industries_list']\n",
286
+ "print(f\"βœ… Aggregated industries for {len(company_industries_agg):,} companies\")\n",
287
+ "\n",
288
+ "# Aggregate specialties\n",
289
+ "company_specialties_agg = company_specialties.groupby('company_id')['speciality'].apply(\n",
290
+ " lambda x: ' | '.join(x.astype(str).tolist())\n",
291
+ ").reset_index()\n",
292
+ "company_specialties_agg.columns = ['company_id', 'specialties_list']\n",
293
+ "print(f\"βœ… Aggregated specialties for {len(company_specialties_agg):,} companies\")\n",
294
+ "\n",
295
+ "# Merge all company data\n",
296
+ "companies_merged = companies_base.copy()\n",
297
+ "companies_merged = companies_merged.merge(company_industries_agg, on='company_id', how='left')\n",
298
+ "companies_merged = companies_merged.merge(company_specialties_agg, on='company_id', how='left')\n",
299
+ "companies_merged = companies_merged.merge(employee_counts, on='company_id', how='left')\n",
300
+ "\n",
301
+ "print(f\"\\nβœ… Base company merge complete: {len(companies_merged):,} companies\\n\")"
302
+ ]
303
+ },
304
+ {
305
+ "cell_type": "markdown",
306
+ "metadata": {},
307
+ "source": [
308
+ "---\n",
309
+ "## πŸŒ‰ Step 7: Enrich with Job Postings"
310
+ ]
311
+ },
312
+ {
313
+ "cell_type": "code",
314
+ "execution_count": 6,
315
+ "metadata": {},
316
+ "outputs": [
317
+ {
318
+ "name": "stdout",
319
+ "output_type": "stream",
320
+ "text": [
321
+ "πŸŒ‰ Enriching companies with job posting data...\n",
322
+ "\n",
323
+ "======================================================================\n",
324
+ "KEY INSIGHT: Postings = 'Requirements Language Bridge'\n",
325
+ "======================================================================\n",
326
+ "\n",
327
+ "βœ… Enriched 35,787 companies with posting data\n",
328
+ "\n"
329
+ ]
330
+ }
331
+ ],
332
+ "source": [
333
+ "print(\"πŸŒ‰ Enriching companies with job posting data...\\n\")\n",
334
+ "print(\"=\" * 70)\n",
335
+ "print(\"KEY INSIGHT: Postings = 'Requirements Language Bridge'\")\n",
336
+ "print(\"=\" * 70 + \"\\n\")\n",
337
+ "\n",
338
+ "postings = postings.fillna('')\n",
339
+ "postings['company_id'] = postings['company_id'].astype(str)\n",
340
+ "\n",
341
+ "# Aggregate postings per company\n",
342
+ "postings_agg = postings.groupby('company_id').agg({\n",
343
+ " 'title': lambda x: ' | '.join(x.astype(str).tolist()[:10]),\n",
344
+ " 'description': lambda x: ' '.join(x.astype(str).tolist()[:5]),\n",
345
+ " 'skills_desc': lambda x: ' | '.join(x.dropna().astype(str).tolist()),\n",
346
+ " 'formatted_experience_level': lambda x: ' | '.join(x.dropna().unique().astype(str)),\n",
347
+ "}).reset_index()\n",
348
+ "\n",
349
+ "postings_agg.columns = ['company_id', 'posted_job_titles', 'posted_descriptions', 'required_skills', 'experience_levels']\n",
350
+ "\n",
351
+ "companies_merged['company_id'] = companies_merged['company_id'].astype(str)\n",
352
+ "companies_full = companies_merged.merge(postings_agg, on='company_id', how='left').fillna('')\n",
353
+ "\n",
354
+ "print(f\"βœ… Enriched {len(companies_full):,} companies with posting data\\n\")"
355
+ ]
356
+ },
357
+ {
358
+ "cell_type": "code",
359
+ "execution_count": 7,
360
+ "metadata": {},
361
+ "outputs": [
362
+ {
363
+ "data": {
364
+ "text/html": [
365
+ "<div>\n",
366
+ "<style scoped>\n",
367
+ " .dataframe tbody tr th:only-of-type {\n",
368
+ " vertical-align: middle;\n",
369
+ " }\n",
370
+ "\n",
371
+ " .dataframe tbody tr th {\n",
372
+ " vertical-align: top;\n",
373
+ " }\n",
374
+ "\n",
375
+ " .dataframe thead th {\n",
376
+ " text-align: right;\n",
377
+ " }\n",
378
+ "</style>\n",
379
+ "<table border=\"1\" class=\"dataframe\">\n",
380
+ " <thead>\n",
381
+ " <tr style=\"text-align: right;\">\n",
382
+ " <th></th>\n",
383
+ " <th>company_id</th>\n",
384
+ " <th>name</th>\n",
385
+ " <th>description</th>\n",
386
+ " <th>company_size</th>\n",
387
+ " <th>state</th>\n",
388
+ " <th>country</th>\n",
389
+ " <th>city</th>\n",
390
+ " <th>zip_code</th>\n",
391
+ " <th>address</th>\n",
392
+ " <th>url</th>\n",
393
+ " <th>industries_list</th>\n",
394
+ " <th>specialties_list</th>\n",
395
+ " <th>employee_count</th>\n",
396
+ " <th>follower_count</th>\n",
397
+ " <th>time_recorded</th>\n",
398
+ " <th>posted_job_titles</th>\n",
399
+ " <th>posted_descriptions</th>\n",
400
+ " <th>required_skills</th>\n",
401
+ " <th>experience_levels</th>\n",
402
+ " </tr>\n",
403
+ " </thead>\n",
404
+ " <tbody>\n",
405
+ " <tr>\n",
406
+ " <th>0</th>\n",
407
+ " <td>1009</td>\n",
408
+ " <td>IBM</td>\n",
409
+ " <td>At IBM, we do more than work. We create. We cr...</td>\n",
410
+ " <td>7.0</td>\n",
411
+ " <td>NY</td>\n",
412
+ " <td>US</td>\n",
413
+ " <td>Armonk, New York</td>\n",
414
+ " <td>10504</td>\n",
415
+ " <td>International Business Machines Corp.</td>\n",
416
+ " <td>https://www.linkedin.com/company/ibm</td>\n",
417
+ " <td>IT Services and IT Consulting</td>\n",
418
+ " <td>Cloud | Mobile | Cognitive | Security | Resear...</td>\n",
419
+ " <td>314102</td>\n",
420
+ " <td>16253625</td>\n",
421
+ " <td>1712378162</td>\n",
422
+ " <td></td>\n",
423
+ " <td></td>\n",
424
+ " <td></td>\n",
425
+ " <td></td>\n",
426
+ " </tr>\n",
427
+ " <tr>\n",
428
+ " <th>1</th>\n",
429
+ " <td>1009</td>\n",
430
+ " <td>IBM</td>\n",
431
+ " <td>At IBM, we do more than work. We create. We cr...</td>\n",
432
+ " <td>7.0</td>\n",
433
+ " <td>NY</td>\n",
434
+ " <td>US</td>\n",
435
+ " <td>Armonk, New York</td>\n",
436
+ " <td>10504</td>\n",
437
+ " <td>International Business Machines Corp.</td>\n",
438
+ " <td>https://www.linkedin.com/company/ibm</td>\n",
439
+ " <td>IT Services and IT Consulting</td>\n",
440
+ " <td>Cloud | Mobile | Cognitive | Security | Resear...</td>\n",
441
+ " <td>313142</td>\n",
442
+ " <td>16309464</td>\n",
443
+ " <td>1713392385</td>\n",
444
+ " <td></td>\n",
445
+ " <td></td>\n",
446
+ " <td></td>\n",
447
+ " <td></td>\n",
448
+ " </tr>\n",
449
+ " <tr>\n",
450
+ " <th>2</th>\n",
451
+ " <td>1009</td>\n",
452
+ " <td>IBM</td>\n",
453
+ " <td>At IBM, we do more than work. We create. We cr...</td>\n",
454
+ " <td>7.0</td>\n",
455
+ " <td>NY</td>\n",
456
+ " <td>US</td>\n",
457
+ " <td>Armonk, New York</td>\n",
458
+ " <td>10504</td>\n",
459
+ " <td>International Business Machines Corp.</td>\n",
460
+ " <td>https://www.linkedin.com/company/ibm</td>\n",
461
+ " <td>IT Services and IT Consulting</td>\n",
462
+ " <td>Cloud | Mobile | Cognitive | Security | Resear...</td>\n",
463
+ " <td>313147</td>\n",
464
+ " <td>16309985</td>\n",
465
+ " <td>1713402495</td>\n",
466
+ " <td></td>\n",
467
+ " <td></td>\n",
468
+ " <td></td>\n",
469
+ " <td></td>\n",
470
+ " </tr>\n",
471
+ " <tr>\n",
472
+ " <th>3</th>\n",
473
+ " <td>1009</td>\n",
474
+ " <td>IBM</td>\n",
475
+ " <td>At IBM, we do more than work. We create. We cr...</td>\n",
476
+ " <td>7.0</td>\n",
477
+ " <td>NY</td>\n",
478
+ " <td>US</td>\n",
479
+ " <td>Armonk, New York</td>\n",
480
+ " <td>10504</td>\n",
481
+ " <td>International Business Machines Corp.</td>\n",
482
+ " <td>https://www.linkedin.com/company/ibm</td>\n",
483
+ " <td>IT Services and IT Consulting</td>\n",
484
+ " <td>Cloud | Mobile | Cognitive | Security | Resear...</td>\n",
485
+ " <td>311223</td>\n",
486
+ " <td>16314846</td>\n",
487
+ " <td>1713501255</td>\n",
488
+ " <td></td>\n",
489
+ " <td></td>\n",
490
+ " <td></td>\n",
491
+ " <td></td>\n",
492
+ " </tr>\n",
493
+ " <tr>\n",
494
+ " <th>4</th>\n",
495
+ " <td>1016</td>\n",
496
+ " <td>GE HealthCare</td>\n",
497
+ " <td>Every day millions of people feel the impact o...</td>\n",
498
+ " <td>7.0</td>\n",
499
+ " <td>0</td>\n",
500
+ " <td>US</td>\n",
501
+ " <td>Chicago</td>\n",
502
+ " <td>0</td>\n",
503
+ " <td>-</td>\n",
504
+ " <td>https://www.linkedin.com/company/gehealthcare</td>\n",
505
+ " <td>Hospitals and Health Care</td>\n",
506
+ " <td>Healthcare | Biotechnology</td>\n",
507
+ " <td>56873</td>\n",
508
+ " <td>2185368</td>\n",
509
+ " <td>1712382540</td>\n",
510
+ " <td></td>\n",
511
+ " <td></td>\n",
512
+ " <td></td>\n",
513
+ " <td></td>\n",
514
+ " </tr>\n",
515
+ " </tbody>\n",
516
+ "</table>\n",
517
+ "</div>"
518
+ ],
519
+ "text/plain": [
520
+ " company_id name \\\n",
521
+ "0 1009 IBM \n",
522
+ "1 1009 IBM \n",
523
+ "2 1009 IBM \n",
524
+ "3 1009 IBM \n",
525
+ "4 1016 GE HealthCare \n",
526
+ "\n",
527
+ " description company_size state \\\n",
528
+ "0 At IBM, we do more than work. We create. We cr... 7.0 NY \n",
529
+ "1 At IBM, we do more than work. We create. We cr... 7.0 NY \n",
530
+ "2 At IBM, we do more than work. We create. We cr... 7.0 NY \n",
531
+ "3 At IBM, we do more than work. We create. We cr... 7.0 NY \n",
532
+ "4 Every day millions of people feel the impact o... 7.0 0 \n",
533
+ "\n",
534
+ " country city zip_code address \\\n",
535
+ "0 US Armonk, New York 10504 International Business Machines Corp. \n",
536
+ "1 US Armonk, New York 10504 International Business Machines Corp. \n",
537
+ "2 US Armonk, New York 10504 International Business Machines Corp. \n",
538
+ "3 US Armonk, New York 10504 International Business Machines Corp. \n",
539
+ "4 US Chicago 0 - \n",
540
+ "\n",
541
+ " url \\\n",
542
+ "0 https://www.linkedin.com/company/ibm \n",
543
+ "1 https://www.linkedin.com/company/ibm \n",
544
+ "2 https://www.linkedin.com/company/ibm \n",
545
+ "3 https://www.linkedin.com/company/ibm \n",
546
+ "4 https://www.linkedin.com/company/gehealthcare \n",
547
+ "\n",
548
+ " industries_list \\\n",
549
+ "0 IT Services and IT Consulting \n",
550
+ "1 IT Services and IT Consulting \n",
551
+ "2 IT Services and IT Consulting \n",
552
+ "3 IT Services and IT Consulting \n",
553
+ "4 Hospitals and Health Care \n",
554
+ "\n",
555
+ " specialties_list employee_count \\\n",
556
+ "0 Cloud | Mobile | Cognitive | Security | Resear... 314102 \n",
557
+ "1 Cloud | Mobile | Cognitive | Security | Resear... 313142 \n",
558
+ "2 Cloud | Mobile | Cognitive | Security | Resear... 313147 \n",
559
+ "3 Cloud | Mobile | Cognitive | Security | Resear... 311223 \n",
560
+ "4 Healthcare | Biotechnology 56873 \n",
561
+ "\n",
562
+ " follower_count time_recorded posted_job_titles posted_descriptions \\\n",
563
+ "0 16253625 1712378162 \n",
564
+ "1 16309464 1713392385 \n",
565
+ "2 16309985 1713402495 \n",
566
+ "3 16314846 1713501255 \n",
567
+ "4 2185368 1712382540 \n",
568
+ "\n",
569
+ " required_skills experience_levels \n",
570
+ "0 \n",
571
+ "1 \n",
572
+ "2 \n",
573
+ "3 \n",
574
+ "4 "
575
+ ]
576
+ },
577
+ "execution_count": 7,
578
+ "metadata": {},
579
+ "output_type": "execute_result"
580
+ }
581
+ ],
582
+ "source": [
583
+ "companies_full.head()"
584
+ ]
585
+ },
586
+ {
587
+ "cell_type": "code",
588
+ "execution_count": 19,
589
+ "metadata": {},
590
+ "outputs": [
591
+ {
592
+ "name": "stdout",
593
+ "output_type": "stream",
594
+ "text": [
595
+ "================================================================================\n",
596
+ "πŸ” DUPLICATE DETECTION REPORT\n",
597
+ "================================================================================\n",
598
+ "\n",
599
+ "β”Œβ”€ πŸ“Š resume_data.csv (Candidates)\n",
600
+ "β”‚ Primary Key: Resume_ID\n",
601
+ "β”‚ Total rows: 9,544\n",
602
+ "β”‚ Unique rows: 9,544\n",
603
+ "β”‚ Duplicates: 0\n",
604
+ "β”‚ Status: βœ… CLEAN\n",
605
+ "└─\n",
606
+ "\n",
607
+ "β”Œβ”€ πŸ“Š companies.csv (Companies Base)\n",
608
+ "β”‚ Primary Key: company_id\n",
609
+ "β”‚ Total rows: 24,473\n",
610
+ "β”‚ Unique rows: 24,473\n",
611
+ "β”‚ Duplicates: 0\n",
612
+ "β”‚ Status: βœ… CLEAN\n",
613
+ "└─\n",
614
+ "\n",
615
+ "β”Œβ”€ πŸ“Š company_industries.csv\n",
616
+ "β”‚ Primary Key: company_id + industry\n",
617
+ "β”‚ Total rows: 24,375\n",
618
+ "β”‚ Unique rows: 24,375\n",
619
+ "β”‚ Duplicates: 0\n",
620
+ "β”‚ Status: βœ… CLEAN\n",
621
+ "└─\n",
622
+ "\n",
623
+ "β”Œβ”€ πŸ“Š company_specialities.csv\n",
624
+ "β”‚ Primary Key: company_id + speciality\n",
625
+ "β”‚ Total rows: 169,387\n",
626
+ "β”‚ Unique rows: 169,387\n",
627
+ "β”‚ Duplicates: 0\n",
628
+ "β”‚ Status: βœ… CLEAN\n",
629
+ "└─\n",
630
+ "\n",
631
+ "β”Œβ”€ πŸ“Š employee_counts.csv\n",
632
+ "β”‚ Primary Key: company_id\n",
633
+ "β”‚ Total rows: 35,787\n",
634
+ "β”‚ Unique rows: 24,473\n",
635
+ "β”‚ Duplicates: 11,314\n",
636
+ "β”‚ Status: πŸ”΄ HAS DUPLICATES\n",
637
+ "└─\n",
638
+ "\n",
639
+ "β”Œβ”€ πŸ“Š postings.csv (Job Postings)\n",
640
+ "β”‚ Primary Key: job_id\n",
641
+ "β”‚ Total rows: 123,849\n",
642
+ "β”‚ Unique rows: 123,849\n",
643
+ "β”‚ Duplicates: 0\n",
644
+ "β”‚ Status: βœ… CLEAN\n",
645
+ "└─\n",
646
+ "\n",
647
+ "β”Œβ”€ πŸ“Š companies_full (After Enrichment)\n",
648
+ "β”‚ Primary Key: company_id\n",
649
+ "β”‚ Total rows: 35,787\n",
650
+ "β”‚ Unique rows: 24,473\n",
651
+ "β”‚ Duplicates: 11,314\n",
652
+ "β”‚ Status: πŸ”΄ HAS DUPLICATES\n",
653
+ "β”‚\n",
654
+ "β”‚ Top duplicate company_ids:\n",
655
+ "β”‚ - 33242739 (Confidential): 13 times\n",
656
+ "β”‚ - 5235 (LHH): 13 times\n",
657
+ "β”‚ - 79383535 (Akkodis): 12 times\n",
658
+ "β”‚ - 1681 (Robert Half): 12 times\n",
659
+ "β”‚ - 220336 (Hyatt Hotels Corporation): 11 times\n",
660
+ "└─\n",
661
+ "\n",
662
+ "================================================================================\n",
663
+ "πŸ“Š SUMMARY\n",
664
+ "================================================================================\n",
665
+ "\n",
666
+ "βœ… Clean datasets: 5/7\n",
667
+ "πŸ”΄ Datasets with duplicates: 2/7\n",
668
+ "πŸ—‘οΈ Total duplicates found: 22,628 rows\n",
669
+ "\n",
670
+ "⚠️ DUPLICATES DETECTED!\n",
671
+ "================================================================================\n"
672
+ ]
673
+ }
674
+ ],
675
+ "source": [
676
+ "## πŸ” Data Quality Check - Duplicate Detection\n",
677
+ "\n",
678
+ "\"\"\"\n",
679
+ "Checking for duplicates in all datasets based on primary keys.\n",
680
+ "This cell only REPORTS duplicates, does not modify data.\n",
681
+ "\"\"\"\n",
682
+ "\n",
683
+ "print(\"=\" * 80)\n",
684
+ "print(\"πŸ” DUPLICATE DETECTION REPORT\")\n",
685
+ "print(\"=\" * 80)\n",
686
+ "print()\n",
687
+ "\n",
688
+ "# Define primary keys for each dataset\n",
689
+ "duplicate_report = []\n",
690
+ "\n",
691
+ "# 1. Candidates\n",
692
+ "print(\"β”Œβ”€ πŸ“Š resume_data.csv (Candidates)\")\n",
693
+ "print(f\"β”‚ Primary Key: Resume_ID\")\n",
694
+ "cand_total = len(candidates)\n",
695
+ "cand_unique = candidates['Resume_ID'].nunique() if 'Resume_ID' in candidates.columns else len(candidates)\n",
696
+ "cand_dups = cand_total - cand_unique\n",
697
+ "print(f\"β”‚ Total rows: {cand_total:,}\")\n",
698
+ "print(f\"β”‚ Unique rows: {cand_unique:,}\")\n",
699
+ "print(f\"β”‚ Duplicates: {cand_dups:,}\")\n",
700
+ "print(f\"β”‚ Status: {'βœ… CLEAN' if cand_dups == 0 else 'πŸ”΄ HAS DUPLICATES'}\")\n",
701
+ "print(\"└─\\n\")\n",
702
+ "duplicate_report.append(('Candidates', cand_total, cand_unique, cand_dups))\n",
703
+ "\n",
704
+ "# 2. Companies Base\n",
705
+ "print(\"β”Œβ”€ πŸ“Š companies.csv (Companies Base)\")\n",
706
+ "print(f\"β”‚ Primary Key: company_id\")\n",
707
+ "comp_total = len(companies_base)\n",
708
+ "comp_unique = companies_base['company_id'].nunique()\n",
709
+ "comp_dups = comp_total - comp_unique\n",
710
+ "print(f\"β”‚ Total rows: {comp_total:,}\")\n",
711
+ "print(f\"β”‚ Unique rows: {comp_unique:,}\")\n",
712
+ "print(f\"β”‚ Duplicates: {comp_dups:,}\")\n",
713
+ "print(f\"β”‚ Status: {'βœ… CLEAN' if comp_dups == 0 else 'πŸ”΄ HAS DUPLICATES'}\")\n",
714
+ "if comp_dups > 0:\n",
715
+ " dup_ids = companies_base[companies_base.duplicated('company_id', keep=False)]['company_id'].value_counts().head(3)\n",
716
+ " print(f\"β”‚ Top duplicates:\")\n",
717
+ " for cid, count in dup_ids.items():\n",
718
+ " print(f\"β”‚ - company_id={cid}: {count} times\")\n",
719
+ "print(\"└─\\n\")\n",
720
+ "duplicate_report.append(('Companies Base', comp_total, comp_unique, comp_dups))\n",
721
+ "\n",
722
+ "# 3. Company Industries\n",
723
+ "print(\"β”Œβ”€ πŸ“Š company_industries.csv\")\n",
724
+ "print(f\"β”‚ Primary Key: company_id + industry\")\n",
725
+ "ci_total = len(company_industries)\n",
726
+ "ci_unique = len(company_industries.drop_duplicates(subset=['company_id', 'industry']))\n",
727
+ "ci_dups = ci_total - ci_unique\n",
728
+ "print(f\"β”‚ Total rows: {ci_total:,}\")\n",
729
+ "print(f\"β”‚ Unique rows: {ci_unique:,}\")\n",
730
+ "print(f\"β”‚ Duplicates: {ci_dups:,}\")\n",
731
+ "print(f\"β”‚ Status: {'βœ… CLEAN' if ci_dups == 0 else 'πŸ”΄ HAS DUPLICATES'}\")\n",
732
+ "print(\"└─\\n\")\n",
733
+ "duplicate_report.append(('Company Industries', ci_total, ci_unique, ci_dups))\n",
734
+ "\n",
735
+ "# 4. Company Specialties\n",
736
+ "print(\"β”Œβ”€ πŸ“Š company_specialities.csv\")\n",
737
+ "print(f\"β”‚ Primary Key: company_id + speciality\")\n",
738
+ "cs_total = len(company_specialties)\n",
739
+ "cs_unique = len(company_specialties.drop_duplicates(subset=['company_id', 'speciality']))\n",
740
+ "cs_dups = cs_total - cs_unique\n",
741
+ "print(f\"β”‚ Total rows: {cs_total:,}\")\n",
742
+ "print(f\"β”‚ Unique rows: {cs_unique:,}\")\n",
743
+ "print(f\"β”‚ Duplicates: {cs_dups:,}\")\n",
744
+ "print(f\"β”‚ Status: {'βœ… CLEAN' if cs_dups == 0 else 'πŸ”΄ HAS DUPLICATES'}\")\n",
745
+ "print(\"└─\\n\")\n",
746
+ "duplicate_report.append(('Company Specialties', cs_total, cs_unique, cs_dups))\n",
747
+ "\n",
748
+ "# 5. Employee Counts\n",
749
+ "print(\"β”Œβ”€ πŸ“Š employee_counts.csv\")\n",
750
+ "print(f\"β”‚ Primary Key: company_id\")\n",
751
+ "ec_total = len(employee_counts)\n",
752
+ "ec_unique = employee_counts['company_id'].nunique()\n",
753
+ "ec_dups = ec_total - ec_unique\n",
754
+ "print(f\"β”‚ Total rows: {ec_total:,}\")\n",
755
+ "print(f\"β”‚ Unique rows: {ec_unique:,}\")\n",
756
+ "print(f\"β”‚ Duplicates: {ec_dups:,}\")\n",
757
+ "print(f\"β”‚ Status: {'βœ… CLEAN' if ec_dups == 0 else 'πŸ”΄ HAS DUPLICATES'}\")\n",
758
+ "print(\"└─\\n\")\n",
759
+ "duplicate_report.append(('Employee Counts', ec_total, ec_unique, ec_dups))\n",
760
+ "\n",
761
+ "# 6. Postings\n",
762
+ "print(\"β”Œβ”€ πŸ“Š postings.csv (Job Postings)\")\n",
763
+ "print(f\"β”‚ Primary Key: job_id\")\n",
764
+ "if 'job_id' in postings.columns:\n",
765
+ " post_total = len(postings)\n",
766
+ " post_unique = postings['job_id'].nunique()\n",
767
+ " post_dups = post_total - post_unique\n",
768
+ "else:\n",
769
+ " post_total = len(postings)\n",
770
+ " post_unique = len(postings.drop_duplicates())\n",
771
+ " post_dups = post_total - post_unique\n",
772
+ "print(f\"β”‚ Total rows: {post_total:,}\")\n",
773
+ "print(f\"β”‚ Unique rows: {post_unique:,}\")\n",
774
+ "print(f\"β”‚ Duplicates: {post_dups:,}\")\n",
775
+ "print(f\"β”‚ Status: {'βœ… CLEAN' if post_dups == 0 else 'πŸ”΄ HAS DUPLICATES'}\")\n",
776
+ "print(\"└─\\n\")\n",
777
+ "duplicate_report.append(('Postings', post_total, post_unique, post_dups))\n",
778
+ "\n",
779
+ "# 7. Companies Full (After Merge)\n",
780
+ "print(\"β”Œβ”€ πŸ“Š companies_full (After Enrichment)\")\n",
781
+ "print(f\"β”‚ Primary Key: company_id\")\n",
782
+ "cf_total = len(companies_full)\n",
783
+ "cf_unique = companies_full['company_id'].nunique()\n",
784
+ "cf_dups = cf_total - cf_unique\n",
785
+ "print(f\"β”‚ Total rows: {cf_total:,}\")\n",
786
+ "print(f\"β”‚ Unique rows: {cf_unique:,}\")\n",
787
+ "print(f\"β”‚ Duplicates: {cf_dups:,}\")\n",
788
+ "print(f\"β”‚ Status: {'βœ… CLEAN' if cf_dups == 0 else 'πŸ”΄ HAS DUPLICATES'}\")\n",
789
+ "if cf_dups > 0:\n",
790
+ " dup_ids = companies_full[companies_full.duplicated('company_id', keep=False)]['company_id'].value_counts().head(5)\n",
791
+ " print(f\"β”‚\")\n",
792
+ " print(f\"β”‚ Top duplicate company_ids:\")\n",
793
+ " for cid, count in dup_ids.items():\n",
794
+ " comp_name = companies_full[companies_full['company_id'] == cid]['name'].iloc[0]\n",
795
+ " print(f\"β”‚ - {cid} ({comp_name}): {count} times\")\n",
796
+ "print(\"└─\\n\")\n",
797
+ "duplicate_report.append(('Companies Full', cf_total, cf_unique, cf_dups))\n",
798
+ "\n",
799
+ "# Summary\n",
800
+ "print(\"=\" * 80)\n",
801
+ "print(\"πŸ“Š SUMMARY\")\n",
802
+ "print(\"=\" * 80)\n",
803
+ "print()\n",
804
+ "\n",
805
+ "total_dups = sum(r[3] for r in duplicate_report)\n",
806
+ "clean_datasets = sum(1 for r in duplicate_report if r[3] == 0)\n",
807
+ "dirty_datasets = len(duplicate_report) - clean_datasets\n",
808
+ "\n",
809
+ "print(f\"βœ… Clean datasets: {clean_datasets}/{len(duplicate_report)}\")\n",
810
+ "print(f\"πŸ”΄ Datasets with duplicates: {dirty_datasets}/{len(duplicate_report)}\")\n",
811
+ "print(f\"πŸ—‘οΈ Total duplicates found: {total_dups:,} rows\")\n",
812
+ "print()\n",
813
+ "\n",
814
+ "if dirty_datasets > 0:\n",
815
+ " print(\"⚠️ DUPLICATES DETECTED!\")\n",
816
+ "else:\n",
817
+ " print(\"βœ… All datasets are clean! No duplicates found.\")\n",
818
+ "\n",
819
+ "print(\"=\" * 80)"
820
+ ]
821
+ },
822
+ {
823
+ "cell_type": "code",
824
+ "execution_count": 22,
825
+ "metadata": {},
826
+ "outputs": [
827
+ {
828
+ "name": "stdout",
829
+ "output_type": "stream",
830
+ "text": [
831
+ "🧹 CLEANING DUPLICATES...\n",
832
+ "\n",
833
+ "================================================================================\n",
834
+ "βœ… companies_base: Already clean\n",
835
+ "\n",
836
+ "βœ… company_industries: Already clean\n",
837
+ "\n",
838
+ "βœ… company_specialties: Already clean\n",
839
+ "\n",
840
+ "βœ… employee_counts:\n",
841
+ " Removed 11,314 duplicates\n",
842
+ " 35,787 β†’ 24,473 rows\n",
843
+ "\n",
844
+ "βœ… postings: Already clean\n",
845
+ "\n",
846
+ "βœ… companies_full:\n",
847
+ " Removed 11,314 duplicates\n",
848
+ " 35,787 β†’ 24,473 rows\n",
849
+ "\n",
850
+ "================================================================================\n",
851
+ "βœ… DATA CLEANING COMPLETE!\n",
852
+ "================================================================================\n",
853
+ "\n",
854
+ "πŸ“Š Total duplicates removed: 22,628 rows\n",
855
+ "\n",
856
+ "Cleaned datasets:\n",
857
+ " - employee_counts: 35,787 β†’ 24,473\n",
858
+ " - companies_full: 35,787 β†’ 24,473\n"
859
+ ]
860
+ }
861
+ ],
862
+ "source": [
863
+ "\"\"\"\n",
864
+ "## 🧹 Data Cleaning - Remove Duplicates\n",
865
+ "\n",
866
+ "Based on the report above, removing duplicates from datasets.\n",
867
+ "\"\"\"\n",
868
+ "\n",
869
+ "print(\"🧹 CLEANING DUPLICATES...\\n\")\n",
870
+ "print(\"=\" * 80)\n",
871
+ "\n",
872
+ "# Store original counts\n",
873
+ "original_counts = {}\n",
874
+ "\n",
875
+ "# 1. Clean Companies Base (if needed)\n",
876
+ "if len(companies_base) != companies_base['company_id'].nunique():\n",
877
+ " original_counts['companies_base'] = len(companies_base)\n",
878
+ " companies_base = companies_base.drop_duplicates(subset=['company_id'], keep='first')\n",
879
+ " removed = original_counts['companies_base'] - len(companies_base)\n",
880
+ " print(f\"βœ… companies_base:\")\n",
881
+ " print(f\" Removed {removed:,} duplicates\")\n",
882
+ " print(f\" {original_counts['companies_base']:,} β†’ {len(companies_base):,} rows\\n\")\n",
883
+ "else:\n",
884
+ " print(f\"βœ… companies_base: Already clean\\n\")\n",
885
+ "\n",
886
+ "# 2. Clean Company Industries (if needed)\n",
887
+ "if len(company_industries) != len(company_industries.drop_duplicates(subset=['company_id', 'industry'])):\n",
888
+ " original_counts['company_industries'] = len(company_industries)\n",
889
+ " company_industries = company_industries.drop_duplicates(subset=['company_id', 'industry'], keep='first')\n",
890
+ " removed = original_counts['company_industries'] - len(company_industries)\n",
891
+ " print(f\"βœ… company_industries:\")\n",
892
+ " print(f\" Removed {removed:,} duplicates\")\n",
893
+ " print(f\" {original_counts['company_industries']:,} β†’ {len(company_industries):,} rows\\n\")\n",
894
+ "else:\n",
895
+ " print(f\"βœ… company_industries: Already clean\\n\")\n",
896
+ "\n",
897
+ "# 3. Clean Company Specialties (if needed)\n",
898
+ "if len(company_specialties) != len(company_specialties.drop_duplicates(subset=['company_id', 'speciality'])):\n",
899
+ " original_counts['company_specialties'] = len(company_specialties)\n",
900
+ " company_specialties = company_specialties.drop_duplicates(subset=['company_id', 'speciality'], keep='first')\n",
901
+ " removed = original_counts['company_specialties'] - len(company_specialties)\n",
902
+ " print(f\"βœ… company_specialties:\")\n",
903
+ " print(f\" Removed {removed:,} duplicates\")\n",
904
+ " print(f\" {original_counts['company_specialties']:,} β†’ {len(company_specialties):,} rows\\n\")\n",
905
+ "else:\n",
906
+ " print(f\"βœ… company_specialties: Already clean\\n\")\n",
907
+ "\n",
908
+ "# 4. Clean Employee Counts (if needed)\n",
909
+ "if len(employee_counts) != employee_counts['company_id'].nunique():\n",
910
+ " original_counts['employee_counts'] = len(employee_counts)\n",
911
+ " employee_counts = employee_counts.drop_duplicates(subset=['company_id'], keep='first')\n",
912
+ " removed = original_counts['employee_counts'] - len(employee_counts)\n",
913
+ " print(f\"βœ… employee_counts:\")\n",
914
+ " print(f\" Removed {removed:,} duplicates\")\n",
915
+ " print(f\" {original_counts['employee_counts']:,} β†’ {len(employee_counts):,} rows\\n\")\n",
916
+ "else:\n",
917
+ " print(f\"βœ… employee_counts: Already clean\\n\")\n",
918
+ "\n",
919
+ "# 5. Clean Postings (if needed)\n",
920
+ "if 'job_id' in postings.columns:\n",
921
+ " if len(postings) != postings['job_id'].nunique():\n",
922
+ " original_counts['postings'] = len(postings)\n",
923
+ " postings = postings.drop_duplicates(subset=['job_id'], keep='first')\n",
924
+ " removed = original_counts['postings'] - len(postings)\n",
925
+ " print(f\"βœ… postings:\")\n",
926
+ " print(f\" Removed {removed:,} duplicates\")\n",
927
+ " print(f\" {original_counts['postings']:,} β†’ {len(postings):,} rows\\n\")\n",
928
+ " else:\n",
929
+ " print(f\"βœ… postings: Already clean\\n\")\n",
930
+ "\n",
931
+ "# 6. Clean Companies Full (if needed)\n",
932
+ "if len(companies_full) != companies_full['company_id'].nunique():\n",
933
+ " original_counts['companies_full'] = len(companies_full)\n",
934
+ " companies_full = companies_full.drop_duplicates(subset=['company_id'], keep='first')\n",
935
+ " removed = original_counts['companies_full'] - len(companies_full)\n",
936
+ " print(f\"βœ… companies_full:\")\n",
937
+ " print(f\" Removed {removed:,} duplicates\")\n",
938
+ " print(f\" {original_counts['companies_full']:,} β†’ {len(companies_full):,} rows\\n\")\n",
939
+ "else:\n",
940
+ " print(f\"βœ… companies_full: Already clean\\n\")\n",
941
+ "\n",
942
+ "print(\"=\" * 80)\n",
943
+ "print(\"βœ… DATA CLEANING COMPLETE!\")\n",
944
+ "print(\"=\" * 80)\n",
945
+ "print()\n",
946
+ "\n",
947
+ "# Summary\n",
948
+ "if original_counts:\n",
949
+ " total_removed = sum(original_counts[k] - globals()[k].shape[0] if k in globals() else 0 \n",
950
+ " for k in original_counts.keys())\n",
951
+ " print(f\"πŸ“Š Total duplicates removed: {total_removed:,} rows\")\n",
952
+ " print()\n",
953
+ " print(\"Cleaned datasets:\")\n",
954
+ " for dataset, original in original_counts.items():\n",
955
+ " current = len(globals()[dataset]) if dataset in globals() else 0\n",
956
+ " print(f\" - {dataset}: {original:,} β†’ {current:,}\")\n",
957
+ "else:\n",
958
+ " print(\"βœ… No duplicates found - all datasets were already clean!\")"
959
+ ]
960
+ },
961
+ {
962
+ "cell_type": "markdown",
963
+ "metadata": {},
964
+ "source": [
965
+ "---\n",
966
+ "## 🧠 Step 8: Load Embedding Model & Pre-computed Vectors"
967
+ ]
968
+ },
969
+ {
970
+ "cell_type": "code",
971
+ "execution_count": 23,
972
+ "metadata": {},
973
+ "outputs": [
974
+ {
975
+ "name": "stdout",
976
+ "output_type": "stream",
977
+ "text": [
978
+ "🧠 Loading embedding model...\n",
979
+ "\n",
980
+ "βœ… Model loaded: all-MiniLM-L6-v2\n",
981
+ "πŸ“ Embedding dimension: ℝ^384\n",
982
+ "\n",
983
+ "πŸ“‚ Loading pre-computed embeddings...\n",
984
+ "βœ… Loaded from ../processed/\n",
985
+ "πŸ“Š Candidate vectors: (9544, 384)\n",
986
+ "πŸ“Š Company vectors: (35787, 384)\n",
987
+ "\n"
988
+ ]
989
+ }
990
+ ],
991
+ "source": [
992
+ "print(\"🧠 Loading embedding model...\\n\")\n",
993
+ "model = SentenceTransformer(Config.EMBEDDING_MODEL)\n",
994
+ "embedding_dim = model.get_sentence_embedding_dimension()\n",
995
+ "print(f\"βœ… Model loaded: {Config.EMBEDDING_MODEL}\")\n",
996
+ "print(f\"πŸ“ Embedding dimension: ℝ^{embedding_dim}\\n\")\n",
997
+ "\n",
998
+ "print(\"πŸ“‚ Loading pre-computed embeddings...\")\n",
999
+ "\n",
1000
+ "try:\n",
1001
+ " # Try to load from processed folder\n",
1002
+ " cand_vectors = np.load(f'{Config.PROCESSED_PATH}candidate_embeddings.npy')\n",
1003
+ " comp_vectors = np.load(f'{Config.PROCESSED_PATH}company_embeddings.npy')\n",
1004
+ " \n",
1005
+ " print(f\"βœ… Loaded from {Config.PROCESSED_PATH}\")\n",
1006
+ " print(f\"πŸ“Š Candidate vectors: {cand_vectors.shape}\")\n",
1007
+ " print(f\"πŸ“Š Company vectors: {comp_vectors.shape}\\n\")\n",
1008
+ " \n",
1009
+ "except FileNotFoundError:\n",
1010
+ " print(\"⚠️ Pre-computed embeddings not found!\")\n",
1011
+ " print(\" Embeddings will need to be generated (takes ~5-10 minutes)\")\n",
1012
+ " print(\" This is normal if running for the first time.\\n\")\n",
1013
+ " \n",
1014
+ " # You can add embedding generation code here if needed\n",
1015
+ " # For now, we'll skip to keep notebook clean\n",
1016
+ " cand_vectors = None\n",
1017
+ " comp_vectors = None"
1018
+ ]
1019
+ },
1020
+ {
1021
+ "cell_type": "markdown",
1022
+ "metadata": {},
1023
+ "source": [
1024
+ "---\n",
1025
+ "## 🎯 Step 9: Core Matching Function"
1026
+ ]
1027
+ },
1028
+ {
1029
+ "cell_type": "code",
1030
+ "execution_count": 24,
1031
+ "metadata": {},
1032
+ "outputs": [
1033
+ {
1034
+ "name": "stdout",
1035
+ "output_type": "stream",
1036
+ "text": [
1037
+ "βœ… Matching function ready\n"
1038
+ ]
1039
+ }
1040
+ ],
1041
+ "source": [
1042
+ "def find_top_matches(candidate_idx: int, top_k: int = 10) -> List[tuple]:\n",
1043
+ " \"\"\"\n",
1044
+ " Find top K company matches for a candidate using cosine similarity.\n",
1045
+ " \n",
1046
+ " Args:\n",
1047
+ " candidate_idx: Index of candidate\n",
1048
+ " top_k: Number of top matches to return\n",
1049
+ " \n",
1050
+ " Returns:\n",
1051
+ " List of (company_index, similarity_score) tuples\n",
1052
+ " \"\"\"\n",
1053
+ " if cand_vectors is None or comp_vectors is None:\n",
1054
+ " raise ValueError(\"Embeddings not loaded! Please run Step 8 first.\")\n",
1055
+ " \n",
1056
+ " cand_vec = cand_vectors[candidate_idx].reshape(1, -1)\n",
1057
+ " similarities = cosine_similarity(cand_vec, comp_vectors)[0]\n",
1058
+ " top_indices = np.argsort(similarities)[::-1][:top_k]\n",
1059
+ " \n",
1060
+ " return [(int(idx), float(similarities[idx])) for idx in top_indices]\n",
1061
+ "\n",
1062
+ "print(\"βœ… Matching function ready\")"
1063
+ ]
1064
+ },
1065
+ {
1066
+ "cell_type": "markdown",
1067
+ "metadata": {},
1068
+ "source": [
1069
+ "---\n",
1070
+ "## πŸ€– Step 10: Initialize FREE LLM (Hugging Face)\n",
1071
+ "\n",
1072
+ "### Get your FREE token: https://huggingface.co/settings/tokens"
1073
+ ]
1074
+ },
1075
+ {
1076
+ "cell_type": "code",
1077
+ "execution_count": 25,
1078
+ "metadata": {},
1079
+ "outputs": [
1080
+ {
1081
+ "name": "stdout",
1082
+ "output_type": "stream",
1083
+ "text": [
1084
+ "βœ… Hugging Face client initialized (FREE)\n",
1085
+ "πŸ€– Model: meta-llama/Llama-3.2-3B-Instruct\n",
1086
+ "πŸ’° Cost: $0.00 (completely free!)\n",
1087
+ "\n",
1088
+ "βœ… LLM helper functions ready\n"
1089
+ ]
1090
+ }
1091
+ ],
1092
+ "source": [
1093
+ "# Initialize Hugging Face Inference Client (FREE)\n",
1094
+ "if Config.HF_TOKEN:\n",
1095
+ " try:\n",
1096
+ " hf_client = InferenceClient(token=Config.HF_TOKEN)\n",
1097
+ " print(\"βœ… Hugging Face client initialized (FREE)\")\n",
1098
+ " print(f\"πŸ€– Model: {Config.LLM_MODEL}\")\n",
1099
+ " print(\"πŸ’° Cost: $0.00 (completely free!)\\n\")\n",
1100
+ " LLM_AVAILABLE = True\n",
1101
+ " except Exception as e:\n",
1102
+ " print(f\"⚠️ Failed to initialize HF client: {e}\")\n",
1103
+ " LLM_AVAILABLE = False\n",
1104
+ "else:\n",
1105
+ " print(\"⚠️ No Hugging Face token configured\")\n",
1106
+ " print(\" LLM features will be disabled\")\n",
1107
+ " print(\"\\nπŸ“ To enable:\")\n",
1108
+ " print(\" 1. Go to: https://huggingface.co/settings/tokens\")\n",
1109
+ " print(\" 2. Create a token (free)\")\n",
1110
+ " print(\" 3. Set: Config.HF_TOKEN = 'your-token-here'\\n\")\n",
1111
+ " LLM_AVAILABLE = False\n",
1112
+ " hf_client = None\n",
1113
+ "\n",
1114
+ "def call_llm(prompt: str, max_tokens: int = 1000) -> str:\n",
1115
+ " \"\"\"\n",
1116
+ " Generic LLM call using Hugging Face Inference API (FREE).\n",
1117
+ " \"\"\"\n",
1118
+ " if not LLM_AVAILABLE:\n",
1119
+ " return \"[LLM not available - check .env file for HF_TOKEN]\"\n",
1120
+ " \n",
1121
+ " try:\n",
1122
+ " response = hf_client.chat_completion( # βœ… chat_completion\n",
1123
+ " messages=[{\"role\": \"user\", \"content\": prompt}],\n",
1124
+ " model=Config.LLM_MODEL,\n",
1125
+ " max_tokens=max_tokens,\n",
1126
+ " temperature=0.7\n",
1127
+ " )\n",
1128
+ " return response.choices[0].message.content # βœ… Extrai conteΓΊdo\n",
1129
+ " except Exception as e:\n",
1130
+ " return f\"[Error: {str(e)}]\"\n",
1131
+ "\n",
1132
+ "print(\"βœ… LLM helper functions ready\")"
1133
+ ]
1134
+ },
1135
+ {
1136
+ "cell_type": "markdown",
1137
+ "metadata": {},
1138
+ "source": [
1139
+ "---\n",
1140
+ "## πŸ€– Step 11: Pydantic Schemas for Structured Output"
1141
+ ]
1142
+ },
1143
+ {
1144
+ "cell_type": "code",
1145
+ "execution_count": 26,
1146
+ "metadata": {},
1147
+ "outputs": [
1148
+ {
1149
+ "name": "stdout",
1150
+ "output_type": "stream",
1151
+ "text": [
1152
+ "βœ… Pydantic schemas defined\n"
1153
+ ]
1154
+ }
1155
+ ],
1156
+ "source": [
1157
+ "class JobLevelClassification(BaseModel):\n",
1158
+ " \"\"\"Job level classification result\"\"\"\n",
1159
+ " level: Literal['Entry', 'Mid', 'Senior', 'Executive']\n",
1160
+ " confidence: float = Field(ge=0.0, le=1.0)\n",
1161
+ " reasoning: str\n",
1162
+ "\n",
1163
+ "class SkillsTaxonomy(BaseModel):\n",
1164
+ " \"\"\"Structured skills extraction\"\"\"\n",
1165
+ " technical_skills: List[str] = Field(default_factory=list)\n",
1166
+ " soft_skills: List[str] = Field(default_factory=list)\n",
1167
+ " certifications: List[str] = Field(default_factory=list)\n",
1168
+ " languages: List[str] = Field(default_factory=list)\n",
1169
+ "\n",
1170
+ "class MatchExplanation(BaseModel):\n",
1171
+ " \"\"\"Match reasoning\"\"\"\n",
1172
+ " overall_score: float = Field(ge=0.0, le=1.0)\n",
1173
+ " match_strengths: List[str]\n",
1174
+ " skill_gaps: List[str]\n",
1175
+ " recommendation: str\n",
1176
+ " fit_summary: str = Field(max_length=200)\n",
1177
+ "\n",
1178
+ "print(\"βœ… Pydantic schemas defined\")"
1179
+ ]
1180
+ },
1181
+ {
1182
+ "cell_type": "markdown",
1183
+ "metadata": {},
1184
+ "source": [
1185
+ "---\n",
1186
+ "## 🏷️ Step 12: Job Level Classification (Zero-Shot)"
1187
+ ]
1188
+ },
1189
+ {
1190
+ "cell_type": "code",
1191
+ "execution_count": 27,
1192
+ "metadata": {},
1193
+ "outputs": [
1194
+ {
1195
+ "name": "stdout",
1196
+ "output_type": "stream",
1197
+ "text": [
1198
+ "πŸ§ͺ Testing zero-shot classification...\n",
1199
+ "\n",
1200
+ "πŸ“Š Classification Result:\n",
1201
+ "{\n",
1202
+ " \"level\": \"Unknown\",\n",
1203
+ " \"confidence\": 0.0,\n",
1204
+ " \"reasoning\": \"Failed to parse response\"\n",
1205
+ "}\n"
1206
+ ]
1207
+ }
1208
+ ],
1209
+ "source": [
1210
+ "def classify_job_level_zero_shot(job_description: str) -> Dict:\n",
1211
+ " \"\"\"\n",
1212
+ " Zero-shot job level classification.\n",
1213
+ " \n",
1214
+ " Returns classification as: Entry, Mid, Senior, or Executive\n",
1215
+ " \"\"\"\n",
1216
+ " \n",
1217
+ " prompt = f\"\"\"Classify this job posting into ONE seniority level.\n",
1218
+ "\n",
1219
+ "Levels:\n",
1220
+ "- Entry: 0-2 years experience, junior roles\n",
1221
+ "- Mid: 3-5 years experience, independent work\n",
1222
+ "- Senior: 6-10 years experience, technical leadership\n",
1223
+ "- Executive: 10+ years, strategic leadership, C-level\n",
1224
+ "\n",
1225
+ "Job Posting:\n",
1226
+ "{job_description[:500]}\n",
1227
+ "\n",
1228
+ "Return ONLY valid JSON:\n",
1229
+ "{{\n",
1230
+ " \"level\": \"Entry|Mid|Senior|Executive\",\n",
1231
+ " \"confidence\": 0.85,\n",
1232
+ " \"reasoning\": \"Brief explanation\"\n",
1233
+ "}}\n",
1234
+ "\"\"\"\n",
1235
+ " \n",
1236
+ " response = call_llm(prompt)\n",
1237
+ " \n",
1238
+ " try:\n",
1239
+ " # Extract JSON\n",
1240
+ " json_str = response.strip()\n",
1241
+ " if '```json' in json_str:\n",
1242
+ " json_str = json_str.split('```json')[1].split('```')[0].strip()\n",
1243
+ " elif '```' in json_str:\n",
1244
+ " json_str = json_str.split('```')[1].split('```')[0].strip()\n",
1245
+ " \n",
1246
+ " # Find JSON in response\n",
1247
+ " if '{' in json_str and '}' in json_str:\n",
1248
+ " start = json_str.index('{')\n",
1249
+ " end = json_str.rindex('}') + 1\n",
1250
+ " json_str = json_str[start:end]\n",
1251
+ " \n",
1252
+ " result = json.loads(json_str)\n",
1253
+ " return result\n",
1254
+ " except:\n",
1255
+ " return {\n",
1256
+ " \"level\": \"Unknown\",\n",
1257
+ " \"confidence\": 0.0,\n",
1258
+ " \"reasoning\": \"Failed to parse response\"\n",
1259
+ " }\n",
1260
+ "\n",
1261
+ "# Test if LLM available and data loaded\n",
1262
+ "if LLM_AVAILABLE and len(postings) > 0:\n",
1263
+ " print(\"πŸ§ͺ Testing zero-shot classification...\\n\")\n",
1264
+ " sample = postings.iloc[0]['description']\n",
1265
+ " result = classify_job_level_zero_shot(sample)\n",
1266
+ " \n",
1267
+ " print(\"πŸ“Š Classification Result:\")\n",
1268
+ " print(json.dumps(result, indent=2))\n",
1269
+ "else:\n",
1270
+ " print(\"⚠️ Skipped - LLM not available or no data\")"
1271
+ ]
1272
+ },
1273
+ {
1274
+ "cell_type": "markdown",
1275
+ "metadata": {},
1276
+ "source": [
1277
+ "---\n",
1278
+ "## πŸŽ“ Step 13: Few-Shot Learning"
1279
+ ]
1280
+ },
1281
+ {
1282
+ "cell_type": "code",
1283
+ "execution_count": 28,
1284
+ "metadata": {},
1285
+ "outputs": [
1286
+ {
1287
+ "name": "stdout",
1288
+ "output_type": "stream",
1289
+ "text": [
1290
+ "πŸ§ͺ Comparing Zero-Shot vs Few-Shot...\n",
1291
+ "\n",
1292
+ "πŸ“Š Comparison:\n",
1293
+ "Zero-shot: Unknown (confidence: 0.00)\n",
1294
+ "Few-shot: Unknown (confidence: 0.00)\n"
1295
+ ]
1296
+ }
1297
+ ],
1298
+ "source": [
1299
+ "def classify_job_level_few_shot(job_description: str) -> Dict:\n",
1300
+ " \"\"\"\n",
1301
+ " Few-shot classification with examples.\n",
1302
+ " \"\"\"\n",
1303
+ " \n",
1304
+ " prompt = f\"\"\"Classify this job posting using examples.\n",
1305
+ "\n",
1306
+ "EXAMPLES:\n",
1307
+ "\n",
1308
+ "Example 1 (Entry):\n",
1309
+ "\"Recent graduate wanted. Python basics. Mentorship provided.\"\n",
1310
+ "β†’ Entry level (learning focus, 0-2 years)\n",
1311
+ "\n",
1312
+ "Example 2 (Senior):\n",
1313
+ "\"5+ years backend. Lead team of 3. System architecture.\"\n",
1314
+ "β†’ Senior level (technical leadership, 6-10 years)\n",
1315
+ "\n",
1316
+ "Example 3 (Executive):\n",
1317
+ "\"CTO position. 15+ years. Define technical strategy.\"\n",
1318
+ "β†’ Executive level (C-level, strategic)\n",
1319
+ "\n",
1320
+ "NOW CLASSIFY:\n",
1321
+ "{job_description[:500]}\n",
1322
+ "\n",
1323
+ "Return JSON:\n",
1324
+ "{{\n",
1325
+ " \"level\": \"Entry|Mid|Senior|Executive\",\n",
1326
+ " \"confidence\": 0.0-1.0,\n",
1327
+ " \"reasoning\": \"Explain\"\n",
1328
+ "}}\n",
1329
+ "\"\"\"\n",
1330
+ " \n",
1331
+ " response = call_llm(prompt)\n",
1332
+ " \n",
1333
+ " try:\n",
1334
+ " json_str = response.strip()\n",
1335
+ " if '```json' in json_str:\n",
1336
+ " json_str = json_str.split('```json')[1].split('```')[0].strip()\n",
1337
+ " \n",
1338
+ " if '{' in json_str and '}' in json_str:\n",
1339
+ " start = json_str.index('{')\n",
1340
+ " end = json_str.rindex('}') + 1\n",
1341
+ " json_str = json_str[start:end]\n",
1342
+ " \n",
1343
+ " result = json.loads(json_str)\n",
1344
+ " return result\n",
1345
+ " except:\n",
1346
+ " return {\"level\": \"Unknown\", \"confidence\": 0.0, \"reasoning\": \"Parse error\"}\n",
1347
+ "\n",
1348
+ "# Compare zero-shot vs few-shot\n",
1349
+ "if LLM_AVAILABLE and len(postings) > 0:\n",
1350
+ " print(\"πŸ§ͺ Comparing Zero-Shot vs Few-Shot...\\n\")\n",
1351
+ " sample = postings.iloc[0]['description']\n",
1352
+ " \n",
1353
+ " zero = classify_job_level_zero_shot(sample)\n",
1354
+ " few = classify_job_level_few_shot(sample)\n",
1355
+ " \n",
1356
+ " print(\"πŸ“Š Comparison:\")\n",
1357
+ " print(f\"Zero-shot: {zero['level']} (confidence: {zero['confidence']:.2f})\")\n",
1358
+ " print(f\"Few-shot: {few['level']} (confidence: {few['confidence']:.2f})\")\n",
1359
+ "else:\n",
1360
+ " print(\"⚠️ Skipped\")"
1361
+ ]
1362
+ },
1363
+ {
1364
+ "cell_type": "markdown",
1365
+ "metadata": {},
1366
+ "source": [
1367
+ "---\n",
1368
+ "## πŸ” Step 14: Structured Skills Extraction"
1369
+ ]
1370
+ },
1371
+ {
1372
+ "cell_type": "code",
1373
+ "execution_count": 29,
1374
+ "metadata": {},
1375
+ "outputs": [
1376
+ {
1377
+ "name": "stdout",
1378
+ "output_type": "stream",
1379
+ "text": [
1380
+ "πŸ” Testing skills extraction...\n",
1381
+ "\n",
1382
+ "πŸ“Š Extracted Skills:\n",
1383
+ "{\n",
1384
+ " \"technical_skills\": [\n",
1385
+ " \"Adobe Creative Cloud (Indesign, Illustrator, Photoshop)\",\n",
1386
+ " \"Microsoft Office Suite\"\n",
1387
+ " ],\n",
1388
+ " \"soft_skills\": [\n",
1389
+ " \"Communication\",\n",
1390
+ " \"Leadership\"\n",
1391
+ " ],\n",
1392
+ " \"certifications\": [],\n",
1393
+ " \"languages\": [\n",
1394
+ " \"English\",\n",
1395
+ " \"Danish\"\n",
1396
+ " ]\n",
1397
+ "}\n"
1398
+ ]
1399
+ }
1400
+ ],
1401
+ "source": [
1402
+ "def extract_skills_taxonomy(job_description: str) -> Dict:\n",
1403
+ " \"\"\"\n",
1404
+ " Extract structured skills using LLM + Pydantic validation.\n",
1405
+ " \"\"\"\n",
1406
+ " \n",
1407
+ " prompt = f\"\"\"Extract skills from this job posting.\n",
1408
+ "\n",
1409
+ "Job Posting:\n",
1410
+ "{job_description[:800]}\n",
1411
+ "\n",
1412
+ "Return ONLY valid JSON:\n",
1413
+ "{{\n",
1414
+ " \"technical_skills\": [\"Python\", \"Docker\", \"AWS\"],\n",
1415
+ " \"soft_skills\": [\"Communication\", \"Leadership\"],\n",
1416
+ " \"certifications\": [\"AWS Certified\"],\n",
1417
+ " \"languages\": [\"English\", \"Danish\"]\n",
1418
+ "}}\n",
1419
+ "\"\"\"\n",
1420
+ " \n",
1421
+ " response = call_llm(prompt, max_tokens=800)\n",
1422
+ " \n",
1423
+ " try:\n",
1424
+ " json_str = response.strip()\n",
1425
+ " if '```json' in json_str:\n",
1426
+ " json_str = json_str.split('```json')[1].split('```')[0].strip()\n",
1427
+ " \n",
1428
+ " if '{' in json_str and '}' in json_str:\n",
1429
+ " start = json_str.index('{')\n",
1430
+ " end = json_str.rindex('}') + 1\n",
1431
+ " json_str = json_str[start:end]\n",
1432
+ " \n",
1433
+ " data = json.loads(json_str)\n",
1434
+ " # Validate with Pydantic\n",
1435
+ " validated = SkillsTaxonomy(**data)\n",
1436
+ " return validated.model_dump()\n",
1437
+ " except:\n",
1438
+ " return {\n",
1439
+ " \"technical_skills\": [],\n",
1440
+ " \"soft_skills\": [],\n",
1441
+ " \"certifications\": [],\n",
1442
+ " \"languages\": []\n",
1443
+ " }\n",
1444
+ "\n",
1445
+ "# Test extraction\n",
1446
+ "if LLM_AVAILABLE and len(postings) > 0:\n",
1447
+ " print(\"πŸ” Testing skills extraction...\\n\")\n",
1448
+ " sample = postings.iloc[0]['description']\n",
1449
+ " skills = extract_skills_taxonomy(sample)\n",
1450
+ " \n",
1451
+ " print(\"πŸ“Š Extracted Skills:\")\n",
1452
+ " print(json.dumps(skills, indent=2))\n",
1453
+ "else:\n",
1454
+ " print(\"⚠️ Skipped\")"
1455
+ ]
1456
+ },
1457
+ {
1458
+ "cell_type": "markdown",
1459
+ "metadata": {},
1460
+ "source": [
1461
+ "---\n",
1462
+ "## πŸ’‘ Step 15: Match Explainability"
1463
+ ]
1464
+ },
1465
+ {
1466
+ "cell_type": "code",
1467
+ "execution_count": 30,
1468
+ "metadata": {},
1469
+ "outputs": [
1470
+ {
1471
+ "name": "stdout",
1472
+ "output_type": "stream",
1473
+ "text": [
1474
+ "πŸ’‘ Testing match explainability...\n",
1475
+ "\n",
1476
+ "πŸ“Š Match Explanation:\n",
1477
+ "{\n",
1478
+ " \"overall_score\": 0.7028058171272278,\n",
1479
+ " \"match_strengths\": [\n",
1480
+ " \"Big Data\",\n",
1481
+ " \"Machine Learning\",\n",
1482
+ " \"Cloud\",\n",
1483
+ " \"Data Science\",\n",
1484
+ " \"Data Structures\"\n",
1485
+ " ],\n",
1486
+ " \"skill_gaps\": [\n",
1487
+ " \"TeachTown-specific skills\"\n",
1488
+ " ],\n",
1489
+ " \"recommendation\": \"Encourage the candidate to learn TeachTown-specific skills\",\n",
1490
+ " \"fit_summary\": \"The candidate has a strong background in big data, machine learning, and cloud technologies, but may need to learn TeachTown-specific skills to fully align with the company's needs.\"\n",
1491
+ "}\n"
1492
+ ]
1493
+ }
1494
+ ],
1495
+ "source": [
1496
+ "def explain_match(candidate_idx: int, company_idx: int, similarity_score: float) -> Dict:\n",
1497
+ " \"\"\"\n",
1498
+ " Generate LLM explanation for why candidate matches company.\n",
1499
+ " \"\"\"\n",
1500
+ " \n",
1501
+ " cand = candidates.iloc[candidate_idx]\n",
1502
+ " comp = companies_full.iloc[company_idx]\n",
1503
+ " \n",
1504
+ " cand_skills = str(cand.get('skills', 'N/A'))[:300]\n",
1505
+ " cand_exp = str(cand.get('positions', 'N/A'))[:300]\n",
1506
+ " comp_req = str(comp.get('required_skills', 'N/A'))[:300]\n",
1507
+ " comp_name = comp.get('name', 'Unknown')\n",
1508
+ " \n",
1509
+ " prompt = f\"\"\"Explain why this candidate matches this company.\n",
1510
+ "\n",
1511
+ "Candidate:\n",
1512
+ "Skills: {cand_skills}\n",
1513
+ "Experience: {cand_exp}\n",
1514
+ "\n",
1515
+ "Company: {comp_name}\n",
1516
+ "Requirements: {comp_req}\n",
1517
+ "\n",
1518
+ "Similarity Score: {similarity_score:.2f}\n",
1519
+ "\n",
1520
+ "Return JSON:\n",
1521
+ "{{\n",
1522
+ " \"overall_score\": {similarity_score},\n",
1523
+ " \"match_strengths\": [\"Top 3-5 matching factors\"],\n",
1524
+ " \"skill_gaps\": [\"Missing skills\"],\n",
1525
+ " \"recommendation\": \"What candidate should do\",\n",
1526
+ " \"fit_summary\": \"One sentence summary\"\n",
1527
+ "}}\n",
1528
+ "\"\"\"\n",
1529
+ " \n",
1530
+ " response = call_llm(prompt, max_tokens=1000)\n",
1531
+ " \n",
1532
+ " try:\n",
1533
+ " json_str = response.strip()\n",
1534
+ " if '```json' in json_str:\n",
1535
+ " json_str = json_str.split('```json')[1].split('```')[0].strip()\n",
1536
+ " \n",
1537
+ " if '{' in json_str and '}' in json_str:\n",
1538
+ " start = json_str.index('{')\n",
1539
+ " end = json_str.rindex('}') + 1\n",
1540
+ " json_str = json_str[start:end]\n",
1541
+ " \n",
1542
+ " data = json.loads(json_str)\n",
1543
+ " return data\n",
1544
+ " except:\n",
1545
+ " return {\n",
1546
+ " \"overall_score\": similarity_score,\n",
1547
+ " \"match_strengths\": [\"Unable to generate\"],\n",
1548
+ " \"skill_gaps\": [],\n",
1549
+ " \"recommendation\": \"Review manually\",\n",
1550
+ " \"fit_summary\": f\"Match score: {similarity_score:.2f}\"\n",
1551
+ " }\n",
1552
+ "\n",
1553
+ "# Test explainability\n",
1554
+ "if LLM_AVAILABLE and cand_vectors is not None and len(candidates) > 0:\n",
1555
+ " print(\"πŸ’‘ Testing match explainability...\\n\")\n",
1556
+ " matches = find_top_matches(0, top_k=1)\n",
1557
+ " if matches:\n",
1558
+ " comp_idx, score = matches[0]\n",
1559
+ " explanation = explain_match(0, comp_idx, score)\n",
1560
+ " \n",
1561
+ " print(\"πŸ“Š Match Explanation:\")\n",
1562
+ " print(json.dumps(explanation, indent=2))\n",
1563
+ "else:\n",
1564
+ " print(\"⚠️ Skipped - requirements not met\")"
1565
+ ]
1566
+ },
1567
+ {
1568
+ "cell_type": "markdown",
1569
+ "metadata": {},
1570
+ "source": [
1571
+ "---\n",
1572
+ "## πŸ“Š Step 16: Summary\n",
1573
+ "\n",
1574
+ "### What We Built"
1575
+ ]
1576
+ },
1577
+ {
1578
+ "cell_type": "code",
1579
+ "execution_count": 31,
1580
+ "metadata": {},
1581
+ "outputs": [
1582
+ {
1583
+ "name": "stdout",
1584
+ "output_type": "stream",
1585
+ "text": [
1586
+ "======================================================================\n",
1587
+ "🎯 HRHUB v2.1 - SUMMARY\n",
1588
+ "======================================================================\n",
1589
+ "\n",
1590
+ "βœ… IMPLEMENTED:\n",
1591
+ " 1. Zero-Shot Job Classification (Entry/Mid/Senior/Executive)\n",
1592
+ " 2. Few-Shot Learning with Examples\n",
1593
+ " 3. Structured Skills Extraction (Pydantic schemas)\n",
1594
+ " 4. Match Explainability (LLM-generated reasoning)\n",
1595
+ " 5. FREE LLM Integration (Hugging Face)\n",
1596
+ " 6. Flexible Data Loading (Upload OR Google Drive)\n",
1597
+ "\n",
1598
+ "πŸ’° COST: $0.00 (completely free!)\n",
1599
+ "\n",
1600
+ "πŸ“ˆ COURSE ALIGNMENT:\n",
1601
+ " βœ… LLMs for structured output\n",
1602
+ " βœ… Pydantic schemas\n",
1603
+ " βœ… Classification pipelines\n",
1604
+ " βœ… Zero-shot & few-shot learning\n",
1605
+ " βœ… JSON extraction\n",
1606
+ " βœ… Transformer architecture (embeddings)\n",
1607
+ " βœ… API deployment strategies\n",
1608
+ "\n",
1609
+ "======================================================================\n",
1610
+ "πŸš€ READY TO MOVE TO VS CODE!\n",
1611
+ "======================================================================\n"
1612
+ ]
1613
+ }
1614
+ ],
1615
+ "source": [
1616
+ "print(\"=\"*70)\n",
1617
+ "print(\"🎯 HRHUB v2.1 - SUMMARY\")\n",
1618
+ "print(\"=\"*70)\n",
1619
+ "print(\"\")\n",
1620
+ "print(\"βœ… IMPLEMENTED:\")\n",
1621
+ "print(\" 1. Zero-Shot Job Classification (Entry/Mid/Senior/Executive)\")\n",
1622
+ "print(\" 2. Few-Shot Learning with Examples\")\n",
1623
+ "print(\" 3. Structured Skills Extraction (Pydantic schemas)\")\n",
1624
+ "print(\" 4. Match Explainability (LLM-generated reasoning)\")\n",
1625
+ "print(\" 5. FREE LLM Integration (Hugging Face)\")\n",
1626
+ "print(\" 6. Flexible Data Loading (Upload OR Google Drive)\")\n",
1627
+ "print(\"\")\n",
1628
+ "print(\"πŸ’° COST: $0.00 (completely free!)\")\n",
1629
+ "print(\"\")\n",
1630
+ "print(\"πŸ“ˆ COURSE ALIGNMENT:\")\n",
1631
+ "print(\" βœ… LLMs for structured output\")\n",
1632
+ "print(\" βœ… Pydantic schemas\")\n",
1633
+ "print(\" βœ… Classification pipelines\")\n",
1634
+ "print(\" βœ… Zero-shot & few-shot learning\")\n",
1635
+ "print(\" βœ… JSON extraction\")\n",
1636
+ "print(\" βœ… Transformer architecture (embeddings)\")\n",
1637
+ "print(\" βœ… API deployment strategies\")\n",
1638
+ "print(\"\")\n",
1639
+ "print(\"=\"*70)\n",
1640
+ "print(\"πŸš€ READY TO MOVE TO VS CODE!\")\n",
1641
+ "print(\"=\"*70)"
1642
+ ]
1643
+ },
1644
+ {
1645
+ "cell_type": "code",
1646
+ "execution_count": null,
1647
+ "metadata": {},
1648
+ "outputs": [],
1649
+ "source": []
1650
+ },
1651
+ {
1652
+ "cell_type": "code",
1653
+ "execution_count": null,
1654
+ "metadata": {},
1655
+ "outputs": [],
1656
+ "source": []
1657
+ },
1658
+ {
1659
+ "cell_type": "code",
1660
+ "execution_count": null,
1661
+ "metadata": {},
1662
+ "outputs": [],
1663
+ "source": []
1664
+ },
1665
+ {
1666
+ "cell_type": "code",
1667
+ "execution_count": null,
1668
+ "metadata": {},
1669
+ "outputs": [],
1670
+ "source": []
1671
+ }
1672
+ ],
1673
+ "metadata": {
1674
+ "kernelspec": {
1675
+ "display_name": "venv",
1676
+ "language": "python",
1677
+ "name": "python3"
1678
+ },
1679
+ "language_info": {
1680
+ "codemirror_mode": {
1681
+ "name": "ipython",
1682
+ "version": 3
1683
+ },
1684
+ "file_extension": ".py",
1685
+ "mimetype": "text/x-python",
1686
+ "name": "python",
1687
+ "nbconvert_exporter": "python",
1688
+ "pygments_lexer": "ipython3",
1689
+ "version": "3.12.3"
1690
+ }
1691
+ },
1692
+ "nbformat": 4,
1693
+ "nbformat_minor": 2
1694
+ }