corrected tagging pipeline + updated README
Browse files- .github/workflows/training.yml +4 -4
- .github/workflows/visualization.yml +13 -0
- README.md +23 -29
- tag-posting.py +26 -10
- tags/03-01-2025/1.txt +26 -11
- tags/03-01-2025/2.txt +10 -2
- tags/03-01-2025/3.txt +14 -2
.github/workflows/training.yml
CHANGED
|
@@ -33,14 +33,14 @@ jobs:
|
|
| 33 |
python llm-tagging.py
|
| 34 |
python filter-faults.py
|
| 35 |
python train.py
|
| 36 |
-
- name: List
|
| 37 |
-
run: ls -R
|
| 38 |
- name: Commit and Push Changes
|
| 39 |
run: |
|
| 40 |
git config --global user.name "github-actions[bot]"
|
| 41 |
git config --global user.email "github-actions[bot]@users.noreply.github.com"
|
| 42 |
-
git add
|
| 43 |
-
git commit -m "
|
| 44 |
git push
|
| 45 |
env:
|
| 46 |
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
|
|
|
| 33 |
python llm-tagging.py
|
| 34 |
python filter-faults.py
|
| 35 |
python train.py
|
| 36 |
+
- name: List data folder
|
| 37 |
+
run: ls -R data || echo "data folder not found"
|
| 38 |
- name: Commit and Push Changes
|
| 39 |
run: |
|
| 40 |
git config --global user.name "github-actions[bot]"
|
| 41 |
git config --global user.email "github-actions[bot]@users.noreply.github.com"
|
| 42 |
+
git add data
|
| 43 |
+
git commit -m "LLM-generated tags uploaded"
|
| 44 |
git push
|
| 45 |
env:
|
| 46 |
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
.github/workflows/visualization.yml
CHANGED
|
@@ -27,6 +27,7 @@ jobs:
|
|
| 27 |
|
| 28 |
- name: Run Visualization Script
|
| 29 |
run: |
|
|
|
|
| 30 |
python embedding_gen.py
|
| 31 |
- name: List plots folder
|
| 32 |
run: ls -R plots || echo "plots not found"
|
|
@@ -34,9 +35,21 @@ jobs:
|
|
| 34 |
run: |
|
| 35 |
git config --global user.name "github-actions[bot]"
|
| 36 |
git config --global user.email "github-actions[bot]@users.noreply.github.com"
|
|
|
|
| 37 |
git add plots
|
| 38 |
git commit -m "Add plots generated by script"
|
| 39 |
git push
|
| 40 |
env:
|
| 41 |
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
| 42 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
- name: Run Visualization Script
|
| 29 |
run: |
|
| 30 |
+
python tag-posting.py
|
| 31 |
python embedding_gen.py
|
| 32 |
- name: List plots folder
|
| 33 |
run: ls -R plots || echo "plots not found"
|
|
|
|
| 35 |
run: |
|
| 36 |
git config --global user.name "github-actions[bot]"
|
| 37 |
git config --global user.email "github-actions[bot]@users.noreply.github.com"
|
| 38 |
+
git add tags
|
| 39 |
git add plots
|
| 40 |
git commit -m "Add plots generated by script"
|
| 41 |
git push
|
| 42 |
env:
|
| 43 |
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
| 44 |
|
| 45 |
+
sync-to-hub:
|
| 46 |
+
runs-on: ubuntu-latest
|
| 47 |
+
steps:
|
| 48 |
+
- uses: actions/checkout@v3
|
| 49 |
+
with:
|
| 50 |
+
fetch-depth: 0
|
| 51 |
+
lfs: true
|
| 52 |
+
- name: Push to hub
|
| 53 |
+
env:
|
| 54 |
+
HF_TOKEN: ${{ secrets.HF_TOKEN }}
|
| 55 |
+
run: git push https://Robzy:$HF_TOKEN@huggingface.co/spaces/Robzy/jobbert_knowledge_extraction main
|
README.md
CHANGED
|
@@ -17,37 +17,39 @@ This projects aims to monitor in-demand skills for machine learning roles. Skill
|
|
| 17 |
|
| 18 |

|
| 19 |
|
| 20 |
-
### [Monitoring Platform Link](https://huggingface.co/spaces/
|
| 21 |
|
| 22 |
## Architecture & Frameworks
|
| 23 |
|
| 24 |
-
|
| 25 |
-
- **
|
| 26 |
-
- **
|
| 27 |
-
- **
|
| 28 |
-
- **
|
| 29 |
-
- ** Weight & Biases **
|
| 30 |
-
- ** Rapid API **
|
| 31 |
-
- ** OpenAI API **
|
| 32 |
|
| 33 |
|
| 34 |
# High-Level Overview
|
| 35 |
|
| 36 |
-
##
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
|
| 42 |
-
##
|
| 43 |
-
Continual training, extract ground truth via LLM with multi-shot learning with examples.
|
| 44 |
|
| 45 |
-
|
| 46 |
-
Save all skills. Make a comprehensive overview by:
|
| 47 |
|
| 48 |
-
1.
|
| 49 |
-
|
| 50 |
-
2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
|
| 53 |
# Job Scraping
|
|
@@ -94,11 +96,3 @@ We generate embeddings for technical skills listed in .txt files and visualizes
|
|
| 94 |
- **3D Projection**: Saved as interactive HTML files in the `./plots` folder.
|
| 95 |
- **3D Clustering Visualization**: Saved as HTML files, showing clusters with different colors.
|
| 96 |
|
| 97 |
-
# Scheduling
|
| 98 |
-
|
| 99 |
-
To monitor the in-demand skills and update our model continously, scheduling is employed. The following scripts are scheduled every Sunday:
|
| 100 |
-
|
| 101 |
-
1. Job-posting scraping: fetching job descriptions for machine learning from LinkedIn
|
| 102 |
-
2. Skills tagging with LLM: we decide to extract the ground truth of skills from the job descriptions by leveraging multi-shot learning and prompt engeneering.
|
| 103 |
-
3. Training
|
| 104 |
-
4. Embedding and visualizatio - skills are embedded and visualized with KMeans clustering
|
|
|
|
| 17 |
|
| 18 |

|
| 19 |
|
| 20 |
+
### [Monitoring Platform Link](https://huggingface.co/spaces/Robzy/jobbert_knowledge_extraction)
|
| 21 |
|
| 22 |
## Architecture & Frameworks
|
| 23 |
|
| 24 |
+
- **Hugging Face Spaces**: Used as an UI to host interactive visualisation of skills embeddings and their clusters.
|
| 25 |
+
- **GitHub Actions**: Used to schedule training, inference and visualisation-updating scripts.
|
| 26 |
+
- **Rapid API**: The API used to scrape job descriptions from LinkedIn
|
| 27 |
+
- **Weights & Biases**: Used for model training monitoring as well as model storing.
|
| 28 |
+
- **OpenAI API**: Used to extract ground-truth from job descriptions by leveraging multi-shot learning and prompt engineering.
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
|
| 31 |
# High-Level Overview
|
| 32 |
|
| 33 |
+
## Models
|
| 34 |
+
* **BERT** - finetuned skill extraction model, lightweight.
|
| 35 |
+
* **LLM** - gpt-4o for skill extraction with multi-shot learning. Computationally expensive.
|
| 36 |
+
* **Embedding model** - [SentenceTransformers](https://sbert.net/) used to embed skills into vectors.
|
| 37 |
+
* [**spaCy**](https://spacy.io/models/en#en_core_web_sm) - sentence tokenization model.
|
| 38 |
|
| 39 |
+
## Pipeline
|
|
|
|
| 40 |
|
| 41 |
+
The follow scripts are scheduled to automate the skill monitoring and model tranining processes continually.
|
|
|
|
| 42 |
|
| 43 |
+
### 1. Job-posting scraping
|
| 44 |
+
Fetching job descriptions for machine learning from LinkedIn via Rapid API.
|
| 45 |
+
### 2. Skills tagging with LLM
|
| 46 |
+
We opinionately extract the ground truth of skills from the job descriptions by leveraging multi-shot learning and prompt engineering.
|
| 47 |
+
### 3. Model training
|
| 48 |
+
The skill extraction model is finetuned with respect to the extracted groundtruth.
|
| 49 |
+
### 4. Skills tagging with JobBERT
|
| 50 |
+
Skills are extracted from job-postings with finetuned model
|
| 51 |
+
### 5. Embedding & visualization
|
| 52 |
+
Extracted skills are embedded, reduced and clustered with an embedding model, UMAP and K-means respectively.
|
| 53 |
|
| 54 |
|
| 55 |
# Job Scraping
|
|
|
|
| 96 |
- **3D Projection**: Saved as interactive HTML files in the `./plots` folder.
|
| 97 |
- **3D Clustering Visualization**: Saved as HTML files, showing clusters with different colors.
|
| 98 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
tag-posting.py
CHANGED
|
@@ -4,6 +4,7 @@ from transformers import AutoTokenizer, BertForTokenClassification, TrainingArgu
|
|
| 4 |
import torch
|
| 5 |
from typing import List
|
| 6 |
import os
|
|
|
|
| 7 |
|
| 8 |
|
| 9 |
### Parsing job posting
|
|
@@ -215,18 +216,33 @@ def backfill():
|
|
| 215 |
|
| 216 |
print(f"Saved skills to: {tag_path}")
|
| 217 |
|
| 218 |
-
def tag_date():
|
| 219 |
|
| 220 |
-
|
|
|
|
| 221 |
|
| 222 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 223 |
|
| 224 |
-
|
| 225 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 226 |
|
|
|
|
|
|
|
| 227 |
|
| 228 |
-
#
|
| 229 |
-
|
| 230 |
-
|
| 231 |
-
# skills_save('./tags/03-01-2024/2.txt',skills)RAPID_API_KEY : 60a10b11e6msh821d32f6e1e955ep15b5b1jsnf61a46680409
|
| 232 |
-
1
|
|
|
|
| 4 |
import torch
|
| 5 |
from typing import List
|
| 6 |
import os
|
| 7 |
+
from datetime import datetime
|
| 8 |
|
| 9 |
|
| 10 |
### Parsing job posting
|
|
|
|
| 216 |
|
| 217 |
print(f"Saved skills to: {tag_path}")
|
| 218 |
|
| 219 |
+
def tag_date(date):
|
| 220 |
|
| 221 |
+
tag_dir = os.path.join(os.getcwd(), 'tags', date)
|
| 222 |
+
job_dir = os.path.join(os.getcwd(), 'job-postings', date)
|
| 223 |
|
| 224 |
+
for job in os.listdir(job_dir):
|
| 225 |
+
|
| 226 |
+
job_path = os.path.join(job_dir, job)
|
| 227 |
+
tag_path = os.path.join(tag_dir, job)
|
| 228 |
+
|
| 229 |
+
print(f"Processing job file: {job_path}")
|
| 230 |
|
| 231 |
+
if not os.path.exists(tag_dir):
|
| 232 |
+
os.makedirs(tag_dir)
|
| 233 |
+
print(f"Created directory: {tag_dir}")
|
| 234 |
+
|
| 235 |
+
sents = parse_post(job_path)
|
| 236 |
+
skills = extract_skills(sents)
|
| 237 |
+
skills_save(tag_path, skills)
|
| 238 |
+
|
| 239 |
+
print(f"Saved skills to: {tag_path}")
|
| 240 |
+
|
| 241 |
+
if __name__ == '__main__':
|
| 242 |
|
| 243 |
+
# Backfill all job postings
|
| 244 |
+
# backfill()
|
| 245 |
|
| 246 |
+
# Tag today's job postings
|
| 247 |
+
date = datetime.today().strftime('%m-%d-%Y')
|
| 248 |
+
tag_date(date)
|
|
|
|
|
|
tags/03-01-2025/1.txt
CHANGED
|
@@ -1,33 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
ML
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
| 4 |
MSc in Data Science
|
| 5 |
Python
|
| 6 |
Go
|
| 7 |
MLOps
|
| 8 |
MLFlow
|
| 9 |
-
Kubeflow
|
|
|
|
| 10 |
Hydra
|
| 11 |
numpy
|
| 12 |
TensorFlow
|
| 13 |
-
|
| 14 |
CI
|
| 15 |
-
/
|
| 16 |
CD
|
| 17 |
-
|
| 18 |
-
pipeline
|
| 19 |
-
testing
|
| 20 |
ML
|
| 21 |
ML
|
| 22 |
PyTorch
|
| 23 |
TensorFlow
|
| 24 |
-
|
| 25 |
-
|
| 26 |
Docker
|
| 27 |
Kaniko
|
| 28 |
Kubernetes
|
| 29 |
Helm
|
| 30 |
-
Cloud
|
| 31 |
AWS
|
| 32 |
Infrastructure management
|
| 33 |
Ansible
|
|
|
|
| 1 |
+
Ericsson
|
| 2 |
+
mobile
|
| 3 |
+
Ericsson
|
| 4 |
+
Ericsson
|
| 5 |
+
Networked Society
|
| 6 |
+
Eric
|
| 7 |
+
6
|
| 8 |
+
6G
|
| 9 |
+
regulation
|
| 10 |
+
standardization
|
| 11 |
+
6th
|
| 12 |
+
cloud
|
| 13 |
+
cloud
|
| 14 |
+
MLOps
|
| 15 |
ML
|
| 16 |
+
AI
|
| 17 |
+
cloud
|
| 18 |
+
6G
|
| 19 |
+
standard
|
| 20 |
MSc in Data Science
|
| 21 |
Python
|
| 22 |
Go
|
| 23 |
MLOps
|
| 24 |
MLFlow
|
| 25 |
+
Kubeflow
|
| 26 |
+
Python
|
| 27 |
Hydra
|
| 28 |
numpy
|
| 29 |
TensorFlow
|
| 30 |
+
Dev
|
| 31 |
CI
|
|
|
|
| 32 |
CD
|
| 33 |
+
deployment
|
| 34 |
+
pipeline
|
|
|
|
| 35 |
ML
|
| 36 |
ML
|
| 37 |
PyTorch
|
| 38 |
TensorFlow
|
| 39 |
+
Jax
|
| 40 |
+
Con
|
| 41 |
Docker
|
| 42 |
Kaniko
|
| 43 |
Kubernetes
|
| 44 |
Helm
|
| 45 |
+
Cloud
|
| 46 |
AWS
|
| 47 |
Infrastructure management
|
| 48 |
Ansible
|
tags/03-01-2025/2.txt
CHANGED
|
@@ -3,11 +3,19 @@ Automation
|
|
| 3 |
data analysis
|
| 4 |
image recognition
|
| 5 |
automation
|
|
|
|
| 6 |
Artificial Intelligence
|
| 7 |
feasibility studies
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
data analysis
|
| 9 |
Data Science
|
| 10 |
-
degree
|
|
|
|
| 11 |
Artificial Intelligence
|
| 12 |
Vision Systems
|
| 13 |
-
|
|
|
|
|
|
|
|
|
| 3 |
data analysis
|
| 4 |
image recognition
|
| 5 |
automation
|
| 6 |
+
Transformers
|
| 7 |
Artificial Intelligence
|
| 8 |
feasibility studies
|
| 9 |
+
AI
|
| 10 |
+
industry
|
| 11 |
+
.
|
| 12 |
+
operational
|
| 13 |
data analysis
|
| 14 |
Data Science
|
| 15 |
+
degree
|
| 16 |
+
software engineering
|
| 17 |
Artificial Intelligence
|
| 18 |
Vision Systems
|
| 19 |
+
project
|
| 20 |
+
English
|
| 21 |
+
Con
|
tags/03-01-2025/3.txt
CHANGED
|
@@ -1,10 +1,20 @@
|
|
|
|
|
|
|
|
| 1 |
SQL
|
| 2 |
cloud infrastructure
|
| 3 |
APIs
|
|
|
|
|
|
|
| 4 |
Python
|
| 5 |
infra
|
| 6 |
database
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
SaaS
|
| 9 |
agile development
|
| 10 |
sprint planning
|
|
@@ -19,4 +29,6 @@ cloud environments
|
|
| 19 |
Azure
|
| 20 |
data processing
|
| 21 |
Databricks
|
| 22 |
-
English
|
|
|
|
|
|
|
|
|
| 1 |
+
data
|
| 2 |
+
web
|
| 3 |
SQL
|
| 4 |
cloud infrastructure
|
| 5 |
APIs
|
| 6 |
+
data
|
| 7 |
+
Market
|
| 8 |
Python
|
| 9 |
infra
|
| 10 |
database
|
| 11 |
+
scraping
|
| 12 |
+
Python
|
| 13 |
+
cloud
|
| 14 |
+
APIs
|
| 15 |
+
Typescript
|
| 16 |
+
node
|
| 17 |
+
anals
|
| 18 |
SaaS
|
| 19 |
agile development
|
| 20 |
sprint planning
|
|
|
|
| 29 |
Azure
|
| 30 |
data processing
|
| 31 |
Databricks
|
| 32 |
+
English
|
| 33 |
+
T
|
| 34 |
+
contract
|