Spaces:

Robzy
/

jobbert_knowledge_extraction

Paused

App Files Files Community

Robzy commited on Jan 8, 2025

Commit

113c4ac

1 Parent(s): e5096ab

corrected tagging pipeline + updated README

Browse files

Files changed (7) hide show

.github/workflows/training.yml +4 -4
.github/workflows/visualization.yml +13 -0
README.md +23 -29
tag-posting.py +26 -10
tags/03-01-2025/1.txt +26 -11
tags/03-01-2025/2.txt +10 -2
tags/03-01-2025/3.txt +14 -2

.github/workflows/training.yml CHANGED Viewed

@@ -33,14 +33,14 @@ jobs:
         python llm-tagging.py
         python filter-faults.py
         python train.py
-    - name: List tags folder
-      run: ls -R tags || echo "tags folder not found"
     - name: Commit and Push Changes
       run: |
         git config --global user.name "github-actions[bot]"
         git config --global user.email "github-actions[bot]@users.noreply.github.com"
-        git add tags
-        git commit -m "Add tags generated by script"
         git push
       env:
         GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

         python llm-tagging.py
         python filter-faults.py
         python train.py
+    - name: List data folder
+      run: ls -R data || echo "data folder not found"
     - name: Commit and Push Changes
       run: |
         git config --global user.name "github-actions[bot]"
         git config --global user.email "github-actions[bot]@users.noreply.github.com"
+        git add data
+        git commit -m "LLM-generated tags uploaded"
         git push
       env:
         GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

.github/workflows/visualization.yml CHANGED Viewed

@@ -27,6 +27,7 @@ jobs:
     - name: Run Visualization Script
       run: |
         python embedding_gen.py
     - name: List plots folder
       run: ls -R plots || echo "plots not found"
@@ -34,9 +35,21 @@ jobs:
       run: |
         git config --global user.name "github-actions[bot]"
         git config --global user.email "github-actions[bot]@users.noreply.github.com"
         git add plots
         git commit -m "Add plots generated by script"
         git push
       env:
         GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

     - name: Run Visualization Script
       run: |
+        python tag-posting.py
         python embedding_gen.py
     - name: List plots folder
       run: ls -R plots || echo "plots not found"
       run: |
         git config --global user.name "github-actions[bot]"
         git config --global user.email "github-actions[bot]@users.noreply.github.com"
+        git add tags
         git add plots
         git commit -m "Add plots generated by script"
         git push
       env:
         GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+  sync-to-hub:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+        with:
+          fetch-depth: 0
+          lfs: true
+      - name: Push to hub
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        run: git push https://Robzy:$HF_TOKEN@huggingface.co/spaces/Robzy/jobbert_knowledge_extraction main

README.md CHANGED Viewed

@@ -17,37 +17,39 @@ This projects aims to monitor in-demand skills for machine learning roles. Skill
 ![Header Image](header.png)
-### [Monitoring Platform Link](https://huggingface.co/spaces/jjzha/skill_extraction_demo)
 ## Architecture & Frameworks
-- ** Hugging Face Spaces **
-- ** Gradio **
-- ** GitHub Actions **
-- ** Rapid API **
-- ** Weight & Biases **
-- ** Rapid API **
-- ** OpenAI API **
 # High-Level Overview
-## Model: skills extraction model
-## Inference
-1. Extracting new job abs from Indeed/LinkedIn
-2. Extract skills from job ads via skills extraction model
-## Online training
-Continual training, extract ground truth via LLM with multi-shot learning with examples.
-## Skill compilation
-Save all skills. Make a comprehensive overview by:
-1. Embed skills to a vector with an embedding model
-2. Perform clustering with KMeans
-2. Visualize clustering with dimensionality reduction (UMAP)
 # Job Scraping
@@ -94,11 +96,3 @@ We generate embeddings for technical skills listed in .txt files and visualizes
 - **3D Projection**: Saved as interactive HTML files in the `./plots` folder.
 - **3D Clustering Visualization**: Saved as HTML files, showing clusters with different colors.
-# Scheduling
-To monitor the in-demand skills and update our model continously, scheduling is employed. The following scripts are scheduled every Sunday:
-1. Job-posting scraping: fetching job descriptions for machine learning from LinkedIn
-2. Skills tagging with LLM: we decide to extract the ground truth of skills from the job descriptions by leveraging multi-shot learning and prompt engeneering.
-3. Training
-4. Embedding and visualizatio -  skills are embedded and visualized with KMeans clustering

 ![Header Image](header.png)
+### [Monitoring Platform Link](https://huggingface.co/spaces/Robzy/jobbert_knowledge_extraction)
 ## Architecture & Frameworks
+- **Hugging Face Spaces**: Used as an UI to host interactive visualisation of skills embeddings and their clusters.
+- **GitHub Actions**: Used to schedule training, inference and visualisation-updating scripts.
+- **Rapid API**: The API used to scrape job descriptions from LinkedIn
+- **Weights & Biases**: Used for model training monitoring as well as model storing.
+- **OpenAI API**: Used to extract ground-truth from job descriptions by leveraging multi-shot learning and prompt engineering.
 # High-Level Overview
+## Models
+* **BERT** - finetuned skill extraction model, lightweight.
+* **LLM** - gpt-4o for skill extraction with multi-shot learning. Computationally expensive.
+* **Embedding model** - [SentenceTransformers](https://sbert.net/) used to embed skills into vectors.
+* [**spaCy**](https://spacy.io/models/en#en_core_web_sm) - sentence tokenization model.
+## Pipeline
+The follow scripts are scheduled to automate the skill monitoring and model tranining processes continually.
+### 1. Job-posting scraping
+Fetching job descriptions for machine learning from LinkedIn via Rapid API.
+### 2. Skills tagging with LLM
+We opinionately extract the ground truth of skills from the job descriptions by leveraging multi-shot learning and prompt engineering.
+### 3. Model training
+The skill extraction model is finetuned with respect to the extracted groundtruth.
+### 4. Skills tagging with JobBERT
+Skills are extracted from job-postings with finetuned model
+### 5. Embedding & visualization
+Extracted skills are embedded, reduced and clustered with an embedding model, UMAP and K-means respectively.
 # Job Scraping
 - **3D Projection**: Saved as interactive HTML files in the `./plots` folder.
 - **3D Clustering Visualization**: Saved as HTML files, showing clusters with different colors.

tag-posting.py CHANGED Viewed

@@ -4,6 +4,7 @@ from transformers import AutoTokenizer, BertForTokenClassification, TrainingArgu
 import torch
 from typing import List
 import os
 ### Parsing job posting
@@ -215,18 +216,33 @@ def backfill():
             print(f"Saved skills to: {tag_path}")
-def tag_date():
-    pass
-if __name__ == '__main__':
-    # Backfill
-    backfill()
-    # path = './job-postings/03-01-2024/2.txt'
-    # sents = parse_post(path)
-    # skills = extract_skills(sents)
-    # skills_save('./tags/03-01-2024/2.txt',skills)RAPID_API_KEY : 60a10b11e6msh821d32f6e1e955ep15b5b1jsnf61a46680409
-1

 import torch
 from typing import List
 import os
+from datetime import datetime
 ### Parsing job posting
             print(f"Saved skills to: {tag_path}")
+def tag_date(date):
+    tag_dir = os.path.join(os.getcwd(), 'tags', date)
+    job_dir = os.path.join(os.getcwd(), 'job-postings', date)
+    for job in os.listdir(job_dir):
+            job_path = os.path.join(job_dir, job)
+            tag_path = os.path.join(tag_dir, job)
+            print(f"Processing job file: {job_path}")
+            if not os.path.exists(tag_dir):
+                os.makedirs(tag_dir)
+                print(f"Created directory: {tag_dir}")
+            sents = parse_post(job_path)
+            skills = extract_skills(sents)
+            skills_save(tag_path, skills)
+            print(f"Saved skills to: {tag_path}")
+if __name__ == '__main__':
+    # Backfill all job postings
+    # backfill()
+    # Tag today's job postings
+    date = datetime.today().strftime('%m-%d-%Y')
+    tag_date(date)

tags/03-01-2025/1.txt CHANGED Viewed

@@ -1,33 +1,48 @@
 ML
--
-AI based R & D
 MSc in Data Science
 Python
 Go
 MLOps
 MLFlow
-Kubeflow )
 Hydra
 numpy
 TensorFlow
-DevOps
 CI
-/
 CD
-runner deployment & management
-pipeline creation
-testing
 ML
 ML
 PyTorch
 TensorFlow
-Containers
-engines, orchestration tools and
 Docker
 Kaniko
 Kubernetes
 Helm
-Cloud ecosystems
 AWS
 Infrastructure management
 Ansible

+Ericsson
+mobile
+Ericsson
+Ericsson
+Networked Society
+Eric
+6
+6G
+regulation
+standardization
+6th
+cloud
+cloud
+MLOps
 ML
+AI
+cloud
+6G
+standard
 MSc in Data Science
 Python
 Go
 MLOps
 MLFlow
+Kubeflow
+Python
 Hydra
 numpy
 TensorFlow
+Dev
 CI
 CD
+deployment
+pipeline
 ML
 ML
 PyTorch
 TensorFlow
+Jax
+Con
 Docker
 Kaniko
 Kubernetes
 Helm
+Cloud
 AWS
 Infrastructure management
 Ansible

tags/03-01-2025/2.txt CHANGED Viewed

@@ -3,11 +3,19 @@ Automation
 data analysis
 image recognition
 automation
 Artificial Intelligence
 feasibility studies
 data analysis
 Data Science
-degree in software engineering
 Artificial Intelligence
 Vision Systems
-English

 data analysis
 image recognition
 automation
+Transformers
 Artificial Intelligence
 feasibility studies
+AI
+industry
+.
+operational
 data analysis
 Data Science
+degree
+software engineering
 Artificial Intelligence
 Vision Systems
+project
+English
+Con

tags/03-01-2025/3.txt CHANGED Viewed

@@ -1,10 +1,20 @@
 SQL
 cloud infrastructure
 APIs
 Python
 infra
 database
-Types
 SaaS
 agile development
 sprint planning
@@ -19,4 +29,6 @@ cloud environments
 Azure
 data processing
 Databricks
-English

+data
+web
 SQL
 cloud infrastructure
 APIs
+data
+Market
 Python
 infra
 database
+scraping
+Python
+cloud
+APIs
+Typescript
+node
+anals
 SaaS
 agile development
 sprint planning
 Azure
 data processing
 Databricks
+English
+T
+contract