Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Skill Filtering BERT - Fine-tuned for Online Job Advertisements (OJAs)
|
| 2 |
+
|
| 3 |
+
## Model Overview
|
| 4 |
+
**Skill Filtering BERT** is a fine-tuned BERT-based model designed for the **information filtering task** of identifying sentences related to **skills** in **Online Job Advertisements (OJAs)**. The model automates the extraction of relevant information, reducing noise and processing complexity in scraped job advertisements by classifying each sentence as skill-relevant or not.
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## Background
|
| 9 |
+
Information filtering systems automate the extraction of relevant information to handle large information flows and mitigate overload, as described in *Hanani et al. (2001)*. Online Job Advertisements (OJAs) often include extraneous elements, such as web page descriptions, layout strings, or menu options, introduced during the scraping process. This noise necessitates a **cleaning step**, which we treat as an **information filtering task**.
|
| 10 |
+
|
| 11 |
+
Given an OJA represented as a set of \(n\) sentences:
|
| 12 |
+
|
| 13 |
+
OJA = {f_1, f_2, ..., f_n}
|
| 14 |
+
|
| 15 |
+
the filtering step produces a **filtered set of \(m\) sentences** (\(m \leq n\)) that are skill-relevant:
|
| 16 |
+
|
| 17 |
+
FilteredOJA = {c_1, c_2, ..., c_m}
|
| 18 |
+
|
| 19 |
+
This model uses a fine-tuned BERT to accomplish this filtering, improving efficiency in downstream skill extraction tasks.
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## Training Process
|
| 24 |
+
|
| 25 |
+
The model was fine-tuned in two stages:
|
| 26 |
+
|
| 27 |
+
### Stage 1: Initial Fine-Tuning
|
| 28 |
+
1. **Dataset:**
|
| 29 |
+
The ESCO taxonomy was used to construct a dataset of ~25,000 sentences, comprising a balanced distribution of:
|
| 30 |
+
- **Skill-related sentences** (class 1)
|
| 31 |
+
- **Occupation-related sentences** (class 0)
|
| 32 |
+
|
| 33 |
+
ESCO was chosen because its skill descriptions closely resemble the contexts in which skills appear in OJAs. By training BERT on these descriptions, the model learns to differentiate between skills and occupations based on contextual clues.
|
| 34 |
+
|
| 35 |
+
2. **Training Details:**
|
| 36 |
+
- **Training Dataset:** 80% of rows
|
| 37 |
+
- **Validation Dataset:** 20% of rows
|
| 38 |
+
- **Loss Function:** Cross-entropy
|
| 39 |
+
- **Batch Size:** 16
|
| 40 |
+
- **Epochs:** 4
|
| 41 |
+
|
| 42 |
+
3. **Results:**
|
| 43 |
+
- **Training Loss:** 0.0211
|
| 44 |
+
- **Precision:** 89%
|
| 45 |
+
- **Recall:** 94%
|
| 46 |
+
|
| 47 |
+
4. **Evaluation:**
|
| 48 |
+
On a manually labeled dataset of 400 OJAs (split into sentences):
|
| 49 |
+
- **Precision:** 40%
|
| 50 |
+
- **Recall:** 81%
|
| 51 |
+
|
| 52 |
+
---
|
| 53 |
+
|
| 54 |
+
### Stage 2: Second Fine-Tuning
|
| 55 |
+
1. **Dataset:**
|
| 56 |
+
To improve recall and precision, we manually labeled **300 OJAs** (split into sentences). Sentences were annotated as:
|
| 57 |
+
- **Skill-relevant (class 1)**
|
| 58 |
+
- **Non-skill-relevant (class 0)**
|
| 59 |
+
|
| 60 |
+
To emphasize skill-related sentences, a **cost matrix** was introduced, doubling the weight for class 1.
|
| 61 |
+
|
| 62 |
+
2. **Training Details:**
|
| 63 |
+
- **Training Dataset:** 75% of manually labeled OJAs
|
| 64 |
+
- **Validation Dataset:** 25% of manually labeled OJAs
|
| 65 |
+
- **Batch Size:** 16
|
| 66 |
+
- **Epochs:** 4
|
| 67 |
+
|
| 68 |
+
3. **Results:**
|
| 69 |
+
- **Precision:** 71%
|
| 70 |
+
- **Recall:** 93%
|
| 71 |
+
|
| 72 |
+
4. **Final Evaluation:**
|
| 73 |
+
Evaluated on the remaining 100 manually labeled OJAs, the model demonstrated significant improvements in identifying skill-relevant sentences.
|
| 74 |
+
|
| 75 |
+
---
|
| 76 |
+
|
| 77 |
+
## Model Usage
|
| 78 |
+
|
| 79 |
+
This model is ideal for organizations and researchers working on **labour market analysis**, **skill extraction**, or similar NLP tasks requiring fine-grained sentence filtering. By processing OJAs to identify skill-relevant sentences, downstream tasks like taxonomy mapping or skill prediction can be performed with higher precision and reduced noise.
|
| 80 |
+
|
| 81 |
+
### How to Use the Model
|
| 82 |
+
|
| 83 |
+
You can load the model using the Hugging Face Transformers library as follows:
|
| 84 |
+
|
| 85 |
+
```python
|
| 86 |
+
from transformers import BertForSequenceClassification, BertTokenizer
|
| 87 |
+
|
| 88 |
+
# Load the model and tokenizer
|
| 89 |
+
model_name = "username/skill-filtering-bert"
|
| 90 |
+
model = BertForSequenceClassification.from_pretrained(model_name)
|
| 91 |
+
tokenizer = BertTokenizer.from_pretrained(model_name)
|
| 92 |
+
|
| 93 |
+
# Example input: a single sentence
|
| 94 |
+
sentence = "This job requires proficiency in Python programming."
|
| 95 |
+
inputs = tokenizer(sentence, return_tensors="pt")
|
| 96 |
+
|
| 97 |
+
# Get predictions
|
| 98 |
+
outputs = model(**inputs)
|
| 99 |
+
logits = outputs.logits
|
| 100 |
+
predicted_class = logits.argmax().item()
|
| 101 |
+
|
| 102 |
+
# Class 1 = Skill-relevant, Class 0 = Non-skill-relevant
|
| 103 |
+
print(f"Predicted Class: {predicted_class}")
|