serino28 commited on
Commit
c446ea6
·
verified ·
1 Parent(s): 3c2c713

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -0
README.md ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Skill Filtering BERT - Fine-tuned for Online Job Advertisements (OJAs)
2
+
3
+ ## Model Overview
4
+ **Skill Filtering BERT** is a fine-tuned BERT-based model designed for the **information filtering task** of identifying sentences related to **skills** in **Online Job Advertisements (OJAs)**. The model automates the extraction of relevant information, reducing noise and processing complexity in scraped job advertisements by classifying each sentence as skill-relevant or not.
5
+
6
+ ---
7
+
8
+ ## Background
9
+ Information filtering systems automate the extraction of relevant information to handle large information flows and mitigate overload, as described in *Hanani et al. (2001)*. Online Job Advertisements (OJAs) often include extraneous elements, such as web page descriptions, layout strings, or menu options, introduced during the scraping process. This noise necessitates a **cleaning step**, which we treat as an **information filtering task**.
10
+
11
+ Given an OJA represented as a set of \(n\) sentences:
12
+
13
+ OJA = {f_1, f_2, ..., f_n}
14
+
15
+ the filtering step produces a **filtered set of \(m\) sentences** (\(m \leq n\)) that are skill-relevant:
16
+
17
+ FilteredOJA = {c_1, c_2, ..., c_m}
18
+
19
+ This model uses a fine-tuned BERT to accomplish this filtering, improving efficiency in downstream skill extraction tasks.
20
+
21
+ ---
22
+
23
+ ## Training Process
24
+
25
+ The model was fine-tuned in two stages:
26
+
27
+ ### Stage 1: Initial Fine-Tuning
28
+ 1. **Dataset:**
29
+ The ESCO taxonomy was used to construct a dataset of ~25,000 sentences, comprising a balanced distribution of:
30
+ - **Skill-related sentences** (class 1)
31
+ - **Occupation-related sentences** (class 0)
32
+
33
+ ESCO was chosen because its skill descriptions closely resemble the contexts in which skills appear in OJAs. By training BERT on these descriptions, the model learns to differentiate between skills and occupations based on contextual clues.
34
+
35
+ 2. **Training Details:**
36
+ - **Training Dataset:** 80% of rows
37
+ - **Validation Dataset:** 20% of rows
38
+ - **Loss Function:** Cross-entropy
39
+ - **Batch Size:** 16
40
+ - **Epochs:** 4
41
+
42
+ 3. **Results:**
43
+ - **Training Loss:** 0.0211
44
+ - **Precision:** 89%
45
+ - **Recall:** 94%
46
+
47
+ 4. **Evaluation:**
48
+ On a manually labeled dataset of 400 OJAs (split into sentences):
49
+ - **Precision:** 40%
50
+ - **Recall:** 81%
51
+
52
+ ---
53
+
54
+ ### Stage 2: Second Fine-Tuning
55
+ 1. **Dataset:**
56
+ To improve recall and precision, we manually labeled **300 OJAs** (split into sentences). Sentences were annotated as:
57
+ - **Skill-relevant (class 1)**
58
+ - **Non-skill-relevant (class 0)**
59
+
60
+ To emphasize skill-related sentences, a **cost matrix** was introduced, doubling the weight for class 1.
61
+
62
+ 2. **Training Details:**
63
+ - **Training Dataset:** 75% of manually labeled OJAs
64
+ - **Validation Dataset:** 25% of manually labeled OJAs
65
+ - **Batch Size:** 16
66
+ - **Epochs:** 4
67
+
68
+ 3. **Results:**
69
+ - **Precision:** 71%
70
+ - **Recall:** 93%
71
+
72
+ 4. **Final Evaluation:**
73
+ Evaluated on the remaining 100 manually labeled OJAs, the model demonstrated significant improvements in identifying skill-relevant sentences.
74
+
75
+ ---
76
+
77
+ ## Model Usage
78
+
79
+ This model is ideal for organizations and researchers working on **labour market analysis**, **skill extraction**, or similar NLP tasks requiring fine-grained sentence filtering. By processing OJAs to identify skill-relevant sentences, downstream tasks like taxonomy mapping or skill prediction can be performed with higher precision and reduced noise.
80
+
81
+ ### How to Use the Model
82
+
83
+ You can load the model using the Hugging Face Transformers library as follows:
84
+
85
+ ```python
86
+ from transformers import BertForSequenceClassification, BertTokenizer
87
+
88
+ # Load the model and tokenizer
89
+ model_name = "username/skill-filtering-bert"
90
+ model = BertForSequenceClassification.from_pretrained(model_name)
91
+ tokenizer = BertTokenizer.from_pretrained(model_name)
92
+
93
+ # Example input: a single sentence
94
+ sentence = "This job requires proficiency in Python programming."
95
+ inputs = tokenizer(sentence, return_tensors="pt")
96
+
97
+ # Get predictions
98
+ outputs = model(**inputs)
99
+ logits = outputs.logits
100
+ predicted_class = logits.argmax().item()
101
+
102
+ # Class 1 = Skill-relevant, Class 0 = Non-skill-relevant
103
+ print(f"Predicted Class: {predicted_class}")