feliponi commited on
Commit
5fe4ae9
·
verified ·
1 Parent(s): b356bc1

Upload folder using huggingface_hub

Browse files
Files changed (7) hide show
  1. .gitattributes +1 -0
  2. README.md +42 -16
  3. config.json +24 -11
  4. model.safetensors +2 -2
  5. test_metrics.json +22 -13
  6. tokenizer.json +0 -0
  7. tokenizer_config.json +6 -8
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,6 +1,3 @@
1
- ---
2
- license: mit
3
- ---
4
  language: en
5
  license: apache-2.0
6
  library_name: transformers
@@ -37,8 +34,7 @@ You can use this model directly with the `token-classification` (or `ner`) pipel
37
  from transformers import pipeline
38
 
39
  # Load the model from the Hub
40
- # (Replace with your actual model ID, e.g., "your-username/hirly-ner-multi")
41
- model_id = "your-username/hirly-ner-multi"
42
 
43
  # Initialize the pipeline
44
  # aggregation_strategy="simple" groups B- and I- tags (e.g., B-SKILL, I-SKILL -> SKILL)
@@ -49,7 +45,20 @@ extractor = pipeline(
49
  )
50
 
51
  # Example text
52
- text = "Data Scientist with 5+ years of experience in Python and machine learning. Also 6 months in Java."
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
  # Get entities
55
  entities = extractor(text)
@@ -66,11 +75,20 @@ for entity in confident_entities:
66
  **Expected Output:**
67
 
68
  ````
69
- [EXPERIENCE_DURATION] 5+ years (Confidence: 1.00)
70
- [SKILL] Python (Confidence: 0.99)
71
- [SKILL] machine learning (Confidence: 1.00)
72
- [EXPERIENCE_DURATION] 6 months (Confidence: 1.00)
73
- [SKILL] Java (Confidence: 0.99)
 
 
 
 
 
 
 
 
 
74
  ````
75
 
76
  ## Training, Performance, and Limitations
@@ -83,15 +101,23 @@ The model was validated on a test set of \~2,000 examples, achieving the followi
83
 
84
  | Entity | F1-Score |
85
  | :--- | :--- |
86
- | **`EXPERIENCE_DURATION`** | **99.9%** |
87
- | **`SKILL`** | **97.6%** |
88
- | **Overall** | **98.8%** |
 
 
 
89
 
90
  ### Training Methodology
91
 
92
- 1. **`EXPERIENCE_DURATION` (High Quality):** This entity was labeled using a robust set of regular expressions designed to find time patterns (e.g., "5+ years", "six months"). Its near-perfect F1 score reflects this.
 
 
 
 
93
 
94
- 2. **`SKILL` (High Recall, Lower Precision):** This entity was labeled by performing *exact matching* against a large, proprietary vocabulary of \~8,700 terms.
 
95
 
96
  ### Limitations (Important)
97
 
 
 
 
 
1
  language: en
2
  license: apache-2.0
3
  library_name: transformers
 
34
  from transformers import pipeline
35
 
36
  # Load the model from the Hub
37
+ model_id = "feliponi/hirly-ner-multi"
 
38
 
39
  # Initialize the pipeline
40
  # aggregation_strategy="simple" groups B- and I- tags (e.g., B-SKILL, I-SKILL -> SKILL)
 
45
  )
46
 
47
  # Example text
48
+ text = """
49
+ Data Scientist with 5+ years of experience in Python and machine learning.
50
+ Also 6 months in Java.
51
+
52
+ Soft skills:
53
+ inclusive leadership
54
+ paradigm thinking
55
+ performance optimization
56
+ personal initiative
57
+
58
+ english language proficiency
59
+ portuguese language proficiency
60
+
61
+ AWS Certified Solutions Architect - Associate"""
62
 
63
  # Get entities
64
  entities = extractor(text)
 
75
  **Expected Output:**
76
 
77
  ````
78
+ [{'entity_group': 'SKILL', 'score': np.float32(0.9340167), 'word': 'Data Scientist', 'start': 1, 'end': 15},
79
+ {'entity_group': 'EXPERIENCE_DURATION', 'score': np.float32(0.9998663), 'word': ' 5+ years', 'start': 21, 'end': 29},
80
+ {'entity_group': 'SKILL', 'score': np.float32(0.99859816), 'word': ' Python', 'start': 47, 'end': 53},
81
+ {'entity_group': 'SKILL', 'score': np.float32(0.9998181), 'word': ' machine learning', 'start': 58, 'end': 74},
82
+ {'entity_group': 'EXPERIENCE_DURATION', 'score': np.float32(0.9998392), 'word': ' 6 months', 'start': 81, 'end': 89},
83
+ {'entity_group': 'SKILL', 'score': np.float32(0.9982002), 'word': ' Java', 'start': 93, 'end': 97},
84
+ {'entity_group': 'SOFT_SKILL', 'score': np.float32(0.995745), 'word': ' leadership', 'start': 124, 'end': 134},
85
+ {'entity_group': 'SOFT_SKILL', 'score': np.float32(0.9859735), 'word': 'performance optimization', 'start': 153, 'end': 177},
86
+ {'entity_group': 'SOFT_SKILL', 'score': np.float32(0.98516375), 'word': 'personal initiative', 'start': 178, 'end': 197},
87
+ {'entity_group': 'LANG', 'score': np.float32(0.96456385), 'word': 'english language proficiency', 'start': 199, 'end': 227},
88
+ {'entity_group': 'LANG', 'score': np.float32(0.9288162), 'word': 'portuguese language proficiency', 'start': 228, 'end': 259},
89
+ {'entity_group': 'SKILL', 'score': np.float32(0.926032), 'word': 'AWS', 'start': 261, 'end': 264},
90
+ {'entity_group': 'SOFT_SKILL', 'score': np.float32(0.9559879), 'word': ' Solutions', 'start': 275, 'end': 284},
91
+ {'entity_group': 'SKILL', 'score': np.float32(0.84499276), 'word': ' Architect', 'start': 285, 'end': 294}]
92
  ````
93
 
94
  ## Training, Performance, and Limitations
 
101
 
102
  | Entity | F1-Score |
103
  | :--- | :--- |
104
+ | **`SKILLS`** | **98.9%** |
105
+ | **`LANG`** | **99.0%** |
106
+ | **`CERT`** | **84.9%** |
107
+ | **`SOFT_SKILL`** | **98.6%** |
108
+ | **`EXPERIENCE_DURATION`** | **99.8%** |
109
+ | **Overall** | **96.3%** |
110
 
111
  ### Training Methodology
112
 
113
+ This model's performance is a direct result of its **Weak Labeling** training methodology. The labels were generated automatically, not manually annotated.
114
+
115
+ 1. **`EXPERIENCE_DURATION` (Pattern-Based):** This entity was labeled using a robust set of regular expressions designed to find time-based patterns (e.g., "5+ years", "six months", "3-5 anos"). Its near-perfect F1 score reflects the high precision of this regex approach.
116
+
117
+ 2. **`SKILL`, `SOFT_SKILL`, `LANG`, `CERT` (Vocabulary-Based):** These four entities were labeled by performing high-speed, *exact matching* against four separate vocabulary files (`skills.txt`, `softskills.txt`, `langskills.txt`, `certifications.txt`).
118
 
119
+ * **High Performance (`SKILL`, `SOFT_SKILL`, `LANG`):** The excellent F1 scores (98-99%) indicate that the vocabularies for these labels were comprehensive and matched the training texts frequently.
120
+ * **Good Performance (`CERT`):** The 84.9% F1 score is strong but shows room for improvement. This score suggests the `certifications.txt` vocabulary was less comprehensive. The model's performance for this label would be directly improved by adding more certification names (e.g., "AWS CSAA", "PMP", etc.) to the vocabulary file and retraining.
121
 
122
  ### Limitations (Important)
123
 
config.json CHANGED
@@ -1,6 +1,6 @@
1
  {
2
  "architectures": [
3
- "RobertaForTokenClassification"
4
  ],
5
  "attention_probs_dropout_prob": 0.1,
6
  "bos_token_id": 0,
@@ -12,29 +12,42 @@
12
  "hidden_size": 768,
13
  "id2label": {
14
  "0": "O",
15
- "1": "B-EXPERIENCE_DURATION",
16
- "2": "I-EXPERIENCE_DURATION",
17
- "3": "B-SKILL",
18
- "4": "I-SKILL"
 
 
 
 
 
 
19
  },
20
  "initializer_range": 0.02,
21
  "intermediate_size": 3072,
22
  "label2id": {
23
- "B-EXPERIENCE_DURATION": 1,
24
- "B-SKILL": 3,
25
- "I-EXPERIENCE_DURATION": 2,
26
- "I-SKILL": 4,
 
 
 
 
 
 
27
  "O": 0
28
  },
29
  "layer_norm_eps": 1e-05,
30
  "max_position_embeddings": 514,
31
- "model_type": "roberta",
32
  "num_attention_heads": 12,
33
  "num_hidden_layers": 12,
 
34
  "pad_token_id": 1,
35
  "position_embedding_type": "absolute",
36
  "transformers_version": "4.57.1",
37
  "type_vocab_size": 1,
38
  "use_cache": true,
39
- "vocab_size": 50265
40
  }
 
1
  {
2
  "architectures": [
3
+ "XLMRobertaForTokenClassification"
4
  ],
5
  "attention_probs_dropout_prob": 0.1,
6
  "bos_token_id": 0,
 
12
  "hidden_size": 768,
13
  "id2label": {
14
  "0": "O",
15
+ "1": "B-CERT",
16
+ "2": "I-CERT",
17
+ "3": "B-EXPERIENCE_DURATION",
18
+ "4": "I-EXPERIENCE_DURATION",
19
+ "5": "B-LANG",
20
+ "6": "I-LANG",
21
+ "7": "B-SKILL",
22
+ "8": "I-SKILL",
23
+ "9": "B-SOFT_SKILL",
24
+ "10": "I-SOFT_SKILL"
25
  },
26
  "initializer_range": 0.02,
27
  "intermediate_size": 3072,
28
  "label2id": {
29
+ "B-CERT": 1,
30
+ "B-EXPERIENCE_DURATION": 3,
31
+ "B-LANG": 5,
32
+ "B-SKILL": 7,
33
+ "B-SOFT_SKILL": 9,
34
+ "I-CERT": 2,
35
+ "I-EXPERIENCE_DURATION": 4,
36
+ "I-LANG": 6,
37
+ "I-SKILL": 8,
38
+ "I-SOFT_SKILL": 10,
39
  "O": 0
40
  },
41
  "layer_norm_eps": 1e-05,
42
  "max_position_embeddings": 514,
43
+ "model_type": "xlm-roberta",
44
  "num_attention_heads": 12,
45
  "num_hidden_layers": 12,
46
+ "output_past": true,
47
  "pad_token_id": 1,
48
  "position_embedding_type": "absolute",
49
  "transformers_version": "4.57.1",
50
  "type_vocab_size": 1,
51
  "use_cache": true,
52
+ "vocab_size": 250002
53
  }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:509d0ec667aaea68af890e0f03fe55b16f5a278671d5c074c130c8f9558cad02
3
- size 496259468
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c560f1474d8f841d665671ca9a61561b565a542a4047d3213f57c6b77cf4ed36
3
+ size 1109870108
test_metrics.json CHANGED
@@ -1,15 +1,24 @@
1
  {
2
- "test_loss": 0.02206496149301529,
3
- "test_precision": 0.9855069157852383,
4
- "test_recall": 0.9891982048789573,
5
- "test_f1": 0.9873462648969002,
6
- "test_SKILL_precision": 0.974378359649466,
7
- "test_SKILL_recall": 0.9787130658693776,
8
- "test_SKILL_f1": 0.97653591590187,
9
- "test_EXPERIENCE_DURATION_precision": 0.9966354719210107,
10
- "test_EXPERIENCE_DURATION_recall": 0.9996833438885371,
11
- "test_EXPERIENCE_DURATION_f1": 0.9981566138919303,
12
- "test_runtime": 35.2236,
13
- "test_samples_per_second": 56.723,
14
- "test_steps_per_second": 7.098
 
 
 
 
 
 
 
 
 
15
  }
 
1
  {
2
+ "test_loss": 0.01223407406359911,
3
+ "test_precision": 0.9551368142202801,
4
+ "test_recall": 0.9799216081310224,
5
+ "test_f1": 0.9630685025439435,
6
+ "test_LANG_precision": 0.9819929629539341,
7
+ "test_LANG_recall": 1.0,
8
+ "test_LANG_f1": 0.9908397325566001,
9
+ "test_SKILL_precision": 0.9889449241719945,
10
+ "test_SKILL_recall": 0.9904749295912962,
11
+ "test_SKILL_f1": 0.9897093160099927,
12
+ "test_CERT_precision": 0.8260869565217391,
13
+ "test_CERT_recall": 0.9166666666666667,
14
+ "test_CERT_f1": 0.8492822966507176,
15
+ "test_SOFT_SKILL_precision": 0.9810121524462506,
16
+ "test_SOFT_SKILL_recall": 0.9926311347792305,
17
+ "test_SOFT_SKILL_f1": 0.9867712480908251,
18
+ "test_EXPERIENCE_DURATION_precision": 0.9976470750074813,
19
+ "test_EXPERIENCE_DURATION_recall": 0.9998353096179183,
20
+ "test_EXPERIENCE_DURATION_f1": 0.9987399194115818,
21
+ "test_runtime": 91.6905,
22
+ "test_samples_per_second": 43.603,
23
+ "test_steps_per_second": 5.453
24
  }
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -4,7 +4,7 @@
4
  "0": {
5
  "content": "<s>",
6
  "lstrip": false,
7
- "normalized": true,
8
  "rstrip": false,
9
  "single_word": false,
10
  "special": true
@@ -12,7 +12,7 @@
12
  "1": {
13
  "content": "<pad>",
14
  "lstrip": false,
15
- "normalized": true,
16
  "rstrip": false,
17
  "single_word": false,
18
  "special": true
@@ -20,7 +20,7 @@
20
  "2": {
21
  "content": "</s>",
22
  "lstrip": false,
23
- "normalized": true,
24
  "rstrip": false,
25
  "single_word": false,
26
  "special": true
@@ -28,12 +28,12 @@
28
  "3": {
29
  "content": "<unk>",
30
  "lstrip": false,
31
- "normalized": true,
32
  "rstrip": false,
33
  "single_word": false,
34
  "special": true
35
  },
36
- "50264": {
37
  "content": "<mask>",
38
  "lstrip": true,
39
  "normalized": false,
@@ -46,13 +46,11 @@
46
  "clean_up_tokenization_spaces": false,
47
  "cls_token": "<s>",
48
  "eos_token": "</s>",
49
- "errors": "replace",
50
  "extra_special_tokens": {},
51
  "mask_token": "<mask>",
52
  "model_max_length": 512,
53
  "pad_token": "<pad>",
54
  "sep_token": "</s>",
55
- "tokenizer_class": "RobertaTokenizer",
56
- "trim_offsets": true,
57
  "unk_token": "<unk>"
58
  }
 
4
  "0": {
5
  "content": "<s>",
6
  "lstrip": false,
7
+ "normalized": false,
8
  "rstrip": false,
9
  "single_word": false,
10
  "special": true
 
12
  "1": {
13
  "content": "<pad>",
14
  "lstrip": false,
15
+ "normalized": false,
16
  "rstrip": false,
17
  "single_word": false,
18
  "special": true
 
20
  "2": {
21
  "content": "</s>",
22
  "lstrip": false,
23
+ "normalized": false,
24
  "rstrip": false,
25
  "single_word": false,
26
  "special": true
 
28
  "3": {
29
  "content": "<unk>",
30
  "lstrip": false,
31
+ "normalized": false,
32
  "rstrip": false,
33
  "single_word": false,
34
  "special": true
35
  },
36
+ "250001": {
37
  "content": "<mask>",
38
  "lstrip": true,
39
  "normalized": false,
 
46
  "clean_up_tokenization_spaces": false,
47
  "cls_token": "<s>",
48
  "eos_token": "</s>",
 
49
  "extra_special_tokens": {},
50
  "mask_token": "<mask>",
51
  "model_max_length": 512,
52
  "pad_token": "<pad>",
53
  "sep_token": "</s>",
54
+ "tokenizer_class": "XLMRobertaTokenizer",
 
55
  "unk_token": "<unk>"
56
  }