hirly-ner-multi / README.md
feliponi's picture
Update README.md
e14f89f verified
---
license: apache-2.0
language:
- en
- pt
tags:
- ner
- human
- hr
- recruit
---
# Entity Extraction NER Model for CVs and JDs (Skills & Experience)
This is a `roberta-base` model fine-tuned for **Named Entity Recognition (NER)** on Human Resources documents, specifically Résumés (CVs) and Job Descriptions (JDs).
The model was trained on a private dataset of approximately **20,000 examples** generated using a **Weak Labeling** strategy. Its primary goal is to extract skills and quantifiable years of experience from free-form text.
## Recognized Entities
The model is trained to extract two main entity types (5 BIO labels):
* **`SKILL`**: Technical skills, software, tools, or soft skills.
* *Examples: "Python", "machine learning", "React", "AWS", "leadership"*
* **`EXPERIENCE_DURATION`**: Text spans that describe a duration of time.
* *Examples: "5+ years", "6 months", "3-5 anos", "two years of experience"*
## How to Use (Python)
You can use this model directly with the `token-classification` (or `ner`) pipeline from the `transformers` library.
```python
from transformers import pipeline
# Load the model from the Hub
model_id = "feliponi/hirly-ner-multi"
# Initialize the pipeline
# aggregation_strategy="simple" groups B- and I- tags (e.g., B-SKILL, I-SKILL -> SKILL)
extractor = pipeline(
"ner",
model=model_id,
aggregation_strategy="simple"
)
# Example text
text = """
Data Scientist with 5+ years of experience in Python and machine learning.
Also 6 months in Java.
Soft skills:
inclusive leadership
paradigm thinking
performance optimization
personal initiative
english language proficiency
portuguese language proficiency
AWS Certified Solutions Architect - Associate"""
# Get entities
entities = extractor(text)
# Filter for high confidence
min_confidence = 0.7
confident_entities = [e for e in entities if e['score'] >= min_confidence]
# Print the results
for entity in confident_entities:
print(f"[{entity['entity_group']}] {entity['word']} (Confidence: {entity['score']:.2f})")
````
**Expected Output:**
````
[{'entity_group': 'SKILL', 'score': np.float32(0.9340167), 'word': 'Data Scientist', 'start': 1, 'end': 15},
{'entity_group': 'EXPERIENCE_DURATION', 'score': np.float32(0.9998663), 'word': ' 5+ years', 'start': 21, 'end': 29},
{'entity_group': 'SKILL', 'score': np.float32(0.99859816), 'word': ' Python', 'start': 47, 'end': 53},
{'entity_group': 'SKILL', 'score': np.float32(0.9998181), 'word': ' machine learning', 'start': 58, 'end': 74},
{'entity_group': 'EXPERIENCE_DURATION', 'score': np.float32(0.9998392), 'word': ' 6 months', 'start': 81, 'end': 89},
{'entity_group': 'SKILL', 'score': np.float32(0.9982002), 'word': ' Java', 'start': 93, 'end': 97},
{'entity_group': 'SOFT_SKILL', 'score': np.float32(0.995745), 'word': ' leadership', 'start': 124, 'end': 134},
{'entity_group': 'SOFT_SKILL', 'score': np.float32(0.9859735), 'word': 'performance optimization', 'start': 153, 'end': 177},
{'entity_group': 'SOFT_SKILL', 'score': np.float32(0.98516375), 'word': 'personal initiative', 'start': 178, 'end': 197},
{'entity_group': 'LANG', 'score': np.float32(0.96456385), 'word': 'english language proficiency', 'start': 199, 'end': 227},
{'entity_group': 'LANG', 'score': np.float32(0.9288162), 'word': 'portuguese language proficiency', 'start': 228, 'end': 259},
{'entity_group': 'SKILL', 'score': np.float32(0.926032), 'word': 'AWS', 'start': 261, 'end': 264},
{'entity_group': 'SOFT_SKILL', 'score': np.float32(0.9559879), 'word': ' Solutions', 'start': 275, 'end': 284},
{'entity_group': 'SKILL', 'score': np.float32(0.84499276), 'word': ' Architect', 'start': 285, 'end': 294}]
````
## Training, Performance, and Limitations
This model's performance is a direct result of its training data and weak labeling methodology.
### Performance
The model was validated on a test set of \~2,000 examples, achieving the following F1-scores:
| Entity | F1-Score |
| :--- | :--- |
| **`SKILLS`** | **98.9%** |
| **`LANG`** | **99.0%** |
| **`CERT`** | **84.9%** |
| **`SOFT_SKILL`** | **98.6%** |
| **`EXPERIENCE_DURATION`** | **99.8%** |
| **Overall** | **96.3%** |
### Training Methodology
This model's performance is a direct result of its **Weak Labeling** training methodology. The labels were generated automatically, not manually annotated.
1. **`EXPERIENCE_DURATION` (Pattern-Based):** This entity was labeled using a robust set of regular expressions designed to find time-based patterns (e.g., "5+ years", "six months", "3-5 anos"). Its near-perfect F1 score reflects the high precision of this regex approach.
2. **`SKILL`, `SOFT_SKILL`, `LANG`, `CERT` (Vocabulary-Based):** These four entities were labeled by performing high-speed, *exact matching* against four separate vocabulary files (`skills.txt`, `softskills.txt`, `langskills.txt`, `certifications.txt`).
* **High Performance (`SKILL`, `SOFT_SKILL`, `LANG`):** The excellent F1 scores (98-99%) indicate that the vocabularies for these labels were comprehensive and matched the training texts frequently.
* **Good Performance (`CERT`):** The 84.9% F1 score is strong but shows room for improvement. This score suggests the `certifications.txt` vocabulary was less comprehensive. The model's performance for this label would be directly improved by adding more certification names (e.g., "AWS CSAA", "PMP", etc.) to the vocabulary file and retraining.
### Limitations (Important)
* **Vocabulary Dependency:** The model is excellent at finding the **8,700 skills** it was trained on. It will *not* reliably find new skills or tools that were absent from the training vocabulary. It functions more as a "high-speed vocabulary extractor" than a "skill concept detector."
* **False Positives:** Because the source vocabulary contained generic words, the model learned to tag them as `SKILL` with high confidence. **Users of this model should filter the output** to remove known false positives.
* *Examples of common false positives: "communication", "leadership", "teamwork", "project", "skills"*.
* **Noise:** The model may occasionally output low-confidence punctuation or noise (e.g., `.` with a 0.33 score, as seen in the sample output). It is highly recommended to **filter results by a confidence score (e.g., `score > 0.7`)** for clean outputs.
<!-- end list -->