| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | - pt |
| | tags: |
| | - ner |
| | - human |
| | - hr |
| | - recruit |
| | --- |
| | |
| | # Entity Extraction NER Model for CVs and JDs (Skills & Experience) |
| |
|
| | This is a `roberta-base` model fine-tuned for **Named Entity Recognition (NER)** on Human Resources documents, specifically Résumés (CVs) and Job Descriptions (JDs). |
| |
|
| | The model was trained on a private dataset of approximately **20,000 examples** generated using a **Weak Labeling** strategy. Its primary goal is to extract skills and quantifiable years of experience from free-form text. |
| |
|
| | ## Recognized Entities |
| |
|
| | The model is trained to extract two main entity types (5 BIO labels): |
| |
|
| | * **`SKILL`**: Technical skills, software, tools, or soft skills. |
| | * *Examples: "Python", "machine learning", "React", "AWS", "leadership"* |
| | * **`EXPERIENCE_DURATION`**: Text spans that describe a duration of time. |
| | * *Examples: "5+ years", "6 months", "3-5 anos", "two years of experience"* |
| | |
| | ## How to Use (Python) |
| | |
| | You can use this model directly with the `token-classification` (or `ner`) pipeline from the `transformers` library. |
| | |
| | ```python |
| | from transformers import pipeline |
| | |
| | # Load the model from the Hub |
| | model_id = "feliponi/hirly-ner-multi" |
| | |
| | # Initialize the pipeline |
| | # aggregation_strategy="simple" groups B- and I- tags (e.g., B-SKILL, I-SKILL -> SKILL) |
| | extractor = pipeline( |
| | "ner", |
| | model=model_id, |
| | aggregation_strategy="simple" |
| | ) |
| | |
| | # Example text |
| | text = """ |
| | Data Scientist with 5+ years of experience in Python and machine learning. |
| | Also 6 months in Java. |
| | |
| | Soft skills: |
| | inclusive leadership |
| | paradigm thinking |
| | performance optimization |
| | personal initiative |
| | |
| | english language proficiency |
| | portuguese language proficiency |
| | |
| | AWS Certified Solutions Architect - Associate""" |
| | |
| | # Get entities |
| | entities = extractor(text) |
| | |
| | # Filter for high confidence |
| | min_confidence = 0.7 |
| | confident_entities = [e for e in entities if e['score'] >= min_confidence] |
| | |
| | # Print the results |
| | for entity in confident_entities: |
| | print(f"[{entity['entity_group']}] {entity['word']} (Confidence: {entity['score']:.2f})") |
| | ```` |
| | |
| | **Expected Output:** |
| |
|
| | ```` |
| | [{'entity_group': 'SKILL', 'score': np.float32(0.9340167), 'word': 'Data Scientist', 'start': 1, 'end': 15}, |
| | {'entity_group': 'EXPERIENCE_DURATION', 'score': np.float32(0.9998663), 'word': ' 5+ years', 'start': 21, 'end': 29}, |
| | {'entity_group': 'SKILL', 'score': np.float32(0.99859816), 'word': ' Python', 'start': 47, 'end': 53}, |
| | {'entity_group': 'SKILL', 'score': np.float32(0.9998181), 'word': ' machine learning', 'start': 58, 'end': 74}, |
| | {'entity_group': 'EXPERIENCE_DURATION', 'score': np.float32(0.9998392), 'word': ' 6 months', 'start': 81, 'end': 89}, |
| | {'entity_group': 'SKILL', 'score': np.float32(0.9982002), 'word': ' Java', 'start': 93, 'end': 97}, |
| | {'entity_group': 'SOFT_SKILL', 'score': np.float32(0.995745), 'word': ' leadership', 'start': 124, 'end': 134}, |
| | {'entity_group': 'SOFT_SKILL', 'score': np.float32(0.9859735), 'word': 'performance optimization', 'start': 153, 'end': 177}, |
| | {'entity_group': 'SOFT_SKILL', 'score': np.float32(0.98516375), 'word': 'personal initiative', 'start': 178, 'end': 197}, |
| | {'entity_group': 'LANG', 'score': np.float32(0.96456385), 'word': 'english language proficiency', 'start': 199, 'end': 227}, |
| | {'entity_group': 'LANG', 'score': np.float32(0.9288162), 'word': 'portuguese language proficiency', 'start': 228, 'end': 259}, |
| | {'entity_group': 'SKILL', 'score': np.float32(0.926032), 'word': 'AWS', 'start': 261, 'end': 264}, |
| | {'entity_group': 'SOFT_SKILL', 'score': np.float32(0.9559879), 'word': ' Solutions', 'start': 275, 'end': 284}, |
| | {'entity_group': 'SKILL', 'score': np.float32(0.84499276), 'word': ' Architect', 'start': 285, 'end': 294}] |
| | ```` |
| |
|
| | ## Training, Performance, and Limitations |
| |
|
| | This model's performance is a direct result of its training data and weak labeling methodology. |
| |
|
| | ### Performance |
| |
|
| | The model was validated on a test set of \~2,000 examples, achieving the following F1-scores: |
| |
|
| | | Entity | F1-Score | |
| | | :--- | :--- | |
| | | **`SKILLS`** | **98.9%** | |
| | | **`LANG`** | **99.0%** | |
| | | **`CERT`** | **84.9%** | |
| | | **`SOFT_SKILL`** | **98.6%** | |
| | | **`EXPERIENCE_DURATION`** | **99.8%** | |
| | | **Overall** | **96.3%** | |
| |
|
| | ### Training Methodology |
| |
|
| | This model's performance is a direct result of its **Weak Labeling** training methodology. The labels were generated automatically, not manually annotated. |
| |
|
| | 1. **`EXPERIENCE_DURATION` (Pattern-Based):** This entity was labeled using a robust set of regular expressions designed to find time-based patterns (e.g., "5+ years", "six months", "3-5 anos"). Its near-perfect F1 score reflects the high precision of this regex approach. |
| | |
| | 2. **`SKILL`, `SOFT_SKILL`, `LANG`, `CERT` (Vocabulary-Based):** These four entities were labeled by performing high-speed, *exact matching* against four separate vocabulary files (`skills.txt`, `softskills.txt`, `langskills.txt`, `certifications.txt`). |
| |
|
| | * **High Performance (`SKILL`, `SOFT_SKILL`, `LANG`):** The excellent F1 scores (98-99%) indicate that the vocabularies for these labels were comprehensive and matched the training texts frequently. |
| | * **Good Performance (`CERT`):** The 84.9% F1 score is strong but shows room for improvement. This score suggests the `certifications.txt` vocabulary was less comprehensive. The model's performance for this label would be directly improved by adding more certification names (e.g., "AWS CSAA", "PMP", etc.) to the vocabulary file and retraining. |
| | |
| | ### Limitations (Important) |
| | |
| | * **Vocabulary Dependency:** The model is excellent at finding the **8,700 skills** it was trained on. It will *not* reliably find new skills or tools that were absent from the training vocabulary. It functions more as a "high-speed vocabulary extractor" than a "skill concept detector." |
| | * **False Positives:** Because the source vocabulary contained generic words, the model learned to tag them as `SKILL` with high confidence. **Users of this model should filter the output** to remove known false positives. |
| | * *Examples of common false positives: "communication", "leadership", "teamwork", "project", "skills"*. |
| | * **Noise:** The model may occasionally output low-confidence punctuation or noise (e.g., `.` with a 0.33 score, as seen in the sample output). It is highly recommended to **filter results by a confidence score (e.g., `score > 0.7`)** for clean outputs. |
| | |
| | <!-- end list --> |