--- license: apache-2.0 language: - en - pt tags: - ner - human - hr - recruit --- # Entity Extraction NER Model for CVs and JDs (Skills & Experience) This is a `roberta-base` model fine-tuned for **Named Entity Recognition (NER)** on Human Resources documents, specifically Résumés (CVs) and Job Descriptions (JDs). The model was trained on a private dataset of approximately **20,000 examples** generated using a **Weak Labeling** strategy. Its primary goal is to extract skills and quantifiable years of experience from free-form text. ## Recognized Entities The model is trained to extract two main entity types (5 BIO labels): * **`SKILL`**: Technical skills, software, tools, or soft skills. * *Examples: "Python", "machine learning", "React", "AWS", "leadership"* * **`EXPERIENCE_DURATION`**: Text spans that describe a duration of time. * *Examples: "5+ years", "6 months", "3-5 anos", "two years of experience"* ## How to Use (Python) You can use this model directly with the `token-classification` (or `ner`) pipeline from the `transformers` library. ```python from transformers import pipeline # Load the model from the Hub model_id = "feliponi/hirly-ner-multi" # Initialize the pipeline # aggregation_strategy="simple" groups B- and I- tags (e.g., B-SKILL, I-SKILL -> SKILL) extractor = pipeline( "ner", model=model_id, aggregation_strategy="simple" ) # Example text text = """ Data Scientist with 5+ years of experience in Python and machine learning. Also 6 months in Java. Soft skills: inclusive leadership paradigm thinking performance optimization personal initiative english language proficiency portuguese language proficiency AWS Certified Solutions Architect - Associate""" # Get entities entities = extractor(text) # Filter for high confidence min_confidence = 0.7 confident_entities = [e for e in entities if e['score'] >= min_confidence] # Print the results for entity in confident_entities: print(f"[{entity['entity_group']}] {entity['word']} (Confidence: {entity['score']:.2f})") ```` **Expected Output:** ```` [{'entity_group': 'SKILL', 'score': np.float32(0.9340167), 'word': 'Data Scientist', 'start': 1, 'end': 15}, {'entity_group': 'EXPERIENCE_DURATION', 'score': np.float32(0.9998663), 'word': ' 5+ years', 'start': 21, 'end': 29}, {'entity_group': 'SKILL', 'score': np.float32(0.99859816), 'word': ' Python', 'start': 47, 'end': 53}, {'entity_group': 'SKILL', 'score': np.float32(0.9998181), 'word': ' machine learning', 'start': 58, 'end': 74}, {'entity_group': 'EXPERIENCE_DURATION', 'score': np.float32(0.9998392), 'word': ' 6 months', 'start': 81, 'end': 89}, {'entity_group': 'SKILL', 'score': np.float32(0.9982002), 'word': ' Java', 'start': 93, 'end': 97}, {'entity_group': 'SOFT_SKILL', 'score': np.float32(0.995745), 'word': ' leadership', 'start': 124, 'end': 134}, {'entity_group': 'SOFT_SKILL', 'score': np.float32(0.9859735), 'word': 'performance optimization', 'start': 153, 'end': 177}, {'entity_group': 'SOFT_SKILL', 'score': np.float32(0.98516375), 'word': 'personal initiative', 'start': 178, 'end': 197}, {'entity_group': 'LANG', 'score': np.float32(0.96456385), 'word': 'english language proficiency', 'start': 199, 'end': 227}, {'entity_group': 'LANG', 'score': np.float32(0.9288162), 'word': 'portuguese language proficiency', 'start': 228, 'end': 259}, {'entity_group': 'SKILL', 'score': np.float32(0.926032), 'word': 'AWS', 'start': 261, 'end': 264}, {'entity_group': 'SOFT_SKILL', 'score': np.float32(0.9559879), 'word': ' Solutions', 'start': 275, 'end': 284}, {'entity_group': 'SKILL', 'score': np.float32(0.84499276), 'word': ' Architect', 'start': 285, 'end': 294}] ```` ## Training, Performance, and Limitations This model's performance is a direct result of its training data and weak labeling methodology. ### Performance The model was validated on a test set of \~2,000 examples, achieving the following F1-scores: | Entity | F1-Score | | :--- | :--- | | **`SKILLS`** | **98.9%** | | **`LANG`** | **99.0%** | | **`CERT`** | **84.9%** | | **`SOFT_SKILL`** | **98.6%** | | **`EXPERIENCE_DURATION`** | **99.8%** | | **Overall** | **96.3%** | ### Training Methodology This model's performance is a direct result of its **Weak Labeling** training methodology. The labels were generated automatically, not manually annotated. 1. **`EXPERIENCE_DURATION` (Pattern-Based):** This entity was labeled using a robust set of regular expressions designed to find time-based patterns (e.g., "5+ years", "six months", "3-5 anos"). Its near-perfect F1 score reflects the high precision of this regex approach. 2. **`SKILL`, `SOFT_SKILL`, `LANG`, `CERT` (Vocabulary-Based):** These four entities were labeled by performing high-speed, *exact matching* against four separate vocabulary files (`skills.txt`, `softskills.txt`, `langskills.txt`, `certifications.txt`). * **High Performance (`SKILL`, `SOFT_SKILL`, `LANG`):** The excellent F1 scores (98-99%) indicate that the vocabularies for these labels were comprehensive and matched the training texts frequently. * **Good Performance (`CERT`):** The 84.9% F1 score is strong but shows room for improvement. This score suggests the `certifications.txt` vocabulary was less comprehensive. The model's performance for this label would be directly improved by adding more certification names (e.g., "AWS CSAA", "PMP", etc.) to the vocabulary file and retraining. ### Limitations (Important) * **Vocabulary Dependency:** The model is excellent at finding the **8,700 skills** it was trained on. It will *not* reliably find new skills or tools that were absent from the training vocabulary. It functions more as a "high-speed vocabulary extractor" than a "skill concept detector." * **False Positives:** Because the source vocabulary contained generic words, the model learned to tag them as `SKILL` with high confidence. **Users of this model should filter the output** to remove known false positives. * *Examples of common false positives: "communication", "leadership", "teamwork", "project", "skills"*. * **Noise:** The model may occasionally output low-confidence punctuation or noise (e.g., `.` with a 0.33 score, as seen in the sample output). It is highly recommended to **filter results by a confidence score (e.g., `score > 0.7`)** for clean outputs.