Update README.md

e14f89f verified 3 months ago

6.37 kB

	---
	license: apache-2.0
	language:
	- en
	- pt
	tags:
	- ner
	- human
	- hr
	- recruit
	---

	# Entity Extraction NER Model for CVs and JDs (Skills & Experience)

	This is a `roberta-base` model fine-tuned for Named Entity Recognition (NER) on Human Resources documents, specifically Résumés (CVs) and Job Descriptions (JDs).

	The model was trained on a private dataset of approximately 20,000 examples generated using a Weak Labeling strategy. Its primary goal is to extract skills and quantifiable years of experience from free-form text.

	## Recognized Entities

	The model is trained to extract two main entity types (5 BIO labels):

	* `SKILL`: Technical skills, software, tools, or soft skills.
	* Examples: "Python", "machine learning", "React", "AWS", "leadership"
	* `EXPERIENCE_DURATION`: Text spans that describe a duration of time.
	* Examples: "5+ years", "6 months", "3-5 anos", "two years of experience"

	## How to Use (Python)

	You can use this model directly with the `token-classification` (or `ner`) pipeline from the `transformers` library.

	```python
	from transformers import pipeline

	# Load the model from the Hub
	model_id = "feliponi/hirly-ner-multi"

	# Initialize the pipeline
	# aggregation_strategy="simple" groups B- and I- tags (e.g., B-SKILL, I-SKILL -> SKILL)
	extractor = pipeline(
	"ner",
	model=model_id,
	aggregation_strategy="simple"
	)

	# Example text
	text = """
	Data Scientist with 5+ years of experience in Python and machine learning.
	Also 6 months in Java.

	Soft skills:
	inclusive leadership
	paradigm thinking
	performance optimization
	personal initiative

	english language proficiency
	portuguese language proficiency

	AWS Certified Solutions Architect - Associate"""

	# Get entities
	entities = extractor(text)

	# Filter for high confidence
	min_confidence = 0.7
	confident_entities = [e for e in entities if e['score'] >= min_confidence]

	# Print the results
	for entity in confident_entities:
	print(f"[{entity['entity_group']}] {entity['word']} (Confidence: {entity['score']:.2f})")
	````

	Expected Output:

	````
	[{'entity_group': 'SKILL', 'score': np.float32(0.9340167), 'word': 'Data Scientist', 'start': 1, 'end': 15},
	{'entity_group': 'EXPERIENCE_DURATION', 'score': np.float32(0.9998663), 'word': ' 5+ years', 'start': 21, 'end': 29},
	{'entity_group': 'SKILL', 'score': np.float32(0.99859816), 'word': ' Python', 'start': 47, 'end': 53},
	{'entity_group': 'SKILL', 'score': np.float32(0.9998181), 'word': ' machine learning', 'start': 58, 'end': 74},
	{'entity_group': 'EXPERIENCE_DURATION', 'score': np.float32(0.9998392), 'word': ' 6 months', 'start': 81, 'end': 89},
	{'entity_group': 'SKILL', 'score': np.float32(0.9982002), 'word': ' Java', 'start': 93, 'end': 97},
	{'entity_group': 'SOFT_SKILL', 'score': np.float32(0.995745), 'word': ' leadership', 'start': 124, 'end': 134},
	{'entity_group': 'SOFT_SKILL', 'score': np.float32(0.9859735), 'word': 'performance optimization', 'start': 153, 'end': 177},
	{'entity_group': 'SOFT_SKILL', 'score': np.float32(0.98516375), 'word': 'personal initiative', 'start': 178, 'end': 197},
	{'entity_group': 'LANG', 'score': np.float32(0.96456385), 'word': 'english language proficiency', 'start': 199, 'end': 227},
	{'entity_group': 'LANG', 'score': np.float32(0.9288162), 'word': 'portuguese language proficiency', 'start': 228, 'end': 259},
	{'entity_group': 'SKILL', 'score': np.float32(0.926032), 'word': 'AWS', 'start': 261, 'end': 264},
	{'entity_group': 'SOFT_SKILL', 'score': np.float32(0.9559879), 'word': ' Solutions', 'start': 275, 'end': 284},
	{'entity_group': 'SKILL', 'score': np.float32(0.84499276), 'word': ' Architect', 'start': 285, 'end': 294}]
	````

	## Training, Performance, and Limitations

	This model's performance is a direct result of its training data and weak labeling methodology.

	### Performance

	The model was validated on a test set of \~2,000 examples, achieving the following F1-scores:

	\| Entity \| F1-Score \|
	\| :--- \| :--- \|
	\| `SKILLS` \| 98.9% \|
	\| `LANG` \| 99.0% \|
	\| `CERT` \| 84.9% \|
	\| `SOFT_SKILL` \| 98.6% \|
	\| `EXPERIENCE_DURATION` \| 99.8% \|
	\| Overall \| 96.3% \|

	### Training Methodology

	This model's performance is a direct result of its Weak Labeling training methodology. The labels were generated automatically, not manually annotated.

	1. `EXPERIENCE_DURATION` (Pattern-Based): This entity was labeled using a robust set of regular expressions designed to find time-based patterns (e.g., "5+ years", "six months", "3-5 anos"). Its near-perfect F1 score reflects the high precision of this regex approach.

	2. `SKILL`, `SOFT_SKILL`, `LANG`, `CERT` (Vocabulary-Based): These four entities were labeled by performing high-speed, exact matching against four separate vocabulary files (`skills.txt`, `softskills.txt`, `langskills.txt`, `certifications.txt`).

	* High Performance (`SKILL`, `SOFT_SKILL`, `LANG`): The excellent F1 scores (98-99%) indicate that the vocabularies for these labels were comprehensive and matched the training texts frequently.
	* Good Performance (`CERT`): The 84.9% F1 score is strong but shows room for improvement. This score suggests the `certifications.txt` vocabulary was less comprehensive. The model's performance for this label would be directly improved by adding more certification names (e.g., "AWS CSAA", "PMP", etc.) to the vocabulary file and retraining.

	### Limitations (Important)

	* Vocabulary Dependency: The model is excellent at finding the 8,700 skills it was trained on. It will not reliably find new skills or tools that were absent from the training vocabulary. It functions more as a "high-speed vocabulary extractor" than a "skill concept detector."
	* False Positives: Because the source vocabulary contained generic words, the model learned to tag them as `SKILL` with high confidence. Users of this model should filter the output to remove known false positives.
	* Examples of common false positives: "communication", "leadership", "teamwork", "project", "skills".
	* Noise: The model may occasionally output low-confidence punctuation or noise (e.g., `.` with a 0.33 score, as seen in the sample output). It is highly recommended to filter results by a confidence score (e.g., `score > 0.7`) for clean outputs.

	<!-- end list -->