katsufu
/

cv_classifier

Text Classification

sentence-transformers

feature-extraction

Model card Files Files and versions

cv_classifier / README.md

katsufu's picture

Upload folder using huggingface_hub

b723c65 verified 17 days ago

|

history blame contribute delete

2.1 kB

	---
	tags:
	- sentence-transformers
	- text-classification
	- feature-extraction
	- pytorch
	---

	# CV Embedding & Recruitment Classification Models

	This directory contains the models used for the fictional IT recruitment classification application.

	## Embedding Model (`all-MiniLM-L6-v2`)

	We utilize the `sentence-transformers/all-MiniLM-L6-v2` model to encode candidate CVs into 384-dimensional dense vectors.

	### Why `all-MiniLM-L6-v2`?
	1. Efficiency and Speed: It is extremely fast and lightweight, making it ideal for rapid inference in web applications without requiring a GPU.
	2. Quality of Sentence Embeddings: Despite its small size, it performs exceptionally well on Semantic Textual Similarity (STS) tasks. It maps sentences and paragraphs to a 384 dimensional dense vector space, perfectly capturing the semantic meaning of candidate CVs and their alignment with job descriptions.
	3. Versatility: It can be used both as a feature extractor for downstream classification tasks (as we did here) and for finding the most similar historical candidates through cosine similarity.

	## Classifiers

	Based on the embeddings generated by the model above, three distinct classifiers were trained:
	- Random Forest (`rf_model.pkl`): An ensemble method that provides robust predictions and can be interpreted via feature importance.
	- Support Vector Machine (`svm_model.pkl`): A linear kernel SVM that excels in high-dimensional spaces like our 384-dimensional text embeddings.
	- PyTorch Neural Network (`nn_model.pt`): A Multi-Layer Perceptron (MLP) with a hidden layer and dropout. This model typically achieves the highest accuracy and F1 score for this specific task and is used as the primary prediction engine in the application.

	## Files
	- `cv_embeddings.pt`: The pre-computed embeddings for the entire 10,000 dataset, used for fast semantic search (k-nearest neighbors) during inference.
	- `cv_metadata.json`: The raw text data and labels corresponding to the pre-computed embeddings.
	- `best_model_info.json`: Specifies which classifier performed best during training.