cv_classifier / README.md
katsufu's picture
Upload folder using huggingface_hub
b723c65 verified
---
tags:
- sentence-transformers
- text-classification
- feature-extraction
- pytorch
---
# CV Embedding & Recruitment Classification Models
This directory contains the models used for the fictional IT recruitment classification application.
## Embedding Model (`all-MiniLM-L6-v2`)
We utilize the `sentence-transformers/all-MiniLM-L6-v2` model to encode candidate CVs into 384-dimensional dense vectors.
### Why `all-MiniLM-L6-v2`?
1. **Efficiency and Speed**: It is extremely fast and lightweight, making it ideal for rapid inference in web applications without requiring a GPU.
2. **Quality of Sentence Embeddings**: Despite its small size, it performs exceptionally well on Semantic Textual Similarity (STS) tasks. It maps sentences and paragraphs to a 384 dimensional dense vector space, perfectly capturing the semantic meaning of candidate CVs and their alignment with job descriptions.
3. **Versatility**: It can be used both as a feature extractor for downstream classification tasks (as we did here) and for finding the most similar historical candidates through cosine similarity.
## Classifiers
Based on the embeddings generated by the model above, three distinct classifiers were trained:
- **Random Forest (`rf_model.pkl`)**: An ensemble method that provides robust predictions and can be interpreted via feature importance.
- **Support Vector Machine (`svm_model.pkl`)**: A linear kernel SVM that excels in high-dimensional spaces like our 384-dimensional text embeddings.
- **PyTorch Neural Network (`nn_model.pt`)**: A Multi-Layer Perceptron (MLP) with a hidden layer and dropout. This model typically achieves the highest accuracy and F1 score for this specific task and is used as the primary prediction engine in the application.
## Files
- `cv_embeddings.pt`: The pre-computed embeddings for the entire 10,000 dataset, used for fast semantic search (k-nearest neighbors) during inference.
- `cv_metadata.json`: The raw text data and labels corresponding to the pre-computed embeddings.
- `best_model_info.json`: Specifies which classifier performed best during training.