Instructions to use katsufu/cv_classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use katsufu/cv_classifier with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("katsufu/cv_classifier") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
File size: 2,101 Bytes
b723c65 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | ---
tags:
- sentence-transformers
- text-classification
- feature-extraction
- pytorch
---
# CV Embedding & Recruitment Classification Models
This directory contains the models used for the fictional IT recruitment classification application.
## Embedding Model (`all-MiniLM-L6-v2`)
We utilize the `sentence-transformers/all-MiniLM-L6-v2` model to encode candidate CVs into 384-dimensional dense vectors.
### Why `all-MiniLM-L6-v2`?
1. **Efficiency and Speed**: It is extremely fast and lightweight, making it ideal for rapid inference in web applications without requiring a GPU.
2. **Quality of Sentence Embeddings**: Despite its small size, it performs exceptionally well on Semantic Textual Similarity (STS) tasks. It maps sentences and paragraphs to a 384 dimensional dense vector space, perfectly capturing the semantic meaning of candidate CVs and their alignment with job descriptions.
3. **Versatility**: It can be used both as a feature extractor for downstream classification tasks (as we did here) and for finding the most similar historical candidates through cosine similarity.
## Classifiers
Based on the embeddings generated by the model above, three distinct classifiers were trained:
- **Random Forest (`rf_model.pkl`)**: An ensemble method that provides robust predictions and can be interpreted via feature importance.
- **Support Vector Machine (`svm_model.pkl`)**: A linear kernel SVM that excels in high-dimensional spaces like our 384-dimensional text embeddings.
- **PyTorch Neural Network (`nn_model.pt`)**: A Multi-Layer Perceptron (MLP) with a hidden layer and dropout. This model typically achieves the highest accuracy and F1 score for this specific task and is used as the primary prediction engine in the application.
## Files
- `cv_embeddings.pt`: The pre-computed embeddings for the entire 10,000 dataset, used for fast semantic search (k-nearest neighbors) during inference.
- `cv_metadata.json`: The raw text data and labels corresponding to the pre-computed embeddings.
- `best_model_info.json`: Specifies which classifier performed best during training.
|