Instructions to use katsufu/cv_classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use katsufu/cv_classifier with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("katsufu/cv_classifier") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
| tags: | |
| - sentence-transformers | |
| - text-classification | |
| - feature-extraction | |
| - pytorch | |
| # CV Embedding & Recruitment Classification Models | |
| This directory contains the models used for the fictional IT recruitment classification application. | |
| ## Embedding Model (`all-MiniLM-L6-v2`) | |
| We utilize the `sentence-transformers/all-MiniLM-L6-v2` model to encode candidate CVs into 384-dimensional dense vectors. | |
| ### Why `all-MiniLM-L6-v2`? | |
| 1. **Efficiency and Speed**: It is extremely fast and lightweight, making it ideal for rapid inference in web applications without requiring a GPU. | |
| 2. **Quality of Sentence Embeddings**: Despite its small size, it performs exceptionally well on Semantic Textual Similarity (STS) tasks. It maps sentences and paragraphs to a 384 dimensional dense vector space, perfectly capturing the semantic meaning of candidate CVs and their alignment with job descriptions. | |
| 3. **Versatility**: It can be used both as a feature extractor for downstream classification tasks (as we did here) and for finding the most similar historical candidates through cosine similarity. | |
| ## Classifiers | |
| Based on the embeddings generated by the model above, three distinct classifiers were trained: | |
| - **Random Forest (`rf_model.pkl`)**: An ensemble method that provides robust predictions and can be interpreted via feature importance. | |
| - **Support Vector Machine (`svm_model.pkl`)**: A linear kernel SVM that excels in high-dimensional spaces like our 384-dimensional text embeddings. | |
| - **PyTorch Neural Network (`nn_model.pt`)**: A Multi-Layer Perceptron (MLP) with a hidden layer and dropout. This model typically achieves the highest accuracy and F1 score for this specific task and is used as the primary prediction engine in the application. | |
| ## Files | |
| - `cv_embeddings.pt`: The pre-computed embeddings for the entire 10,000 dataset, used for fast semantic search (k-nearest neighbors) during inference. | |
| - `cv_metadata.json`: The raw text data and labels corresponding to the pre-computed embeddings. | |
| - `best_model_info.json`: Specifies which classifier performed best during training. | |