File size: 2,101 Bytes
b723c65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
---
tags:
- sentence-transformers
- text-classification
- feature-extraction
- pytorch
---

# CV Embedding & Recruitment Classification Models

This directory contains the models used for the fictional IT recruitment classification application.

## Embedding Model (`all-MiniLM-L6-v2`)

We utilize the `sentence-transformers/all-MiniLM-L6-v2` model to encode candidate CVs into 384-dimensional dense vectors. 

### Why `all-MiniLM-L6-v2`?
1. **Efficiency and Speed**: It is extremely fast and lightweight, making it ideal for rapid inference in web applications without requiring a GPU.
2. **Quality of Sentence Embeddings**: Despite its small size, it performs exceptionally well on Semantic Textual Similarity (STS) tasks. It maps sentences and paragraphs to a 384 dimensional dense vector space, perfectly capturing the semantic meaning of candidate CVs and their alignment with job descriptions.
3. **Versatility**: It can be used both as a feature extractor for downstream classification tasks (as we did here) and for finding the most similar historical candidates through cosine similarity.

## Classifiers

Based on the embeddings generated by the model above, three distinct classifiers were trained:
- **Random Forest (`rf_model.pkl`)**: An ensemble method that provides robust predictions and can be interpreted via feature importance.
- **Support Vector Machine (`svm_model.pkl`)**: A linear kernel SVM that excels in high-dimensional spaces like our 384-dimensional text embeddings.
- **PyTorch Neural Network (`nn_model.pt`)**: A Multi-Layer Perceptron (MLP) with a hidden layer and dropout. This model typically achieves the highest accuracy and F1 score for this specific task and is used as the primary prediction engine in the application.

## Files
- `cv_embeddings.pt`: The pre-computed embeddings for the entire 10,000 dataset, used for fast semantic search (k-nearest neighbors) during inference.
- `cv_metadata.json`: The raw text data and labels corresponding to the pre-computed embeddings.
- `best_model_info.json`: Specifies which classifier performed best during training.