--- tags: - sentence-transformers - text-classification - feature-extraction - pytorch --- # CV Embedding & Recruitment Classification Models This directory contains the models used for the fictional IT recruitment classification application. ## Embedding Model (`all-MiniLM-L6-v2`) We utilize the `sentence-transformers/all-MiniLM-L6-v2` model to encode candidate CVs into 384-dimensional dense vectors. ### Why `all-MiniLM-L6-v2`? 1. **Efficiency and Speed**: It is extremely fast and lightweight, making it ideal for rapid inference in web applications without requiring a GPU. 2. **Quality of Sentence Embeddings**: Despite its small size, it performs exceptionally well on Semantic Textual Similarity (STS) tasks. It maps sentences and paragraphs to a 384 dimensional dense vector space, perfectly capturing the semantic meaning of candidate CVs and their alignment with job descriptions. 3. **Versatility**: It can be used both as a feature extractor for downstream classification tasks (as we did here) and for finding the most similar historical candidates through cosine similarity. ## Classifiers Based on the embeddings generated by the model above, three distinct classifiers were trained: - **Random Forest (`rf_model.pkl`)**: An ensemble method that provides robust predictions and can be interpreted via feature importance. - **Support Vector Machine (`svm_model.pkl`)**: A linear kernel SVM that excels in high-dimensional spaces like our 384-dimensional text embeddings. - **PyTorch Neural Network (`nn_model.pt`)**: A Multi-Layer Perceptron (MLP) with a hidden layer and dropout. This model typically achieves the highest accuracy and F1 score for this specific task and is used as the primary prediction engine in the application. ## Files - `cv_embeddings.pt`: The pre-computed embeddings for the entire 10,000 dataset, used for fast semantic search (k-nearest neighbors) during inference. - `cv_metadata.json`: The raw text data and labels corresponding to the pre-computed embeddings. - `best_model_info.json`: Specifies which classifier performed best during training.