language:
- en
license: mit
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- recruitment
- job-description
- applai
base_model: sentence-transformers/all-mpnet-base-v2
pipeline_tag: sentence-similarity
AppAI β SBERT 4-Way Pairing
Fine-tuned sentence-transformers/all-mpnet-base-v2 for job description (JD) embedding generation in the AppAI recruitment matching pipeline.
The model encodes job descriptions into four 768-dim L2-normalised span embeddings β full, education, experience, and leadership β enabling granular, section-level candidate matching when paired with the LayoutLMv3 resume encoder and the ConFiT alignment model.
Model Details
| Property | Value |
|---|---|
| Base model | sentence-transformers/all-mpnet-base-v2 |
| Max sequence length | 384 tokens |
| Embedding dimension | 768 |
| Normalisation | L2 (unit norm) |
| Training objective | 4-way paired cosine similarity |
| Batch size | 64 |
Architecture
Input text β MPNet tokenizer β all-mpnet-base-v2 β Mean Pooling β L2 Normalize β 768-dim embedding
Training pairs each JD span against its corresponding resume span across four named features. Each feature is encoded independently, giving fine-grained control over education, experience, and leadership matching in addition to the full-text representation.
Intended Use
This model is part of the AppAI recruitment intelligence pipeline:
- This model β encodes JD text spans (SBERT)
Smutypi3/applai-layoutlmv3β encodes resume PDF spans (LayoutLMv3)Smutypi3/applai-confitβ aligns both embedding spaces (ConFiT)
It is intended for cosine similarity-based candidate ranking within the AppAI system. It is not designed for general-purpose semantic search.
Usage
Installation
pip install sentence-transformers
Encoding a Job Description
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Smutypi3/applai-sbert")
model.max_seq_length = 384
jd_text = "We are looking for a senior software engineer with 5+ years of Python experience..."
embedding = model.encode(jd_text, convert_to_tensor=True, normalize_embeddings=True)
print(embedding.shape) # torch.Size([768])
Full Pipeline (via AppAI inference service)
from ai_models.preprocessing.jd_preprocessor import preprocess_jd
from ai_models.services.sbert_service import encode_jd_spans
spans = preprocess_jd(raw_jd_html)
# {"full": "...", "education": "...", "experience": "...", "leadership": "..."}
embeddings = encode_jd_spans(spans)
# {"full": [...], "education": [...], "experience": [...], "leadership": [...]}
Training Details
Preprocessing
Job descriptions are preprocessed before encoding to exactly mirror the training pipeline:
- HTML is stripped and entities are unescaped
- Full span:
clean_text(preserve_tech=True)β keeps.,+,#for C++, C#, .NET - Section spans:
clean_text(preserve_tech=False)+ keyword-based sentence filtering - Fallback:
"no specific information available"when extracted span is β€ 10 characters
Training Objective
4-way pairing loss over (JD full, JD education, JD experience, JD leadership) vs (resume full, resume education, resume experience, resume leadership) pairs using cosine similarity.
Limitations
- Designed for English-language job descriptions
- Section extraction relies on keyword matching β niche or unconventional JD formats may not extract cleanly
- Best used together with
Smutypi3/applai-layoutlmv3andSmutypi3/applai-confit; standalone cosine similarity without alignment will not reflect trained performance
Citation
@software{lucero2025applai_sbert,
author = {Lucero, Jaime Emmanuel},
title = {{AppAI SBERT 4-Way Pairing}: Fine-tuned Sentence Transformer for Job Description Embedding},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/Smutypi3/applai-sbert},
note = {Part of the AppAI recruitment intelligence pipeline}
}
License
MIT