| | --- |
| | language: |
| | - en |
| | license: mit |
| | tags: |
| | - sentence-transformers |
| | - sentence-similarity |
| | - feature-extraction |
| | - dense |
| | - recruitment |
| | - job-description |
| | - applai |
| | base_model: sentence-transformers/all-mpnet-base-v2 |
| | pipeline_tag: sentence-similarity |
| | --- |
| | |
| | # AppAI β SBERT 4-Way Pairing |
| |
|
| | Fine-tuned `sentence-transformers/all-mpnet-base-v2` for job description (JD) embedding generation in the [AppAI](https://github.com/jaimeemanuellucero/applai) recruitment matching pipeline. |
| |
|
| | The model encodes job descriptions into four **768-dim L2-normalised** span embeddings β **full**, **education**, **experience**, and **leadership** β enabling granular, section-level candidate matching when paired with the LayoutLMv3 resume encoder and the ConFiT alignment model. |
| |
|
| | --- |
| |
|
| | ## Model Details |
| |
|
| | | Property | Value | |
| | |---|---| |
| | | Base model | `sentence-transformers/all-mpnet-base-v2` | |
| | | Max sequence length | 384 tokens | |
| | | Embedding dimension | 768 | |
| | | Normalisation | L2 (unit norm) | |
| | | Training objective | 4-way paired cosine similarity | |
| | | Batch size | 64 | |
| |
|
| | ### Architecture |
| |
|
| | ``` |
| | Input text β MPNet tokenizer β all-mpnet-base-v2 β Mean Pooling β L2 Normalize β 768-dim embedding |
| | ``` |
| |
|
| | Training pairs each JD span against its corresponding resume span across four named features. Each feature is encoded independently, giving fine-grained control over education, experience, and leadership matching in addition to the full-text representation. |
| |
|
| | --- |
| |
|
| | ## Intended Use |
| |
|
| | This model is part of the **AppAI** recruitment intelligence pipeline: |
| |
|
| | 1. **This model** β encodes JD text spans (SBERT) |
| | 2. [`Smutypi3/applai-layoutlmv3`](https://huggingface.co/Smutypi3/applai-layoutlmv3) β encodes resume PDF spans (LayoutLMv3) |
| | 3. [`Smutypi3/applai-confit`](https://huggingface.co/Smutypi3/applai-confit) β aligns both embedding spaces (ConFiT) |
| |
|
| | It is intended for **cosine similarity-based candidate ranking** within the AppAI system. It is not designed for general-purpose semantic search. |
| |
|
| | --- |
| |
|
| | ## Usage |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install sentence-transformers |
| | ``` |
| |
|
| | ### Encoding a Job Description |
| |
|
| | ```python |
| | from sentence_transformers import SentenceTransformer |
| | |
| | model = SentenceTransformer("Smutypi3/applai-sbert") |
| | model.max_seq_length = 384 |
| | |
| | jd_text = "We are looking for a senior software engineer with 5+ years of Python experience..." |
| | embedding = model.encode(jd_text, convert_to_tensor=True, normalize_embeddings=True) |
| | print(embedding.shape) # torch.Size([768]) |
| | ``` |
| |
|
| | ### Full Pipeline (via AppAI inference service) |
| |
|
| | ```python |
| | from ai_models.preprocessing.jd_preprocessor import preprocess_jd |
| | from ai_models.services.sbert_service import encode_jd_spans |
| | |
| | spans = preprocess_jd(raw_jd_html) |
| | # {"full": "...", "education": "...", "experience": "...", "leadership": "..."} |
| | |
| | embeddings = encode_jd_spans(spans) |
| | # {"full": [...], "education": [...], "experience": [...], "leadership": [...]} |
| | ``` |
| |
|
| | --- |
| |
|
| | ## Training Details |
| |
|
| | ### Preprocessing |
| |
|
| | Job descriptions are preprocessed before encoding to exactly mirror the training pipeline: |
| |
|
| | - HTML is stripped and entities are unescaped |
| | - **Full span**: `clean_text(preserve_tech=True)` β keeps `.`, `+`, `#` for C++, C#, .NET |
| | - **Section spans**: `clean_text(preserve_tech=False)` + keyword-based sentence filtering |
| | - **Fallback**: `"no specific information available"` when extracted span is β€ 10 characters |
| |
|
| | ### Training Objective |
| |
|
| | 4-way pairing loss over (JD full, JD education, JD experience, JD leadership) vs (resume full, resume education, resume experience, resume leadership) pairs using cosine similarity. |
| |
|
| | --- |
| |
|
| | ## Limitations |
| |
|
| | - Designed for English-language job descriptions |
| | - Section extraction relies on keyword matching β niche or unconventional JD formats may not extract cleanly |
| | - Best used together with `Smutypi3/applai-layoutlmv3` and `Smutypi3/applai-confit`; standalone cosine similarity without alignment will not reflect trained performance |
| |
|
| | --- |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @software{lucero2025applai_sbert, |
| | author = {Lucero, Jaime Emmanuel}, |
| | title = {{AppAI SBERT 4-Way Pairing}: Fine-tuned Sentence Transformer for Job Description Embedding}, |
| | year = {2025}, |
| | publisher = {HuggingFace}, |
| | url = {https://huggingface.co/Smutypi3/applai-sbert}, |
| | note = {Part of the AppAI recruitment intelligence pipeline} |
| | } |
| | ``` |
| |
|
| | --- |
| |
|
| | ## License |
| |
|
| | MIT |
| |
|