applai-sbert / README.md
Smutypi3's picture
Upload folder using huggingface_hub
a178dde verified
---
language:
- en
license: mit
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- recruitment
- job-description
- applai
base_model: sentence-transformers/all-mpnet-base-v2
pipeline_tag: sentence-similarity
---
# AppAI β€” SBERT 4-Way Pairing
Fine-tuned `sentence-transformers/all-mpnet-base-v2` for job description (JD) embedding generation in the [AppAI](https://github.com/jaimeemanuellucero/applai) recruitment matching pipeline.
The model encodes job descriptions into four **768-dim L2-normalised** span embeddings β€” **full**, **education**, **experience**, and **leadership** β€” enabling granular, section-level candidate matching when paired with the LayoutLMv3 resume encoder and the ConFiT alignment model.
---
## Model Details
| Property | Value |
|---|---|
| Base model | `sentence-transformers/all-mpnet-base-v2` |
| Max sequence length | 384 tokens |
| Embedding dimension | 768 |
| Normalisation | L2 (unit norm) |
| Training objective | 4-way paired cosine similarity |
| Batch size | 64 |
### Architecture
```
Input text β†’ MPNet tokenizer β†’ all-mpnet-base-v2 β†’ Mean Pooling β†’ L2 Normalize β†’ 768-dim embedding
```
Training pairs each JD span against its corresponding resume span across four named features. Each feature is encoded independently, giving fine-grained control over education, experience, and leadership matching in addition to the full-text representation.
---
## Intended Use
This model is part of the **AppAI** recruitment intelligence pipeline:
1. **This model** β€” encodes JD text spans (SBERT)
2. [`Smutypi3/applai-layoutlmv3`](https://huggingface.co/Smutypi3/applai-layoutlmv3) β€” encodes resume PDF spans (LayoutLMv3)
3. [`Smutypi3/applai-confit`](https://huggingface.co/Smutypi3/applai-confit) β€” aligns both embedding spaces (ConFiT)
It is intended for **cosine similarity-based candidate ranking** within the AppAI system. It is not designed for general-purpose semantic search.
---
## Usage
### Installation
```bash
pip install sentence-transformers
```
### Encoding a Job Description
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Smutypi3/applai-sbert")
model.max_seq_length = 384
jd_text = "We are looking for a senior software engineer with 5+ years of Python experience..."
embedding = model.encode(jd_text, convert_to_tensor=True, normalize_embeddings=True)
print(embedding.shape) # torch.Size([768])
```
### Full Pipeline (via AppAI inference service)
```python
from ai_models.preprocessing.jd_preprocessor import preprocess_jd
from ai_models.services.sbert_service import encode_jd_spans
spans = preprocess_jd(raw_jd_html)
# {"full": "...", "education": "...", "experience": "...", "leadership": "..."}
embeddings = encode_jd_spans(spans)
# {"full": [...], "education": [...], "experience": [...], "leadership": [...]}
```
---
## Training Details
### Preprocessing
Job descriptions are preprocessed before encoding to exactly mirror the training pipeline:
- HTML is stripped and entities are unescaped
- **Full span**: `clean_text(preserve_tech=True)` β€” keeps `.`, `+`, `#` for C++, C#, .NET
- **Section spans**: `clean_text(preserve_tech=False)` + keyword-based sentence filtering
- **Fallback**: `"no specific information available"` when extracted span is ≀ 10 characters
### Training Objective
4-way pairing loss over (JD full, JD education, JD experience, JD leadership) vs (resume full, resume education, resume experience, resume leadership) pairs using cosine similarity.
---
## Limitations
- Designed for English-language job descriptions
- Section extraction relies on keyword matching β€” niche or unconventional JD formats may not extract cleanly
- Best used together with `Smutypi3/applai-layoutlmv3` and `Smutypi3/applai-confit`; standalone cosine similarity without alignment will not reflect trained performance
---
## Citation
```bibtex
@software{lucero2025applai_sbert,
author = {Lucero, Jaime Emmanuel},
title = {{AppAI SBERT 4-Way Pairing}: Fine-tuned Sentence Transformer for Job Description Embedding},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/Smutypi3/applai-sbert},
note = {Part of the AppAI recruitment intelligence pipeline}
}
```
---
## License
MIT