--- language: - en license: mit tags: - sentence-transformers - sentence-similarity - feature-extraction - dense - recruitment - job-description - applai base_model: sentence-transformers/all-mpnet-base-v2 pipeline_tag: sentence-similarity --- # AppAI — SBERT 4-Way Pairing Fine-tuned `sentence-transformers/all-mpnet-base-v2` for job description (JD) embedding generation in the [AppAI](https://github.com/jaimeemanuellucero/applai) recruitment matching pipeline. The model encodes job descriptions into four **768-dim L2-normalised** span embeddings — **full**, **education**, **experience**, and **leadership** — enabling granular, section-level candidate matching when paired with the LayoutLMv3 resume encoder and the ConFiT alignment model. --- ## Model Details | Property | Value | |---|---| | Base model | `sentence-transformers/all-mpnet-base-v2` | | Max sequence length | 384 tokens | | Embedding dimension | 768 | | Normalisation | L2 (unit norm) | | Training objective | 4-way paired cosine similarity | | Batch size | 64 | ### Architecture ``` Input text → MPNet tokenizer → all-mpnet-base-v2 → Mean Pooling → L2 Normalize → 768-dim embedding ``` Training pairs each JD span against its corresponding resume span across four named features. Each feature is encoded independently, giving fine-grained control over education, experience, and leadership matching in addition to the full-text representation. --- ## Intended Use This model is part of the **AppAI** recruitment intelligence pipeline: 1. **This model** — encodes JD text spans (SBERT) 2. [`Smutypi3/applai-layoutlmv3`](https://huggingface.co/Smutypi3/applai-layoutlmv3) — encodes resume PDF spans (LayoutLMv3) 3. [`Smutypi3/applai-confit`](https://huggingface.co/Smutypi3/applai-confit) — aligns both embedding spaces (ConFiT) It is intended for **cosine similarity-based candidate ranking** within the AppAI system. It is not designed for general-purpose semantic search. --- ## Usage ### Installation ```bash pip install sentence-transformers ``` ### Encoding a Job Description ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("Smutypi3/applai-sbert") model.max_seq_length = 384 jd_text = "We are looking for a senior software engineer with 5+ years of Python experience..." embedding = model.encode(jd_text, convert_to_tensor=True, normalize_embeddings=True) print(embedding.shape) # torch.Size([768]) ``` ### Full Pipeline (via AppAI inference service) ```python from ai_models.preprocessing.jd_preprocessor import preprocess_jd from ai_models.services.sbert_service import encode_jd_spans spans = preprocess_jd(raw_jd_html) # {"full": "...", "education": "...", "experience": "...", "leadership": "..."} embeddings = encode_jd_spans(spans) # {"full": [...], "education": [...], "experience": [...], "leadership": [...]} ``` --- ## Training Details ### Preprocessing Job descriptions are preprocessed before encoding to exactly mirror the training pipeline: - HTML is stripped and entities are unescaped - **Full span**: `clean_text(preserve_tech=True)` — keeps `.`, `+`, `#` for C++, C#, .NET - **Section spans**: `clean_text(preserve_tech=False)` + keyword-based sentence filtering - **Fallback**: `"no specific information available"` when extracted span is ≤ 10 characters ### Training Objective 4-way pairing loss over (JD full, JD education, JD experience, JD leadership) vs (resume full, resume education, resume experience, resume leadership) pairs using cosine similarity. --- ## Limitations - Designed for English-language job descriptions - Section extraction relies on keyword matching — niche or unconventional JD formats may not extract cleanly - Best used together with `Smutypi3/applai-layoutlmv3` and `Smutypi3/applai-confit`; standalone cosine similarity without alignment will not reflect trained performance --- ## Citation ```bibtex @software{lucero2025applai_sbert, author = {Lucero, Jaime Emmanuel}, title = {{AppAI SBERT 4-Way Pairing}: Fine-tuned Sentence Transformer for Job Description Embedding}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/Smutypi3/applai-sbert}, note = {Part of the AppAI recruitment intelligence pipeline} } ``` --- ## License MIT