AppAI β€” SBERT 4-Way Pairing

Fine-tuned sentence-transformers/all-mpnet-base-v2 for job description (JD) embedding generation in the AppAI recruitment matching pipeline.

The model encodes job descriptions into four 768-dim L2-normalised span embeddings β€” full, education, experience, and leadership β€” enabling granular, section-level candidate matching when paired with the LayoutLMv3 resume encoder and the ConFiT alignment model.


Model Details

Property Value
Base model sentence-transformers/all-mpnet-base-v2
Max sequence length 384 tokens
Embedding dimension 768
Normalisation L2 (unit norm)
Training objective 4-way paired cosine similarity
Batch size 64

Architecture

Input text β†’ MPNet tokenizer β†’ all-mpnet-base-v2 β†’ Mean Pooling β†’ L2 Normalize β†’ 768-dim embedding

Training pairs each JD span against its corresponding resume span across four named features. Each feature is encoded independently, giving fine-grained control over education, experience, and leadership matching in addition to the full-text representation.


Intended Use

This model is part of the AppAI recruitment intelligence pipeline:

  1. This model β€” encodes JD text spans (SBERT)
  2. Smutypi3/applai-layoutlmv3 β€” encodes resume PDF spans (LayoutLMv3)
  3. Smutypi3/applai-confit β€” aligns both embedding spaces (ConFiT)

It is intended for cosine similarity-based candidate ranking within the AppAI system. It is not designed for general-purpose semantic search.


Usage

Installation

pip install sentence-transformers

Encoding a Job Description

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Smutypi3/applai-sbert")
model.max_seq_length = 384

jd_text = "We are looking for a senior software engineer with 5+ years of Python experience..."
embedding = model.encode(jd_text, convert_to_tensor=True, normalize_embeddings=True)
print(embedding.shape)  # torch.Size([768])

Full Pipeline (via AppAI inference service)

from ai_models.preprocessing.jd_preprocessor import preprocess_jd
from ai_models.services.sbert_service import encode_jd_spans

spans = preprocess_jd(raw_jd_html)
# {"full": "...", "education": "...", "experience": "...", "leadership": "..."}

embeddings = encode_jd_spans(spans)
# {"full": [...], "education": [...], "experience": [...], "leadership": [...]}

Training Details

Preprocessing

Job descriptions are preprocessed before encoding to exactly mirror the training pipeline:

  • HTML is stripped and entities are unescaped
  • Full span: clean_text(preserve_tech=True) β€” keeps ., +, # for C++, C#, .NET
  • Section spans: clean_text(preserve_tech=False) + keyword-based sentence filtering
  • Fallback: "no specific information available" when extracted span is ≀ 10 characters

Training Objective

4-way pairing loss over (JD full, JD education, JD experience, JD leadership) vs (resume full, resume education, resume experience, resume leadership) pairs using cosine similarity.


Limitations

  • Designed for English-language job descriptions
  • Section extraction relies on keyword matching β€” niche or unconventional JD formats may not extract cleanly
  • Best used together with Smutypi3/applai-layoutlmv3 and Smutypi3/applai-confit; standalone cosine similarity without alignment will not reflect trained performance

Citation

@software{lucero2025applai_sbert,
  author    = {Lucero, Jaime Emmanuel},
  title     = {{AppAI SBERT 4-Way Pairing}: Fine-tuned Sentence Transformer for Job Description Embedding},
  year      = {2025},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/Smutypi3/applai-sbert},
  note      = {Part of the AppAI recruitment intelligence pipeline}
}

License

MIT

Downloads last month
8
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Smutypi3/applai-sbert

Finetuned
(350)
this model