Upload folder using huggingface_hub

a178dde verified 3 days ago

4.33 kB

language:
  - en
license: mit
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - dense
  - recruitment
  - job-description
  - applai
base_model: sentence-transformers/all-mpnet-base-v2
pipeline_tag: sentence-similarity

AppAI — SBERT 4-Way Pairing

Fine-tuned sentence-transformers/all-mpnet-base-v2 for job description (JD) embedding generation in the AppAI recruitment matching pipeline.

The model encodes job descriptions into four 768-dim L2-normalised span embeddings — full, education, experience, and leadership — enabling granular, section-level candidate matching when paired with the LayoutLMv3 resume encoder and the ConFiT alignment model.

Model Details

Property	Value
Base model	`sentence-transformers/all-mpnet-base-v2`
Max sequence length	384 tokens
Embedding dimension	768
Normalisation	L2 (unit norm)
Training objective	4-way paired cosine similarity
Batch size	64

Architecture

Input text → MPNet tokenizer → all-mpnet-base-v2 → Mean Pooling → L2 Normalize → 768-dim embedding

Training pairs each JD span against its corresponding resume span across four named features. Each feature is encoded independently, giving fine-grained control over education, experience, and leadership matching in addition to the full-text representation.

Intended Use

This model is part of the AppAI recruitment intelligence pipeline:

This model — encodes JD text spans (SBERT)
Smutypi3/applai-layoutlmv3 — encodes resume PDF spans (LayoutLMv3)
Smutypi3/applai-confit — aligns both embedding spaces (ConFiT)

It is intended for cosine similarity-based candidate ranking within the AppAI system. It is not designed for general-purpose semantic search.

Usage

Installation

pip install sentence-transformers

Encoding a Job Description

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Smutypi3/applai-sbert")
model.max_seq_length = 384

jd_text = "We are looking for a senior software engineer with 5+ years of Python experience..."
embedding = model.encode(jd_text, convert_to_tensor=True, normalize_embeddings=True)
print(embedding.shape)  # torch.Size([768])

Full Pipeline (via AppAI inference service)

from ai_models.preprocessing.jd_preprocessor import preprocess_jd
from ai_models.services.sbert_service import encode_jd_spans

spans = preprocess_jd(raw_jd_html)
# {"full": "...", "education": "...", "experience": "...", "leadership": "..."}

embeddings = encode_jd_spans(spans)
# {"full": [...], "education": [...], "experience": [...], "leadership": [...]}

Training Details

Preprocessing

Job descriptions are preprocessed before encoding to exactly mirror the training pipeline:

HTML is stripped and entities are unescaped
Full span: clean_text(preserve_tech=True) — keeps ., +, # for C++, C#, .NET
Section spans: clean_text(preserve_tech=False) + keyword-based sentence filtering
Fallback: "no specific information available" when extracted span is ≤ 10 characters

Training Objective

4-way pairing loss over (JD full, JD education, JD experience, JD leadership) vs (resume full, resume education, resume experience, resume leadership) pairs using cosine similarity.

Limitations

Designed for English-language job descriptions
Section extraction relies on keyword matching — niche or unconventional JD formats may not extract cleanly
Best used together with Smutypi3/applai-layoutlmv3 and Smutypi3/applai-confit; standalone cosine similarity without alignment will not reflect trained performance

Citation

@software{lucero2025applai_sbert,
  author    = {Lucero, Jaime Emmanuel},
  title     = {{AppAI SBERT 4-Way Pairing}: Fine-tuned Sentence Transformer for Job Description Embedding},
  year      = {2025},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/Smutypi3/applai-sbert},
  note      = {Part of the AppAI recruitment intelligence pipeline}
}

License

MIT