Upload folder using huggingface_hub

a178dde verified 4 days ago

4.33 kB

	---
	language:
	- en
	license: mit
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- dense
	- recruitment
	- job-description
	- applai
	base_model: sentence-transformers/all-mpnet-base-v2
	pipeline_tag: sentence-similarity
	---

	# AppAI — SBERT 4-Way Pairing

	Fine-tuned `sentence-transformers/all-mpnet-base-v2` for job description (JD) embedding generation in the [AppAI](https://github.com/jaimeemanuellucero/applai) recruitment matching pipeline.

	The model encodes job descriptions into four 768-dim L2-normalised span embeddings — full, education, experience, and leadership — enabling granular, section-level candidate matching when paired with the LayoutLMv3 resume encoder and the ConFiT alignment model.

	---

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Base model \| `sentence-transformers/all-mpnet-base-v2` \|
	\| Max sequence length \| 384 tokens \|
	\| Embedding dimension \| 768 \|
	\| Normalisation \| L2 (unit norm) \|
	\| Training objective \| 4-way paired cosine similarity \|
	\| Batch size \| 64 \|

	### Architecture

	```
	Input text → MPNet tokenizer → all-mpnet-base-v2 → Mean Pooling → L2 Normalize → 768-dim embedding
	```

	Training pairs each JD span against its corresponding resume span across four named features. Each feature is encoded independently, giving fine-grained control over education, experience, and leadership matching in addition to the full-text representation.

	---

	## Intended Use

	This model is part of the AppAI recruitment intelligence pipeline:

	1. This model — encodes JD text spans (SBERT)
	2. [`Smutypi3/applai-layoutlmv3`](https://huggingface.co/Smutypi3/applai-layoutlmv3) — encodes resume PDF spans (LayoutLMv3)
	3. [`Smutypi3/applai-confit`](https://huggingface.co/Smutypi3/applai-confit) — aligns both embedding spaces (ConFiT)

	It is intended for cosine similarity-based candidate ranking within the AppAI system. It is not designed for general-purpose semantic search.

	---

	## Usage

	### Installation

	```bash
	pip install sentence-transformers
	```

	### Encoding a Job Description

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("Smutypi3/applai-sbert")
	model.max_seq_length = 384

	jd_text = "We are looking for a senior software engineer with 5+ years of Python experience..."
	embedding = model.encode(jd_text, convert_to_tensor=True, normalize_embeddings=True)
	print(embedding.shape) # torch.Size([768])
	```

	### Full Pipeline (via AppAI inference service)

	```python
	from ai_models.preprocessing.jd_preprocessor import preprocess_jd
	from ai_models.services.sbert_service import encode_jd_spans

	spans = preprocess_jd(raw_jd_html)
	# {"full": "...", "education": "...", "experience": "...", "leadership": "..."}

	embeddings = encode_jd_spans(spans)
	# {"full": [...], "education": [...], "experience": [...], "leadership": [...]}
	```

	---

	## Training Details

	### Preprocessing

	Job descriptions are preprocessed before encoding to exactly mirror the training pipeline:

	- HTML is stripped and entities are unescaped
	- Full span: `clean_text(preserve_tech=True)` — keeps `.`, `+`, `#` for C++, C#, .NET
	- Section spans: `clean_text(preserve_tech=False)` + keyword-based sentence filtering
	- Fallback: `"no specific information available"` when extracted span is ≤ 10 characters

	### Training Objective

	4-way pairing loss over (JD full, JD education, JD experience, JD leadership) vs (resume full, resume education, resume experience, resume leadership) pairs using cosine similarity.

	---

	## Limitations

	- Designed for English-language job descriptions
	- Section extraction relies on keyword matching — niche or unconventional JD formats may not extract cleanly
	- Best used together with `Smutypi3/applai-layoutlmv3` and `Smutypi3/applai-confit`; standalone cosine similarity without alignment will not reflect trained performance

	---

	## Citation

	```bibtex
	@software{lucero2025applai_sbert,
	author = {Lucero, Jaime Emmanuel},
	title = {{AppAI SBERT 4-Way Pairing}: Fine-tuned Sentence Transformer for Job Description Embedding},
	year = {2025},
	publisher = {HuggingFace},
	url = {https://huggingface.co/Smutypi3/applai-sbert},
	note = {Part of the AppAI recruitment intelligence pipeline}
	}
	```

	---

	## License

	MIT