AppAI β LayoutLMv3 Contrastive Learning
Fine-tuned microsoft/layoutlmv3-base for resume PDF embedding generation in the AppAI recruitment matching pipeline.
The model encodes resume PDFs into four 768-dim L2-normalised span embeddings β full, education, experience, and leadership β using real bounding boxes extracted directly from PDF documents via pdfplumber.
Model Details
| Property | Value |
|---|---|
| Base model | microsoft/layoutlmv3-base |
| Max sequence length | 512 tokens |
| Embedding dimension | 768 |
| Normalisation | L2 (unit norm) |
| Pooling | Mean pooling over last hidden state with attention mask |
| Training objective | Contrastive learning (resume β JD span matching) |
| Bounding box range | [0, 1000] normalised per page |
Architecture
Resume PDF
β pdfplumber word extraction + bbox normalisation [0, 1000]
β Section label assignment (O / EDUCATION / EXPERIENCE / LEADERSHIP)
β LayoutLMv3TokenizerFast (full resume, single tokenisation pass)
β word_ids() label propagation to subword tokens
β Per-label token subsequence extraction
β LayoutLMv3Model backbone (pixel_values=None)
β Mean pooling over last hidden state
β L2 normalisation
β 768-dim embedding per span
Inference mirrors extract_feature_indices_by_label from the training notebook exactly: the full resume is tokenised once, word-level section labels are propagated to subword tokens via word_ids(), then per-label token subsequences are extracted and encoded independently.
Intended Use
This model is part of the AppAI recruitment intelligence pipeline:
Smutypi3/applai-sbertβ encodes JD text spans (SBERT)- This model β encodes resume PDF spans (LayoutLMv3)
Smutypi3/applai-confitβ aligns both embedding spaces (ConFiT)
Usage
Installation
pip install torch transformers pdfplumber
Encoding a Resume PDF
from ai_models.preprocessing.resume_preprocessor import preprocess_resume_pdf
from ai_models.services.layoutlm_service import encode_resume_spans
with open("resume.pdf", "rb") as f:
pdf_bytes = f.read()
parsed = preprocess_resume_pdf(pdf_bytes)
# {"all_words": [{"text": ..., "bbox": [x1,y1,x2,y2]}, ...], "word_labels": [...]}
embeddings = encode_resume_spans(parsed)
# {"full": [...], "education": [...], "experience": [...], "leadership": [...]}
Section Label IDs
| Label ID | Section |
|---|---|
| 0 | O (Other / header / unclassified) |
| 1 | EDUCATION |
| 2 | EXPERIENCE |
| 3 | LEADERSHIP |
Training Details
Preprocessing
Resume PDFs are parsed using pdfplumber. Per-word bounding boxes are normalised to [0, 1000] per page. Words are grouped into lines by y-coordinate proximity, and short lines (β€ 5 words) containing section header keywords trigger a label change for all subsequent body words.
Token-Level Label Extraction
The full resume is tokenised in a single pass. The tokenizer's word_ids() propagates word-level labels to subword tokens. Special tokens ([CLS], [SEP]) receive label -100 and are excluded from span extraction β exactly mirroring extract_feature_indices_by_label from the training notebook.
Fallback (no tokens for a label): zeros(1) input IDs + ones(1) attention mask, matching training exactly.
Limitations
- Designed for English-language resumes in standard PDF format
- Section detection relies on keyword-based line scanning β unconventional section names may be missed
pixel_valuesare not used at inference (text + layout only), consistent with training- Best used together with
Smutypi3/applai-sbertandSmutypi3/applai-confit
Citation
@software{lucero2025applai_layoutlmv3,
author = {Lucero, Jaime Emmanuel},
title = {{AppAI LayoutLMv3 Contrastive Learning}: Fine-tuned LayoutLMv3 for Resume PDF Embedding},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/Smutypi3/applai-layoutlmv3},
note = {Part of the AppAI recruitment intelligence pipeline}
}
License
MIT
Model tree for Smutypi3/applai-layoutlmv3
Base model
microsoft/layoutlmv3-base