AppAI — LayoutLMv3 Contrastive Learning

Fine-tuned microsoft/layoutlmv3-base for resume PDF embedding generation in the AppAI recruitment matching pipeline.

The model encodes resume PDFs into four 768-dim L2-normalised span embeddings — full, education, experience, and leadership — using real bounding boxes extracted directly from PDF documents via pdfplumber.

Model Details

Property	Value
Base model	`microsoft/layoutlmv3-base`
Max sequence length	512 tokens
Embedding dimension	768
Normalisation	L2 (unit norm)
Pooling	Mean pooling over last hidden state with attention mask
Training objective	Contrastive learning (resume ↔ JD span matching)
Bounding box range	[0, 1000] normalised per page

Architecture

Resume PDF
  → pdfplumber word extraction + bbox normalisation [0, 1000]
  → Section label assignment (O / EDUCATION / EXPERIENCE / LEADERSHIP)
  → LayoutLMv3TokenizerFast (full resume, single tokenisation pass)
  → word_ids() label propagation to subword tokens
  → Per-label token subsequence extraction
  → LayoutLMv3Model backbone (pixel_values=None)
  → Mean pooling over last hidden state
  → L2 normalisation
  → 768-dim embedding per span

Inference mirrors extract_feature_indices_by_label from the training notebook exactly: the full resume is tokenised once, word-level section labels are propagated to subword tokens via word_ids(), then per-label token subsequences are extracted and encoded independently.

Intended Use

This model is part of the AppAI recruitment intelligence pipeline:

Smutypi3/applai-sbert — encodes JD text spans (SBERT)
This model — encodes resume PDF spans (LayoutLMv3)
Smutypi3/applai-confit — aligns both embedding spaces (ConFiT)

Usage

Installation

pip install torch transformers pdfplumber

Encoding a Resume PDF

from ai_models.preprocessing.resume_preprocessor import preprocess_resume_pdf
from ai_models.services.layoutlm_service import encode_resume_spans

with open("resume.pdf", "rb") as f:
    pdf_bytes = f.read()

parsed = preprocess_resume_pdf(pdf_bytes)
# {"all_words": [{"text": ..., "bbox": [x1,y1,x2,y2]}, ...], "word_labels": [...]}

embeddings = encode_resume_spans(parsed)
# {"full": [...], "education": [...], "experience": [...], "leadership": [...]}

Section Label IDs

Label ID	Section
0	O (Other / header / unclassified)
1	EDUCATION
2	EXPERIENCE
3	LEADERSHIP

Training Details

Preprocessing

Resume PDFs are parsed using pdfplumber. Per-word bounding boxes are normalised to [0, 1000] per page. Words are grouped into lines by y-coordinate proximity, and short lines (≤ 5 words) containing section header keywords trigger a label change for all subsequent body words.

Token-Level Label Extraction

The full resume is tokenised in a single pass. The tokenizer's word_ids() propagates word-level labels to subword tokens. Special tokens ([CLS], [SEP]) receive label -100 and are excluded from span extraction — exactly mirroring extract_feature_indices_by_label from the training notebook.

Fallback (no tokens for a label): zeros(1) input IDs + ones(1) attention mask, matching training exactly.

Limitations

Designed for English-language resumes in standard PDF format
Section detection relies on keyword-based line scanning — unconventional section names may be missed
pixel_values are not used at inference (text + layout only), consistent with training
Best used together with Smutypi3/applai-sbert and Smutypi3/applai-confit

Citation

@software{lucero2025applai_layoutlmv3,
  author    = {Lucero, Jaime Emmanuel},
  title     = {{AppAI LayoutLMv3 Contrastive Learning}: Fine-tuned LayoutLMv3 for Resume PDF Embedding},
  year      = {2025},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/Smutypi3/applai-layoutlmv3},
  note      = {Part of the AppAI recruitment intelligence pipeline}
}

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Smutypi3/applai-layoutlmv3

Base model

microsoft/layoutlmv3-base

Finetuned

(307)

this model