AppAI β€” LayoutLMv3 Contrastive Learning

Fine-tuned microsoft/layoutlmv3-base for resume PDF embedding generation in the AppAI recruitment matching pipeline.

The model encodes resume PDFs into four 768-dim L2-normalised span embeddings β€” full, education, experience, and leadership β€” using real bounding boxes extracted directly from PDF documents via pdfplumber.


Model Details

Property Value
Base model microsoft/layoutlmv3-base
Max sequence length 512 tokens
Embedding dimension 768
Normalisation L2 (unit norm)
Pooling Mean pooling over last hidden state with attention mask
Training objective Contrastive learning (resume ↔ JD span matching)
Bounding box range [0, 1000] normalised per page

Architecture

Resume PDF
  β†’ pdfplumber word extraction + bbox normalisation [0, 1000]
  β†’ Section label assignment (O / EDUCATION / EXPERIENCE / LEADERSHIP)
  β†’ LayoutLMv3TokenizerFast (full resume, single tokenisation pass)
  β†’ word_ids() label propagation to subword tokens
  β†’ Per-label token subsequence extraction
  β†’ LayoutLMv3Model backbone (pixel_values=None)
  β†’ Mean pooling over last hidden state
  β†’ L2 normalisation
  β†’ 768-dim embedding per span

Inference mirrors extract_feature_indices_by_label from the training notebook exactly: the full resume is tokenised once, word-level section labels are propagated to subword tokens via word_ids(), then per-label token subsequences are extracted and encoded independently.


Intended Use

This model is part of the AppAI recruitment intelligence pipeline:

  1. Smutypi3/applai-sbert β€” encodes JD text spans (SBERT)
  2. This model β€” encodes resume PDF spans (LayoutLMv3)
  3. Smutypi3/applai-confit β€” aligns both embedding spaces (ConFiT)

Usage

Installation

pip install torch transformers pdfplumber

Encoding a Resume PDF

from ai_models.preprocessing.resume_preprocessor import preprocess_resume_pdf
from ai_models.services.layoutlm_service import encode_resume_spans

with open("resume.pdf", "rb") as f:
    pdf_bytes = f.read()

parsed = preprocess_resume_pdf(pdf_bytes)
# {"all_words": [{"text": ..., "bbox": [x1,y1,x2,y2]}, ...], "word_labels": [...]}

embeddings = encode_resume_spans(parsed)
# {"full": [...], "education": [...], "experience": [...], "leadership": [...]}

Section Label IDs

Label ID Section
0 O (Other / header / unclassified)
1 EDUCATION
2 EXPERIENCE
3 LEADERSHIP

Training Details

Preprocessing

Resume PDFs are parsed using pdfplumber. Per-word bounding boxes are normalised to [0, 1000] per page. Words are grouped into lines by y-coordinate proximity, and short lines (≀ 5 words) containing section header keywords trigger a label change for all subsequent body words.

Token-Level Label Extraction

The full resume is tokenised in a single pass. The tokenizer's word_ids() propagates word-level labels to subword tokens. Special tokens ([CLS], [SEP]) receive label -100 and are excluded from span extraction β€” exactly mirroring extract_feature_indices_by_label from the training notebook.

Fallback (no tokens for a label): zeros(1) input IDs + ones(1) attention mask, matching training exactly.


Limitations

  • Designed for English-language resumes in standard PDF format
  • Section detection relies on keyword-based line scanning β€” unconventional section names may be missed
  • pixel_values are not used at inference (text + layout only), consistent with training
  • Best used together with Smutypi3/applai-sbert and Smutypi3/applai-confit

Citation

@software{lucero2025applai_layoutlmv3,
  author    = {Lucero, Jaime Emmanuel},
  title     = {{AppAI LayoutLMv3 Contrastive Learning}: Fine-tuned LayoutLMv3 for Resume PDF Embedding},
  year      = {2025},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/Smutypi3/applai-layoutlmv3},
  note      = {Part of the AppAI recruitment intelligence pipeline}
}

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Smutypi3/applai-layoutlmv3

Finetuned
(291)
this model