ArcDoc-UnB — Sub-Center CosFace + Professor Network (Sprint3b, Split 0)
ArcDoc-UnB is a visual document embedding model trained with the Sprint3b pipeline:
a two-phase curriculum using a Professor Network (RL-based hard-negative mining) on top of
the InternVL3-2B backbone. Trained on the UnB gpds2 server.
Architecture
| Component |
Value |
| Backbone |
InternVL3-2B (OpenGVLab/InternVL3-2B) — frozen |
| Cut layer |
27 |
| Pooler |
Attention |
| Head |
MLP |
| Embedding dim |
1536 |
| Loss |
Sub-Center CosFace (m=0.35, s=32, k=3) |
Performance
| Dataset |
Metric |
Value |
| LA-CDIP split0 (validation) |
EER |
1.80% |
| RVL-CDIP (zero-shot, Top-1) |
Accuracy |
— |
Training Details
| Parameter |
Value |
| Training data |
LA-CDIP split0 (ZSL protocol, val=split 0) |
| Phase 1 |
10 epochs, no professor |
| Phase 2 |
5 epochs, professor active (warmup=140 steps) |
| Steps / epoch |
140 |
| Batch size |
4 (grad accum 3 → effective 12) |
| Candidate pool |
8 |
| Student LR |
5e-5 (AdamW, plateau scheduler) |
| Seed |
42 |
Usage
import torch
from PIL import Image
from huggingface_hub import hf_hub_download
from cavl_doc.models.backbone_loader import load_model
from cavl_doc.models.modeling_cavl import build_cavl_model
from cavl_doc.data.transforms import build_transform, dynamic_preprocess
from cavl_doc.utils.embedding_utils import prepare_inputs_for_multimodal_embedding
REPO_ID = "Jpcosta90/arcdoc-unb"
PROMPT = "<image> Analyze this document"
device = "cuda" if torch.cuda.is_available() else "cpu"
backbone, _, tokenizer, _, _ = load_model("InternVL3-2B", load_in_4bit=False)
backbone = backbone.to(device)
backbone.img_context_token_id = tokenizer.convert_tokens_to_ids("<IMG_CONTEXT>")
ckpt_path = hf_hub_download(REPO_ID, "best_model.pt")
ckpt = torch.load(ckpt_path, map_location=device, weights_only=False)
cfg = ckpt["config"]
model = build_cavl_model(
backbone=backbone,
cut_layer=cfg["cut_layer"],
encode_fn=None,
pool_dim=cfg["hidden_dim"],
proj_hidden=4096,
proj_out=cfg["projection_output_dim"],
set_trainable=False,
tokenizer=tokenizer,
pooler_type=cfg["pooler_type"],
head_type=cfg["head_type"],
num_queries=cfg["num_queries"],
)
model.pool.load_state_dict(ckpt["siam_pool"])
model.head.load_state_dict(ckpt["siam_head"])
model.to(device).eval()
Citation
@misc{cavldoc2026,
title = {CaVL-Doc: Curriculum and Active-learning Vision-Language for Document Retrieval},
author = {Costa, João Paulo},
year = {2026},
url = {https://huggingface.co/Jpcosta90/arcdoc-unb}
}