--- library_name: pytorch license: apache-2.0 language: - en tags: - authorship-attribution - contrastive-learning - late-interaction - stylometry - modernbert datasets: - almanach/halvest-contrastive base_model: - answerdotai/ModernBERT-base pipeline_tag: sentence-similarity --- # deep-stylometry-modernbert-layerwise ModernBERT-base with layerwise attention pooling for contrastive authorship attribution, fine-tuned on [HALvest-Contrastive](https://huggingface.co/datasets/almanach/halvest-contrastive). ## Model description This model is a [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) encoder with a two-layer projection head, fine-tuned with InfoNCE contrastive loss on the HALvest-Contrastive authorship attribution benchmark. **Interaction method**: Learned layerwise attention over all ModernBERT hidden states, followed by mean pooling and cosine similarity. The model learns a scalar weight per layer to combine representations before pooling. ## Intended use Authorship attribution and authorship verification on English text. Given a query text and a set of candidate texts, the model scores how likely each candidate was written by the same author as the query. ## How to use This checkpoint is a PyTorch Lightning `.ckpt` file. You need the [`deep_stylometry`](https://github.com/Madjakul/deep_stylometry) codebase to load it. ### Installation ```bash git clone https://github.com/Madjakul/deep_stylometry.git cd deep_stylometry pip install -r requirements.txt ``` ### Minimal inference ```python import torch from transformers import AutoTokenizer from deep_stylometry.modules.modeling_deep_stylometry import DeepStylometry from deep_stylometry.utils.configs import BaseConfig # 1. Load model from checkpoint cfg = BaseConfig(mode="test").from_yaml("configs/test_layerwise.yml") model = DeepStylometry.load_from_checkpoint("last.ckpt", cfg=cfg) model.eval() # 2. Tokenize tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base") texts = [ "Query text whose authorship you want to identify.", "Candidate A: a text by the same author.", "Candidate B: a text by a different author.", ] enc = tokenizer( texts, padding=True, truncation=True, max_length=512, return_tensors="pt", ) # 3. Encode all texts through the model with torch.no_grad(): embs = model(enc["input_ids"], enc["attention_mask"]) # (3, seq_len, 768) # 4. Score the query against each candidate pool = model.contrastive_loss.pool # the interaction module with torch.no_grad(): scores = pool( query_embs=embs[:1], key_embs=embs[1:], q_mask=enc['attention_mask'][:1], k_mask=enc['attention_mask'][1:], ) # scores shape: (1, 2) -- similarity of the query to [candidate A, candidate B] # Higher score = more likely same author. print(scores) ``` ## Training details The model was trained on HALvest-Contrastive (English-only contrastive triplets derived from the HAL open-access scholarly repository) using InfoNCE loss with in-batch negatives. Training used 4x H100 GPUs with DDP, mixed precision (fp16), and a cosine learning rate schedule with linear warmup. See the [training configuration](configs/train_layerwise.yml) for full hyperparameters. ## Citation ```bibtex @misc{kulumba_halvest_2026, title={HALvest-Contrastive: Retrieval-Like Authorship Attribution with Patch-Level Late Interaction}, author={Francis Kulumba and Wissam Antoun and Guillaume Vimont and Laurent Romary and Florian Cafiero}, year={2026}, eprint={2407.20595}, archivePrefix={arXiv}, primaryClass={cs.DL}, url={https://arxiv.org/abs/2407.20595}, } ``` ```bibtex @misc{kulumba_does_2026, title={Where Does Authorship Signal Emerge in Encoder-Based Language Models?}, author={Francis Kulumba and Guillaume Vimont and Laurent Romary and Florian Cafiero}, year={2026}, eprint={2605.19908}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2605.19908}, } ```