| --- |
| library_name: pytorch |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - authorship-attribution |
| - contrastive-learning |
| - late-interaction |
| - stylometry |
| - modernbert |
| datasets: |
| - almanach/halvest-contrastive |
| base_model: |
| - answerdotai/ModernBERT-base |
| pipeline_tag: sentence-similarity |
| --- |
| |
| # deep-stylometry-modernbert-layerwise |
|
|
| ModernBERT-base with layerwise attention pooling for contrastive authorship attribution, fine-tuned on [HALvest-Contrastive](https://huggingface.co/datasets/almanach/halvest-contrastive). |
|
|
| ## Model description |
|
|
| This model is a [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) encoder with a two-layer projection head, fine-tuned with InfoNCE contrastive loss on the HALvest-Contrastive authorship attribution benchmark. |
|
|
| **Interaction method**: Learned layerwise attention over all ModernBERT hidden states, followed by mean pooling and cosine similarity. The model learns a scalar weight per layer to combine representations before pooling. |
|
|
| ## Intended use |
|
|
| Authorship attribution and authorship verification on English text. Given a query text and a set of candidate texts, the model scores how likely each candidate was written by the same author as the query. |
|
|
| ## How to use |
|
|
| This checkpoint is a PyTorch Lightning `.ckpt` file. You need the [`deep_stylometry`](https://github.com/Madjakul/deep_stylometry) codebase to load it. |
|
|
| ### Installation |
|
|
| ```bash |
| git clone https://github.com/Madjakul/deep_stylometry.git |
| cd deep_stylometry |
| pip install -r requirements.txt |
| ``` |
|
|
| ### Minimal inference |
|
|
| ```python |
| import torch |
| from transformers import AutoTokenizer |
| from deep_stylometry.modules.modeling_deep_stylometry import DeepStylometry |
| from deep_stylometry.utils.configs import BaseConfig |
| |
| # 1. Load model from checkpoint |
| cfg = BaseConfig(mode="test").from_yaml("configs/test_layerwise.yml") |
| model = DeepStylometry.load_from_checkpoint("last.ckpt", cfg=cfg) |
| model.eval() |
| |
| # 2. Tokenize |
| tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base") |
| texts = [ |
| "Query text whose authorship you want to identify.", |
| "Candidate A: a text by the same author.", |
| "Candidate B: a text by a different author.", |
| ] |
| enc = tokenizer( |
| texts, padding=True, truncation=True, max_length=512, return_tensors="pt", |
| ) |
| |
| # 3. Encode all texts through the model |
| with torch.no_grad(): |
| embs = model(enc["input_ids"], enc["attention_mask"]) # (3, seq_len, 768) |
| |
| # 4. Score the query against each candidate |
| pool = model.contrastive_loss.pool # the interaction module |
| with torch.no_grad(): |
| scores = pool( |
| query_embs=embs[:1], |
| key_embs=embs[1:], |
| q_mask=enc['attention_mask'][:1], |
| k_mask=enc['attention_mask'][1:], |
| ) |
| # scores shape: (1, 2) -- similarity of the query to [candidate A, candidate B] |
| # Higher score = more likely same author. |
| print(scores) |
| ``` |
|
|
| ## Training details |
|
|
| The model was trained on HALvest-Contrastive (English-only contrastive triplets derived from the HAL open-access scholarly repository) using InfoNCE loss with in-batch negatives. Training used 4x H100 GPUs with DDP, mixed precision (fp16), and a cosine learning rate schedule with linear warmup. |
|
|
| See the [training configuration](configs/train_layerwise.yml) for full hyperparameters. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{kulumba_halvest_2026, |
| title={HALvest-Contrastive: Retrieval-Like Authorship Attribution with Patch-Level Late Interaction}, |
| author={Francis Kulumba and Wissam Antoun and Guillaume Vimont and Laurent Romary and Florian Cafiero}, |
| year={2026}, |
| eprint={2407.20595}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.DL}, |
| url={https://arxiv.org/abs/2407.20595}, |
| } |
| ``` |
|
|
| ```bibtex |
| @misc{kulumba_does_2026, |
| title={Where Does Authorship Signal Emerge in Encoder-Based Language Models?}, |
| author={Francis Kulumba and Guillaume Vimont and Laurent Romary and Florian Cafiero}, |
| year={2026}, |
| eprint={2605.19908}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CL}, |
| url={https://arxiv.org/abs/2605.19908}, |
| } |
| ``` |
|
|