Upload README.md with huggingface_hub

8f142e2 verified 17 days ago

4.04 kB

	---
	library_name: pytorch
	license: apache-2.0
	language:
	- en
	tags:
	- authorship-attribution
	- contrastive-learning
	- late-interaction
	- stylometry
	- modernbert
	datasets:
	- almanach/halvest-contrastive
	base_model:
	- answerdotai/ModernBERT-base
	pipeline_tag: sentence-similarity
	---

	# deep-stylometry-modernbert-layerwise

	ModernBERT-base with layerwise attention pooling for contrastive authorship attribution, fine-tuned on [HALvest-Contrastive](https://huggingface.co/datasets/almanach/halvest-contrastive).

	## Model description

	This model is a [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) encoder with a two-layer projection head, fine-tuned with InfoNCE contrastive loss on the HALvest-Contrastive authorship attribution benchmark.

	Interaction method: Learned layerwise attention over all ModernBERT hidden states, followed by mean pooling and cosine similarity. The model learns a scalar weight per layer to combine representations before pooling.

	## Intended use

	Authorship attribution and authorship verification on English text. Given a query text and a set of candidate texts, the model scores how likely each candidate was written by the same author as the query.

	## How to use

	This checkpoint is a PyTorch Lightning `.ckpt` file. You need the [`deep_stylometry`](https://github.com/Madjakul/deep_stylometry) codebase to load it.

	### Installation

	```bash
	git clone https://github.com/Madjakul/deep_stylometry.git
	cd deep_stylometry
	pip install -r requirements.txt
	```

	### Minimal inference

	```python
	import torch
	from transformers import AutoTokenizer
	from deep_stylometry.modules.modeling_deep_stylometry import DeepStylometry
	from deep_stylometry.utils.configs import BaseConfig

	# 1. Load model from checkpoint
	cfg = BaseConfig(mode="test").from_yaml("configs/test_layerwise.yml")
	model = DeepStylometry.load_from_checkpoint("last.ckpt", cfg=cfg)
	model.eval()

	# 2. Tokenize
	tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
	texts = [
	"Query text whose authorship you want to identify.",
	"Candidate A: a text by the same author.",
	"Candidate B: a text by a different author.",
	]
	enc = tokenizer(
	texts, padding=True, truncation=True, max_length=512, return_tensors="pt",
	)

	# 3. Encode all texts through the model
	with torch.no_grad():
	embs = model(enc["input_ids"], enc["attention_mask"]) # (3, seq_len, 768)

	# 4. Score the query against each candidate
	pool = model.contrastive_loss.pool # the interaction module
	with torch.no_grad():
	scores = pool(
	query_embs=embs[:1],
	key_embs=embs[1:],
	q_mask=enc['attention_mask'][:1],
	k_mask=enc['attention_mask'][1:],
	)
	# scores shape: (1, 2) -- similarity of the query to [candidate A, candidate B]
	# Higher score = more likely same author.
	print(scores)
	```

	## Training details

	The model was trained on HALvest-Contrastive (English-only contrastive triplets derived from the HAL open-access scholarly repository) using InfoNCE loss with in-batch negatives. Training used 4x H100 GPUs with DDP, mixed precision (fp16), and a cosine learning rate schedule with linear warmup.

	See the [training configuration](configs/train_layerwise.yml) for full hyperparameters.

	## Citation

	```bibtex
	@misc{kulumba_halvest_2026,
	title={HALvest-Contrastive: Retrieval-Like Authorship Attribution with Patch-Level Late Interaction},
	author={Francis Kulumba and Wissam Antoun and Guillaume Vimont and Laurent Romary and Florian Cafiero},
	year={2026},
	eprint={2407.20595},
	archivePrefix={arXiv},
	primaryClass={cs.DL},
	url={https://arxiv.org/abs/2407.20595},
	}
	```

	```bibtex
	@misc{kulumba_does_2026,
	title={Where Does Authorship Signal Emerge in Encoder-Based Language Models?},
	author={Francis Kulumba and Guillaume Vimont and Laurent Romary and Florian Cafiero},
	year={2026},
	eprint={2605.19908},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2605.19908},
	}
	```