Wikipedia Document Segmentation with CoSeNet Transformer

This repository contains the official implementation and pretrained weights of CoseNet Transformer, a neural architecture for document segmentation and sentence-level representation learning, developed as part of a Master’s Thesis (final thesis project) at the University of Alcalá. A research paper describing the model and experiments is currently in preparation.

Hugging Face model repository: Alverciito/wikipedia_segmentation

Overview

CoseNet Transformer addresses document segmentation by combining:

Transformer-based contextual representations
A structural segmentation module (CoSeNet) for boundary modeling
Sentence-level embedding extraction utilities for downstream segmentation and clustering pipelines

The model is evaluated on Spanish Wikipedia and compared against multiple baselines, including multilingual Transformer models and heuristic segmentation approaches.

Repository Structure

src/
- src/model/: core architecture (SegmentationNetwork, encoder blocks, CoSeNet components)
- src/dataset/: dataset utilities and loaders
- src/dlutils/: deep learning utilities
research_files/
- research_files/train/: training scripts and configurations
- research_files/inference/: standalone inference utilities (pre-Hugging Face integration)
- research_files/benchmark/: benchmark suite, datasets, and evaluation scripts
Root files (Hugging Face integration)
- model.py: Hugging Face wrapper (SentenceCoseNet, SentenceCoseNetConfig)
- config.json: Hugging Face model configuration
- model.safetensors: pretrained model weights
- tokenizer.json, tokenizer_config.json, special_tokens_map.json: tokenizer artifacts
- requirements.txt: dependencies

Installation

Install the dependencies listed in requirements.txt. The model is implemented in PyTorch and integrates with Hugging Face Transformers via custom code.

Loading the Model

This repository uses custom model code, so loading requires enabling remote code execution in the Hugging Face API.

The tokenizer and model should both be loaded directly from the Hugging Face repository. A minimal end-to-end example:

from transformers import AutoTokenizer, AutoModel

# Loading:
tokenizer = AutoTokenizer.from_pretrained("Alverciito/wikipedia_segmentation")
model = AutoModel.from_pretrained("Alverciito/wikipedia_segmentation")

# Tokenize:
text = "Hola mundo!"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

# Text inference:
output = model.encode(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
print(output.shape)
    >>> torch.Size([1, 4, 256])  # (batch, tokens, embedding_dim)

# Sentence embedding inference:
output = model.get_sentence_embedding(
    input_ids=inputs["input_ids"], 
    attention_mask=inputs["attention_mask"],
    normalize=True # L2 normalization option
)
print(output.shape)
    >>> torch.Size([1, 256])  # (batch, embedding_dim)

Inference API

The Hugging Face wrapper exposes a task-oriented API:

encode(input_ids, attention_mask=None)
Returns token-level contextual embeddings. Intended for encoder-style usage and intermediate representations.
Typical output shape: (batch, tokens, embedding_dim).
get_sentence_embedding(input_ids, attention_mask=None, normalize=True/False)
Returns one embedding per sentence, obtained via masked pooling and a lightweight regularization head.
Typical output shape: (batch, embedding_dim).
Optional L2 normalization is supported for similarity-based applications.
forward(...)
Reserved exclusively for document segmentation.
Expects 3D inputs (batch, sentences, tokens) and returns structured segmentation outputs.
This method is not intended for sentence embedding extraction.

Benchmarks

Benchmark experiments are located in research_files/benchmark/.

The evaluation framework includes:

The proposed CoseNet Transformer model
Heuristic segmentation baselines (TexTile)
Multilingual Transformer-based baselines, including:
- sBERT
- mBERT
- LaBSE
- XLM-R

Multiple segmentation strategies are evaluated, such as PELT, binary segmentation, and cosine-similarity-based methods. Metrics include Precision, Recall, F1-score, and WindowDiff.

Category	Model / Method	Spanish Support	Training
Classical	TextTiling	Yes	No
Neural (sentence-level)	Multilingual Sentence-Transformer + cosine similarity (csim)	Yes	No
Neural (clustering)	Multilingual Sentence-Transformer + PELT	Yes	No
Neural (clustering)	Multilingual Sentence-Transformer + Binary Segmentation (BinSeg)	Yes	No
Neural (frozen LM)	mBERT (frozen) + PELT / BinSeg / cosine similarity	Yes	No
Neural (frozen LM)	LaBSE (frozen) + PELT / BinSeg / cosine similarity	Yes	No
Neural (frozen LM)	XLM-R (frozen) + PELT / BinSeg / cosine similarity	Yes	No
Proposed	CoseNet Transformer (sentence encoder + CoSeNet layering + candidate masking + pooling)	Yes	Yes

WindowDiff (WD) is used as the primary segmentation error metric.
Lower values indicate better segmentation quality. In this work, WindowDiff values ≤ 0.30 are considered acceptable, values ≤ 0.20 indicate good performance, and values ≤ 0.10 indicate strong segmentation accuracy under standard evaluation settings.

The benchmark entry point is bench.py, and results are stored as JSON files for reproducibility and further analysis.

Research Context

This work is part of an academic research project and a PhD’s Thesis, with an accompanying paper currently being written. The repository prioritizes:

Architectural clarity
Reproducibility of experiments
Explicit separation between segmentation and embedding extraction

The model is not a SentenceTransformer, although it internally uses Transformer encoders.

License

This project is released under the Apache License 2.0.

Author and Affiliation

Alberto Palomo Alonso
University of Alcalá
Escuela Politécnica Superior

Downloads last month: 12

Safetensors

Model size

11.1M params

Tensor type

F32

Collection including Alverciito/wikipedia_segmentation

language-modeling

Collection

3 items • Updated Mar 2 • 1