Wikipedia Document Segmentation with CoSeNet Transformer
This repository contains the official implementation and pretrained weights of CoseNet Transformer, a neural architecture for document segmentation and sentence-level representation learning, developed as part of a Master’s Thesis (final thesis project) at the University of Alcalá. A research paper describing the model and experiments is currently in preparation.
Hugging Face model repository: Alverciito/wikipedia_segmentation
Overview
CoseNet Transformer addresses document segmentation by combining:
- Transformer-based contextual representations
- A structural segmentation module (CoSeNet) for boundary modeling
- Sentence-level embedding extraction utilities for downstream segmentation and clustering pipelines
The model is evaluated on Spanish Wikipedia and compared against multiple baselines, including multilingual Transformer models and heuristic segmentation approaches.
Repository Structure
src/
- src/model/: core architecture (SegmentationNetwork, encoder blocks, CoSeNet components)
- src/dataset/: dataset utilities and loaders
- src/dlutils/: deep learning utilities
research_files/
- research_files/train/: training scripts and configurations
- research_files/inference/: standalone inference utilities (pre-Hugging Face integration)
- research_files/benchmark/: benchmark suite, datasets, and evaluation scripts
Root files (Hugging Face integration)
- model.py: Hugging Face wrapper (
SentenceCoseNet,SentenceCoseNetConfig) - config.json: Hugging Face model configuration
- model.safetensors: pretrained model weights
- tokenizer.json, tokenizer_config.json, special_tokens_map.json: tokenizer artifacts
- requirements.txt: dependencies
- model.py: Hugging Face wrapper (
Installation
Install the dependencies listed in requirements.txt. The model is implemented in PyTorch and integrates with Hugging Face Transformers via custom code.
Loading the Model
This repository uses custom model code, so loading requires enabling remote code execution in the Hugging Face API.
The tokenizer and model should both be loaded directly from the Hugging Face repository. A minimal end-to-end example:
from transformers import AutoTokenizer, AutoModel
# Loading:
tokenizer = AutoTokenizer.from_pretrained("Alverciito/wikipedia_segmentation")
model = AutoModel.from_pretrained("Alverciito/wikipedia_segmentation")
# Tokenize:
text = "Hola mundo!"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
# Text inference:
output = model.encode(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
print(output.shape)
>>> torch.Size([1, 4, 256]) # (batch, tokens, embedding_dim)
# Sentence embedding inference:
output = model.get_sentence_embedding(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
normalize=True # L2 normalization option
)
print(output.shape)
>>> torch.Size([1, 256]) # (batch, embedding_dim)
Inference API
The Hugging Face wrapper exposes a task-oriented API:
encode(input_ids, attention_mask=None)
Returns token-level contextual embeddings. Intended for encoder-style usage and intermediate representations.
Typical output shape:(batch, tokens, embedding_dim).get_sentence_embedding(input_ids, attention_mask=None, normalize=True/False)
Returns one embedding per sentence, obtained via masked pooling and a lightweight regularization head.
Typical output shape:(batch, embedding_dim).
Optional L2 normalization is supported for similarity-based applications.forward(...)
Reserved exclusively for document segmentation.
Expects 3D inputs(batch, sentences, tokens)and returns structured segmentation outputs.
This method is not intended for sentence embedding extraction.
Benchmarks
Benchmark experiments are located in research_files/benchmark/.
The evaluation framework includes:
- The proposed CoseNet Transformer model
- Heuristic segmentation baselines (TexTile)
- Multilingual Transformer-based baselines, including:
- sBERT
- mBERT
- LaBSE
- XLM-R
Multiple segmentation strategies are evaluated, such as PELT, binary segmentation, and cosine-similarity-based methods. Metrics include Precision, Recall, F1-score, and WindowDiff.
| Category | Model / Method | Spanish Support | Training |
|---|---|---|---|
| Classical | TextTiling | Yes | No |
| Neural (sentence-level) | Multilingual Sentence-Transformer + cosine similarity (csim) | Yes | No |
| Neural (clustering) | Multilingual Sentence-Transformer + PELT | Yes | No |
| Neural (clustering) | Multilingual Sentence-Transformer + Binary Segmentation (BinSeg) | Yes | No |
| Neural (frozen LM) | mBERT (frozen) + PELT / BinSeg / cosine similarity | Yes | No |
| Neural (frozen LM) | LaBSE (frozen) + PELT / BinSeg / cosine similarity | Yes | No |
| Neural (frozen LM) | XLM-R (frozen) + PELT / BinSeg / cosine similarity | Yes | No |
| Proposed | CoseNet Transformer (sentence encoder + CoSeNet layering + candidate masking + pooling) | Yes | Yes |
WindowDiff (WD) is used as the primary segmentation error metric.
Lower values indicate better segmentation quality. In this work, WindowDiff values ≤ 0.30 are considered acceptable, values ≤ 0.20 indicate good performance, and values ≤ 0.10 indicate strong segmentation accuracy under standard evaluation settings.
The benchmark entry point is bench.py, and results are stored as JSON files for reproducibility and further analysis.
Research Context
This work is part of an academic research project and a PhD’s Thesis, with an accompanying paper currently being written. The repository prioritizes:
- Architectural clarity
- Reproducibility of experiments
- Explicit separation between segmentation and embedding extraction
The model is not a SentenceTransformer, although it internally uses Transformer encoders.
License
This project is released under the Apache License 2.0.
Author and Affiliation
Alberto Palomo Alonso
University of Alcalá
Escuela Politécnica Superior
- Downloads last month
- 128