---
library_name: transformers
pipeline_tag: sentence-similarity
tags:
- sentence-embeddings
- information-retrieval
- semantic-search
license: apache-2.0
---

# Wikipedia Document Segmentation with CoSeNet Transformer

This repository contains the official implementation and pretrained weights of **CoseNet Transformer**, a neural architecture for **document segmentation** and **sentence-level representation learning**, developed as part of a **Master’s Thesis (final thesis project)** at the University of Alcalá. A research paper describing the model and experiments is currently **in preparation**.

Hugging Face model repository: **Alverciito/wikipedia_segmentation**

---

## Overview

CoseNet Transformer addresses document segmentation by combining:

- Transformer-based contextual representations  
- A structural segmentation module (CoSeNet) for boundary modeling  
- Sentence-level embedding extraction utilities for downstream segmentation and clustering pipelines  

The model is evaluated on **Spanish Wikipedia** and compared against multiple baselines, including multilingual Transformer models and heuristic segmentation approaches.

---

## Repository Structure

- **src/**
  - **src/model/**: core architecture (SegmentationNetwork, encoder blocks, CoSeNet components)
  - **src/dataset/**: dataset utilities and loaders
  - **src/dlutils/**: deep learning utilities

- **research_files/**
  - **research_files/train/**: training scripts and configurations
  - **research_files/inference/**: standalone inference utilities (pre-Hugging Face integration)
  - **research_files/benchmark/**: benchmark suite, datasets, and evaluation scripts

- **Root files (Hugging Face integration)**
  - **model.py**: Hugging Face wrapper (`SentenceCoseNet`, `SentenceCoseNetConfig`)
  - **config.json**: Hugging Face model configuration
  - **model.safetensors**: pretrained model weights
  - **tokenizer.json**, **tokenizer_config.json**, **special_tokens_map.json**: tokenizer artifacts
  - **requirements.txt**: dependencies

---

## Installation

Install the dependencies listed in **requirements.txt**. The model is implemented in PyTorch and integrates with Hugging Face Transformers via custom code.

---

## Loading the Model

This repository uses **custom model code**, so loading requires enabling remote code execution in the Hugging Face API.

The tokenizer and model should both be loaded directly from the Hugging Face repository. A minimal end-to-end example:

````python
from transformers import AutoTokenizer, AutoModel

# Loading:
tokenizer = AutoTokenizer.from_pretrained("Alverciito/wikipedia_segmentation")
model = AutoModel.from_pretrained("Alverciito/wikipedia_segmentation")

# Tokenize:
text = "Hola mundo!"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

# Text inference:
output = model.encode(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
print(output.shape)
    >>> torch.Size([1, 4, 256])  # (batch, tokens, embedding_dim)

# Sentence embedding inference:
output = model.get_sentence_embedding(
    input_ids=inputs["input_ids"], 
    attention_mask=inputs["attention_mask"],
    normalize=True # L2 normalization option
)
print(output.shape)
    >>> torch.Size([1, 256])  # (batch, embedding_dim)
````

---

## Inference API

The Hugging Face wrapper exposes a **task-oriented API**:

- **encode(input_ids, attention_mask=None)**  
  Returns **token-level contextual embeddings**. Intended for encoder-style usage and intermediate representations.  
  Typical output shape: `(batch, tokens, embedding_dim)`.

- **get_sentence_embedding(input_ids, attention_mask=None, normalize=True/False)**  
  Returns **one embedding per sentence**, obtained via masked pooling and a lightweight regularization head.  
  Typical output shape: `(batch, embedding_dim)`.  
  Optional L2 normalization is supported for similarity-based applications.

- **forward(...)**  
  Reserved **exclusively for document segmentation**.  
  Expects **3D inputs** `(batch, sentences, tokens)` and returns structured segmentation outputs.  
  This method is **not intended** for sentence embedding extraction.

---

## Benchmarks

Benchmark experiments are located in **research_files/benchmark/**.

The evaluation framework includes:

- The proposed **CoseNet Transformer** model  
- Heuristic segmentation baselines (TexTile)
- Multilingual Transformer-based baselines, including:
  - sBERT 
  - mBERT
  - LaBSE
  - XLM-R

Multiple segmentation strategies are evaluated, such as PELT, binary segmentation, and cosine-similarity-based methods. Metrics include Precision, Recall, F1-score, and WindowDiff.


| Category | Model / Method | Spanish Support | Training |
|--------|----------------|-----------------|----------|
| Classical | TextTiling | Yes | No |
| Neural (sentence-level) | Multilingual Sentence-Transformer + cosine similarity (csim) | Yes | No |
| Neural (clustering) | Multilingual Sentence-Transformer + PELT | Yes | No |
| Neural (clustering) | Multilingual Sentence-Transformer + Binary Segmentation (BinSeg) | Yes | No |
| Neural (frozen LM) | mBERT (frozen) + PELT / BinSeg / cosine similarity | Yes | No |
| Neural (frozen LM) | LaBSE (frozen) + PELT / BinSeg / cosine similarity | Yes | No |
| Neural (frozen LM) | XLM-R (frozen) + PELT / BinSeg / cosine similarity | Yes | No |
| Proposed | CoseNet Transformer (sentence encoder + CoSeNet layering + candidate masking + pooling) | Yes | Yes |

**WindowDiff (WD)** is used as the primary segmentation error metric.  
Lower values indicate better segmentation quality. In this work, **WindowDiff values ≤ 0.30 are considered acceptable**, values **≤ 0.20 indicate good performance**, and values **≤ 0.10 indicate strong segmentation accuracy** under standard evaluation settings.


The benchmark entry point is **bench.py**, and results are stored as JSON files for reproducibility and further analysis.

---

## Research Context

This work is part of an academic research project and a **PhD’s Thesis**, with an accompanying paper currently being written. The repository prioritizes:

- Architectural clarity
- Reproducibility of experiments
- Explicit separation between segmentation and embedding extraction

The model is **not** a SentenceTransformer, although it internally uses Transformer encoders.

---

## License

This project is released under the **Apache License 2.0**.

---

## Author and Affiliation

**Alberto Palomo Alonso**  
University of Alcalá  
Escuela Politécnica Superior

---