|
|
---
|
|
|
library_name: transformers
|
|
|
pipeline_tag: sentence-similarity
|
|
|
tags:
|
|
|
- sentence-embeddings
|
|
|
- information-retrieval
|
|
|
- semantic-search
|
|
|
license: apache-2.0
|
|
|
---
|
|
|
|
|
|
# Wikipedia Document Segmentation with CoSeNet Transformer
|
|
|
|
|
|
This repository contains the official implementation and pretrained weights of **CoseNet Transformer**, a neural architecture for **document segmentation** and **sentence-level representation learning**, developed as part of a **Master’s Thesis (final thesis project)** at the University of Alcalá. A research paper describing the model and experiments is currently **in preparation**.
|
|
|
|
|
|
Hugging Face model repository: **Alverciito/wikipedia_segmentation**
|
|
|
|
|
|
---
|
|
|
|
|
|
## Overview
|
|
|
|
|
|
CoseNet Transformer addresses document segmentation by combining:
|
|
|
|
|
|
- Transformer-based contextual representations
|
|
|
- A structural segmentation module (CoSeNet) for boundary modeling
|
|
|
- Sentence-level embedding extraction utilities for downstream segmentation and clustering pipelines
|
|
|
|
|
|
The model is evaluated on **Spanish Wikipedia** and compared against multiple baselines, including multilingual Transformer models and heuristic segmentation approaches.
|
|
|
|
|
|
---
|
|
|
|
|
|
## Repository Structure
|
|
|
|
|
|
- **src/**
|
|
|
- **src/model/**: core architecture (SegmentationNetwork, encoder blocks, CoSeNet components)
|
|
|
- **src/dataset/**: dataset utilities and loaders
|
|
|
- **src/dlutils/**: deep learning utilities
|
|
|
|
|
|
- **research_files/**
|
|
|
- **research_files/train/**: training scripts and configurations
|
|
|
- **research_files/inference/**: standalone inference utilities (pre-Hugging Face integration)
|
|
|
- **research_files/benchmark/**: benchmark suite, datasets, and evaluation scripts
|
|
|
|
|
|
- **Root files (Hugging Face integration)**
|
|
|
- **model.py**: Hugging Face wrapper (`SentenceCoseNet`, `SentenceCoseNetConfig`)
|
|
|
- **config.json**: Hugging Face model configuration
|
|
|
- **model.safetensors**: pretrained model weights
|
|
|
- **tokenizer.json**, **tokenizer_config.json**, **special_tokens_map.json**: tokenizer artifacts
|
|
|
- **requirements.txt**: dependencies
|
|
|
|
|
|
---
|
|
|
|
|
|
## Installation
|
|
|
|
|
|
Install the dependencies listed in **requirements.txt**. The model is implemented in PyTorch and integrates with Hugging Face Transformers via custom code.
|
|
|
|
|
|
---
|
|
|
|
|
|
## Loading the Model
|
|
|
|
|
|
This repository uses **custom model code**, so loading requires enabling remote code execution in the Hugging Face API.
|
|
|
|
|
|
The tokenizer and model should both be loaded directly from the Hugging Face repository. A minimal end-to-end example:
|
|
|
|
|
|
````python
|
|
|
from transformers import AutoTokenizer, AutoModel
|
|
|
|
|
|
# Loading:
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Alverciito/wikipedia_segmentation")
|
|
|
model = AutoModel.from_pretrained("Alverciito/wikipedia_segmentation")
|
|
|
|
|
|
# Tokenize:
|
|
|
text = "Hola mundo!"
|
|
|
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
|
|
|
|
|
|
# Text inference:
|
|
|
output = model.encode(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
|
|
|
print(output.shape)
|
|
|
>>> torch.Size([1, 4, 256]) # (batch, tokens, embedding_dim)
|
|
|
|
|
|
# Sentence embedding inference:
|
|
|
output = model.get_sentence_embedding(
|
|
|
input_ids=inputs["input_ids"],
|
|
|
attention_mask=inputs["attention_mask"],
|
|
|
normalize=True # L2 normalization option
|
|
|
)
|
|
|
print(output.shape)
|
|
|
>>> torch.Size([1, 256]) # (batch, embedding_dim)
|
|
|
````
|
|
|
|
|
|
---
|
|
|
|
|
|
## Inference API
|
|
|
|
|
|
The Hugging Face wrapper exposes a **task-oriented API**:
|
|
|
|
|
|
- **encode(input_ids, attention_mask=None)**
|
|
|
Returns **token-level contextual embeddings**. Intended for encoder-style usage and intermediate representations.
|
|
|
Typical output shape: `(batch, tokens, embedding_dim)`.
|
|
|
|
|
|
- **get_sentence_embedding(input_ids, attention_mask=None, normalize=True/False)**
|
|
|
Returns **one embedding per sentence**, obtained via masked pooling and a lightweight regularization head.
|
|
|
Typical output shape: `(batch, embedding_dim)`.
|
|
|
Optional L2 normalization is supported for similarity-based applications.
|
|
|
|
|
|
- **forward(...)**
|
|
|
Reserved **exclusively for document segmentation**.
|
|
|
Expects **3D inputs** `(batch, sentences, tokens)` and returns structured segmentation outputs.
|
|
|
This method is **not intended** for sentence embedding extraction.
|
|
|
|
|
|
---
|
|
|
|
|
|
## Benchmarks
|
|
|
|
|
|
Benchmark experiments are located in **research_files/benchmark/**.
|
|
|
|
|
|
The evaluation framework includes:
|
|
|
|
|
|
- The proposed **CoseNet Transformer** model
|
|
|
- Heuristic segmentation baselines (TexTile)
|
|
|
- Multilingual Transformer-based baselines, including:
|
|
|
- sBERT
|
|
|
- mBERT
|
|
|
- LaBSE
|
|
|
- XLM-R
|
|
|
|
|
|
Multiple segmentation strategies are evaluated, such as PELT, binary segmentation, and cosine-similarity-based methods. Metrics include Precision, Recall, F1-score, and WindowDiff.
|
|
|
|
|
|
|
|
|
| Category | Model / Method | Spanish Support | Training |
|
|
|
|--------|----------------|-----------------|----------|
|
|
|
| Classical | TextTiling | Yes | No |
|
|
|
| Neural (sentence-level) | Multilingual Sentence-Transformer + cosine similarity (csim) | Yes | No |
|
|
|
| Neural (clustering) | Multilingual Sentence-Transformer + PELT | Yes | No |
|
|
|
| Neural (clustering) | Multilingual Sentence-Transformer + Binary Segmentation (BinSeg) | Yes | No |
|
|
|
| Neural (frozen LM) | mBERT (frozen) + PELT / BinSeg / cosine similarity | Yes | No |
|
|
|
| Neural (frozen LM) | LaBSE (frozen) + PELT / BinSeg / cosine similarity | Yes | No |
|
|
|
| Neural (frozen LM) | XLM-R (frozen) + PELT / BinSeg / cosine similarity | Yes | No |
|
|
|
| Proposed | CoseNet Transformer (sentence encoder + CoSeNet layering + candidate masking + pooling) | Yes | Yes |
|
|
|
|
|
|
**WindowDiff (WD)** is used as the primary segmentation error metric.
|
|
|
Lower values indicate better segmentation quality. In this work, **WindowDiff values ≤ 0.30 are considered acceptable**, values **≤ 0.20 indicate good performance**, and values **≤ 0.10 indicate strong segmentation accuracy** under standard evaluation settings.
|
|
|
|
|
|
|
|
|
The benchmark entry point is **bench.py**, and results are stored as JSON files for reproducibility and further analysis.
|
|
|
|
|
|
---
|
|
|
|
|
|
## Research Context
|
|
|
|
|
|
This work is part of an academic research project and a **PhD’s Thesis**, with an accompanying paper currently being written. The repository prioritizes:
|
|
|
|
|
|
- Architectural clarity
|
|
|
- Reproducibility of experiments
|
|
|
- Explicit separation between segmentation and embedding extraction
|
|
|
|
|
|
The model is **not** a SentenceTransformer, although it internally uses Transformer encoders.
|
|
|
|
|
|
---
|
|
|
|
|
|
## License
|
|
|
|
|
|
This project is released under the **Apache License 2.0**.
|
|
|
|
|
|
---
|
|
|
|
|
|
## Author and Affiliation
|
|
|
|
|
|
**Alberto Palomo Alonso**
|
|
|
University of Alcalá
|
|
|
Escuela Politécnica Superior
|
|
|
|
|
|
---
|
|
|
|