alverciito
add zero shot experiment and similarity method
d0a3e2d
---
library_name: transformers
pipeline_tag: sentence-similarity
tags:
- sentence-embeddings
- information-retrieval
- semantic-search
license: apache-2.0
---
# Wikipedia Document Segmentation with CoSeNet Transformer
This repository contains the official implementation and pretrained weights of **CoseNet Transformer**, a neural architecture for **document segmentation** and **sentence-level representation learning**, developed as part of a **Master’s Thesis (final thesis project)** at the University of Alcalá. A research paper describing the model and experiments is currently **in preparation**.
Hugging Face model repository: **Alverciito/wikipedia_segmentation**
---
## Overview
CoseNet Transformer addresses document segmentation by combining:
- Transformer-based contextual representations
- A structural segmentation module (CoSeNet) for boundary modeling
- Sentence-level embedding extraction utilities for downstream segmentation and clustering pipelines
The model is evaluated on **Spanish Wikipedia** and compared against multiple baselines, including multilingual Transformer models and heuristic segmentation approaches.
---
## Repository Structure
- **src/**
- **src/model/**: core architecture (SegmentationNetwork, encoder blocks, CoSeNet components)
- **src/dataset/**: dataset utilities and loaders
- **src/dlutils/**: deep learning utilities
- **research_files/**
- **research_files/train/**: training scripts and configurations
- **research_files/inference/**: standalone inference utilities (pre-Hugging Face integration)
- **research_files/benchmark/**: benchmark suite, datasets, and evaluation scripts
- **Root files (Hugging Face integration)**
- **model.py**: Hugging Face wrapper (`SentenceCoseNet`, `SentenceCoseNetConfig`)
- **config.json**: Hugging Face model configuration
- **model.safetensors**: pretrained model weights
- **tokenizer.json**, **tokenizer_config.json**, **special_tokens_map.json**: tokenizer artifacts
- **requirements.txt**: dependencies
---
## Installation
Install the dependencies listed in **requirements.txt**. The model is implemented in PyTorch and integrates with Hugging Face Transformers via custom code.
---
## Loading the Model
This repository uses **custom model code**, so loading requires enabling remote code execution in the Hugging Face API.
The tokenizer and model should both be loaded directly from the Hugging Face repository. A minimal end-to-end example:
````python
from transformers import AutoTokenizer, AutoModel
# Loading:
tokenizer = AutoTokenizer.from_pretrained("Alverciito/wikipedia_segmentation")
model = AutoModel.from_pretrained("Alverciito/wikipedia_segmentation")
# Tokenize:
text = "Hola mundo!"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
# Text inference:
output = model.encode(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
print(output.shape)
>>> torch.Size([1, 4, 256]) # (batch, tokens, embedding_dim)
# Sentence embedding inference:
output = model.get_sentence_embedding(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
normalize=True # L2 normalization option
)
print(output.shape)
>>> torch.Size([1, 256]) # (batch, embedding_dim)
````
---
## Inference API
The Hugging Face wrapper exposes a **task-oriented API**:
- **encode(input_ids, attention_mask=None)**
Returns **token-level contextual embeddings**. Intended for encoder-style usage and intermediate representations.
Typical output shape: `(batch, tokens, embedding_dim)`.
- **get_sentence_embedding(input_ids, attention_mask=None, normalize=True/False)**
Returns **one embedding per sentence**, obtained via masked pooling and a lightweight regularization head.
Typical output shape: `(batch, embedding_dim)`.
Optional L2 normalization is supported for similarity-based applications.
- **forward(...)**
Reserved **exclusively for document segmentation**.
Expects **3D inputs** `(batch, sentences, tokens)` and returns structured segmentation outputs.
This method is **not intended** for sentence embedding extraction.
---
## Benchmarks
Benchmark experiments are located in **research_files/benchmark/**.
The evaluation framework includes:
- The proposed **CoseNet Transformer** model
- Heuristic segmentation baselines (TexTile)
- Multilingual Transformer-based baselines, including:
- sBERT
- mBERT
- LaBSE
- XLM-R
Multiple segmentation strategies are evaluated, such as PELT, binary segmentation, and cosine-similarity-based methods. Metrics include Precision, Recall, F1-score, and WindowDiff.
| Category | Model / Method | Spanish Support | Training |
|--------|----------------|-----------------|----------|
| Classical | TextTiling | Yes | No |
| Neural (sentence-level) | Multilingual Sentence-Transformer + cosine similarity (csim) | Yes | No |
| Neural (clustering) | Multilingual Sentence-Transformer + PELT | Yes | No |
| Neural (clustering) | Multilingual Sentence-Transformer + Binary Segmentation (BinSeg) | Yes | No |
| Neural (frozen LM) | mBERT (frozen) + PELT / BinSeg / cosine similarity | Yes | No |
| Neural (frozen LM) | LaBSE (frozen) + PELT / BinSeg / cosine similarity | Yes | No |
| Neural (frozen LM) | XLM-R (frozen) + PELT / BinSeg / cosine similarity | Yes | No |
| Proposed | CoseNet Transformer (sentence encoder + CoSeNet layering + candidate masking + pooling) | Yes | Yes |
**WindowDiff (WD)** is used as the primary segmentation error metric.
Lower values indicate better segmentation quality. In this work, **WindowDiff values ≤ 0.30 are considered acceptable**, values **≤ 0.20 indicate good performance**, and values **≤ 0.10 indicate strong segmentation accuracy** under standard evaluation settings.
The benchmark entry point is **bench.py**, and results are stored as JSON files for reproducibility and further analysis.
---
## Research Context
This work is part of an academic research project and a **PhD’s Thesis**, with an accompanying paper currently being written. The repository prioritizes:
- Architectural clarity
- Reproducibility of experiments
- Explicit separation between segmentation and embedding extraction
The model is **not** a SentenceTransformer, although it internally uses Transformer encoders.
---
## License
This project is released under the **Apache License 2.0**.
---
## Author and Affiliation
**Alberto Palomo Alonso**
University of Alcalá
Escuela Politécnica Superior
---