alverciito

add zero shot experiment and similarity method

d0a3e2d about 1 month ago

6.73 kB

	---
	library_name: transformers
	pipeline_tag: sentence-similarity
	tags:
	- sentence-embeddings
	- information-retrieval
	- semantic-search
	license: apache-2.0
	---

	# Wikipedia Document Segmentation with CoSeNet Transformer

	This repository contains the official implementation and pretrained weights of CoseNet Transformer, a neural architecture for document segmentation and sentence-level representation learning, developed as part of a Master’s Thesis (final thesis project) at the University of Alcalá. A research paper describing the model and experiments is currently in preparation.

	Hugging Face model repository: Alverciito/wikipedia_segmentation

	---

	## Overview

	CoseNet Transformer addresses document segmentation by combining:

	- Transformer-based contextual representations
	- A structural segmentation module (CoSeNet) for boundary modeling
	- Sentence-level embedding extraction utilities for downstream segmentation and clustering pipelines

	The model is evaluated on Spanish Wikipedia and compared against multiple baselines, including multilingual Transformer models and heuristic segmentation approaches.

	---

	## Repository Structure

	- src/
	- src/model/: core architecture (SegmentationNetwork, encoder blocks, CoSeNet components)
	- src/dataset/: dataset utilities and loaders
	- src/dlutils/: deep learning utilities

	- research_files/
	- research_files/train/: training scripts and configurations
	- research_files/inference/: standalone inference utilities (pre-Hugging Face integration)
	- research_files/benchmark/: benchmark suite, datasets, and evaluation scripts

	- Root files (Hugging Face integration)
	- model.py: Hugging Face wrapper (`SentenceCoseNet`, `SentenceCoseNetConfig`)
	- config.json: Hugging Face model configuration
	- model.safetensors: pretrained model weights
	- tokenizer.json, tokenizer_config.json, special_tokens_map.json: tokenizer artifacts
	- requirements.txt: dependencies

	---

	## Installation

	Install the dependencies listed in requirements.txt. The model is implemented in PyTorch and integrates with Hugging Face Transformers via custom code.

	---

	## Loading the Model

	This repository uses custom model code, so loading requires enabling remote code execution in the Hugging Face API.

	The tokenizer and model should both be loaded directly from the Hugging Face repository. A minimal end-to-end example:

	````python
	from transformers import AutoTokenizer, AutoModel

	# Loading:
	tokenizer = AutoTokenizer.from_pretrained("Alverciito/wikipedia_segmentation")
	model = AutoModel.from_pretrained("Alverciito/wikipedia_segmentation")

	# Tokenize:
	text = "Hola mundo!"
	inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

	# Text inference:
	output = model.encode(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
	print(output.shape)
	>>> torch.Size([1, 4, 256]) # (batch, tokens, embedding_dim)

	# Sentence embedding inference:
	output = model.get_sentence_embedding(
	input_ids=inputs["input_ids"],
	attention_mask=inputs["attention_mask"],
	normalize=True # L2 normalization option
	)
	print(output.shape)
	>>> torch.Size([1, 256]) # (batch, embedding_dim)
	````

	---

	## Inference API

	The Hugging Face wrapper exposes a task-oriented API:

	- encode(input_ids, attention_mask=None)
	Returns token-level contextual embeddings. Intended for encoder-style usage and intermediate representations.
	Typical output shape: `(batch, tokens, embedding_dim)`.

	- get_sentence_embedding(input_ids, attention_mask=None, normalize=True/False)
	Returns one embedding per sentence, obtained via masked pooling and a lightweight regularization head.
	Typical output shape: `(batch, embedding_dim)`.
	Optional L2 normalization is supported for similarity-based applications.

	- forward(...)
	Reserved exclusively for document segmentation.
	Expects 3D inputs `(batch, sentences, tokens)` and returns structured segmentation outputs.
	This method is not intended for sentence embedding extraction.

	---

	## Benchmarks

	Benchmark experiments are located in research_files/benchmark/.

	The evaluation framework includes:

	- The proposed CoseNet Transformer model
	- Heuristic segmentation baselines (TexTile)
	- Multilingual Transformer-based baselines, including:
	- sBERT
	- mBERT
	- LaBSE
	- XLM-R

	Multiple segmentation strategies are evaluated, such as PELT, binary segmentation, and cosine-similarity-based methods. Metrics include Precision, Recall, F1-score, and WindowDiff.


	\| Category \| Model / Method \| Spanish Support \| Training \|
	\|--------\|----------------\|-----------------\|----------\|
	\| Classical \| TextTiling \| Yes \| No \|
	\| Neural (sentence-level) \| Multilingual Sentence-Transformer + cosine similarity (csim) \| Yes \| No \|
	\| Neural (clustering) \| Multilingual Sentence-Transformer + PELT \| Yes \| No \|
	\| Neural (clustering) \| Multilingual Sentence-Transformer + Binary Segmentation (BinSeg) \| Yes \| No \|
	\| Neural (frozen LM) \| mBERT (frozen) + PELT / BinSeg / cosine similarity \| Yes \| No \|
	\| Neural (frozen LM) \| LaBSE (frozen) + PELT / BinSeg / cosine similarity \| Yes \| No \|
	\| Neural (frozen LM) \| XLM-R (frozen) + PELT / BinSeg / cosine similarity \| Yes \| No \|
	\| Proposed \| CoseNet Transformer (sentence encoder + CoSeNet layering + candidate masking + pooling) \| Yes \| Yes \|

	WindowDiff (WD) is used as the primary segmentation error metric.
	Lower values indicate better segmentation quality. In this work, WindowDiff values ≤ 0.30 are considered acceptable, values ≤ 0.20 indicate good performance, and values ≤ 0.10 indicate strong segmentation accuracy under standard evaluation settings.


	The benchmark entry point is bench.py, and results are stored as JSON files for reproducibility and further analysis.

	---

	## Research Context

	This work is part of an academic research project and a PhD’s Thesis, with an accompanying paper currently being written. The repository prioritizes:

	- Architectural clarity
	- Reproducibility of experiments
	- Explicit separation between segmentation and embedding extraction

	The model is not a SentenceTransformer, although it internally uses Transformer encoders.

	---

	## License

	This project is released under the Apache License 2.0.

	---

	## Author and Affiliation

	Alberto Palomo Alonso
	University of Alcalá
	Escuela Politécnica Superior

	---