--- library_name: transformers pipeline_tag: sentence-similarity tags: - sentence-embeddings - information-retrieval - semantic-search license: apache-2.0 --- # Wikipedia Document Segmentation with CoSeNet Transformer This repository contains the official implementation and pretrained weights of **CoseNet Transformer**, a neural architecture for **document segmentation** and **sentence-level representation learning**, developed as part of a **Master’s Thesis (final thesis project)** at the University of Alcalá. A research paper describing the model and experiments is currently **in preparation**. Hugging Face model repository: **Alverciito/wikipedia_segmentation** --- ## Overview CoseNet Transformer addresses document segmentation by combining: - Transformer-based contextual representations - A structural segmentation module (CoSeNet) for boundary modeling - Sentence-level embedding extraction utilities for downstream segmentation and clustering pipelines The model is evaluated on **Spanish Wikipedia** and compared against multiple baselines, including multilingual Transformer models and heuristic segmentation approaches. --- ## Repository Structure - **src/** - **src/model/**: core architecture (SegmentationNetwork, encoder blocks, CoSeNet components) - **src/dataset/**: dataset utilities and loaders - **src/dlutils/**: deep learning utilities - **research_files/** - **research_files/train/**: training scripts and configurations - **research_files/inference/**: standalone inference utilities (pre-Hugging Face integration) - **research_files/benchmark/**: benchmark suite, datasets, and evaluation scripts - **Root files (Hugging Face integration)** - **model.py**: Hugging Face wrapper (`SentenceCoseNet`, `SentenceCoseNetConfig`) - **config.json**: Hugging Face model configuration - **model.safetensors**: pretrained model weights - **tokenizer.json**, **tokenizer_config.json**, **special_tokens_map.json**: tokenizer artifacts - **requirements.txt**: dependencies --- ## Installation Install the dependencies listed in **requirements.txt**. The model is implemented in PyTorch and integrates with Hugging Face Transformers via custom code. --- ## Loading the Model This repository uses **custom model code**, so loading requires enabling remote code execution in the Hugging Face API. The tokenizer and model should both be loaded directly from the Hugging Face repository. A minimal end-to-end example: ````python from transformers import AutoTokenizer, AutoModel # Loading: tokenizer = AutoTokenizer.from_pretrained("Alverciito/wikipedia_segmentation") model = AutoModel.from_pretrained("Alverciito/wikipedia_segmentation") # Tokenize: text = "Hola mundo!" inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) # Text inference: output = model.encode(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"]) print(output.shape) >>> torch.Size([1, 4, 256]) # (batch, tokens, embedding_dim) # Sentence embedding inference: output = model.get_sentence_embedding( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], normalize=True # L2 normalization option ) print(output.shape) >>> torch.Size([1, 256]) # (batch, embedding_dim) ```` --- ## Inference API The Hugging Face wrapper exposes a **task-oriented API**: - **encode(input_ids, attention_mask=None)** Returns **token-level contextual embeddings**. Intended for encoder-style usage and intermediate representations. Typical output shape: `(batch, tokens, embedding_dim)`. - **get_sentence_embedding(input_ids, attention_mask=None, normalize=True/False)** Returns **one embedding per sentence**, obtained via masked pooling and a lightweight regularization head. Typical output shape: `(batch, embedding_dim)`. Optional L2 normalization is supported for similarity-based applications. - **forward(...)** Reserved **exclusively for document segmentation**. Expects **3D inputs** `(batch, sentences, tokens)` and returns structured segmentation outputs. This method is **not intended** for sentence embedding extraction. --- ## Benchmarks Benchmark experiments are located in **research_files/benchmark/**. The evaluation framework includes: - The proposed **CoseNet Transformer** model - Heuristic segmentation baselines (TexTile) - Multilingual Transformer-based baselines, including: - sBERT - mBERT - LaBSE - XLM-R Multiple segmentation strategies are evaluated, such as PELT, binary segmentation, and cosine-similarity-based methods. Metrics include Precision, Recall, F1-score, and WindowDiff. | Category | Model / Method | Spanish Support | Training | |--------|----------------|-----------------|----------| | Classical | TextTiling | Yes | No | | Neural (sentence-level) | Multilingual Sentence-Transformer + cosine similarity (csim) | Yes | No | | Neural (clustering) | Multilingual Sentence-Transformer + PELT | Yes | No | | Neural (clustering) | Multilingual Sentence-Transformer + Binary Segmentation (BinSeg) | Yes | No | | Neural (frozen LM) | mBERT (frozen) + PELT / BinSeg / cosine similarity | Yes | No | | Neural (frozen LM) | LaBSE (frozen) + PELT / BinSeg / cosine similarity | Yes | No | | Neural (frozen LM) | XLM-R (frozen) + PELT / BinSeg / cosine similarity | Yes | No | | Proposed | CoseNet Transformer (sentence encoder + CoSeNet layering + candidate masking + pooling) | Yes | Yes | **WindowDiff (WD)** is used as the primary segmentation error metric. Lower values indicate better segmentation quality. In this work, **WindowDiff values ≤ 0.30 are considered acceptable**, values **≤ 0.20 indicate good performance**, and values **≤ 0.10 indicate strong segmentation accuracy** under standard evaluation settings. The benchmark entry point is **bench.py**, and results are stored as JSON files for reproducibility and further analysis. --- ## Research Context This work is part of an academic research project and a **PhD’s Thesis**, with an accompanying paper currently being written. The repository prioritizes: - Architectural clarity - Reproducibility of experiments - Explicit separation between segmentation and embedding extraction The model is **not** a SentenceTransformer, although it internally uses Transformer encoders. --- ## License This project is released under the **Apache License 2.0**. --- ## Author and Affiliation **Alberto Palomo Alonso** University of Alcalá Escuela Politécnica Superior ---