NSP-CouncilSeg-EN: Linear Text Segmentation for Municipal Meeting Minutes

Model Description

NSP-CouncilSeg-EN is a fine-tuned BERT model specialized in Text Segmentation for municipal council meeting minutes. The model uses Next Sentence Prediction (NSP) to identify topic boundaries in long-form documents, making it particularly effective for segmenting administrative and governmental meeting minutes.

Try out the model: Hugging Face Space Demo

Key Features

  • 🎯 Specialized for Meeting Minutes: Fine-tuned on Portuguese municipal council meeting minutes
  • ⚑ Fast Inference: Efficient BERT-base architecture for real-time segmentation
  • πŸ“Š Good Accuracy: Achieves BED F-measure score of 0.61 on CouncilSeg-EN dataset
  • πŸ”„ Sentence-Level Segmentation: Identifies topic boundaries at sentence granularity

Model Details

  • Base Model: google-bert/bert-base-uncased
  • Architecture: BERT with Next Sentence Prediction head
  • Parameters: 110M
  • Max Sequence Length: 512 tokens
  • Fine-tuning Dataset: CouncilSeg (Portuguese Municipal Meeting Minutes)
  • Fine-tuning Method: Focal Loss with boundary-aware weighting
  • Training Framework: PyTorch + Transformers

How It Works

The model predicts whether two consecutive sentences belong to the same topic (label 0: "is_next") or represent a topic transition (label 1: "not_next"). By applying this classifier sequentially across all sentence pairs in a document, it identifies topic boundaries.

Sentence A: "By the President, minutes no. 28 of 20.12.2023 were present at the meeting."
Sentence B: "After considering and analyzing the matter, the Municipal Executive unanimously decided to approve minute no. 28 of 12.20.2023."
β†’ Prediction: Same Topic (confidence: 76%)

Sentence A: "After considering and analyzing the matter, the Municipal Executive unanimously decided to approve minute no. 28 of 12.20.2023."
Sentence B: "There were no various processes and requests to submit."
β†’ Prediction: Topic Boundary (confidence: 82%)

Usage

Quick Start with Transformers

from transformers import AutoTokenizer, AutoModelForNextSentencePrediction
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("anonymous15135/nsp-councilseg-en")
model = AutoModelForNextSentencePrediction.from_pretrained("anonymous15135/nsp-councilseg-en")

# Prepare input
sentence_a = "By the President, minutes no. 28 of 20.12.2023 were present at the meeting."
sentence_b = "After considering and analyzing the matter, the Municipal Executive unanimously decided to approve minute no. 28 of 12.20.2023."


# Tokenize
inputs = tokenizer(sentence_a, sentence_b, return_tensors="pt")

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probs = torch.softmax(logits, dim=1)
    
# Interpret results
is_next_prob = probs[0][0].item()
not_next_prob = probs[0][1].item()

print(f"Is Next (same topic): {is_next_prob:.3f}")
print(f"Not Next (topic boundary): {not_next_prob:.3f}")

if not_next_prob > 0.5:
    print("πŸ”΄ Topic boundary detected!")
else:
    print("🟒 Same topic continues")

Evaluation Results

Limitations

  • Domain Specificity: Best performance on administrative/governmental meeting minutes
  • Language: Optimized for English; Portuguese performance may vary
  • Document Length: Designed for documents with 10-50 segments
  • Context Window: Limited to 512 tokens per sentence pair
  • Ambiguous Boundaries: May struggle with subtle topic transitions

Model Card Contact

For questions or feedback, please open an issue in the model repository.

License

This model is released under the Attribution-NonCommercial-NoDerivatives 4.0 International

Downloads last month
6
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for anonymous15135/nsp-councilseg-en

Finetuned
(6281)
this model

Space using anonymous15135/nsp-councilseg-en 1