Citilink-XLMR-large-Structural-Segmentation-pt: Boundary Detection in Municipal Meeting Minutes
Model Description
This model performs extractive Question Answering (QA) to detect structural segments in Portuguese municipal meeting minutes, namely the Opening, Body, and Closing sections.
Given the full text of a meeting minute and a predefined question targeting a specific segment, the model predicts the most relevant text span using character-level start and end offsets.
It follows the SQuAD v2 paradigm, allowing the model to explicitly return no answer when a segment is not present in the document.
The model is designed to operate on long, unstructured administrative texts and is typically used as a preprocessing step for downstream tasks such as metadata extraction.
Key Features
๐๏ธ Specialized for Municipal Minutes
Fine-tuned on Portuguese municipal council meeting minutes, capturing the structural patterns of administrative documents.๐งฉ Extractive Question Answering
Predicts precise start and end offsets for Opening, Body, and Closing segments using a span-based QA formulation.โ๏ธ Transformer-based Architecture
Built on a pre-trained transformer model and adapted to handle long, unstructured texts through window-based inference.๐ Robust QA Performance
Achieves strong F1 scores on a held-out Portuguese test set, demonstrating reliable segment detection across municipalities.
Model Details
- Base Model:
deepset/xlm-roberta-large-squad2 - Architecture: Transformer encoder with a span prediction head for extractive Question Answering
- Parameters: ~550M
- Maximum Sequence Length: 512 tokens
- Fine-tuning Dataset: 120 Portuguese municipal meeting minutes from 6 different municipalities
- Answer Types:
opening,bodyandclosing(no-answer cases following the SQuAD v2 formulation) - Training Framework: PyTorch with Hugging Face Transformers
- Evaluation Metrics: Exact Match (EM) and F1 score, following the SQuAD v2 evaluation protocol
How It Works
The model follows a standard extractive Question Answering pipeline.
Given a question targeting a specific structural segment (e.g., Opening or Body) and the full text of a meeting minute as context, both inputs are jointly tokenized and passed to the transformer model. The model predicts start and end logits for each token in the sequence, corresponding to the most likely answer span.
For long documents exceeding the maximum sequence length, the context is split into overlapping windows. Each window is processed independently, and the final answer is selected based on the highest scoring span across all windows, while also considering the modelโs no-answer (null) score in accordance with the SQuAD v2 protocol.
The following example illustrates how to perform inference using this model:
import argparse
import json
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
def postprocess_qa_prediction(
context,
question,
start_logits,
end_logits,
offset_mapping,
input_ids,
tokenizer,
n_best_size=20,
max_answer_length=3000,
null_score_diff_threshold=0.0,
return_offsets=False
):
"""
Converts logits (start/end) to text answer for a single question.
If return_offsets=True, returns dict with 'text', 'start_char', 'end_char'.
Otherwise, returns just the text string.
"""
# Get CLS token index
cls_index = input_ids.index(tokenizer.cls_token_id)
# Null score (for impossible answers)
null_score = float(start_logits[cls_index] + end_logits[cls_index])
# Get top-n start/end positions
start_indexes = np.argsort(start_logits)[-1: -n_best_size - 1: -1].tolist()
end_indexes = np.argsort(end_logits)[-1: -n_best_size - 1: -1].tolist()
valid_answers = []
for s in start_indexes:
for e in end_indexes:
if e < s:
continue
length = e - s + 1
if length > max_answer_length:
continue
if s >= len(offset_mapping) or e >= len(offset_mapping):
continue
if offset_mapping[s] is None or offset_mapping[e] is None:
continue
start_char = offset_mapping[s][0]
end_char = offset_mapping[e][1]
if start_char is None or end_char is None:
continue
text = context[start_char:end_char]
score = float(start_logits[s] + end_logits[e])
valid_answers.append({
"text": text,
"score": score,
"start_char": start_char,
"end_char": end_char
})
if not valid_answers:
if return_offsets:
return {"text": "", "start_char": 0, "end_char": 0}
return ""
best_answer = max(valid_answers, key=lambda x: x["score"])
# Apply SQuAD v2 rule for impossible answers
score_diff = null_score - best_answer["score"]
if score_diff > null_score_diff_threshold:
if return_offsets:
return {"text": "", "start_char": 0, "end_char": 0}
return ""
if return_offsets:
return {
"text": best_answer["text"],
"start_char": best_answer["start_char"],
"end_char": best_answer["end_char"]
}
return best_answer["text"]
def extract_segment(context, question, tokenizer, model, max_length=512, doc_stride=128):
"""
Extracts a single segment by asking a question to the QA model.
Handles long documents using sliding windows with doc_stride.
Returns both the text and the character offsets (start, end).
"""
# Tokenize with stride to handle long documents
encoding = tokenizer(
question,
context,
truncation="only_second",
max_length=max_length,
stride=doc_stride,
return_overflowing_tokens=True,
return_offsets_mapping=True,
padding="max_length",
return_tensors="pt"
)
# Process all windows
all_start_logits = []
all_end_logits = []
with torch.no_grad():
for i in range(len(encoding["input_ids"])):
outputs = model(
input_ids=encoding["input_ids"][i:i+1],
attention_mask=encoding["attention_mask"][i:i+1]
)
all_start_logits.append(outputs.start_logits[0].cpu().numpy())
all_end_logits.append(outputs.end_logits[0].cpu().numpy())
# Find best answer across all windows
best_answer = None
best_score = float('-inf')
for i in range(len(encoding["input_ids"])):
offset_mapping = encoding["offset_mapping"][i].tolist()
input_ids = encoding["input_ids"][i].tolist()
start_logits = all_start_logits[i]
end_logits = all_end_logits[i]
# Post-process this window
answer_data = postprocess_qa_prediction(
context=context,
question=question,
start_logits=start_logits,
end_logits=end_logits,
offset_mapping=offset_mapping,
input_ids=input_ids,
tokenizer=tokenizer,
return_offsets=True
)
# Calculate score for this answer
if answer_data["text"]:
# Find the token positions for this answer
answer_score = float('-inf')
for s_idx, (s_start, s_end) in enumerate(offset_mapping):
if s_start == answer_data["start_char"]:
for e_idx, (e_start, e_end) in enumerate(offset_mapping):
if e_end == answer_data["end_char"]:
answer_score = float(start_logits[s_idx] + end_logits[e_idx])
break
break
if answer_score > best_score:
best_score = answer_score
best_answer = answer_data
if best_answer is None:
return {"text": "", "start_char": 0, "end_char": 0}
return best_answer
def segment_document(text, tokenizer, model):
"""
Segments the document into three parts using two questions and offset-based segmentation.
Logic:
1. Ask OPENING_Q to find the last sentence of the opening segment
2. Ask CLOSING_Q to find the first sentence of the closing segment
3. Segment based on offsets:
- intro_segment: from start to end_offset of OPENING_Q answer
- closing_segment: from start_offset of CLOSING_Q answer to end
- body_segment: everything in between
Returns a dictionary with intro_segment, body_segment, and closing_segment.
"""
if not text or text.strip() == "":
return {
"intro_segment": "",
"body_segment": "",
"closing_segment": ""
}
# Define the two questions
OPENING_Q = "No inรญcio da ata hรก um segmento de abertura. Qual รฉ a รบltima frase desse segmento de abertura?"
CLOSING_Q = "No final da ata hรก um segmento de encerramento. Qual รฉ a primeira frase desse segmento de encerramento?"
# Extract opening segment boundary
opening_answer = extract_segment(text, OPENING_Q, tokenizer, model)
# Extract closing segment boundary
closing_answer = extract_segment(text, CLOSING_Q, tokenizer, model)
# Get offsets
opening_end = opening_answer["end_char"] if opening_answer["text"] else 0
closing_start = closing_answer["start_char"] if closing_answer["text"] else len(text)
# Segment the document based on offsets
intro_segment = text[:opening_end]
body_segment = text[opening_end:closing_start]
closing_segment = text[closing_start:]
segments = {
"intro_segment": intro_segment,
"body_segment": body_segment,
"closing_segment": closing_segment
}
# Debug info
debug_info = {
"opening_answer": {
"text": opening_answer["text"],
"end_offset": opening_end
},
"closing_answer": {
"text": closing_answer["text"],
"start_offset": closing_start
},
"segment_lengths": {
"intro": len(intro_segment),
"body": len(body_segment),
"closing": len(closing_segment)
}
}
return segments, debug_info
def main():
parser = argparse.ArgumentParser(description="Segment document into intro, body, and closing sections")
parser.add_argument("--input_file", type=str, help="Path to input text file")
parser.add_argument("--text", type=str, help="Direct text input")
parser.add_argument("--output_file", type=str, help="Path to save JSON output (optional)")
parser.add_argument("--model", type=str, default="liaad/Citilink-XLMR-large-Structural-Segmentation-pt",
help="Model name or path")
parser.add_argument("--verbose", action="store_true", help="Show debug information")
args = parser.parse_args()
# Get input text
if args.input_file:
with open(args.input_file, "r", encoding="utf-8") as f:
text = f.read()
print(f"Loaded text from: {args.input_file}")
elif args.text:
text = args.text
else:
print("Error: Please provide either --input_file or --text")
return
# Load model
print(f"Loading model: {args.model}")
tokenizer = AutoTokenizer.from_pretrained(args.model, use_fast=True)
model = AutoModelForQuestionAnswering.from_pretrained(args.model)
model.eval()
print("Model loaded successfully")
# Segment document
print("\nProcessing document...")
segments, debug_info = segment_document(text, tokenizer, model)
# Print results
print("\n" + "="*80)
print("SEGMENTATION RESULTS")
print("="*80)
if args.verbose:
print("\nDEBUG INFO:")
print(json.dumps(debug_info, ensure_ascii=False, indent=2))
print("\n" + "-"*80)
print("\nINTRO SEGMENT:")
print("-"*80)
print(segments["intro_segment"])
print("\nBODY SEGMENT:")
print("-"*80)
print(segments["body_segment"])
print("\nCLOSING SEGMENT:")
print("-"*80)
print(segments["closing_segment"])
print("\n" + "="*80)
print(f"Statistics:")
print(f" Intro length: {len(segments['intro_segment'])} chars")
print(f" Body length: {len(segments['body_segment'])} chars")
print(f" Closing length: {len(segments['closing_segment'])} chars")
print(f" Total: {len(text)} chars")
print("="*80)
# Save to file if requested
if args.output_file:
output_data = {
"segments": segments,
"debug_info": debug_info if args.verbose else None
}
with open(args.output_file, "w", encoding="utf-8") as f:
json.dump(output_data, f, ensure_ascii=False, indent=2)
print(f"\nResults saved to: {args.output_file}")
if __name__ == "__main__":
main()
Evaluation Results
Municipal Meeting Minutes Test Set
| Metric | Score |
|---|---|
| F1 score | 0.907 |
| Exact Match | 0.875 |
Limitations
Domain Specificity
The model is fine-tuned on Portuguese municipal meeting minutes and performs best on administrative and governmental texts. Performance may degrade on documents with substantially different structure or writing style.Context Window Length
The model has a maximum input length of 512 tokens. Longer documents require window-based processing, which may lead to partial or fragmented segment predictions in edge cases.Structural Variability
Municipal minutes can vary significantly across municipalities and time periods. Unseen formatting patterns or atypical section ordering may reduce prediction accuracy.
License
This model is released under the cc-by-nc-nd-4.0 license.
- Downloads last month
- 19
Model tree for liaad/Citilink-XLMR-large-Structural-Segmentation-pt
Base model
deepset/xlm-roberta-large-squad2