Citilink-XLMR-large-Structural-Segmentation-pt: Boundary Detection in Municipal Meeting Minutes

Model Description

This model performs extractive Question Answering (QA) to detect structural segments in Portuguese municipal meeting minutes, namely the Opening, Body, and Closing sections.

Given the full text of a meeting minute and a predefined question targeting a specific segment, the model predicts the most relevant text span using character-level start and end offsets.
It follows the SQuAD v2 paradigm, allowing the model to explicitly return no answer when a segment is not present in the document.

The model is designed to operate on long, unstructured administrative texts and is typically used as a preprocessing step for downstream tasks such as metadata extraction.

Key Features

  • ๐Ÿ›๏ธ Specialized for Municipal Minutes
    Fine-tuned on Portuguese municipal council meeting minutes, capturing the structural patterns of administrative documents.

  • ๐Ÿงฉ Extractive Question Answering
    Predicts precise start and end offsets for Opening, Body, and Closing segments using a span-based QA formulation.

  • โš™๏ธ Transformer-based Architecture
    Built on a pre-trained transformer model and adapted to handle long, unstructured texts through window-based inference.

  • ๐Ÿ“ˆ Robust QA Performance
    Achieves strong F1 scores on a held-out Portuguese test set, demonstrating reliable segment detection across municipalities.

Model Details

  • Base Model: deepset/xlm-roberta-large-squad2
  • Architecture: Transformer encoder with a span prediction head for extractive Question Answering
  • Parameters: ~550M
  • Maximum Sequence Length: 512 tokens
  • Fine-tuning Dataset: 120 Portuguese municipal meeting minutes from 6 different municipalities
  • Answer Types: opening, body and closing (no-answer cases following the SQuAD v2 formulation)
  • Training Framework: PyTorch with Hugging Face Transformers
  • Evaluation Metrics: Exact Match (EM) and F1 score, following the SQuAD v2 evaluation protocol

How It Works

The model follows a standard extractive Question Answering pipeline.

Given a question targeting a specific structural segment (e.g., Opening or Body) and the full text of a meeting minute as context, both inputs are jointly tokenized and passed to the transformer model. The model predicts start and end logits for each token in the sequence, corresponding to the most likely answer span.

For long documents exceeding the maximum sequence length, the context is split into overlapping windows. Each window is processed independently, and the final answer is selected based on the highest scoring span across all windows, while also considering the modelโ€™s no-answer (null) score in accordance with the SQuAD v2 protocol.

The following example illustrates how to perform inference using this model:

import argparse
import json
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

def postprocess_qa_prediction(
    context,
    question,
    start_logits,
    end_logits,
    offset_mapping,
    input_ids,
    tokenizer,
    n_best_size=20,
    max_answer_length=3000,
    null_score_diff_threshold=0.0,
    return_offsets=False
):
    """
    Converts logits (start/end) to text answer for a single question.
    If return_offsets=True, returns dict with 'text', 'start_char', 'end_char'.
    Otherwise, returns just the text string.
    """
    # Get CLS token index
    cls_index = input_ids.index(tokenizer.cls_token_id)
    
    # Null score (for impossible answers)
    null_score = float(start_logits[cls_index] + end_logits[cls_index])
    
    # Get top-n start/end positions
    start_indexes = np.argsort(start_logits)[-1: -n_best_size - 1: -1].tolist()
    end_indexes = np.argsort(end_logits)[-1: -n_best_size - 1: -1].tolist()
    
    valid_answers = []
    
    for s in start_indexes:
        for e in end_indexes:
            if e < s:
                continue
            length = e - s + 1
            if length > max_answer_length:
                continue
            if s >= len(offset_mapping) or e >= len(offset_mapping):
                continue
            if offset_mapping[s] is None or offset_mapping[e] is None:
                continue
            
            start_char = offset_mapping[s][0]
            end_char = offset_mapping[e][1]
            if start_char is None or end_char is None:
                continue
            
            text = context[start_char:end_char]
            score = float(start_logits[s] + end_logits[e])
            valid_answers.append({
                "text": text, 
                "score": score,
                "start_char": start_char,
                "end_char": end_char
            })
    
    if not valid_answers:
        if return_offsets:
            return {"text": "", "start_char": 0, "end_char": 0}
        return ""
    
    best_answer = max(valid_answers, key=lambda x: x["score"])
    
    # Apply SQuAD v2 rule for impossible answers
    score_diff = null_score - best_answer["score"]
    if score_diff > null_score_diff_threshold:
        if return_offsets:
            return {"text": "", "start_char": 0, "end_char": 0}
        return ""
    
    if return_offsets:
        return {
            "text": best_answer["text"],
            "start_char": best_answer["start_char"],
            "end_char": best_answer["end_char"]
        }
    
    return best_answer["text"]


def extract_segment(context, question, tokenizer, model, max_length=512, doc_stride=128):
    """
    Extracts a single segment by asking a question to the QA model.
    Handles long documents using sliding windows with doc_stride.
    Returns both the text and the character offsets (start, end).
    """
    # Tokenize with stride to handle long documents
    encoding = tokenizer(
        question,
        context,
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
        return_tensors="pt"
    )
    
    # Process all windows
    all_start_logits = []
    all_end_logits = []
    
    with torch.no_grad():
        for i in range(len(encoding["input_ids"])):
            outputs = model(
                input_ids=encoding["input_ids"][i:i+1],
                attention_mask=encoding["attention_mask"][i:i+1]
            )
            all_start_logits.append(outputs.start_logits[0].cpu().numpy())
            all_end_logits.append(outputs.end_logits[0].cpu().numpy())
    
    # Find best answer across all windows
    best_answer = None
    best_score = float('-inf')
    
    for i in range(len(encoding["input_ids"])):
        offset_mapping = encoding["offset_mapping"][i].tolist()
        input_ids = encoding["input_ids"][i].tolist()
        start_logits = all_start_logits[i]
        end_logits = all_end_logits[i]
        
        # Post-process this window
        answer_data = postprocess_qa_prediction(
            context=context,
            question=question,
            start_logits=start_logits,
            end_logits=end_logits,
            offset_mapping=offset_mapping,
            input_ids=input_ids,
            tokenizer=tokenizer,
            return_offsets=True
        )
        
        # Calculate score for this answer
        if answer_data["text"]:
            # Find the token positions for this answer
            answer_score = float('-inf')
            for s_idx, (s_start, s_end) in enumerate(offset_mapping):
                if s_start == answer_data["start_char"]:
                    for e_idx, (e_start, e_end) in enumerate(offset_mapping):
                        if e_end == answer_data["end_char"]:
                            answer_score = float(start_logits[s_idx] + end_logits[e_idx])
                            break
                    break
            
            if answer_score > best_score:
                best_score = answer_score
                best_answer = answer_data
    
    if best_answer is None:
        return {"text": "", "start_char": 0, "end_char": 0}
    
    return best_answer


def segment_document(text, tokenizer, model):
    """
    Segments the document into three parts using two questions and offset-based segmentation.
    
    Logic:
    1. Ask OPENING_Q to find the last sentence of the opening segment
    2. Ask CLOSING_Q to find the first sentence of the closing segment
    3. Segment based on offsets:
       - intro_segment: from start to end_offset of OPENING_Q answer
       - closing_segment: from start_offset of CLOSING_Q answer to end
       - body_segment: everything in between
    
    Returns a dictionary with intro_segment, body_segment, and closing_segment.
    """
    if not text or text.strip() == "":
        return {
            "intro_segment": "",
            "body_segment": "",
            "closing_segment": ""
        }
    
    # Define the two questions
    OPENING_Q = "No inรญcio da ata hรก um segmento de abertura. Qual รฉ a รบltima frase desse segmento de abertura?"
    CLOSING_Q = "No final da ata hรก um segmento de encerramento. Qual รฉ a primeira frase desse segmento de encerramento?"
    
    # Extract opening segment boundary
    opening_answer = extract_segment(text, OPENING_Q, tokenizer, model)
    
    # Extract closing segment boundary
    closing_answer = extract_segment(text, CLOSING_Q, tokenizer, model)
    
    # Get offsets
    opening_end = opening_answer["end_char"] if opening_answer["text"] else 0
    closing_start = closing_answer["start_char"] if closing_answer["text"] else len(text)
    
    # Segment the document based on offsets
    intro_segment = text[:opening_end]
    body_segment = text[opening_end:closing_start]
    closing_segment = text[closing_start:]
    
    segments = {
        "intro_segment": intro_segment,
        "body_segment": body_segment,
        "closing_segment": closing_segment
    }
    
    # Debug info
    debug_info = {
        "opening_answer": {
            "text": opening_answer["text"],
            "end_offset": opening_end
        },
        "closing_answer": {
            "text": closing_answer["text"],
            "start_offset": closing_start
        },
        "segment_lengths": {
            "intro": len(intro_segment),
            "body": len(body_segment),
            "closing": len(closing_segment)
        }
    }
    
    return segments, debug_info


def main():
    parser = argparse.ArgumentParser(description="Segment document into intro, body, and closing sections")
    parser.add_argument("--input_file", type=str, help="Path to input text file")
    parser.add_argument("--text", type=str, help="Direct text input")
    parser.add_argument("--output_file", type=str, help="Path to save JSON output (optional)")
    parser.add_argument("--model", type=str, default="liaad/Citilink-XLMR-large-Structural-Segmentation-pt", 
                        help="Model name or path")
    parser.add_argument("--verbose", action="store_true", help="Show debug information")
    
    args = parser.parse_args()
    
    # Get input text
    if args.input_file:
        with open(args.input_file, "r", encoding="utf-8") as f:
            text = f.read()
        print(f"Loaded text from: {args.input_file}")
    elif args.text:
        text = args.text
    else:
        print("Error: Please provide either --input_file or --text")
        return
    
    # Load model
    print(f"Loading model: {args.model}")
    tokenizer = AutoTokenizer.from_pretrained(args.model, use_fast=True)
    model = AutoModelForQuestionAnswering.from_pretrained(args.model)
    model.eval()
    print("Model loaded successfully")
    
    # Segment document
    print("\nProcessing document...")
    segments, debug_info = segment_document(text, tokenizer, model)
    
    # Print results
    print("\n" + "="*80)
    print("SEGMENTATION RESULTS")
    print("="*80)
    
    if args.verbose:
        print("\nDEBUG INFO:")
        print(json.dumps(debug_info, ensure_ascii=False, indent=2))
        print("\n" + "-"*80)
    
    print("\nINTRO SEGMENT:")
    print("-"*80)
    print(segments["intro_segment"])
    
    print("\nBODY SEGMENT:")
    print("-"*80)
    print(segments["body_segment"])
    
    print("\nCLOSING SEGMENT:")
    print("-"*80)
    print(segments["closing_segment"])
    
    print("\n" + "="*80)
    print(f"Statistics:")
    print(f"   Intro length: {len(segments['intro_segment'])} chars")
    print(f"   Body length: {len(segments['body_segment'])} chars")
    print(f"   Closing length: {len(segments['closing_segment'])} chars")
    print(f"   Total: {len(text)} chars")
    print("="*80)
    
    # Save to file if requested
    if args.output_file:
        output_data = {
            "segments": segments,
            "debug_info": debug_info if args.verbose else None
        }
        with open(args.output_file, "w", encoding="utf-8") as f:
            json.dump(output_data, f, ensure_ascii=False, indent=2)
        print(f"\nResults saved to: {args.output_file}")


if __name__ == "__main__":
    main()

Evaluation Results

Municipal Meeting Minutes Test Set

Metric Score
F1 score 0.907
Exact Match 0.875

Limitations

  • Domain Specificity
    The model is fine-tuned on Portuguese municipal meeting minutes and performs best on administrative and governmental texts. Performance may degrade on documents with substantially different structure or writing style.

  • Context Window Length
    The model has a maximum input length of 512 tokens. Longer documents require window-based processing, which may lead to partial or fragmented segment predictions in edge cases.

  • Structural Variability
    Municipal minutes can vary significantly across municipalities and time periods. Unseen formatting patterns or atypical section ordering may reduce prediction accuracy.

License

This model is released under the cc-by-nc-nd-4.0 license.

Downloads last month
19
Safetensors
Model size
0.6B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for liaad/Citilink-XLMR-large-Structural-Segmentation-pt

Finetuned
(12)
this model

Spaces using liaad/Citilink-XLMR-large-Structural-Segmentation-pt 2

Collection including liaad/Citilink-XLMR-large-Structural-Segmentation-pt