mBART50-Summarization-Council-PT: Abstractive Summarization of Portuguese Municipal Meeting Minutes

Model Description

mBART50-Summarization-Council-PT is an abstractive text summarization model based on mBART-50 Large, fine-tuned to generate concise summaries of discussion subjects from Portuguese municipal meeting minutes.

Unlike standard models, this version is specifically optimized for the administrative and legalistic register of European Portuguese. It utilizes a chunking and stride strategy to process long discussion segments common in municipal documentation, ensuring that relevant context is preserved even in documents exceeding standard token limits.

Try out the model: Hugging Face Space Demo

Key Features

🧾 Abstractive Summarization – Generates natural, human-like summaries focusing on key deliberations and decisions.
🇵🇹 European Portuguese – Optimized for the official language used in Portuguese public administration.
🏛️ Domain-Specific – Trained on a curated corpus of municipal meeting minutes (urban planning, finance, social support, etc.).
⚙️ Chunk-Aware Processing – Developed using a window of 1024 tokens with a 512-token stride for handling long-form text.
🧠 mBART-50 Architecture – Leverages the large-scale multilingual pre-training of facebook/mbart-large-50.

Model Details

Architecture: facebook/mbart-large-50
Task: Abstractive summarization (seq2seq-generation)
Language Code: Source and Target: pt_XX
Max Input Length: 1024 tokens
Max Summary Length: 128 tokens
Training Objective: Conditional generation with Cross-Entropy loss
Dataset: Official Portuguese municipal meeting minutes from 6 different municipalities.

How It Works

To handle long meeting minutes, the model uses a sliding window (chunking with stride). The following example demonstrates how to process long text using the same logic used during training.

Example Usage

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "your-username/mBART50-Summarization-Council-PT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Define language codes for mBART
LANGUAGE_CODE = "pt_XX"
tokenizer.src_lang = LANGUAGE_CODE
tokenizer.tgt_lang = LANGUAGE_CODE

def chunk_text(text, tokenizer, max_length=1024, stride=512):
    tokens = tokenizer.encode(text, truncation=False)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + max_length, len(tokens))
        chunks.append(tokenizer.decode(tokens[start:end], skip_special_tokens=True))
        if end == len(tokens): break
        start += stride
    return chunks

text = "17. PROCESSO DE OBRAS N.º... [Insert long meeting text here]"

# Process long text through chunks
chunks = chunk_text(text, tokenizer)
full_context = " ".join(chunks)

inputs = tokenizer(full_context, return_tensors="pt", max_length=1024, truncation=True)
summary_ids = model.generate(
    **inputs, 
    max_length=128, 
    num_beams=4, 
    early_stopping=True,
    forced_bos_token_id=tokenizer.lang_code_to_id[LANGUAGE_CODE]
)

print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))

⚙️ Training Details

Pretrained Model: facebook/mbart-large-50
Language: pt_XX
Optimizer: AdamW
Training Strategy: Seq2SeqTrainer with predict_with_generate=True
Chunking Method: 1024 max length / 512 stride
Beam Search: num_beams=4
Precision: Mixed Precision (FP16)

📄 License

This model is released under the
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0). """

Downloads last month: 17

Safetensors

Model size

0.6B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for liaad/Citilink_mBART-50_Summarization

Base model

facebook/mbart-large-50

Finetuned

(290)

this model

Collection including liaad/Citilink_mBART-50_Summarization

Citilink

Collection

Citilink aims to create AI models to facilitate the understanding of city council meetings • 13 items • Updated 8 days ago