mBART50-Summarization-Council-PT: Abstractive Summarization of Portuguese Municipal Meeting Minutes

Model Description

mBART50-Summarization-Council-PT is an abstractive text summarization model based on mBART-50 Large, fine-tuned to generate concise summaries of discussion subjects from Portuguese municipal meeting minutes.

Unlike standard models, this version is specifically optimized for the administrative and legalistic register of European Portuguese. It utilizes a chunking and stride strategy to process long discussion segments common in municipal documentation, ensuring that relevant context is preserved even in documents exceeding standard token limits.

Try out the model: Hugging Face Space Demo

Key Features

  • 🧾 Abstractive Summarization – Generates natural, human-like summaries focusing on key deliberations and decisions.
  • πŸ‡΅πŸ‡Ή European Portuguese – Optimized for the official language used in Portuguese public administration.
  • πŸ›οΈ Domain-Specific – Trained on a curated corpus of municipal meeting minutes (urban planning, finance, social support, etc.).
  • βš™οΈ Chunk-Aware Processing – Developed using a window of 1024 tokens with a 512-token stride for handling long-form text.
  • 🧠 mBART-50 Architecture – Leverages the large-scale multilingual pre-training of facebook/mbart-large-50.

Model Details

  • Architecture: facebook/mbart-large-50
  • Task: Abstractive summarization (seq2seq-generation)
  • Language Code: Source and Target: pt_XX
  • Max Input Length: 1024 tokens
  • Max Summary Length: 128 tokens
  • Training Objective: Conditional generation with Cross-Entropy loss
  • Dataset: Official Portuguese municipal meeting minutes from 6 different municipalities.

How It Works

To handle long meeting minutes, the model uses a sliding window (chunking with stride). The following example demonstrates how to process long text using the same logic used during training.

Example Usage

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "your-username/mBART50-Summarization-Council-PT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Define language codes for mBART
LANGUAGE_CODE = "pt_XX"
tokenizer.src_lang = LANGUAGE_CODE
tokenizer.tgt_lang = LANGUAGE_CODE

def chunk_text(text, tokenizer, max_length=1024, stride=512):
    tokens = tokenizer.encode(text, truncation=False)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + max_length, len(tokens))
        chunks.append(tokenizer.decode(tokens[start:end], skip_special_tokens=True))
        if end == len(tokens): break
        start += stride
    return chunks

text = "17. PROCESSO DE OBRAS N.ΒΊ... [Insert long meeting text here]"

# Process long text through chunks
chunks = chunk_text(text, tokenizer)
full_context = " ".join(chunks)

inputs = tokenizer(full_context, return_tensors="pt", max_length=1024, truncation=True)
summary_ids = model.generate(
    **inputs, 
    max_length=128, 
    num_beams=4, 
    early_stopping=True,
    forced_bos_token_id=tokenizer.lang_code_to_id[LANGUAGE_CODE]
)

print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))

βš™οΈ Training Details

  • Pretrained Model: facebook/mbart-large-50
  • Language: pt_XX
  • Optimizer: AdamW
  • Training Strategy: Seq2SeqTrainer with predict_with_generate=True
  • Chunking Method: 1024 max length / 512 stride
  • Beam Search: num_beams=4
  • Precision: Mixed Precision (FP16)

πŸ“„ License

This model is released under the
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0). """

Downloads last month
17
Safetensors
Model size
0.6B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for liaad/Citilink_mBART-50_Summarization

Finetuned
(290)
this model

Collection including liaad/Citilink_mBART-50_Summarization