mBART50-Summarization-Council-PT: Abstractive Summarization of Portuguese Municipal Meeting Minutes
Model Description
mBART50-Summarization-Council-PT is an abstractive text summarization model based on mBART-50 Large, fine-tuned to generate concise summaries of discussion subjects from Portuguese municipal meeting minutes.
Unlike standard models, this version is specifically optimized for the administrative and legalistic register of European Portuguese. It utilizes a chunking and stride strategy to process long discussion segments common in municipal documentation, ensuring that relevant context is preserved even in documents exceeding standard token limits.
Try out the model: Hugging Face Space Demo
Key Features
- π§Ύ Abstractive Summarization β Generates natural, human-like summaries focusing on key deliberations and decisions.
- π΅πΉ European Portuguese β Optimized for the official language used in Portuguese public administration.
- ποΈ Domain-Specific β Trained on a curated corpus of municipal meeting minutes (urban planning, finance, social support, etc.).
- βοΈ Chunk-Aware Processing β Developed using a window of 1024 tokens with a 512-token stride for handling long-form text.
- π§ mBART-50 Architecture β Leverages the large-scale multilingual pre-training of
facebook/mbart-large-50.
Model Details
- Architecture:
facebook/mbart-large-50 - Task: Abstractive summarization (
seq2seq-generation) - Language Code: Source and Target:
pt_XX - Max Input Length: 1024 tokens
- Max Summary Length: 128 tokens
- Training Objective: Conditional generation with Cross-Entropy loss
- Dataset: Official Portuguese municipal meeting minutes from 6 different municipalities.
How It Works
To handle long meeting minutes, the model uses a sliding window (chunking with stride). The following example demonstrates how to process long text using the same logic used during training.
Example Usage
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "your-username/mBART50-Summarization-Council-PT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Define language codes for mBART
LANGUAGE_CODE = "pt_XX"
tokenizer.src_lang = LANGUAGE_CODE
tokenizer.tgt_lang = LANGUAGE_CODE
def chunk_text(text, tokenizer, max_length=1024, stride=512):
tokens = tokenizer.encode(text, truncation=False)
chunks = []
start = 0
while start < len(tokens):
end = min(start + max_length, len(tokens))
chunks.append(tokenizer.decode(tokens[start:end], skip_special_tokens=True))
if end == len(tokens): break
start += stride
return chunks
text = "17. PROCESSO DE OBRAS N.ΒΊ... [Insert long meeting text here]"
# Process long text through chunks
chunks = chunk_text(text, tokenizer)
full_context = " ".join(chunks)
inputs = tokenizer(full_context, return_tensors="pt", max_length=1024, truncation=True)
summary_ids = model.generate(
**inputs,
max_length=128,
num_beams=4,
early_stopping=True,
forced_bos_token_id=tokenizer.lang_code_to_id[LANGUAGE_CODE]
)
print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))
βοΈ Training Details
- Pretrained Model:
facebook/mbart-large-50 - Language:
pt_XX - Optimizer: AdamW
- Training Strategy: Seq2SeqTrainer with
predict_with_generate=True - Chunking Method: 1024 max length / 512 stride
- Beam Search:
num_beams=4 - Precision: Mixed Precision (FP16)
π License
This model is released under the
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).
"""
- Downloads last month
- 17
Model tree for liaad/Citilink_mBART-50_Summarization
Base model
facebook/mbart-large-50