mBART50-Topic-Generator-Council-PT: Topic & Title Generation for Municipal Minutes

Model Description

mBART50-Topic-Generator-Council-PT is a fine-tuned mBART-50 Large (Many-to-Many) model specialized in generating concise, formal titles (topics) from segments of Portuguese municipal meeting minutes.

The model is trained to transform dense administrative discourse into short titles (max. 15 words) using nominalization (e.g., "Approval of...", "Creation of...") without trailing punctuation, adhering to the standard formal style of Portuguese local government.

Key Features

  • 🏷️ Title Generation – Focused on extreme conciseness and high information density.
  • 🏗️ Nominalization Style – Specifically trained to start titles with action-based nouns.
  • 🇵🇹 European Portuguese (pt_XX) – Optimized for the specific institutional vocabulary of Portugal.
  • 🧹 Entity Sanitization – Trained on data where specific names (e.g., "Sr. Presidente") are mapped to generic terms ("um cidadão") for better generalization and privacy.
  • ⚙️ Instruction Tuned – Responds to a specific prefix to guide the formatting of the output.

Instruction Prompt Structure

For optimal performance, the input text should always be preceded by the following instruction used during training:

"Sumariza o segmento de ata num tema conciso (máx. 15 palavras), começando com nominalização (ex.: aprovação da, criação de) e sem pontuação final. Segmento: [YOUR_TEXT_HERE]"


Model Details

  • Architecture: facebook/mbart-large-50-many-to-many-mmt
  • Task: Sequence-to-Sequence (Topic/Title Generation)
  • Language Code: Source and Target: pt_XX
  • Max Input Length: 1024 tokens
  • Max Target Length: 150 tokens
  • Dataset: Annotated segments from 6 Portuguese municipalities.

Example Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "your-username/mBART50-Topic-Generator-Council-PT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Required configuration for mBART-50
LANGUAGE_CODE = "pt_XX"
tokenizer.src_lang = LANGUAGE_CODE
tokenizer.tgt_lang = LANGUAGE_CODE

instruction = "Sumariza o segmento de ata num tema conciso (máx. 15 palavras), começando com nominalização (ex.: aprovação da, criação de) e sem pontuação final. Segmento: "
text = "Pelo Senhor Presidente foi presente a esta reunião a informação da Secção de Urbanismo relativa ao processo de obras..."

full_input = instruction + text

inputs = tokenizer(full_input, return_tensors="pt", max_length=1024, truncation=True)

output_ids = model.generate(
    inputs["input_ids"],
    max_length=150,
    num_beams=5,
    early_stopping=True,
    forced_bos_token_id=tokenizer.lang_code_to_id[LANGUAGE_CODE]
)

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
# Expected output: "Aprovação de processo de obras da Secção de Urbanismo"

⚙️ Training Details

  • Optimizer: AdamW
  • Epochs: 3
  • Batch Size: 8
  • Learning Rate Strategy: 500 warmup steps with 0.01 weight decay.
  • Precision: FP16 (Mixed Precision) enabled.
  • Evaluation: Sequence-to-sequence generation during evaluation (predict_with_generate=True).

⚠️ Limitations

  • Prompt Dependency: The model is sensitive to the instruction prefix. Omitting it may lead to less consistent title formatting.
  • Domain Specificity: Best suited for municipal/administrative contexts; performance on creative or casual Portuguese is not guaranteed.
  • Anonymization Artifacts: Due to training data sanitization, the model might automatically refer to specific individuals as "um cidadão" (a citizen).

📄 License

This model is released under the
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).

Downloads last month
11
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inesctec/Citilink-mBART-50-Theme-Generation-pt

Finetuned
(252)
this model

Space using inesctec/Citilink-mBART-50-Theme-Generation-pt 1

Collection including inesctec/Citilink-mBART-50-Theme-Generation-pt