File size: 5,481 Bytes
2ddf7f2 a4480db 2ddf7f2 81eb846 2ddf7f2 620536f 2ddf7f2 81eb846 2ddf7f2 81eb846 2ddf7f2 81eb846 2ddf7f2 81eb846 2ddf7f2 a4480db |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
---
language:
- pt
license: cc-by-nc-nd-4.0
colorTo: blue
tags:
- text-summarization
- abstractive-summarization
- portuguese
- administrative-documents
- municipal-meetings
- primera
library_name: transformers
base_model:
- allenai/PRIMERA
---
# Bart-Base-Summarization-Council-PT: Abstractive Summarization of Portuguese Municipal Meeting Minutes Discussion Subjects
## Model Description
**Primera-Summarization-Council-PT** is an **abstractive text summarization model** based on **primera**, fine-tuned to produce concise and informative summaries of discussion subjects from **Portuguese municipal meeting minutes**.
The model was trained on a curated and annotated corpus of official municipal meeting minutes covering a variety of administrative and political topics at the municipal level.
**Try out the model**: [Hugging Face Space Demo](https://huggingface.co/spaces/anonymous12321/Citilink-Summ-PT)
### Key Features
- 🧾 **Abstractive Summarization** – Generates natural, human-like summaries rather than extracts.
- 🇵🇹 **European Portuguese** – Optimized for official and administrative Portuguese.
- 🏛️ **Domain-Specific** – Trained on municipal meeting minutes and administrative discussions.
- ⚙️ **Fine-tuned primera** – Built upon `allenai/primera` using supervised fine-tuning.
- 🧠 **Fact-Aware Generation** – Produces short summaries that preserve factual content.
---
## Model Details
- Architecture: allenai/PRIMERA
- Base Model: Longformer Encoder-Decoder (extension of BART)
- Task: Abstractive summarization (text → summary)
- Framework: Hugging Face Transformers (PyTorch)
- Tokenizer: Longformer/BART tokenizer (English vocabulary reused for Portuguese text)
- Max Input Length: 4096 tokens
- Max Summary Length: 128 tokens
- Training Objective: Conditional generation (cross-entropy loss)
- Dataset: Portuguese municipal meeting minutes annotated with summaries
---
## How It Works
The model receives a discussion subject of a municipal meeting and outputs a short, coherent summary highlighting:
- The **main subject or topic** of discussion
- Any **decisions, motions, or proposals** made
- The **entities or departments** involved
### Example Usage
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "anonymous12321/Primera-Summarization-Council-PT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
text = """
17. PROCESSO DE OBRAS N.º ***** -- EDIFIC\nPelo Senhor Presidente foi presente a esta reunião a informação n.º ****** da Secção de Urbanismo e Fiscalização -- Serviço de Obras Particulares que se anexa à presente ata. \nPonderado e analisado o assunto o Executivo Municipal deliberou por unanimidade aprovar as especialidades relativas ao processo de obras n.º ***** -- EDIFIC.
"""
inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
summary_ids = model.generate(**inputs, max_length=128, num_beams=4, early_stopping=True)
print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))
```
# 🧾 Model Output
**Output:**
> "O Executivo Municipal aprovou, por unanimidade, as especialidades relativas a um processo de obras particulares."
---
## 📊 Evaluation Results
### Quantitative Metrics (on held-out test set)
| Metric | Score | Description |
|:-------|:------:|:------------|
| **ROUGE-1** | 0.632 | Unigram overlap between generated and reference summaries |
| **ROUGE-2** | 0.500 | Bigram overlap |
| **ROUGE-L** | 0.577 | Longest common subsequence overlap |
| **BERTScore (F1)** | 0.846 | Semantic similarity between summary and reference |
---
## ⚙️ Training Details
- **Pretrained Model:** `allenai/primera`
- **Optimizer:** AdamW (default in Hugging Face Trainer)
- **Learning Rate:** 2e-5
- **Batch Size:** 4
- **Epochs:** 3
- **Scheduler:** Linear warmup
- **Loss Function:** Cross-entropy
- **Evaluation Metrics:** ROUGE (computed on validation set every 100 steps)
- **Evaluation Strategy:** Step-based evaluation (`eval_steps=100`)
- **Weight Decay:** 0.01
- **Mixed Precision (fp16):** Enabled when CUDA is available
- **Chunking:** Implemented with `max_length=512` and `stride=256` for hierarchical input segmentation
- **Target (summary) Max Length:** 128 tokens
---
## 📚 Dataset Description
The model was trained on a specialized dataset of **Portuguese municipal meeting minutes**, consisting of:
- Discussion Subjects from official municipal meeting minutes.
- Decisions and deliberations across departments (urban planning, finance, education, etc.)
- Expert-annotated summaries per discussion segment
**Dataset sources include:**
- Six Portuguese municipalities meeting minutes
---
## ⚠️ Limitations
- **Language Restriction:** The model is optimized for Portuguese; performance may degrade in other languages.
- **Domain Dependence:** Best suited for administrative and institutional texts; less effective on informal or creative writing.
- **Length Sensitivity:** Very long transcripts (>1024 tokens) are truncated; chunking may be needed for full documents.
- **Generalization:** While robust within-domain, it may underperform on unseen domains or vocabulary.
---
## 📄 License
This model is released under the
**Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).**
--- |