File size: 5,481 Bytes
2ddf7f2
 
 
 
 
 
 
 
 
 
 
 
 
 
a4480db
2ddf7f2
 
 
 
 
 
 
 
 
81eb846
2ddf7f2
 
 
 
 
 
 
 
 
 
 
 
 
620536f
 
 
 
 
 
 
 
 
2ddf7f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81eb846
2ddf7f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81eb846
 
 
 
2ddf7f2
 
 
 
 
81eb846
2ddf7f2
 
 
 
 
 
 
 
 
 
81eb846
 
2ddf7f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a4480db
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
language:
- pt
license: cc-by-nc-nd-4.0
colorTo: blue
tags:
- text-summarization
- abstractive-summarization
- portuguese
- administrative-documents
- municipal-meetings
- primera
library_name: transformers
base_model:
- allenai/PRIMERA
---

# Bart-Base-Summarization-Council-PT: Abstractive Summarization of Portuguese Municipal Meeting Minutes Discussion Subjects

## Model Description

**Primera-Summarization-Council-PT** is an **abstractive text summarization model** based on **primera**, fine-tuned to produce concise and informative summaries of discussion subjects from **Portuguese municipal meeting minutes**.  
The model was trained on a curated and annotated corpus of official municipal meeting minutes covering a variety of administrative and political topics at the municipal level.

**Try out the model**: [Hugging Face Space Demo](https://huggingface.co/spaces/anonymous12321/Citilink-Summ-PT)

### Key Features

- 🧾 **Abstractive Summarization** – Generates natural, human-like summaries rather than extracts.  
- 🇵🇹 **European Portuguese** – Optimized for official and administrative Portuguese.  
- 🏛️ **Domain-Specific** – Trained on municipal meeting minutes and administrative discussions.  
- ⚙️ **Fine-tuned primera** – Built upon `allenai/primera` using supervised fine-tuning.  
- 🧠 **Fact-Aware Generation** – Produces short summaries that preserve factual content.  

---

## Model Details

- Architecture: allenai/PRIMERA
- Base Model: Longformer Encoder-Decoder (extension of BART)
- Task: Abstractive summarization (text → summary)
- Framework: Hugging Face Transformers (PyTorch)
- Tokenizer: Longformer/BART tokenizer (English vocabulary reused for Portuguese text)
- Max Input Length: 4096 tokens
- Max Summary Length: 128 tokens
- Training Objective: Conditional generation (cross-entropy loss)
- Dataset: Portuguese municipal meeting minutes annotated with summaries

---

## How It Works

The model receives a discussion subject of a municipal meeting and outputs a short, coherent summary highlighting:
- The **main subject or topic** of discussion  
- Any **decisions, motions, or proposals** made  
- The **entities or departments** involved  

### Example Usage

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "anonymous12321/Primera-Summarization-Council-PT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = """
17. PROCESSO DE OBRAS N.º ***** -- EDIFIC\nPelo Senhor Presidente foi presente a esta reunião a informação n.º ****** da Secção de Urbanismo e Fiscalização -- Serviço de Obras Particulares que se anexa à presente ata. \nPonderado e analisado o assunto o Executivo Municipal deliberou por unanimidade aprovar as especialidades relativas ao processo de obras n.º ***** -- EDIFIC.
"""

inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
summary_ids = model.generate(**inputs, max_length=128, num_beams=4, early_stopping=True)
print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))

```

# 🧾 Model Output

**Output:**  
> "O Executivo Municipal aprovou, por unanimidade, as especialidades relativas a um processo de obras particulares."

---

## 📊 Evaluation Results

### Quantitative Metrics (on held-out test set)

| Metric | Score | Description |
|:-------|:------:|:------------|
| **ROUGE-1** | 0.632 | Unigram overlap between generated and reference summaries |
| **ROUGE-2** | 0.500 | Bigram overlap |
| **ROUGE-L** | 0.577 | Longest common subsequence overlap |
| **BERTScore (F1)** | 0.846 | Semantic similarity between summary and reference |

---

## ⚙️ Training Details

- **Pretrained Model:** `allenai/primera`  
- **Optimizer:** AdamW (default in Hugging Face Trainer)  
- **Learning Rate:** 2e-5  
- **Batch Size:** 4  
- **Epochs:** 3  
- **Scheduler:** Linear warmup  
- **Loss Function:** Cross-entropy  
- **Evaluation Metrics:** ROUGE (computed on validation set every 100 steps)  
- **Evaluation Strategy:** Step-based evaluation (`eval_steps=100`)  
- **Weight Decay:** 0.01  
- **Mixed Precision (fp16):** Enabled when CUDA is available  
- **Chunking:** Implemented with `max_length=512` and `stride=256` for hierarchical input segmentation  
- **Target (summary) Max Length:** 128 tokens  

---

## 📚 Dataset Description

The model was trained on a specialized dataset of **Portuguese municipal meeting minutes**, consisting of:

- Discussion Subjects from official municipal meeting minutes.  
- Decisions and deliberations across departments (urban planning, finance, education, etc.)  
- Expert-annotated summaries per discussion segment  

**Dataset sources include:**

- Six Portuguese municipalities meeting minutes

---

## ⚠️ Limitations

- **Language Restriction:** The model is optimized for Portuguese; performance may degrade in other languages.  
- **Domain Dependence:** Best suited for administrative and institutional texts; less effective on informal or creative writing.  
- **Length Sensitivity:** Very long transcripts (>1024 tokens) are truncated; chunking may be needed for full documents.  
- **Generalization:** While robust within-domain, it may underperform on unseen domains or vocabulary.  

---

## 📄 License

This model is released under the  
**Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).**

---