anonymous12321 commited on
Commit
2ddf7f2
·
verified ·
1 Parent(s): a3e137a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +149 -0
README.md ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - pt
4
+ license: cc-by-nc-nd-4.0
5
+ colorTo: blue
6
+ tags:
7
+ - text-summarization
8
+ - abstractive-summarization
9
+ - portuguese
10
+ - administrative-documents
11
+ - municipal-meetings
12
+ - primera
13
+ library_name: transformers
14
+ base_model:
15
+ - allenai/primera
16
+ ---
17
+
18
+ # Bart-Base-Summarization-Council-PT: Abstractive Summarization of Portuguese Municipal Meeting Minutes Discussion Subjects
19
+
20
+ ## Model Description
21
+
22
+ **Primera-Summarization-Council-PT** is an **abstractive text summarization model** based on **primera**, fine-tuned to produce concise and informative summaries of discussion subjects from **Portuguese municipal meeting minutes**.
23
+ The model was trained on a curated and annotated corpus of official municipal meeting minutes covering a variety of administrative and political topics at the municipal level.
24
+
25
+ **Try out the model**: [Hugging Face Space Demo](https://huggingface.co/spaces/anonymous12321/CitilinkSumm-PT)
26
+
27
+ ### Key Features
28
+
29
+ - 🧾 **Abstractive Summarization** – Generates natural, human-like summaries rather than extracts.
30
+ - 🇵🇹 **European Portuguese** – Optimized for official and administrative Portuguese.
31
+ - 🏛️ **Domain-Specific** – Trained on municipal meeting minutes and administrative discussions.
32
+ - ⚙️ **Fine-tuned primera** – Built upon `allenai/primera` using supervised fine-tuning.
33
+ - 🧠 **Fact-Aware Generation** – Produces short summaries that preserve factual content.
34
+
35
+ ---
36
+
37
+ ## Model Details
38
+
39
+ - **Architecture:** `allenai/primera`
40
+ - **Task:** Abstractive summarization (`text → summary`)
41
+ - **Framework:** 🤗 Transformers (PyTorch)
42
+ - **Tokenizer:** BART-base tokenizer (English vocabulary adapted for Portuguese text)
43
+ - **Max Input Length:** 1024 tokens
44
+ - **Max Summary Length:** 128 tokens
45
+ - **Training Objective:** Conditional generation (cross-entropy loss)
46
+ - **Dataset:** Portuguese municipal meeting minutes annotated with summaries
47
+
48
+ ---
49
+
50
+ ## How It Works
51
+
52
+ The model receives a discussion subject of a municipal meeting and outputs a short, coherent summary highlighting:
53
+ - The **main subject or topic** of discussion
54
+ - Any **decisions, motions, or proposals** made
55
+ - The **entities or departments** involved
56
+
57
+ ### Example Usage
58
+
59
+ ```python
60
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
61
+
62
+ model_name = "anonymous12321/CitilinkSumm-PT"
63
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
64
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
65
+
66
+ text = """
67
+ 17. PROCESSO DE OBRAS N.º ***** -- EDIFIC\nPelo Senhor Presidente foi presente a esta reunião a informação n.º ****** da Secção de Urbanismo e Fiscalização -- Serviço de Obras Particulares que se anexa à presente ata. \nPonderado e analisado o assunto o Executivo Municipal deliberou por unanimidade aprovar as especialidades relativas ao processo de obras n.º ***** -- EDIFIC.
68
+ """
69
+
70
+ inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
71
+ summary_ids = model.generate(**inputs, max_length=128, num_beams=4, early_stopping=True)
72
+ print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))
73
+
74
+ ```
75
+
76
+ # 🧾 Model Output
77
+
78
+ **Output:**
79
+ > "O Executivo Municipal aprovou, por unanimidade, as especialidades relativas a um processo de obras particulares."
80
+
81
+ ---
82
+
83
+ ## 📊 Evaluation Results
84
+
85
+ ### Quantitative Metrics (on held-out test set)
86
+
87
+ | Metric | Score | Description |
88
+ |:-------|:------:|:------------|
89
+ | **ROUGE-1** | ... | Unigram overlap between generated and reference summaries |
90
+ | **ROUGE-2** | ... | Bigram overlap |
91
+ | **ROUGE-L** | ... | Longest common subsequence overlap |
92
+ | **BERTScore (F1)** | ... | Semantic similarity between summary and reference |
93
+
94
+ ---
95
+
96
+ ## ⚙️ Training Details
97
+
98
+ - **Pretrained Model:** `facebook/bart-base`
99
+ - **Optimizer:** AdamW (default in Hugging Face Trainer)
100
+ - **Learning Rate:** 2e-5
101
+ - **Batch Size:** 4
102
+ - **Epochs:** 3
103
+ - **Scheduler:** Linear warmup
104
+ - **Loss Function:** Cross-entropy
105
+ - **Evaluation Metrics:** ROUGE (computed on validation set every 100 steps)
106
+ - **Evaluation Strategy:** Step-based evaluation (`eval_steps=100`)
107
+ - **Weight Decay:** 0.01
108
+ - **Mixed Precision (fp16):** Enabled when CUDA is available
109
+
110
+ ---
111
+
112
+ ## 📚 Dataset Description
113
+
114
+ The model was trained on a specialized dataset of **Portuguese municipal meeting minutes**, consisting of:
115
+
116
+ - Discussion Subjects from official municipal meeting minutes.
117
+ - Decisions and deliberations across departments (urban planning, finance, education, etc.)
118
+ - Expert-annotated summaries per discussion segment
119
+
120
+ **Dataset sources include:**
121
+
122
+ - Six Portuguese municipalities meeting minutes
123
+
124
+ ---
125
+
126
+ ## ⚠️ Limitations
127
+
128
+ - **Language Restriction:** The model is optimized for Portuguese; performance may degrade in other languages.
129
+ - **Domain Dependence:** Best suited for administrative and institutional texts; less effective on informal or creative writing.
130
+ - **Length Sensitivity:** Very long transcripts (>1024 tokens) are truncated; chunking may be needed for full documents.
131
+ - **Generalization:** While robust within-domain, it may underperform on unseen domains or vocabulary.
132
+
133
+ ---
134
+
135
+ ## ⚖️ Ethical Considerations
136
+
137
+ The model is intended for **research and administrative document processing**.
138
+
139
+ - Outputs should **not** be used for legal decision-making without human verification.
140
+ - Potential bias may exist due to limited geographic and institutional diversity in training data.
141
+
142
+ ---
143
+
144
+ ## 📄 License
145
+
146
+ This model is released under the
147
+ **Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).**
148
+
149
+ ---