Update README.md
Browse files
README.md
CHANGED
|
@@ -40,7 +40,6 @@ The model was trained on a curated and annotated corpus of official municipal me
|
|
| 40 |
- **Task:** Abstractive summarization (`text → summary`)
|
| 41 |
- **Framework:** 🤗 Transformers (PyTorch)
|
| 42 |
- **Max Input Length:** 512 tokens
|
| 43 |
-
- **Max Summary Length:** 128 tokens
|
| 44 |
- **Training Objective:** Conditional generation (cross-entropy loss)
|
| 45 |
- **Dataset:** Portuguese municipal meeting minutes annotated with summaries
|
| 46 |
|
|
@@ -58,7 +57,7 @@ The model receives a discussion subject of a municipal meeting and outputs a sho
|
|
| 58 |
```python
|
| 59 |
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
| 60 |
|
| 61 |
-
model_name = "anonymous12321/
|
| 62 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 63 |
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
|
| 64 |
|
|
@@ -85,26 +84,36 @@ print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))
|
|
| 85 |
|
| 86 |
| Metric | Score | Description |
|
| 87 |
|:-------|:------:|:------------|
|
| 88 |
-
| **ROUGE-1** | .
|
| 89 |
-
| **ROUGE-2** | .
|
| 90 |
-
| **ROUGE-L** | .
|
| 91 |
-
| **BERTScore (F1)** | .
|
| 92 |
|
| 93 |
---
|
| 94 |
|
| 95 |
## ⚙️ Training Details
|
| 96 |
|
| 97 |
- **Pretrained Model:** `google/pegasus-xsum`
|
|
|
|
| 98 |
- **Optimizer:** AdamW (default in Hugging Face Trainer)
|
| 99 |
- **Learning Rate:** 2e-5
|
| 100 |
- **Batch Size:** 4
|
| 101 |
- **Epochs:** 3
|
| 102 |
-
- **Scheduler:** Linear warmup
|
| 103 |
-
- **Loss Function:** Cross-entropy
|
| 104 |
- **Evaluation Metrics:** ROUGE (computed on validation set every 100 steps)
|
| 105 |
-
- **Evaluation Strategy:** Step-based evaluation (`eval_steps=100`)
|
| 106 |
- **Weight Decay:** 0.01
|
|
|
|
| 107 |
- **Mixed Precision (fp16):** Enabled when CUDA is available
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
---
|
| 110 |
|
|
@@ -131,15 +140,6 @@ The model was trained on a specialized dataset of **Portuguese municipal meeting
|
|
| 131 |
|
| 132 |
---
|
| 133 |
|
| 134 |
-
## ⚖️ Ethical Considerations
|
| 135 |
-
|
| 136 |
-
The model is intended for **research and administrative document processing**.
|
| 137 |
-
|
| 138 |
-
- Outputs should **not** be used for legal decision-making without human verification.
|
| 139 |
-
- Potential bias may exist due to limited geographic and institutional diversity in training data.
|
| 140 |
-
|
| 141 |
-
---
|
| 142 |
-
|
| 143 |
## 📄 License
|
| 144 |
|
| 145 |
This model is released under the
|
|
|
|
| 40 |
- **Task:** Abstractive summarization (`text → summary`)
|
| 41 |
- **Framework:** 🤗 Transformers (PyTorch)
|
| 42 |
- **Max Input Length:** 512 tokens
|
|
|
|
| 43 |
- **Training Objective:** Conditional generation (cross-entropy loss)
|
| 44 |
- **Dataset:** Portuguese municipal meeting minutes annotated with summaries
|
| 45 |
|
|
|
|
| 57 |
```python
|
| 58 |
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
| 59 |
|
| 60 |
+
model_name = "anonymous12321/Pegasus-Summarization-Council-PT"
|
| 61 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 62 |
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
|
| 63 |
|
|
|
|
| 84 |
|
| 85 |
| Metric | Score | Description |
|
| 86 |
|:-------|:------:|:------------|
|
| 87 |
+
| **ROUGE-1** | 0.367 | Unigram overlap between generated and reference summaries |
|
| 88 |
+
| **ROUGE-2** | 0.179 | Bigram overlap |
|
| 89 |
+
| **ROUGE-L** | 0.309 | Longest common subsequence overlap |
|
| 90 |
+
| **BERTScore (F1)** | 0.746 | Semantic similarity between summary and reference |
|
| 91 |
|
| 92 |
---
|
| 93 |
|
| 94 |
## ⚙️ Training Details
|
| 95 |
|
| 96 |
- **Pretrained Model:** `google/pegasus-xsum`
|
| 97 |
+
- **Tokenizer:** `AutoTokenizer` (matching checkpoint)
|
| 98 |
- **Optimizer:** AdamW (default in Hugging Face Trainer)
|
| 99 |
- **Learning Rate:** 2e-5
|
| 100 |
- **Batch Size:** 4
|
| 101 |
- **Epochs:** 3
|
| 102 |
+
- **Scheduler:** Linear warmup (default)
|
| 103 |
+
- **Loss Function:** Cross-entropy (seq2seq objective)
|
| 104 |
- **Evaluation Metrics:** ROUGE (computed on validation set every 100 steps)
|
| 105 |
+
- **Evaluation Strategy:** Step-based evaluation (`eval_strategy="steps"`, `eval_steps=100`)
|
| 106 |
- **Weight Decay:** 0.01
|
| 107 |
+
- **Logging Steps:** 10
|
| 108 |
- **Mixed Precision (fp16):** Enabled when CUDA is available
|
| 109 |
+
- **Save Strategy:** Keep only latest checkpoint (`save_total_limit=1`)
|
| 110 |
+
- **Chunking:** Token-based with `max_length=512` and `stride=256`
|
| 111 |
+
- **Target Max Length:** 128
|
| 112 |
+
- **Validation Split:** 10% of data
|
| 113 |
+
- **Data Collator:** `DataCollatorForSeq2Seq` (dynamic padding)
|
| 114 |
+
- **Output Directory:** `./results_hierarchical_pegasus_segments`
|
| 115 |
+
- **Saved Model Path:** `./trained_pegasus_segments`
|
| 116 |
+
|
| 117 |
|
| 118 |
---
|
| 119 |
|
|
|
|
| 140 |
|
| 141 |
---
|
| 142 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 143 |
## 📄 License
|
| 144 |
|
| 145 |
This model is released under the
|