anonymous12321 commited on
Commit
f904b75
·
verified ·
1 Parent(s): 5749ee6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -18
README.md CHANGED
@@ -40,7 +40,6 @@ The model was trained on a curated and annotated corpus of official municipal me
40
  - **Task:** Abstractive summarization (`text → summary`)
41
  - **Framework:** 🤗 Transformers (PyTorch)
42
  - **Max Input Length:** 512 tokens
43
- - **Max Summary Length:** 128 tokens
44
  - **Training Objective:** Conditional generation (cross-entropy loss)
45
  - **Dataset:** Portuguese municipal meeting minutes annotated with summaries
46
 
@@ -58,7 +57,7 @@ The model receives a discussion subject of a municipal meeting and outputs a sho
58
  ```python
59
  from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
60
 
61
- model_name = "anonymous12321/CitilinkSumm-PT"
62
  tokenizer = AutoTokenizer.from_pretrained(model_name)
63
  model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
64
 
@@ -85,26 +84,36 @@ print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))
85
 
86
  | Metric | Score | Description |
87
  |:-------|:------:|:------------|
88
- | **ROUGE-1** | ... | Unigram overlap between generated and reference summaries |
89
- | **ROUGE-2** | ... | Bigram overlap |
90
- | **ROUGE-L** | ... | Longest common subsequence overlap |
91
- | **BERTScore (F1)** | ... | Semantic similarity between summary and reference |
92
 
93
  ---
94
 
95
  ## ⚙️ Training Details
96
 
97
  - **Pretrained Model:** `google/pegasus-xsum`
 
98
  - **Optimizer:** AdamW (default in Hugging Face Trainer)
99
  - **Learning Rate:** 2e-5
100
  - **Batch Size:** 4
101
  - **Epochs:** 3
102
- - **Scheduler:** Linear warmup
103
- - **Loss Function:** Cross-entropy
104
  - **Evaluation Metrics:** ROUGE (computed on validation set every 100 steps)
105
- - **Evaluation Strategy:** Step-based evaluation (`eval_steps=100`)
106
  - **Weight Decay:** 0.01
 
107
  - **Mixed Precision (fp16):** Enabled when CUDA is available
 
 
 
 
 
 
 
 
108
 
109
  ---
110
 
@@ -131,15 +140,6 @@ The model was trained on a specialized dataset of **Portuguese municipal meeting
131
 
132
  ---
133
 
134
- ## ⚖️ Ethical Considerations
135
-
136
- The model is intended for **research and administrative document processing**.
137
-
138
- - Outputs should **not** be used for legal decision-making without human verification.
139
- - Potential bias may exist due to limited geographic and institutional diversity in training data.
140
-
141
- ---
142
-
143
  ## 📄 License
144
 
145
  This model is released under the
 
40
  - **Task:** Abstractive summarization (`text → summary`)
41
  - **Framework:** 🤗 Transformers (PyTorch)
42
  - **Max Input Length:** 512 tokens
 
43
  - **Training Objective:** Conditional generation (cross-entropy loss)
44
  - **Dataset:** Portuguese municipal meeting minutes annotated with summaries
45
 
 
57
  ```python
58
  from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
59
 
60
+ model_name = "anonymous12321/Pegasus-Summarization-Council-PT"
61
  tokenizer = AutoTokenizer.from_pretrained(model_name)
62
  model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
63
 
 
84
 
85
  | Metric | Score | Description |
86
  |:-------|:------:|:------------|
87
+ | **ROUGE-1** | 0.367 | Unigram overlap between generated and reference summaries |
88
+ | **ROUGE-2** | 0.179 | Bigram overlap |
89
+ | **ROUGE-L** | 0.309 | Longest common subsequence overlap |
90
+ | **BERTScore (F1)** | 0.746 | Semantic similarity between summary and reference |
91
 
92
  ---
93
 
94
  ## ⚙️ Training Details
95
 
96
  - **Pretrained Model:** `google/pegasus-xsum`
97
+ - **Tokenizer:** `AutoTokenizer` (matching checkpoint)
98
  - **Optimizer:** AdamW (default in Hugging Face Trainer)
99
  - **Learning Rate:** 2e-5
100
  - **Batch Size:** 4
101
  - **Epochs:** 3
102
+ - **Scheduler:** Linear warmup (default)
103
+ - **Loss Function:** Cross-entropy (seq2seq objective)
104
  - **Evaluation Metrics:** ROUGE (computed on validation set every 100 steps)
105
+ - **Evaluation Strategy:** Step-based evaluation (`eval_strategy="steps"`, `eval_steps=100`)
106
  - **Weight Decay:** 0.01
107
+ - **Logging Steps:** 10
108
  - **Mixed Precision (fp16):** Enabled when CUDA is available
109
+ - **Save Strategy:** Keep only latest checkpoint (`save_total_limit=1`)
110
+ - **Chunking:** Token-based with `max_length=512` and `stride=256`
111
+ - **Target Max Length:** 128
112
+ - **Validation Split:** 10% of data
113
+ - **Data Collator:** `DataCollatorForSeq2Seq` (dynamic padding)
114
+ - **Output Directory:** `./results_hierarchical_pegasus_segments`
115
+ - **Saved Model Path:** `./trained_pegasus_segments`
116
+
117
 
118
  ---
119
 
 
140
 
141
  ---
142
 
 
 
 
 
 
 
 
 
 
143
  ## 📄 License
144
 
145
  This model is released under the