excribe commited on
Commit
cf26b49
verified
1 Parent(s): 09b098d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +225 -3
README.md CHANGED
@@ -1,3 +1,225 @@
1
- ---
2
- license: cc-by-3.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ model_name: Led_sgd_summarizer_250_sp
3
+ base_model: allenai/led-large-16384
4
+ language:
5
+ - es
6
+ license: cc-by-3.0
7
+ tags:
8
+ - summarization
9
+ - abstractive-summarization
10
+ - long-text
11
+ - spanish
12
+ - transformers
13
+ - led
14
+ datasets:
15
+ - custom_parquet_dataset
16
+ metrics:
17
+ - loss
18
+ - gen_len
19
+ model_index:
20
+ - name: Led_sgd_summarizer_250_sp
21
+ results:
22
+ - task:
23
+ type: summarization
24
+ name: Abstractive Summarization
25
+ dataset:
26
+ name: Custom Spanish Summarization Dataset
27
+ type: custom_parquet_dataset
28
+ split: validation
29
+ metrics:
30
+ - name: Eval Loss
31
+ type: loss
32
+ value: 1.0227
33
+ - name: Generated Length
34
+ type: gen_len
35
+ value: 60.66
36
+ pipeline_tag: summarization
37
+ ---
38
+
39
+ # Model Card for Led_sgd_summarizer_250_sp
40
+
41
+ ## Model Description
42
+
43
+ **Overview**
44
+ `Led_sgd_summarizer_250_sp` is a fine-tuned version of `allenai/led-large-16384` designed for abstractive text summarization of long Spanish texts (up to 16,384 tokens). It generates concise summaries (~350 characters) for documents such as governmental petitions and legal correspondence. Built using the Hugging Face Transformers library, it leverages the Longformer Encoder-Decoder (LED) architecture for efficient processing of extended sequences.
45
+
46
+ **Intended Use**
47
+ - **Primary Use**: Summarizing long Spanish texts, particularly in governmental or legal domains.
48
+ - **Users**: Researchers, developers, and organizations working with Spanish text summarization.
49
+ - **Out-of-Scope**: Real-time applications, non-Spanish languages, or very short texts.
50
+
51
+ **Ethical Considerations**
52
+ - **Bias**: May reflect biases in the training dataset. Summaries should be reviewed for fairness.
53
+ - **Misinformation**: Summaries may omit key details. Verify outputs for critical applications.
54
+ - **Environmental Impact**: Training required significant computational resources, mitigated by FP16 and gradient checkpointing.
55
+
56
+ ## Model Details
57
+
58
+ - **Model Name**: `Led_sgd_summarizer_250_sp`
59
+ - **Base Model**: `allenai/led-large-16384`
60
+ - **Architecture**: Sequence-to-Sequence (Seq2Seq) Transformer (Longformer Encoder-Decoder, LED)
61
+ - **Parameters**: ~460M
62
+ - **Tokenizer**: `AutoTokenizer` from `allenai/led-large-16384` (BART-based, fast)
63
+ - **Framework**: PyTorch
64
+ - **Hardware**: NVIDIA A100-SXM4-80GB GPU
65
+ - **Developed By**: excribe.co
66
+ - **Author**: excribe.co
67
+ - **Date**: 2025-04-26
68
+ - **License**: [CC-BY-3.0](https://creativecommons.org/licenses/by/3.0/)
69
+ - **Training Details**:
70
+ - **Script**: Custom Python script using Hugging Face `transformers`, `datasets`, `evaluate`, and `accelerate`.
71
+ - **Hyperparameters**:
72
+ - Learning Rate: 5e-6
73
+ - Batch Size: 1 (effective batch size 32 with gradient accumulation)
74
+ - Gradient Accumulation Steps: 32
75
+ - Epochs: 5
76
+ - Optimizer: AdamW (weight decay: 0.01)
77
+ - Precision: FP16
78
+ - Gradient Clipping: max_norm=1.0
79
+ - Early Stopping: Patience of 2, based on `eval_rougeL`
80
+ - Generation Parameters:
81
+ - Num Beams: 4
82
+ - Max Length: 250
83
+ - Min Length: 50
84
+ - Length Penalty: 1.0
85
+ - No Repeat N-gram Size: 3
86
+ - **Optimization**:
87
+ - Gradient checkpointing to reduce memory usage.
88
+ - Model cache disabled during training to save VRAM.
89
+ - **Training Metrics**:
90
+ - Train Loss: 0.8424
91
+ - Train Runtime: 14,992.38 seconds (~4 hours, 9 minutes)
92
+ - Train Samples per Second: 2.477
93
+ - Train Steps per Second: 0.077
94
+ - Epochs Completed: 3.2197
95
+ - Total FLOPs: 6.849736835009741e+16
96
+ - Train Samples: 7,428
97
+
98
+ ## Training Data
99
+
100
+ - **Name**: Custom Spanish Summarization Dataset
101
+ - **Source**: Proprietary `.parquet` file
102
+ - **Size**: 8,254 records
103
+ - **Columns**:
104
+ - `texto_entrada`: Input text (long Spanish texts, e.g., petitions, official correspondence)
105
+ - `asunto`: Target summary (~350 characters)
106
+ - **Preprocessing**:
107
+ - Removed HTML tags using `BeautifulSoup`.
108
+ - Normalized text (extra whitespaces removed) using regex.
109
+ - Filtered invalid records (empty texts, summaries < 5 chars, texts < 10 chars).
110
+ - **Split**:
111
+ - Train: 7,428 records (90%)
112
+ - Validation: 826 records (10%)
113
+ - **Language**: Spanish
114
+ - **Access**: Contact [excribe.co](mailto:info@excribe.co) for inquiries.
115
+
116
+ ## Model Usage
117
+
118
+ ### Installation
119
+ Install required dependencies:
120
+ ```bash
121
+ pip install transformers datasets evaluate rouge_score nltk torch accelerate sentencepiece beautifulsoup4
122
+ ```
123
+
124
+ ### Using the Pipeline
125
+ For easy inference:
126
+ ```python
127
+ from transformers import pipeline
128
+
129
+ summarizer = pipeline(
130
+ "summarization",
131
+ model="excribe-co/Led_sgd_summarizer_250_sp",
132
+ tokenizer="excribe-co/Led_sgd_summarizer_250_sp",
133
+ device=0 if torch.cuda.is_available() else -1
134
+ )
135
+
136
+ text = "Radicador de correo electronico Orfeo *20242200099852* Rad. No. 20242200099852 ..."
137
+ summary = summarizer(
138
+ text,
139
+ max_length=250,
140
+ min_length=50,
141
+ num_beams=4,
142
+ length_penalty=1.0,
143
+ no_repeat_ngram_size=3,
144
+ truncation=True
145
+ )
146
+ print("Generated Summary:", summary[0]["summary_text"])
147
+ ```
148
+
149
+ ### Manual Inference
150
+ For more control:
151
+ ```python
152
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
153
+ import torch
154
+
155
+ model = AutoModelForSeq2SeqLM.from_pretrained("excribe-co/Led_sgd_summarizer_250_sp")
156
+ tokenizer = AutoTokenizer.from_pretrained("excribe-co/Led_sgd_summarizer_250_sp")
157
+ model.eval()
158
+
159
+ text = "Your long text here..."
160
+ inputs = tokenizer(
161
+ "summarize: " + text,
162
+ max_length=16384,
163
+ truncation=True,
164
+ return_tensors="pt",
165
+ padding=True
166
+ )
167
+
168
+ global_attention_mask = torch.zeros_like(inputs["input_ids"])
169
+ global_attention_mask[:, 0] = 1
170
+
171
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
172
+ model.to(device)
173
+ inputs = {k: v.to(device) for k, v in inputs.items()}
174
+ global_attention_mask = global_attention_mask.to(device)
175
+
176
+ outputs = model.generate(
177
+ **inputs,
178
+ global_attention_mask=global_attention_mask,
179
+ max_length=250,
180
+ min_length=50,
181
+ num_beams=4,
182
+ length_penalty=1.0,
183
+ no_repeat_ngram_size=3
184
+ )
185
+
186
+ summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
187
+ print("Generated Summary:", summary)
188
+ ```
189
+
190
+ ### Evaluation Results
191
+ - **Metrics**:
192
+ - Eval Loss: 1.0227
193
+ - Generated Length (gen_len): 60.66
194
+ - Eval Runtime: 1,807.29 seconds
195
+ - Eval Samples per Second: 0.457
196
+ - Eval Steps per Second: 0.229
197
+ - Eval Samples: 826
198
+ - **Note**: ROUGE metrics were computed but reported as zero, likely due to an issue in computation or data preprocessing. Generated summaries are coherent, but ROUGE scoring requires further investigation.
199
+
200
+ ## Additional Information
201
+
202
+ ### Limitations
203
+ - **Computational Requirements**: Requires ~48GB VRAM for training, ~16GB for inference with FP16.
204
+ - **Language**: Optimized for Spanish; untested on other languages.
205
+ - **Domain Specificity**: Best for governmental and legal documents; may underperform on out-of-domain texts.
206
+ - **Inference Speed**: Slow for very long texts due to model size and sequence length.
207
+
208
+ ### Citation
209
+ ```bibtex
210
+ @misc{excribe2025ledsgdsummarizer,
211
+ author = {excribe.co},
212
+ title = {Led_sgd_summarizer_250_sp: Fine-Tuned LED for Long-Text Summarization in Spanish},
213
+ year = {2025},
214
+ publisher = {Hugging Face},
215
+ url = {https://huggingface.co/excribe-co/Led_sgd_summarizer_250_sp}
216
+ }
217
+ ```
218
+
219
+ ### Contact
220
+ For questions, contact [excribe.co](mailto:info@excribe.co) or open an issue at [https://huggingface.co/excribe-co/Led_sgd_summarizer_250_sp](https://huggingface.co/excribe-co/Led_sgd_summarizer_250_sp).
221
+
222
+ ### Acknowledgments
223
+ - Built upon `allenai/led-large-16384` from Hugging Face.
224
+ - Thanks to Hugging Face for `transformers`, `datasets`, and `evaluate` libraries.
225
+ - Training supported by excribe.co.