Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,225 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
model_name: Led_sgd_summarizer_250_sp
|
| 3 |
+
base_model: allenai/led-large-16384
|
| 4 |
+
language:
|
| 5 |
+
- es
|
| 6 |
+
license: cc-by-3.0
|
| 7 |
+
tags:
|
| 8 |
+
- summarization
|
| 9 |
+
- abstractive-summarization
|
| 10 |
+
- long-text
|
| 11 |
+
- spanish
|
| 12 |
+
- transformers
|
| 13 |
+
- led
|
| 14 |
+
datasets:
|
| 15 |
+
- custom_parquet_dataset
|
| 16 |
+
metrics:
|
| 17 |
+
- loss
|
| 18 |
+
- gen_len
|
| 19 |
+
model_index:
|
| 20 |
+
- name: Led_sgd_summarizer_250_sp
|
| 21 |
+
results:
|
| 22 |
+
- task:
|
| 23 |
+
type: summarization
|
| 24 |
+
name: Abstractive Summarization
|
| 25 |
+
dataset:
|
| 26 |
+
name: Custom Spanish Summarization Dataset
|
| 27 |
+
type: custom_parquet_dataset
|
| 28 |
+
split: validation
|
| 29 |
+
metrics:
|
| 30 |
+
- name: Eval Loss
|
| 31 |
+
type: loss
|
| 32 |
+
value: 1.0227
|
| 33 |
+
- name: Generated Length
|
| 34 |
+
type: gen_len
|
| 35 |
+
value: 60.66
|
| 36 |
+
pipeline_tag: summarization
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
# Model Card for Led_sgd_summarizer_250_sp
|
| 40 |
+
|
| 41 |
+
## Model Description
|
| 42 |
+
|
| 43 |
+
**Overview**
|
| 44 |
+
`Led_sgd_summarizer_250_sp` is a fine-tuned version of `allenai/led-large-16384` designed for abstractive text summarization of long Spanish texts (up to 16,384 tokens). It generates concise summaries (~350 characters) for documents such as governmental petitions and legal correspondence. Built using the Hugging Face Transformers library, it leverages the Longformer Encoder-Decoder (LED) architecture for efficient processing of extended sequences.
|
| 45 |
+
|
| 46 |
+
**Intended Use**
|
| 47 |
+
- **Primary Use**: Summarizing long Spanish texts, particularly in governmental or legal domains.
|
| 48 |
+
- **Users**: Researchers, developers, and organizations working with Spanish text summarization.
|
| 49 |
+
- **Out-of-Scope**: Real-time applications, non-Spanish languages, or very short texts.
|
| 50 |
+
|
| 51 |
+
**Ethical Considerations**
|
| 52 |
+
- **Bias**: May reflect biases in the training dataset. Summaries should be reviewed for fairness.
|
| 53 |
+
- **Misinformation**: Summaries may omit key details. Verify outputs for critical applications.
|
| 54 |
+
- **Environmental Impact**: Training required significant computational resources, mitigated by FP16 and gradient checkpointing.
|
| 55 |
+
|
| 56 |
+
## Model Details
|
| 57 |
+
|
| 58 |
+
- **Model Name**: `Led_sgd_summarizer_250_sp`
|
| 59 |
+
- **Base Model**: `allenai/led-large-16384`
|
| 60 |
+
- **Architecture**: Sequence-to-Sequence (Seq2Seq) Transformer (Longformer Encoder-Decoder, LED)
|
| 61 |
+
- **Parameters**: ~460M
|
| 62 |
+
- **Tokenizer**: `AutoTokenizer` from `allenai/led-large-16384` (BART-based, fast)
|
| 63 |
+
- **Framework**: PyTorch
|
| 64 |
+
- **Hardware**: NVIDIA A100-SXM4-80GB GPU
|
| 65 |
+
- **Developed By**: excribe.co
|
| 66 |
+
- **Author**: excribe.co
|
| 67 |
+
- **Date**: 2025-04-26
|
| 68 |
+
- **License**: [CC-BY-3.0](https://creativecommons.org/licenses/by/3.0/)
|
| 69 |
+
- **Training Details**:
|
| 70 |
+
- **Script**: Custom Python script using Hugging Face `transformers`, `datasets`, `evaluate`, and `accelerate`.
|
| 71 |
+
- **Hyperparameters**:
|
| 72 |
+
- Learning Rate: 5e-6
|
| 73 |
+
- Batch Size: 1 (effective batch size 32 with gradient accumulation)
|
| 74 |
+
- Gradient Accumulation Steps: 32
|
| 75 |
+
- Epochs: 5
|
| 76 |
+
- Optimizer: AdamW (weight decay: 0.01)
|
| 77 |
+
- Precision: FP16
|
| 78 |
+
- Gradient Clipping: max_norm=1.0
|
| 79 |
+
- Early Stopping: Patience of 2, based on `eval_rougeL`
|
| 80 |
+
- Generation Parameters:
|
| 81 |
+
- Num Beams: 4
|
| 82 |
+
- Max Length: 250
|
| 83 |
+
- Min Length: 50
|
| 84 |
+
- Length Penalty: 1.0
|
| 85 |
+
- No Repeat N-gram Size: 3
|
| 86 |
+
- **Optimization**:
|
| 87 |
+
- Gradient checkpointing to reduce memory usage.
|
| 88 |
+
- Model cache disabled during training to save VRAM.
|
| 89 |
+
- **Training Metrics**:
|
| 90 |
+
- Train Loss: 0.8424
|
| 91 |
+
- Train Runtime: 14,992.38 seconds (~4 hours, 9 minutes)
|
| 92 |
+
- Train Samples per Second: 2.477
|
| 93 |
+
- Train Steps per Second: 0.077
|
| 94 |
+
- Epochs Completed: 3.2197
|
| 95 |
+
- Total FLOPs: 6.849736835009741e+16
|
| 96 |
+
- Train Samples: 7,428
|
| 97 |
+
|
| 98 |
+
## Training Data
|
| 99 |
+
|
| 100 |
+
- **Name**: Custom Spanish Summarization Dataset
|
| 101 |
+
- **Source**: Proprietary `.parquet` file
|
| 102 |
+
- **Size**: 8,254 records
|
| 103 |
+
- **Columns**:
|
| 104 |
+
- `texto_entrada`: Input text (long Spanish texts, e.g., petitions, official correspondence)
|
| 105 |
+
- `asunto`: Target summary (~350 characters)
|
| 106 |
+
- **Preprocessing**:
|
| 107 |
+
- Removed HTML tags using `BeautifulSoup`.
|
| 108 |
+
- Normalized text (extra whitespaces removed) using regex.
|
| 109 |
+
- Filtered invalid records (empty texts, summaries < 5 chars, texts < 10 chars).
|
| 110 |
+
- **Split**:
|
| 111 |
+
- Train: 7,428 records (90%)
|
| 112 |
+
- Validation: 826 records (10%)
|
| 113 |
+
- **Language**: Spanish
|
| 114 |
+
- **Access**: Contact [excribe.co](mailto:info@excribe.co) for inquiries.
|
| 115 |
+
|
| 116 |
+
## Model Usage
|
| 117 |
+
|
| 118 |
+
### Installation
|
| 119 |
+
Install required dependencies:
|
| 120 |
+
```bash
|
| 121 |
+
pip install transformers datasets evaluate rouge_score nltk torch accelerate sentencepiece beautifulsoup4
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
### Using the Pipeline
|
| 125 |
+
For easy inference:
|
| 126 |
+
```python
|
| 127 |
+
from transformers import pipeline
|
| 128 |
+
|
| 129 |
+
summarizer = pipeline(
|
| 130 |
+
"summarization",
|
| 131 |
+
model="excribe-co/Led_sgd_summarizer_250_sp",
|
| 132 |
+
tokenizer="excribe-co/Led_sgd_summarizer_250_sp",
|
| 133 |
+
device=0 if torch.cuda.is_available() else -1
|
| 134 |
+
)
|
| 135 |
+
|
| 136 |
+
text = "Radicador de correo electronico Orfeo *20242200099852* Rad. No. 20242200099852 ..."
|
| 137 |
+
summary = summarizer(
|
| 138 |
+
text,
|
| 139 |
+
max_length=250,
|
| 140 |
+
min_length=50,
|
| 141 |
+
num_beams=4,
|
| 142 |
+
length_penalty=1.0,
|
| 143 |
+
no_repeat_ngram_size=3,
|
| 144 |
+
truncation=True
|
| 145 |
+
)
|
| 146 |
+
print("Generated Summary:", summary[0]["summary_text"])
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
### Manual Inference
|
| 150 |
+
For more control:
|
| 151 |
+
```python
|
| 152 |
+
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
| 153 |
+
import torch
|
| 154 |
+
|
| 155 |
+
model = AutoModelForSeq2SeqLM.from_pretrained("excribe-co/Led_sgd_summarizer_250_sp")
|
| 156 |
+
tokenizer = AutoTokenizer.from_pretrained("excribe-co/Led_sgd_summarizer_250_sp")
|
| 157 |
+
model.eval()
|
| 158 |
+
|
| 159 |
+
text = "Your long text here..."
|
| 160 |
+
inputs = tokenizer(
|
| 161 |
+
"summarize: " + text,
|
| 162 |
+
max_length=16384,
|
| 163 |
+
truncation=True,
|
| 164 |
+
return_tensors="pt",
|
| 165 |
+
padding=True
|
| 166 |
+
)
|
| 167 |
+
|
| 168 |
+
global_attention_mask = torch.zeros_like(inputs["input_ids"])
|
| 169 |
+
global_attention_mask[:, 0] = 1
|
| 170 |
+
|
| 171 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 172 |
+
model.to(device)
|
| 173 |
+
inputs = {k: v.to(device) for k, v in inputs.items()}
|
| 174 |
+
global_attention_mask = global_attention_mask.to(device)
|
| 175 |
+
|
| 176 |
+
outputs = model.generate(
|
| 177 |
+
**inputs,
|
| 178 |
+
global_attention_mask=global_attention_mask,
|
| 179 |
+
max_length=250,
|
| 180 |
+
min_length=50,
|
| 181 |
+
num_beams=4,
|
| 182 |
+
length_penalty=1.0,
|
| 183 |
+
no_repeat_ngram_size=3
|
| 184 |
+
)
|
| 185 |
+
|
| 186 |
+
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 187 |
+
print("Generated Summary:", summary)
|
| 188 |
+
```
|
| 189 |
+
|
| 190 |
+
### Evaluation Results
|
| 191 |
+
- **Metrics**:
|
| 192 |
+
- Eval Loss: 1.0227
|
| 193 |
+
- Generated Length (gen_len): 60.66
|
| 194 |
+
- Eval Runtime: 1,807.29 seconds
|
| 195 |
+
- Eval Samples per Second: 0.457
|
| 196 |
+
- Eval Steps per Second: 0.229
|
| 197 |
+
- Eval Samples: 826
|
| 198 |
+
- **Note**: ROUGE metrics were computed but reported as zero, likely due to an issue in computation or data preprocessing. Generated summaries are coherent, but ROUGE scoring requires further investigation.
|
| 199 |
+
|
| 200 |
+
## Additional Information
|
| 201 |
+
|
| 202 |
+
### Limitations
|
| 203 |
+
- **Computational Requirements**: Requires ~48GB VRAM for training, ~16GB for inference with FP16.
|
| 204 |
+
- **Language**: Optimized for Spanish; untested on other languages.
|
| 205 |
+
- **Domain Specificity**: Best for governmental and legal documents; may underperform on out-of-domain texts.
|
| 206 |
+
- **Inference Speed**: Slow for very long texts due to model size and sequence length.
|
| 207 |
+
|
| 208 |
+
### Citation
|
| 209 |
+
```bibtex
|
| 210 |
+
@misc{excribe2025ledsgdsummarizer,
|
| 211 |
+
author = {excribe.co},
|
| 212 |
+
title = {Led_sgd_summarizer_250_sp: Fine-Tuned LED for Long-Text Summarization in Spanish},
|
| 213 |
+
year = {2025},
|
| 214 |
+
publisher = {Hugging Face},
|
| 215 |
+
url = {https://huggingface.co/excribe-co/Led_sgd_summarizer_250_sp}
|
| 216 |
+
}
|
| 217 |
+
```
|
| 218 |
+
|
| 219 |
+
### Contact
|
| 220 |
+
For questions, contact [excribe.co](mailto:info@excribe.co) or open an issue at [https://huggingface.co/excribe-co/Led_sgd_summarizer_250_sp](https://huggingface.co/excribe-co/Led_sgd_summarizer_250_sp).
|
| 221 |
+
|
| 222 |
+
### Acknowledgments
|
| 223 |
+
- Built upon `allenai/led-large-16384` from Hugging Face.
|
| 224 |
+
- Thanks to Hugging Face for `transformers`, `datasets`, and `evaluate` libraries.
|
| 225 |
+
- Training supported by excribe.co.
|