excribe
/

Led_sgd_summarizer_250_sp

+---
+model_name: Led_sgd_summarizer_250_sp
+base_model: allenai/led-large-16384
+language:
+- es
+license: cc-by-3.0
+tags:
+- summarization
+- abstractive-summarization
+- long-text
+- spanish
+- transformers
+- led
+datasets:
+- custom_parquet_dataset
+metrics:
+- loss
+- gen_len
+model_index:
+- name: Led_sgd_summarizer_250_sp
+  results:
+  - task:
+      type: summarization
+      name: Abstractive Summarization
+    dataset:
+      name: Custom Spanish Summarization Dataset
+      type: custom_parquet_dataset
+      split: validation
+    metrics:
+    - name: Eval Loss
+      type: loss
+      value: 1.0227
+    - name: Generated Length
+      type: gen_len
+      value: 60.66
+pipeline_tag: summarization
+---
+# Model Card for Led_sgd_summarizer_250_sp
+## Model Description
+**Overview**
+`Led_sgd_summarizer_250_sp` is a fine-tuned version of `allenai/led-large-16384` designed for abstractive text summarization of long Spanish texts (up to 16,384 tokens). It generates concise summaries (~350 characters) for documents such as governmental petitions and legal correspondence. Built using the Hugging Face Transformers library, it leverages the Longformer Encoder-Decoder (LED) architecture for efficient processing of extended sequences.
+**Intended Use**
+- **Primary Use**: Summarizing long Spanish texts, particularly in governmental or legal domains.
+- **Users**: Researchers, developers, and organizations working with Spanish text summarization.
+- **Out-of-Scope**: Real-time applications, non-Spanish languages, or very short texts.
+**Ethical Considerations**
+- **Bias**: May reflect biases in the training dataset. Summaries should be reviewed for fairness.
+- **Misinformation**: Summaries may omit key details. Verify outputs for critical applications.
+- **Environmental Impact**: Training required significant computational resources, mitigated by FP16 and gradient checkpointing.
+## Model Details
+- **Model Name**: `Led_sgd_summarizer_250_sp`
+- **Base Model**: `allenai/led-large-16384`
+- **Architecture**: Sequence-to-Sequence (Seq2Seq) Transformer (Longformer Encoder-Decoder, LED)
+- **Parameters**: ~460M
+- **Tokenizer**: `AutoTokenizer` from `allenai/led-large-16384` (BART-based, fast)
+- **Framework**: PyTorch
+- **Hardware**: NVIDIA A100-SXM4-80GB GPU
+- **Developed By**: excribe.co
+- **Author**: excribe.co
+- **Date**: 2025-04-26
+- **License**: [CC-BY-3.0](https://creativecommons.org/licenses/by/3.0/)
+- **Training Details**:
+  - **Script**: Custom Python script using Hugging Face `transformers`, `datasets`, `evaluate`, and `accelerate`.
+  - **Hyperparameters**:
+    - Learning Rate: 5e-6
+    - Batch Size: 1 (effective batch size 32 with gradient accumulation)
+    - Gradient Accumulation Steps: 32
+    - Epochs: 5
+    - Optimizer: AdamW (weight decay: 0.01)
+    - Precision: FP16
+    - Gradient Clipping: max_norm=1.0
+    - Early Stopping: Patience of 2, based on `eval_rougeL`
+    - Generation Parameters:
+      - Num Beams: 4
+      - Max Length: 250
+      - Min Length: 50
+      - Length Penalty: 1.0
+      - No Repeat N-gram Size: 3
+  - **Optimization**:
+    - Gradient checkpointing to reduce memory usage.
+    - Model cache disabled during training to save VRAM.
+  - **Training Metrics**:
+    - Train Loss: 0.8424
+    - Train Runtime: 14,992.38 seconds (~4 hours, 9 minutes)
+    - Train Samples per Second: 2.477
+    - Train Steps per Second: 0.077
+    - Epochs Completed: 3.2197
+    - Total FLOPs: 6.849736835009741e+16
+    - Train Samples: 7,428
+## Training Data
+- **Name**: Custom Spanish Summarization Dataset
+- **Source**: Proprietary `.parquet` file
+- **Size**: 8,254 records
+- **Columns**:
+  - `texto_entrada`: Input text (long Spanish texts, e.g., petitions, official correspondence)
+  - `asunto`: Target summary (~350 characters)
+- **Preprocessing**:
+  - Removed HTML tags using `BeautifulSoup`.
+  - Normalized text (extra whitespaces removed) using regex.
+  - Filtered invalid records (empty texts, summaries < 5 chars, texts < 10 chars).
+- **Split**:
+  - Train: 7,428 records (90%)
+  - Validation: 826 records (10%)
+- **Language**: Spanish
+- **Access**: Contact [excribe.co](mailto:info@excribe.co) for inquiries.
+## Model Usage
+### Installation
+Install required dependencies:
+```bash
+pip install transformers datasets evaluate rouge_score nltk torch accelerate sentencepiece beautifulsoup4
+```
+### Using the Pipeline
+For easy inference:
+```python
+from transformers import pipeline
+summarizer = pipeline(
+    "summarization",
+    model="excribe-co/Led_sgd_summarizer_250_sp",
+    tokenizer="excribe-co/Led_sgd_summarizer_250_sp",
+    device=0 if torch.cuda.is_available() else -1
+)
+text = "Radicador de correo electronico Orfeo *20242200099852* Rad. No. 20242200099852 ..."
+summary = summarizer(
+    text,
+    max_length=250,
+    min_length=50,
+    num_beams=4,
+    length_penalty=1.0,
+    no_repeat_ngram_size=3,
+    truncation=True
+)
+print("Generated Summary:", summary[0]["summary_text"])
+```
+### Manual Inference
+For more control:
+```python
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
+import torch
+model = AutoModelForSeq2SeqLM.from_pretrained("excribe-co/Led_sgd_summarizer_250_sp")
+tokenizer = AutoTokenizer.from_pretrained("excribe-co/Led_sgd_summarizer_250_sp")
+model.eval()
+text = "Your long text here..."
+inputs = tokenizer(
+    "summarize: " + text,
+    max_length=16384,
+    truncation=True,
+    return_tensors="pt",
+    padding=True
+)
+global_attention_mask = torch.zeros_like(inputs["input_ids"])
+global_attention_mask[:, 0] = 1
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model.to(device)
+inputs = {k: v.to(device) for k, v in inputs.items()}
+global_attention_mask = global_attention_mask.to(device)
+outputs = model.generate(
+    **inputs,
+    global_attention_mask=global_attention_mask,
+    max_length=250,
+    min_length=50,
+    num_beams=4,
+    length_penalty=1.0,
+    no_repeat_ngram_size=3
+)
+summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print("Generated Summary:", summary)
+```
+### Evaluation Results
+- **Metrics**:
+  - Eval Loss: 1.0227
+  - Generated Length (gen_len): 60.66
+  - Eval Runtime: 1,807.29 seconds
+  - Eval Samples per Second: 0.457
+  - Eval Steps per Second: 0.229
+  - Eval Samples: 826
+- **Note**: ROUGE metrics were computed but reported as zero, likely due to an issue in computation or data preprocessing. Generated summaries are coherent, but ROUGE scoring requires further investigation.
+## Additional Information
+### Limitations
+- **Computational Requirements**: Requires ~48GB VRAM for training, ~16GB for inference with FP16.
+- **Language**: Optimized for Spanish; untested on other languages.
+- **Domain Specificity**: Best for governmental and legal documents; may underperform on out-of-domain texts.
+- **Inference Speed**: Slow for very long texts due to model size and sequence length.
+### Citation
+```bibtex
+@misc{excribe2025ledsgdsummarizer,
+  author = {excribe.co},
+  title = {Led_sgd_summarizer_250_sp: Fine-Tuned LED for Long-Text Summarization in Spanish},
+  year = {2025},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/excribe-co/Led_sgd_summarizer_250_sp}
+}
+```
+### Contact
+For questions, contact [excribe.co](mailto:info@excribe.co) or open an issue at [https://huggingface.co/excribe-co/Led_sgd_summarizer_250_sp](https://huggingface.co/excribe-co/Led_sgd_summarizer_250_sp).
+### Acknowledgments
+- Built upon `allenai/led-large-16384` from Hugging Face.
+- Thanks to Hugging Face for `transformers`, `datasets`, and `evaluate` libraries.
+- Training supported by excribe.co.