feat: Add T5 Text Summarizer code and README documentation.

Files changed (3) hide show

README.md +22 -0
model.py +59 -0
requirements.txt +26 -0

README.md CHANGED Viewed

@@ -1,3 +1,25 @@
 ---
 license: apache-2.0
 ---

+# T5 Text Summarizer
+This repository contains a simple text summarization script using a pre-trained T5 model from the Hugging Face Transformers library. The script demonstrates how to use prompt-based summarization to generate a concise summary of an input text.
+## Overview
+The main script (`model.py`) defines a function `summarize_text` that:
+- Loads the T5 tokenizer and T5 model.
+- Adds a summarization prompt (`"summarize: "`) to the input text.
+- Tokenizes the input text and truncates it to a maximum length.
+- Generates a summary using beam search.
+- Decodes the generated token sequence back into human-readable text while skipping special tokens.
+## Code Explanation
+### Tokenization and Decoding
+- **Tokenization:**
+  The input text is first prefixed with the summarization prompt and then tokenized using:
+  ```python
+  input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)
 ---
 license: apache-2.0
 ---

model.py ADDED Viewed

	@@ -0,0 +1,59 @@

+from transformers import T5Tokenizer, T5ForConditionalGeneration
+def summarize_text(text: str,
+                   model_name: str = "t5-base",
+                   max_length: int = 150,
+                   min_length: int = 40,
+                   num_beams: int = 4) -> str:
+    """
+    Summarizes the given text using a T5 model.
+    Parameters:
+      - text: The long input text to be summarized.
+      - model_name: The pre-trained T5 model to use (e.g., "t5-base", "t5-small", etc.)
+      - max_length: The maximum length (in tokens) of the generated summary.
+      - min_length: The minimum length (in tokens) of the generated summary.
+      - num_beams: The number of beams for beam search (affects summary quality).
+    Returns:
+      - The summarized text (str)
+    """
+    # Load tokenizer and model (using new tokenizer behavior)
+    tokenizer = T5Tokenizer.from_pretrained(model_name, legacy=False)
+    model = T5ForConditionalGeneration.from_pretrained(model_name)
+    # Add the summarization prompt (T5 uses prompt-based approach)
+    input_text = "summarize: " + text.strip()
+    # Tokenize the input text
+    input_ids = tokenizer.encode(input_text,
+                                 return_tensors="pt",
+                                 max_length=512,
+                                 truncation=True)
+    # Generate summary using the model's generate method
+    summary_ids = model.generate(input_ids,
+                                 max_length=max_length,
+                                 min_length=min_length,
+                                 num_beams=num_beams,
+                                 early_stopping=True)
+    # Decode the generated tokens back into text
+    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
+    return summary
+if __name__ == "__main__":
+    # Example English text to summarize
+    long_text = (
+        "In recent years, the global economy has faced various challenges. Trade tensions, "
+        "inflationary pressures, and rapid technological advancements have contributed to "
+        "significant changes in market dynamics. Experts believe that these factors will continue "
+        "to influence economic trends, while governments around the world are exploring policies "
+        "to stabilize the economy. Meanwhile, the rise of the digital economy and the transition "
+        "to green energy are emerging as key drivers of future economic growth."
+    )
+    # Execute the summarization
+    summary_result = summarize_text(long_text)
+    print("Summary:")
+    print(summary_result)

requirements.txt ADDED Viewed

	@@ -0,0 +1,26 @@

+certifi==2025.1.31
+charset-normalizer==3.4.1
+filelock==3.17.0
+fsspec==2025.2.0
+huggingface-hub==0.28.1
+idna==3.10
+Jinja2==3.1.5
+MarkupSafe==3.0.2
+mpmath==1.3.0
+networkx==3.4.2
+numpy==2.2.2
+packaging==24.2
+protobuf==5.29.3
+PyYAML==6.0.2
+regex==2024.11.6
+requests==2.32.3
+safetensors==0.5.2
+sentencepiece==0.2.0
+setuptools==75.8.0
+sympy==1.13.1
+tokenizers==0.21.0
+torch==2.6.0
+tqdm==4.67.1
+transformers==4.48.2
+typing_extensions==4.12.2
+urllib3==2.3.0