snippet-extractor / README.md
derenrich's picture
Upload snippet extraction model
87785d0 verified
---
language:
- en
license: mit
library_name: transformers
tags:
- question-answering
- extractive-qa
- snippet-extraction
- text-extraction
pipeline_tag: question-answering
widget:
- text: "What is the most compelling or interesting snippet from this text?"
context: "The Crash at Crush was a one-day publicity stunt in the U.S. state of Texas that took place on September 15, 1896, in which two uncrewed locomotives were crashed into each other head-on at high speed. William George Crush conceived the idea to demonstrate a staged train wreck as a public spectacle. An estimated 40,000 people attended the event. Unexpectedly, the impact caused both engine boilers to explode, resulting in a shower of flying debris that killed two people and caused numerous injuries among the spectators."
example_title: "Train Crash Example"
- text: "What is the most compelling or interesting snippet from this text?"
context: "TempleOS is a biblical-themed lightweight operating system designed to be the Third Temple from the Hebrew Bible. It was created by American computer programmer Terry A. Davis, who developed it alone over the course of a decade after a series of manic episodes that he later described as a revelation from God. The system was characterized as a modern x86-64 Commodore 64, using an interface similar to a mixture of DOS and Turbo C."
example_title: "TempleOS Example"
- text: "What is the most compelling or interesting snippet from this text?"
context: "Lina Marcela Medina de Jurado is a Peruvian woman who became the youngest confirmed mother in history when she gave birth to her son Gerardo on 14 May 1939 when she was five years, seven months, and 21 days of age. Based on the medical assessments of her pregnancy, she was four years old when she became pregnant, which was biologically possible due to precocious puberty."
example_title: "Medical Record Example"
datasets:
- custom
base_model: answerdotai/ModernBERT-large
---
# Snippet Extractor Model
This model extracts the most compelling or interesting snippets from text passages. It's fine-tuned for extractive question answering where the "question" is always:
> **"What is the most compelling or interesting snippet from this text?"**
## Model Description
- **Task**: Extractive Question Answering / Snippet Extraction
- **Base Model**: `answerdotai/ModernBERT-large`
- **Training Data**: Wikipedia article snippets curated for interesting/compelling content
- **Language**: English
## Usage
### Quick Start with Pipeline (Recommended)
```python
from transformers import pipeline
# Load the model
qa_pipeline = pipeline("question-answering", model="derenrich/snippet-extractor")
# Extract a compelling snippet
context = """
The Crash at Crush was a one-day publicity stunt in the U.S. state of Texas
that took place on September 15, 1896, in which two uncrewed locomotives were
crashed into each other head-on at high speed. An estimated 40,000 people
attended the event. Unexpectedly, the impact caused both engine boilers to
explode, resulting in a shower of flying debris that killed two people.
"""
result = qa_pipeline(
question="What is the most compelling or interesting snippet from this text?",
context=context
)
print(f"Snippet: {result['answer']}")
print(f"Confidence: {result['score']:.4f}")
```
### Manual Loading
```python
from transformers import AutoModelForQuestionAnswering, AutoTokenizer
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("derenrich/snippet-extractor")
model = AutoModelForQuestionAnswering.from_pretrained("derenrich/snippet-extractor")
# Prepare inputs
question = "What is the most compelling or interesting snippet from this text?"
context = "Your text here..."
inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=384)
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
# Decode answer
start_idx = outputs.start_logits.argmax()
end_idx = outputs.end_logits.argmax()
answer_tokens = inputs.input_ids[0][start_idx:end_idx+1]
answer = tokenizer.decode(answer_tokens)
print(f"Extracted snippet: {answer}")
```
## Training Details
- **Epochs**: 3
- **Learning Rate**: 2e-5
- **Batch Size**: 8
- **Max Sequence Length**: 384
- **Optimizer**: AdamW with weight decay 0.01
## Intended Use
This model is designed to:
- Extract interesting/compelling snippets from text for summaries
- Highlight the most notable information in articles
- Generate "hook" text for content previews
## Limitations
- Works best on English text
- Trained primarily on Wikipedia-style content
- May not perform as well on highly technical or domain-specific text
- The concept of "compelling" is subjective; results may vary
## Citation
If you use this model, please cite:
```bibtex
@misc{snippet-extractor,
title={Snippet Extractor: Extracting Compelling Text Snippets},
author={Daniel Erenrich},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/derenrich/snippet-extractor}
}
```