snippet-extractor / README.md
derenrich's picture
Upload snippet extraction model
87785d0 verified
metadata
language:
  - en
license: mit
library_name: transformers
tags:
  - question-answering
  - extractive-qa
  - snippet-extraction
  - text-extraction
pipeline_tag: question-answering
widget:
  - text: What is the most compelling or interesting snippet from this text?
    context: >-
      The Crash at Crush was a one-day publicity stunt in the U.S. state of
      Texas that took place on September 15, 1896, in which two uncrewed
      locomotives were crashed into each other head-on at high speed. William
      George Crush conceived the idea to demonstrate a staged train wreck as a
      public spectacle. An estimated 40,000 people attended the event.
      Unexpectedly, the impact caused both engine boilers to explode, resulting
      in a shower of flying debris that killed two people and caused numerous
      injuries among the spectators.
    example_title: Train Crash Example
  - text: What is the most compelling or interesting snippet from this text?
    context: >-
      TempleOS is a biblical-themed lightweight operating system designed to be
      the Third Temple from the Hebrew Bible. It was created by American
      computer programmer Terry A. Davis, who developed it alone over the course
      of a decade after a series of manic episodes that he later described as a
      revelation from God. The system was characterized as a modern x86-64
      Commodore 64, using an interface similar to a mixture of DOS and Turbo C.
    example_title: TempleOS Example
  - text: What is the most compelling or interesting snippet from this text?
    context: >-
      Lina Marcela Medina de Jurado is a Peruvian woman who became the youngest
      confirmed mother in history when she gave birth to her son Gerardo on 14
      May 1939 when she was five years, seven months, and 21 days of age. Based
      on the medical assessments of her pregnancy, she was four years old when
      she became pregnant, which was biologically possible due to precocious
      puberty.
    example_title: Medical Record Example
datasets:
  - custom
base_model: answerdotai/ModernBERT-large

Snippet Extractor Model

This model extracts the most compelling or interesting snippets from text passages. It's fine-tuned for extractive question answering where the "question" is always:

"What is the most compelling or interesting snippet from this text?"

Model Description

  • Task: Extractive Question Answering / Snippet Extraction
  • Base Model: answerdotai/ModernBERT-large
  • Training Data: Wikipedia article snippets curated for interesting/compelling content
  • Language: English

Usage

Quick Start with Pipeline (Recommended)

from transformers import pipeline

# Load the model
qa_pipeline = pipeline("question-answering", model="derenrich/snippet-extractor")

# Extract a compelling snippet
context = """
The Crash at Crush was a one-day publicity stunt in the U.S. state of Texas 
that took place on September 15, 1896, in which two uncrewed locomotives were 
crashed into each other head-on at high speed. An estimated 40,000 people 
attended the event. Unexpectedly, the impact caused both engine boilers to 
explode, resulting in a shower of flying debris that killed two people.
"""

result = qa_pipeline(
    question="What is the most compelling or interesting snippet from this text?",
    context=context
)

print(f"Snippet: {result['answer']}")
print(f"Confidence: {result['score']:.4f}")

Manual Loading

from transformers import AutoModelForQuestionAnswering, AutoTokenizer
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("derenrich/snippet-extractor")
model = AutoModelForQuestionAnswering.from_pretrained("derenrich/snippet-extractor")

# Prepare inputs
question = "What is the most compelling or interesting snippet from this text?"
context = "Your text here..."

inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=384)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)

# Decode answer
start_idx = outputs.start_logits.argmax()
end_idx = outputs.end_logits.argmax()
answer_tokens = inputs.input_ids[0][start_idx:end_idx+1]
answer = tokenizer.decode(answer_tokens)

print(f"Extracted snippet: {answer}")

Training Details

  • Epochs: 3
  • Learning Rate: 2e-5
  • Batch Size: 8
  • Max Sequence Length: 384
  • Optimizer: AdamW with weight decay 0.01

Intended Use

This model is designed to:

  • Extract interesting/compelling snippets from text for summaries
  • Highlight the most notable information in articles
  • Generate "hook" text for content previews

Limitations

  • Works best on English text
  • Trained primarily on Wikipedia-style content
  • May not perform as well on highly technical or domain-specific text
  • The concept of "compelling" is subjective; results may vary

Citation

If you use this model, please cite:

@misc{snippet-extractor,
  title={Snippet Extractor: Extracting Compelling Text Snippets},
  author={Daniel Erenrich},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/derenrich/snippet-extractor}
}