snippet-extractor / README.md

Upload snippet extraction model

87785d0 verified about 1 month ago

5.04 kB

	---
	language:
	- en
	license: mit
	library_name: transformers
	tags:
	- question-answering
	- extractive-qa
	- snippet-extraction
	- text-extraction
	pipeline_tag: question-answering
	widget:
	- text: "What is the most compelling or interesting snippet from this text?"
	context: "The Crash at Crush was a one-day publicity stunt in the U.S. state of Texas that took place on September 15, 1896, in which two uncrewed locomotives were crashed into each other head-on at high speed. William George Crush conceived the idea to demonstrate a staged train wreck as a public spectacle. An estimated 40,000 people attended the event. Unexpectedly, the impact caused both engine boilers to explode, resulting in a shower of flying debris that killed two people and caused numerous injuries among the spectators."
	example_title: "Train Crash Example"
	- text: "What is the most compelling or interesting snippet from this text?"
	context: "TempleOS is a biblical-themed lightweight operating system designed to be the Third Temple from the Hebrew Bible. It was created by American computer programmer Terry A. Davis, who developed it alone over the course of a decade after a series of manic episodes that he later described as a revelation from God. The system was characterized as a modern x86-64 Commodore 64, using an interface similar to a mixture of DOS and Turbo C."
	example_title: "TempleOS Example"
	- text: "What is the most compelling or interesting snippet from this text?"
	context: "Lina Marcela Medina de Jurado is a Peruvian woman who became the youngest confirmed mother in history when she gave birth to her son Gerardo on 14 May 1939 when she was five years, seven months, and 21 days of age. Based on the medical assessments of her pregnancy, she was four years old when she became pregnant, which was biologically possible due to precocious puberty."
	example_title: "Medical Record Example"
	datasets:
	- custom
	base_model: answerdotai/ModernBERT-large
	---

	# Snippet Extractor Model

	This model extracts the most compelling or interesting snippets from text passages. It's fine-tuned for extractive question answering where the "question" is always:

	> "What is the most compelling or interesting snippet from this text?"

	## Model Description

	- Task: Extractive Question Answering / Snippet Extraction
	- Base Model: `answerdotai/ModernBERT-large`
	- Training Data: Wikipedia article snippets curated for interesting/compelling content
	- Language: English

	## Usage

	### Quick Start with Pipeline (Recommended)

	```python
	from transformers import pipeline

	# Load the model
	qa_pipeline = pipeline("question-answering", model="derenrich/snippet-extractor")

	# Extract a compelling snippet
	context = """
	The Crash at Crush was a one-day publicity stunt in the U.S. state of Texas
	that took place on September 15, 1896, in which two uncrewed locomotives were
	crashed into each other head-on at high speed. An estimated 40,000 people
	attended the event. Unexpectedly, the impact caused both engine boilers to
	explode, resulting in a shower of flying debris that killed two people.
	"""

	result = qa_pipeline(
	question="What is the most compelling or interesting snippet from this text?",
	context=context
	)

	print(f"Snippet: {result['answer']}")
	print(f"Confidence: {result['score']:.4f}")
	```

	### Manual Loading

	```python
	from transformers import AutoModelForQuestionAnswering, AutoTokenizer
	import torch

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("derenrich/snippet-extractor")
	model = AutoModelForQuestionAnswering.from_pretrained("derenrich/snippet-extractor")

	# Prepare inputs
	question = "What is the most compelling or interesting snippet from this text?"
	context = "Your text here..."

	inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=384)

	# Get predictions
	with torch.no_grad():
	outputs = model(**inputs)

	# Decode answer
	start_idx = outputs.start_logits.argmax()
	end_idx = outputs.end_logits.argmax()
	answer_tokens = inputs.input_ids[0][start_idx:end_idx+1]
	answer = tokenizer.decode(answer_tokens)

	print(f"Extracted snippet: {answer}")
	```

	## Training Details

	- Epochs: 3
	- Learning Rate: 2e-5
	- Batch Size: 8
	- Max Sequence Length: 384
	- Optimizer: AdamW with weight decay 0.01

	## Intended Use

	This model is designed to:
	- Extract interesting/compelling snippets from text for summaries
	- Highlight the most notable information in articles
	- Generate "hook" text for content previews

	## Limitations

	- Works best on English text
	- Trained primarily on Wikipedia-style content
	- May not perform as well on highly technical or domain-specific text
	- The concept of "compelling" is subjective; results may vary

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{snippet-extractor,
	title={Snippet Extractor: Extracting Compelling Text Snippets},
	author={Daniel Erenrich},
	year={2024},
	publisher={Hugging Face},
	url={https://huggingface.co/derenrich/snippet-extractor}
	}
	```