Spaces:

sagarchapara
/

openenv-summarization

Sleeping

openenv-summarization / tasks /medium.py

Sagar Chapara

Update benchmark grading and docs

999c3ec about 2 months ago

10.7 kB

	"""Medium task: SQuAD v1 longer passages with multiple-hop reasoning.

	Context: 800–2000 characters, truncated to 65%.
	The answer is always within the truncated portion.
	Episode: 2 steps (summarize → answer).
	Grading: token-level F1 score (same as SQuAD official eval).
	"""
	import random
	import logging
	from typing import Dict, Any, List, Optional

	from .base import BaseSummarizationTask

	logger = logging.getLogger(__name__)

	FALLBACK_SAMPLES: List[Dict[str, Any]] = [
	{
	"context": (
	"The Byzantine Empire, also referred to as the Eastern Roman Empire or Byzantium, "
	"was the continuation of the Roman Empire primarily in its eastern provinces during "
	"Late Antiquity and the Middle Ages, when its capital city was Constantinople. It "
	"survived the fragmentation and fall of the Western Roman Empire in the 5th century "
	"AD and continued to exist for an additional thousand years until the fall of "
	"Constantinople to the Ottoman Empire in 1453. During most of its existence, the "
	"empire was the most powerful economic, cultural, and military force in Europe. "
	"Both the See of Constantinople and the Ecumenical Patriarchate, which are Christian "
	"institutions, trace their origins to the foundation of Constantinople by Constantine "
	"the Great in 330 AD. The empire's rich history, blending Greek, Roman, and Christian "
	"traditions, produced important developments in art, architecture, and philosophy "
	"that continue to influence Eastern Europe to this day."
	),
	"question": "In what year did the Byzantine Empire fall to the Ottoman Empire?",
	"answer_list": ["1453"],
	},
	{
	"context": (
	"The Industrial Revolution was the transition to new manufacturing processes in Great "
	"Britain, continental Europe, and the United States, from about 1760 to sometime between "
	"1820 and 1840. This transition included going from hand production methods to machines; "
	"new chemical manufacturing and iron production processes; the increasing use of steam "
	"power and water power; the development of machine tools; and the rise of the mechanised "
	"factory system. Output greatly increased, and a result was an unprecedented rise in "
	"population and the rate of population growth. The textile industry was the first to use "
	"modern production methods, and textiles became the dominant industry in terms of "
	"employment, value of output, and capital invested. Cotton was the leading textile of "
	"the Industrial Revolution and assumed its dominant role because cotton could be cultivated "
	"at scale in warm climates outside Europe, especially in what became the southern United States."
	),
	"question": "Which industry was the first to use modern production methods during the Industrial Revolution?",
	"answer_list": ["textile industry", "The textile industry", "textiles"],
	},
	{
	"context": (
	"Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to "
	"intelligence of humans and other animals. Example tasks in which this is done include "
	"speech recognition, computer vision, translation between (natural) languages, as well as "
	"other mappings of inputs. AI applications include advanced web search engines (e.g., "
	"Google Search), recommendation systems (used by YouTube, Amazon, and Netflix), "
	"understanding human speech (such as Siri and Alexa), self-driving cars (e.g., Waymo), "
	"generative or creative tools (ChatGPT and AI art), automated decision-making, and "
	"competing at the highest level in strategic game systems (such as chess and Go). As "
	"machines become increasingly capable, tasks considered to require 'intelligence' are "
	"often removed from the definition of AI, a phenomenon known as the AI effect. For "
	"instance, optical character recognition is frequently excluded from things considered "
	"to be AI, having become a routine technology. Artificial intelligence was founded as "
	"an academic discipline in 1956, and in the years since it has experienced several waves "
	"of optimism, followed by disappointment and the loss of funding (known as an 'AI winter'), "
	"followed by new approaches, success, and renewed funding."
	),
	"question": "In what year was artificial intelligence founded as an academic discipline?",
	"answer_list": ["1956"],
	},
	{
	"context": (
	"The human brain is the command center for the human nervous system. It receives signals "
	"from the body's sensory organs and outputs information to the muscles. The human brain "
	"has the same basic structure as other mammal brains, but is larger in relation to body "
	"size than any other brains. The cerebral cortex is the outer layer of the brain and is "
	"responsible for most higher-order functions, including consciousness, memory, reasoning, "
	"and language. It is divided into four lobes: the frontal lobe, parietal lobe, temporal "
	"lobe, and occipital lobe. The brain communicates with the rest of the body through the "
	"spinal cord and a network of nerves, which together form the peripheral nervous system. "
	"The average adult human brain weighs about 1.4 kilograms (3 lb) and contains approximately "
	"86 billion neurons. These neurons are connected by trillions of synaptic connections, "
	"making the brain one of the most complex structures in the known universe."
	),
	"question": "What is the outer layer of the brain called?",
	"answer_list": ["cerebral cortex", "The cerebral cortex"],
	},
	{
	"context": (
	"Climate change refers to long-term shifts in temperatures and weather patterns. Such "
	"shifts can be natural, due to changes in the sun's activity or large volcanic eruptions. "
	"But since the 1800s, human activities have been the main driver of climate change, "
	"primarily due to the burning of fossil fuels like coal, oil and gas. Burning fossil fuels "
	"generates greenhouse gas emissions that act like a blanket wrapped around the Earth, "
	"trapping the sun's heat and raising temperatures. The main greenhouse gases that are "
	"causing climate change include carbon dioxide and methane. These come from using gasoline "
	"for driving a car or coal for heating a building, for example. Clearing land and forests "
	"can also release carbon dioxide. Agriculture, oil and gas operations are major sources of "
	"methane emissions. Energy, industry, transport, buildings, agriculture and land use are "
	"among the main sectors causing greenhouse gas emissions. Between 1880 and 2012, the "
	"average global temperature increased by 0.85 degrees Celsius."
	),
	"question": "What has been the main driver of climate change since the 1800s?",
	"answer_list": ["human activities", "burning of fossil fuels", "fossil fuels"],
	},
	]

	TRUNCATION_RATIO = 0.65 # Show 65% of context (harder than easy)


	class MediumTask(BaseSummarizationTask):
	"""Medium-length context factual QA task using longer SQuAD passages."""

	name = "medium"
	max_steps = 2

	def __init__(self):
	self._samples: List[Dict[str, Any]] = []
	self._load_dataset()

	def _load_dataset(self):
	"""Load SQuAD samples with longer passages."""
	try:
	from datasets import load_dataset

	logger.info("Loading SQuAD dataset for medium task...")
	ds = load_dataset("rajpurkar/squad", split="validation", trust_remote_code=False)

	target_min, target_max = 900, 2500 # chars
	seen_contexts = set()

	for item in ds:
	context: str = item["context"]
	if not (target_min <= len(context) <= target_max):
	continue
	if context in seen_contexts:
	continue
	seen_contexts.add(context)

	answers = item["answers"]["text"]
	answer_starts = item["answers"]["answer_start"]
	cutoff = int(len(context) * TRUNCATION_RATIO)

	if not answers or not all(s < cutoff for s in answer_starts):
	continue

	self._samples.append(
	{
	"context": context,
	"question": item["question"],
	"answer_list": list(set(answers)),
	}
	)

	if len(self._samples) >= 500:
	break

	logger.info(f"Medium task: loaded {len(self._samples)} SQuAD samples")
	except Exception as e:
	logger.warning(f"Could not load SQuAD dataset for medium: {e}. Using fallback.")

	if not self._samples:
	self._samples = FALLBACK_SAMPLES

	def get_sample(self, seed: Optional[int] = None) -> Dict[str, Any]:
	rng = random.Random(seed)
	item = rng.choice(self._samples)

	context = item["context"]
	cutoff = int(len(context) * TRUNCATION_RATIO)
	category = self.infer_category(item["question"])

	return {
	"context": context,
	"truncated_context": context[:cutoff],
	"truncation_ratio": TRUNCATION_RATIO,
	"category": category,
	"source_type": "long_form_reference",
	"question": item["question"],
	"answer": item["answer_list"][0],
	"answer_list": item["answer_list"],
	}

	def get_summarize_prompt(self, truncated_context: str, truncation_ratio: float) -> str:
	pct = int(truncation_ratio * 100)
	return (
	f"Here is a document excerpt ({pct}% of the full text):\n\n"
	f"{truncated_context}\n\n"
	"Produce a retrieval-safe summary for a downstream assistant. Preserve "
	"specific names, dates, numbers, causal relationships, and claims that "
	"are likely to be queried later. Keep the summary under 200 words while "
	"retaining the details needed for factual QA."
	)