Spaces:

farffadet
/

syllogym-env

Sleeping

App Files Files Community

farffadet commited on Mar 13

Commit

c6837f1

verified ·

1 Parent(s): 25f6703

Upload folder using huggingface_hub

Browse files

Files changed (18) hide show

Dockerfile +6 -0
README.md +87 -6
openenv.yaml +8 -6
pyproject.toml +12 -2
server/Dockerfile +29 -0
server/__init__.py +1 -1
server/app.py +2 -2
server/core/__init__.py +6 -0
server/core/base_driver.py +110 -0
server/core/environment.py +233 -0
server/core/reward.py +78 -0
server/drivers/__init__.py +17 -0
server/drivers/fol_nli.py +151 -0
server/drivers/folio.py +141 -0
server/drivers/knights_knaves.py +147 -0
server/drivers/legalbench.py +271 -0
server/drivers/proofwriter.py +148 -0
server/drivers/rulebreakers.py +145 -0

Dockerfile CHANGED Viewed

@@ -9,12 +9,18 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
 # Copy the environment package (build context = envs/syllogym_env/)
 COPY . ./syllogym_env/
 # Install all dependencies from pyproject.toml
 RUN pip install --no-cache-dir -e ./syllogym_env/
 ENV PYTHONUNBUFFERED=1
 ENV PYTHONPATH="/app:$PYTHONPATH"
 EXPOSE 7860
 HEALTHCHECK --interval=30s --timeout=3s --start-period=30s --retries=5 \

 # Copy the environment package (build context = envs/syllogym_env/)
 COPY . ./syllogym_env/
+# Copy README to /app/README.md so the web interface can load it
+COPY README.md ./README.md
 # Install all dependencies from pyproject.toml
 RUN pip install --no-cache-dir -e ./syllogym_env/
 ENV PYTHONUNBUFFERED=1
 ENV PYTHONPATH="/app:$PYTHONPATH"
+# Enable the OpenEnv built-in web interface (served at /web)
+ENV ENABLE_WEB_INTERFACE=true
 EXPOSE 7860
 HEALTHCHECK --interval=30s --timeout=3s --start-period=30s --retries=5 \

README.md CHANGED Viewed

@@ -1,10 +1,91 @@
 ---
-title: Syllogym Env
-emoji: 🐢
-colorFrom: blue
-colorTo: yellow
 sdk: docker
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: SylloGym — Deductive Reasoning Environment
 sdk: docker
+app_port: 7860
+base_path: /web
+colorFrom: blue
+colorTo: green
+tags:
+  - openenv
+  - reasoning
+  - rlvr
+  - legalbench
+  - grpo
+  - deductive-reasoning
+  - first-order-logic
 ---
+# SylloGym — Deductive Reasoning Environment
+SylloGym is a multi-dataset deductive reasoning environment for training LLMs via reinforcement learning. The model receives a rule, a set of facts, and a question — and must apply strict logical deduction to reach the correct conclusion.
+## How to interact
+1. Click **Reset** to get a new reasoning problem
+2. Fill in the **reasoning** field with your chain-of-thought (wrap in `<reasoning>...</reasoning>`)
+3. Fill in the **answer** field with your final answer (wrap in `<answer>...</answer>`)
+4. Click **Step** to submit and see your reward
+**Example:**
+```
+reasoning: <reasoning>The rule states that if a contract contains a confidentiality clause, it is an NDA. The facts show a confidentiality clause is present. Therefore, this is an NDA.</reasoning>
+answer: <answer>Yes</answer>
+```
+## Reward breakdown
+| Component | Points | Condition |
+|-----------|--------|-----------|
+| Format    | +0.1   | Both `<reasoning>` and `<answer>` tags present |
+| Answer    | +1.0   | Correct answer |
+| Reasoning | +0.2   | Non-trivial reasoning (>50 chars, not just restating the answer) |
+| **Max**   | **1.3**| |
+## Datasets & Tasks
+SylloGym covers **6 datasets** and **20 tasks** across different reasoning domains:
+| Dataset | Tasks | Type | Difficulty |
+|---------|-------|------|------------|
+| **LegalBench** | `diversity_1–6`, `ucc_v_common_law`, `abercrombie`, `hearsay`, `telemarketing_sales_rule` | Yes/No or multi-class | 1–5 |
+| **Knights & Knaves** | `knights_knaves` | Who is knight/knave? | 1–3 |
+| **ProofWriter** | `proofwriter_d1–d5` | True/False | 1–5 |
+| **FOLIO** | `folio` | True/False/Uncertain | 2–5 |
+| **RuleBreakers** | `rulebreakers_mt`, `rulebreakers_ds` | True/False | 2 |
+| **FOL-NLI** | `fol_nli` | entailment/contradiction/neutral | 2–5 |
+## Reset options
+Pass optional parameters to `reset()` to control task selection:
+- `task_mode="mixed"` (default) — weighted random across all datasets
+- `task_mode="single"` + `task_name="hearsay"` — restrict to one task
+- `seed=42` — reproducible sampling
+## Connect from code
+```python
+from syllogym_env import SylloGymEnv, SylloAction
+env = SylloGymEnv(base_url="https://huggingface.co/spaces/eliot-gtn/syllogym-env")
+env.connect()
+result = env.reset(task_mode="mixed")
+obs = result.observation
+print(obs.rule)
+print(obs.facts)
+print(obs.question)
+result = env.step(SylloAction(
+    reasoning="<reasoning>Applying the rule to the facts...</reasoning>",
+    answer=f"<answer>{obs.valid_answers[0]}</answer>",
+))
+print(f"Reward: {result.reward}")
+env.disconnect()
+```
+## About
+**Competition:** [OpenEnv Challenge](https://huggingface.co/openenv) by Meta PyTorch × HuggingFace × Unsloth
+**Training:** GRPO (Group Relative Policy Optimization) via TRL + Unsloth
+**Base model:** Qwen2.5-3B-Instruct

openenv.yaml CHANGED Viewed

@@ -3,15 +3,17 @@ name: syllogym_env
 type: space
 runtime: fastapi
 app: server.app:app
-port: 8000
 description: >
-  SylloGym: Legal Syllogistic Reasoning Environment.
-  Trains LLMs to apply deductive reasoning on LegalBench tasks.
-  The model receives a legal rule + case facts and must derive the correct
-  legal conclusion (Yes/No or multi-class).
 tags:
-  - legal
   - reasoning
   - rlvr
   - legalbench
   - grpo

 type: space
 runtime: fastapi
 app: server.app:app
+port: 7860
 description: >
+  SylloGym: Multi-dataset Deductive Reasoning Environment.
+  Trains LLMs to apply strict logical deduction across 6 datasets and 20 tasks:
+  LegalBench, Knights & Knaves, ProofWriter, FOLIO, RuleBreakers, FOL-NLI.
+  The model receives premises/rules and must derive the correct conclusion.
 tags:
+  - openenv
   - reasoning
   - rlvr
   - legalbench
   - grpo
+  - deductive-reasoning
+  - first-order-logic

pyproject.toml CHANGED Viewed

@@ -21,5 +21,15 @@ server = "syllogym_env.server.app:main"
 [tool.setuptools]
 include-package-data = true
-packages = ["syllogym_env", "syllogym_env.server"]
-package-dir = {"syllogym_env" = ".", "syllogym_env.server" = "server"}

 [tool.setuptools]
 include-package-data = true
+packages = [
+    "syllogym_env",
+    "syllogym_env.server",
+    "syllogym_env.server.core",
+    "syllogym_env.server.drivers",
+]
+package-dir = {
+    "syllogym_env" = ".",
+    "syllogym_env.server" = "server",
+    "syllogym_env.server.core" = "server/core",
+    "syllogym_env.server.drivers" = "server/drivers",
+}

server/Dockerfile ADDED Viewed

	@@ -0,0 +1,29 @@

+FROM python:3.11-slim
+WORKDIR /app
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    git curl \
+    && rm -rf /var/lib/apt/lists/*
+# Copy the environment package (build context = envs/syllogym_env/)
+COPY . ./syllogym_env/
+# Copy README to /app/README.md so the web interface can load it
+COPY README.md ./README.md
+# Install all dependencies from pyproject.toml
+RUN pip install --no-cache-dir -e ./syllogym_env/
+ENV PYTHONUNBUFFERED=1
+ENV PYTHONPATH="/app:$PYTHONPATH"
+# Enable the OpenEnv built-in web interface (served at /web)
+ENV ENABLE_WEB_INTERFACE=true
+EXPOSE 7860
+HEALTHCHECK --interval=30s --timeout=3s --start-period=30s --retries=5 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:7860/health')" || exit 1
+CMD ["uvicorn", "syllogym_env.server.app:app", "--host", "0.0.0.0", "--port", "7860"]

server/__init__.py CHANGED Viewed

@@ -1,5 +1,5 @@
 """SylloGym environment server components."""
-from .syllogym_environment import SylloGymEnvironment
 __all__ = ["SylloGymEnvironment"]

 """SylloGym environment server components."""
+from .core.environment import SylloGymEnvironment
 __all__ = ["SylloGymEnvironment"]

server/app.py CHANGED Viewed

@@ -12,11 +12,11 @@ Usage:
 try:
     from openenv.core.env_server import create_app
     from syllogym_env.models import SylloAction, SylloObservation
-    from syllogym_env.server.syllogym_environment import SylloGymEnvironment
 except ImportError:
     from openenv.core.env_server import create_app
     from ..models import SylloAction, SylloObservation
-    from .syllogym_environment import SylloGymEnvironment
 app = create_app(
     SylloGymEnvironment,

 try:
     from openenv.core.env_server import create_app
     from syllogym_env.models import SylloAction, SylloObservation
+    from syllogym_env.server.core.environment import SylloGymEnvironment
 except ImportError:
     from openenv.core.env_server import create_app
     from ..models import SylloAction, SylloObservation
+    from .core.environment import SylloGymEnvironment
 app = create_app(
     SylloGymEnvironment,

server/core/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+"""SylloGym core abstractions."""
+from .base_driver import BaseDriver, RuleTask
+from .reward import compute_reward
+__all__ = ["BaseDriver", "RuleTask", "compute_reward"]

server/core/base_driver.py ADDED Viewed

	@@ -0,0 +1,110 @@

+"""
+server/core/base_driver.py
+--------------------------
+RuleTask — the shared data contract between drivers and the environment.
+BaseDriver — the interface every dataset driver must implement.
+"""
+from __future__ import annotations
+import random
+from abc import ABC, abstractmethod
+from dataclasses import dataclass, field
+@dataclass
+class RuleTask:
+    """
+    A fully resolved reasoning problem, ready to present to a model.
+    Drivers produce RuleTask objects; the environment consumes them.
+    All fields map directly to SylloObservation fields.
+    Fields shown to the model:
+        rule          — the explicit rule or principle to apply
+        facts         — the case / scenario to reason about
+        question      — what the model must answer
+        valid_answers — accepted answer strings (case-insensitive match)
+    Metadata:
+        task_name     — unique identifier for this task type
+        difficulty    — 1 (easiest) to N (hardest); used for curriculum sampling
+        task_type     — "binary" (Yes/No or two-class) | "multiclass"
+    Ground truth (server-side only, not shown to model):
+        correct_answer — the expected answer string
+    """
+    rule: str
+    facts: str
+    question: str
+    valid_answers: list[str]
+    task_name: str
+    difficulty: int
+    task_type: str  # "binary" | "multiclass"
+    correct_answer: str
+class BaseDriver(ABC):
+    """
+    Interface for dataset drivers.
+    A driver is responsible for:
+    - Maintaining its own example pool or procedural generator
+    - Sampling a single RuleTask on demand via sample()
+    - Reporting which task names it owns via task_names
+    The environment owns the random.Random instance and passes it to sample()
+    so that episode reproducibility is controlled by a single seed.
+    Minimal implementation example:
+        class MyDriver(BaseDriver):
+            @property
+            def task_names(self) -> list[str]:
+                return ["my_task"]
+            def sample(self, rng, task_name=None) -> RuleTask | None:
+                if task_name is not None and task_name != "my_task":
+                    return None
+                return RuleTask(
+                    rule="...", facts="...", question="...",
+                    valid_answers=["Yes", "No"],
+                    task_name="my_task", difficulty=1,
+                    task_type="binary", correct_answer="Yes",
+                )
+    """
+    @property
+    @abstractmethod
+    def task_names(self) -> list[str]:
+        """All task names this driver can produce. Used for single-task routing."""
+        ...
+    @abstractmethod
+    def sample(
+        self,
+        rng: random.Random,
+        task_name: str | None = None,
+    ) -> RuleTask | None:
+        """
+        Sample one RuleTask.
+        Args:
+            rng:        Shared Random instance from the environment.
+            task_name:  If provided, restrict to this task.
+                        Return None if this driver does not own that task_name.
+        Returns:
+            A RuleTask, or None if no examples are available
+            (dataset not loaded, generation failed, task not owned, etc.).
+        """
+        ...
+    @property
+    def weight(self) -> float:
+        """
+        Relative sampling weight for this driver in mixed mode.
+        Override to up- or down-weight a driver relative to others.
+        Default: 1.0 per task_name (so total weight scales with number of tasks).
+        """
+        return float(len(self.task_names))

server/core/environment.py ADDED Viewed

	@@ -0,0 +1,233 @@

+"""
+server/core/environment.py
+--------------------------
+SylloGymEnvironment — driver-aware deductive reasoning environment.
+The environment is dataset-agnostic. It delegates task sampling to registered
+BaseDriver instances and uses the shared reward functions from core.reward.
+Each episode is a single step:
+  reset() → SylloObservation (rule + facts + question)
+  step(SylloAction) → SylloObservation (with reward + done=True)
+"""
+from __future__ import annotations
+import random
+import uuid
+from typing import Any, Optional
+from openenv.core.env_server.interfaces import Action, Environment, Observation
+try:
+    from ...models import SylloAction, SylloObservation, SylloState
+except ImportError:
+    from syllogym_env.models import SylloAction, SylloObservation, SylloState
+from .base_driver import BaseDriver, RuleTask
+from .reward import check_format, check_answer, check_reasoning_quality, compute_reward
+from ..drivers.legalbench import LegalBenchDriver
+from ..drivers.knights_knaves import KnightsKnavesDriver
+from ..drivers.proofwriter import ProofWriterDriver
+from ..drivers.folio import FOLIODriver
+from ..drivers.rulebreakers import RuleBreakersDriver
+from ..drivers.fol_nli import FOLNLIDriver
+def _default_drivers() -> list[BaseDriver]:
+    return [
+        LegalBenchDriver(),
+        KnightsKnavesDriver(),
+        ProofWriterDriver(),
+        FOLIODriver(),
+        RuleBreakersDriver(),
+        FOLNLIDriver(),
+    ]
+class SylloGymEnvironment(Environment):
+    """
+    SylloGym: Multi-dataset Deductive Reasoning Environment.
+    Trains LLMs to apply deductive (syllogistic) reasoning across domains.
+    The model receives a rule + facts and must derive the correct conclusion.
+    Drivers provide the actual tasks. The default set is LegalBench + Knights & Knaves.
+    Additional drivers can be registered at construction time.
+    Args:
+        task_mode:  Sampling strategy.
+                    "mixed"  — weighted random across all drivers and tasks.
+                    "single" — restrict to one specific task_name.
+        task_name:  When task_mode="single", the specific task to use.
+        seed:       Optional random seed for reproducibility.
+        drivers:    List of BaseDriver instances. Defaults to [LegalBenchDriver(), KnightsKnavesDriver()].
+    Example:
+        >>> env = SylloGymEnvironment(task_mode="mixed")
+        >>> obs = env.reset()
+        >>> action = SylloAction(
+        ...     reasoning="<reasoning>The rule states...</reasoning>",
+        ...     answer="<answer>Yes</answer>"
+        ... )
+        >>> result = env.step(action)
+        >>> print(result.reward)   # 0.0 to 1.3
+        # Single-task mode (K&K only):
+        >>> env = SylloGymEnvironment(task_mode="single", task_name="knights_knaves")
+    """
+    def __init__(
+        self,
+        task_mode: str = "mixed",
+        task_name: Optional[str] = None,
+        seed: Optional[int] = None,
+        drivers: Optional[list[BaseDriver]] = None,
+    ):
+        self._task_mode = task_mode
+        self._task_name = task_name
+        self._rng = random.Random(seed)
+        self._drivers: list[BaseDriver] = drivers if drivers is not None else _default_drivers()
+        # Build task_name → driver lookup for O(1) single-task routing
+        self._task_to_driver: dict[str, BaseDriver] = {}
+        for driver in self._drivers:
+            for name in driver.task_names:
+                self._task_to_driver[name] = driver
+        self._state = SylloState(
+            episode_id=str(uuid.uuid4()),
+            task_mode=task_mode,
+        )
+        self._current_task: Optional[RuleTask] = None
+    # Public property for callers that need the full task list (e.g. eval callbacks)
+    @property
+    def task_registry(self) -> list[dict]:
+        """Return all tasks across all drivers as a list of {name, difficulty} dicts."""
+        tasks = []
+        for driver in self._drivers:
+            for name in driver.task_names:
+                # Ask the driver for a sample to get difficulty; use a placeholder rng
+                sample = driver.sample(random.Random(0), task_name=name)
+                tasks.append({
+                    "name": name,
+                    "difficulty": sample.difficulty if sample else 1,
+                })
+        return tasks
+    def _sample_task(self) -> Optional[RuleTask]:
+        """Sample one RuleTask from the appropriate driver."""
+        if self._task_mode == "single" and self._task_name:
+            driver = self._task_to_driver.get(self._task_name)
+            if driver is None:
+                return None
+            return driver.sample(self._rng, task_name=self._task_name)
+        # "mixed": weighted selection across drivers, then delegate internally
+        weights = [d.weight for d in self._drivers]
+        driver = self._rng.choices(self._drivers, weights=weights, k=1)[0]
+        return driver.sample(self._rng)
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        task_mode: Optional[str] = None,
+        task_name: Optional[str] = None,
+        **kwargs: Any,
+    ) -> Observation:
+        if seed is not None:
+            self._rng = random.Random(seed)
+        if task_mode is not None:
+            self._task_mode = task_mode
+        if task_name is not None:
+            self._task_name = task_name
+        self._state = SylloState(
+            episode_id=episode_id or str(uuid.uuid4()),
+            task_mode=self._task_mode,
+            task_name=self._task_name or "",
+            total_correct=self._state.total_correct,
+            total_steps=self._state.total_steps,
+        )
+        self._current_task = self._sample_task()
+        if self._current_task is None:
+            return SylloObservation(
+                facts="[Dataset unavailable — check internet connection and datasets library]",
+                reward=None,
+                done=False,
+            )
+        t = self._current_task
+        return SylloObservation(
+            rule=t.rule,
+            facts=t.facts,
+            question=t.question,
+            task_type=t.task_type,
+            valid_answers=t.valid_answers,
+            task_name=t.task_name,
+            difficulty=t.difficulty,
+            correct_answer=t.correct_answer,
+            reward=None,
+            done=False,
+        )
+    def step(self, action: Action, **kwargs: Any) -> Observation:
+        if self._current_task is None:
+            return SylloObservation(
+                reward=0.0,
+                done=True,
+                metadata={"error": "Environment not initialized. Call reset() first."},
+            )
+        if not isinstance(action, SylloAction):
+            try:
+                action = SylloAction(
+                    reasoning=getattr(action, "reasoning", ""),
+                    answer=getattr(action, "answer", ""),
+                )
+            except Exception:
+                return SylloObservation(reward=0.0, done=True)
+        t = self._current_task
+        total, breakdown = compute_reward(
+            reasoning=action.reasoning,
+            answer=action.answer,
+            correct_answer=t.correct_answer,
+            valid_answers=t.valid_answers,
+            rule=t.rule,
+            facts=t.facts,
+        )
+        self._state.step_count += 1
+        self._state.total_steps += 1
+        if total >= 1.0:
+            self._state.total_correct += 1
+        return SylloObservation(
+            rule=t.rule,
+            facts=t.facts,
+            question=t.question,
+            task_type=t.task_type,
+            valid_answers=t.valid_answers,
+            task_name=t.task_name,
+            difficulty=t.difficulty,
+            correct_answer=t.correct_answer,
+            reward=total,
+            done=True,
+            metadata={
+                "predicted_answer": action.answer,
+                "correct_answer": t.correct_answer,
+                "format_reward": breakdown["format"],
+                "answer_reward": breakdown["answer"],
+                "reasoning_reward": breakdown["reasoning"],
+            },
+        )
+    @property
+    def state(self) -> SylloState:
+        return self._state

server/core/reward.py ADDED Viewed

	@@ -0,0 +1,78 @@

+"""
+server/core/reward.py
+---------------------
+Reward functions for SylloGym.
+Composite reward structure (max 1.3):
+  +0.1  format reward     — <reasoning> and <answer> tags present
+  +1.0  answer reward     — correct answer (exact match, case-insensitive)
+  +0.2  reasoning quality — reasoning references keywords from rule AND facts
+"""
+from __future__ import annotations
+import re
+_STOPWORDS = {
+    "under", "shall", "must", "with", "that", "this", "from", "into",
+    "have", "been", "were", "they", "their", "there", "when", "which",
+}
+def check_format(reasoning: str, answer: str) -> float:
+    """Return 0.1 if both <reasoning>...</reasoning> and <answer>...</answer> tags are present."""
+    combined = reasoning + answer
+    has_reasoning = bool(re.search(r"<reasoning>.*?</reasoning>", combined, re.DOTALL))
+    has_answer = bool(re.search(r"<answer>.*?</answer>", combined, re.DOTALL | re.IGNORECASE))
+    return 0.1 if (has_reasoning and has_answer) else 0.0
+def check_answer(predicted: str, correct: str, valid_answers: list[str]) -> float:
+    """Return 1.0 for exact match (case-insensitive). Extracts from <answer> tags if present."""
+    tag_match = re.search(r"<answer>(.*?)</answer>", predicted, re.DOTALL | re.IGNORECASE)
+    pred_clean = tag_match.group(1).strip().lower() if tag_match else predicted.strip().lower()
+    return 1.0 if pred_clean == correct.strip().lower() else 0.0
+def check_reasoning_quality(reasoning: str, rule: str, facts: str) -> float:
+    """
+    Return 0.2 if the reasoning references ≥2 significant keywords from both rule and facts.
+    Heuristic for genuine deductive reasoning vs pattern-matching.
+    """
+    if not reasoning:
+        return 0.0
+    reasoning_lower = reasoning.lower()
+    rule_words = {w for w in re.findall(r"\b[a-z]{5,}\b", rule.lower()) if w not in _STOPWORDS}
+    facts_words = {w for w in re.findall(r"\b[a-z]{5,}\b", facts.lower()) if w not in _STOPWORDS}
+    rule_hits = sum(1 for w in rule_words if w in reasoning_lower)
+    facts_hits = sum(1 for w in facts_words if w in reasoning_lower)
+    return 0.2 if (rule_hits >= 2 and facts_hits >= 2) else 0.0
+def compute_reward(
+    reasoning: str,
+    answer: str,
+    correct_answer: str,
+    valid_answers: list[str],
+    rule: str,
+    facts: str,
+) -> tuple[float, dict[str, float]]:
+    """
+    Compute the composite reward and return (total, breakdown).
+    Args:
+        reasoning:      The model's reasoning text (may include <reasoning> tags).
+        answer:         The model's answer text (may include <answer> tags).
+        correct_answer: Ground truth answer string.
+        valid_answers:  List of accepted answer strings.
+        rule:           The rule text shown in the prompt.
+        facts:          The facts text shown in the prompt.
+    Returns:
+        (total_reward, {"format": float, "answer": float, "reasoning": float})
+    """
+    fmt = check_format(reasoning, answer)
+    ans = check_answer(answer, correct_answer, valid_answers)
+    rsn = check_reasoning_quality(reasoning, rule, facts)
+    total = round(fmt + ans + rsn, 4)
+    return total, {"format": fmt, "answer": ans, "reasoning": rsn}

server/drivers/__init__.py ADDED Viewed

	@@ -0,0 +1,17 @@

+"""SylloGym dataset drivers."""
+from .legalbench import LegalBenchDriver
+from .knights_knaves import KnightsKnavesDriver
+from .proofwriter import ProofWriterDriver
+from .folio import FOLIODriver
+from .rulebreakers import RuleBreakersDriver
+from .fol_nli import FOLNLIDriver
+__all__ = [
+    "LegalBenchDriver",
+    "KnightsKnavesDriver",
+    "ProofWriterDriver",
+    "FOLIODriver",
+    "RuleBreakersDriver",
+    "FOLNLIDriver",
+]

server/drivers/fol_nli.py ADDED Viewed

	@@ -0,0 +1,151 @@

+"""
+server/drivers/fol_nli.py
+--------------------------
+FOLNLIDriver — loads tasks from the FOL-NLI dataset
+(tasksource/FOL-nli on HuggingFace).
+FOL-NLI (First-Order Logic Natural Language Inference) is a 82,219-example
+dataset of NLI problems grounded in formal first-order logic. Each example
+consists of a set of premises (natural language rules + facts) and a hypothesis
+to evaluate. Labels are theorem-prover verified:
+  - entailment   : hypothesis follows from premises
+  - contradiction: hypothesis is inconsistent with premises
+  - neutral      : cannot be determined from premises alone
+Unlike SNLI/MultiNLI (linguistic NLI), FOL-NLI is derived from formal proofs,
+making it a rigorous test of deductive reasoning rather than linguistic intuition.
+Difficulty is estimated from rule_concentration (density of logical rules):
+  rule_concentration < 0.05  → difficulty 2  (few rules, mostly facts)
+  0.05 ≤ rc < 0.12           → difficulty 3
+  0.12 ≤ rc < 0.20           → difficulty 4
+  rc ≥ 0.20                  → difficulty 5  (rule-heavy, deepest inference)
+Dataset: https://huggingface.co/datasets/tasksource/FOL-nli
+"""
+from __future__ import annotations
+import random
+from typing import Optional
+from ..core.base_driver import BaseDriver, RuleTask
+TASK_NAME = "fol_nli"
+RULE_TEXT = (
+    "You are given a set of premises expressed in natural language. "
+    "Some premises are facts (direct statements about the world); others are "
+    "logical rules (conditional or universal statements). "
+    "Using only these premises and strict logical deduction, determine whether "
+    "the hypothesis is:\n"
+    "  - entailment   : it necessarily follows from the premises\n"
+    "  - contradiction: it is inconsistent with the premises\n"
+    "  - neutral      : it cannot be determined from the premises alone"
+)
+QUESTION_TEMPLATE = (
+    'Based solely on the premises above, what is the relationship to the '
+    'following hypothesis?\nHypothesis: "{hypothesis}"'
+)
+_VALID_LABELS = {"entailment", "contradiction", "neutral"}
+def _difficulty_from_rule_concentration(rc: float) -> int:
+    if rc < 0.05:
+        return 2
+    if rc < 0.12:
+        return 3
+    if rc < 0.20:
+        return 4
+    return 5
+def _load_examples() -> list[dict]:
+    """Load all FOL-NLI examples (train split only — large enough)."""
+    try:
+        from datasets import load_dataset
+        ds = load_dataset("tasksource/FOL-nli", split="train", trust_remote_code=True)
+        examples = []
+        for ex in ds:
+            label = ex.get("label", "")
+            if label not in _VALID_LABELS:
+                continue
+            premise = ex.get("premise", "").strip()
+            hypothesis = ex.get("hypothesis", "").strip()
+            if not premise or not hypothesis:
+                continue
+            rc = float(ex.get("rule_concentration") or 0.0)
+            examples.append({
+                "premise": premise,
+                "hypothesis": hypothesis,
+                "label": label,
+                "difficulty": _difficulty_from_rule_concentration(rc),
+            })
+        return examples
+    except Exception:
+        return []
+class FOLNLIDriver(BaseDriver):
+    """
+    Driver for the FOL-NLI formal logic NLI dataset.
+    Single task name "fol_nli" covering all 82K examples.
+    Three-class: entailment / contradiction / neutral.
+    Difficulty 2–5 based on rule_concentration.
+    This driver provides the largest pure-logic reasoning dataset in SylloGym,
+    with theorem-prover verified labels — no annotation noise.
+    """
+    def __init__(self, max_examples: int = 20000) -> None:
+        self._max = max_examples
+        self._cache: Optional[list[dict]] = None
+    @property
+    def task_names(self) -> list[str]:
+        return [TASK_NAME]
+    @property
+    def weight(self) -> float:
+        # Large and high-quality — weight between ProofWriter (5) and LegalBench (10)
+        return 4.0
+    def _ensure_loaded(self) -> list[dict]:
+        if self._cache is None:
+            examples = _load_examples()
+            if len(examples) > self._max:
+                # Subsample while preserving label balance
+                rng = random.Random(42)
+                rng.shuffle(examples)
+                examples = examples[: self._max]
+            self._cache = examples
+        return self._cache
+    def sample(
+        self,
+        rng: random.Random,
+        task_name: Optional[str] = None,
+    ) -> Optional[RuleTask]:
+        if task_name is not None and task_name != TASK_NAME:
+            return None
+        examples = self._ensure_loaded()
+        if not examples:
+            return None
+        ex = rng.choice(examples)
+        return RuleTask(
+            rule=RULE_TEXT,
+            facts=ex["premise"],
+            question=QUESTION_TEMPLATE.format(hypothesis=ex["hypothesis"]),
+            valid_answers=["entailment", "contradiction", "neutral"],
+            task_name=TASK_NAME,
+            difficulty=ex["difficulty"],
+            task_type="multiclass",
+            correct_answer=ex["label"],
+        )

server/drivers/folio.py ADDED Viewed

	@@ -0,0 +1,141 @@

+"""
+server/drivers/folio.py
+------------------------
+FOLIODriver — loads tasks from the FOLIO dataset
+(tasksource/folio on HuggingFace, open-access mirror of yale-nlp/FOLIO).
+FOLIO (First-Order Logic Inference on Natural Language) is an expert-written
+dataset of 1,204 examples requiring first-order logic reasoning over natural
+language premises. Unlike ProofWriter (synthetic), FOLIO uses real-world
+knowledge and genuinely complex multi-step inference.
+Each example:
+    premises   — a set of natural language statements (the theory)
+    conclusion — a statement to evaluate
+    label      — "True", "False", or "Uncertain"
+Difficulty is estimated from the number of premises (proxy for reasoning depth):
+    ≤3 premises → difficulty 2
+    4–5 premises → difficulty 3
+    6–7 premises → difficulty 4
+    ≥8 premises → difficulty 5
+Dataset: https://huggingface.co/datasets/tasksource/folio
+Paper: "FOLIO: Natural Language Reasoning with First-Order Logic"
+       (Han et al., EMNLP 2022)
+"""
+from __future__ import annotations
+import random
+from typing import Optional
+from ..core.base_driver import BaseDriver, RuleTask
+TASK_NAME = "folio"
+RULE_TEXT = (
+    "You are given a set of premises (statements that are true). "
+    "Using only these premises and logical deduction, determine whether "
+    "the conclusion is True, False, or Uncertain (cannot be determined "
+    "from the premises alone)."
+)
+QUESTION_TEMPLATE = (
+    'Based solely on the premises above, is the following conclusion '
+    'True, False, or Uncertain?\n"{conclusion}"'
+)
+def _difficulty_from_n_premises(n: int) -> int:
+    if n <= 3:
+        return 2
+    if n <= 5:
+        return 3
+    if n <= 7:
+        return 4
+    return 5
+def _load_examples() -> list[dict]:
+    """Load all FOLIO examples (train + validation splits)."""
+    try:
+        from datasets import load_dataset
+        examples = []
+        for split in ("train", "validation"):
+            try:
+                ds = load_dataset("tasksource/folio", split=split, trust_remote_code=True)
+                for ex in ds:
+                    premises = ex["premises"].strip()
+                    conclusion = ex["conclusion"].strip()
+                    label = ex["label"].strip()  # "True" | "False" | "Uncertain"
+                    if not premises or not conclusion or label not in ("True", "False", "Uncertain"):
+                        continue
+                    n_premises = len([p for p in premises.split("\n") if p.strip()])
+                    examples.append({
+                        "premises": premises,
+                        "conclusion": conclusion,
+                        "label": label,
+                        "difficulty": _difficulty_from_n_premises(n_premises),
+                    })
+            except Exception:
+                continue
+        return examples
+    except Exception:
+        return []
+class FOLIODriver(BaseDriver):
+    """
+    Driver for the FOLIO first-order logic reasoning dataset.
+    Single task name "folio" covering all examples (True / False / Uncertain).
+    Difficulty ranges from 2 to 5 based on number of premises.
+    The dataset is small (~1200 examples total) and expert-written, making it
+    qualitatively different from ProofWriter (synthetic) and LegalBench (legal domain).
+    It is best used as a hard evaluation set or mixed in at low weight.
+    """
+    def __init__(self) -> None:
+        self._cache: Optional[list[dict]] = None
+    @property
+    def task_names(self) -> list[str]:
+        return [TASK_NAME]
+    @property
+    def weight(self) -> float:
+        # Small dataset — down-weight relative to ProofWriter (weight=5) and LegalBench (weight=10)
+        return 1.0
+    def _ensure_loaded(self) -> list[dict]:
+        if self._cache is None:
+            self._cache = _load_examples()
+        return self._cache
+    def sample(
+        self,
+        rng: random.Random,
+        task_name: Optional[str] = None,
+    ) -> Optional[RuleTask]:
+        if task_name is not None and task_name != TASK_NAME:
+            return None
+        examples = self._ensure_loaded()
+        if not examples:
+            return None
+        ex = rng.choice(examples)
+        return RuleTask(
+            rule=RULE_TEXT,
+            facts=ex["premises"],
+            question=QUESTION_TEMPLATE.format(conclusion=ex["conclusion"]),
+            valid_answers=["True", "False", "Uncertain"],
+            task_name=TASK_NAME,
+            difficulty=ex["difficulty"],
+            task_type="multiclass",
+            correct_answer=ex["label"],
+        )

server/drivers/knights_knaves.py ADDED Viewed

	@@ -0,0 +1,147 @@

+"""
+server/drivers/knights_knaves.py
+---------------------------------
+KnightsKnavesDriver — procedural generation of Knights & Knaves logic puzzles.
+No dataset required. Problems are generated on demand with full control over
+difficulty (number of entities) and are infinitely varied.
+Puzzle format:
+    RULE:    "On this island, every person is either a Knight (who always tells
+              the truth) or a Knave (who always lies)..."
+    FACTS:   One statement per entity, e.g.:
+              "Alex says: 'Blake is a Knave.'"
+              "Blake says: 'Alex is a Knight.'"
+    QUESTION: "Based solely on the rule above, is Alex a Knight or a Knave?"
+    ANSWER:   "Knight" or "Knave"
+Difficulty scales with number of entities:
+    2 entities → difficulty 1
+    3 entities → difficulty 2
+    4 entities → difficulty 3
+    ...
+Multi-turn mode (future): the environment could reveal one statement at a time,
+asking the model to update its belief state at each step. This driver supports
+single-turn mode only for now.
+"""
+from __future__ import annotations
+import random
+from dataclasses import dataclass
+from typing import Optional
+from ..core.base_driver import BaseDriver, RuleTask
+TASK_NAME = "knights_knaves"
+RULE_TEXT = (
+    "On this island, every person is either a Knight (who always tells the truth) "
+    "or a Knave (who always lies). A Knight's statements are always true; "
+    "a Knave's statements are always false. "
+    "Use these facts to determine each person's type."
+)
+# A diverse pool of first names to avoid repetition within a puzzle
+_NAME_POOL = [
+    "Alex", "Blake", "Casey", "Dana", "Ellis",
+    "Fran", "Gray", "Harper", "Indra", "Jules",
+    "Kit", "Lane", "Morgan", "Noel", "Paige",
+    "Quinn", "Reese", "Sage", "Taylor", "Uma",
+]
+@dataclass
+class _Entity:
+    name: str
+    is_knight: bool
+def _generate_puzzle(
+    rng: random.Random,
+    n_entities: int,
+) -> tuple[list[_Entity], list[str]]:
+    """
+    Generate entity roles and statements for a K&K puzzle.
+    Each entity makes exactly one statement about a randomly chosen other entity.
+    A knight states the true type of the target; a knave states the false type.
+    Returns:
+        (entities, statements) where statements[i] is entity[i]'s statement.
+    """
+    names = rng.sample(_NAME_POOL, n_entities)
+    entities = [_Entity(name=name, is_knight=rng.choice([True, False])) for name in names]
+    statements: list[str] = []
+    for i, speaker in enumerate(entities):
+        # Each entity addresses a different entity (not itself)
+        others = [e for e in entities if e.name != speaker.name]
+        target = rng.choice(others)
+        # Knight tells truth; knave lies about the target's type
+        if speaker.is_knight:
+            claimed = "Knight" if target.is_knight else "Knave"
+        else:
+            claimed = "Knave" if target.is_knight else "Knight"
+        statements.append(f'{speaker.name} says: "{target.name} is a {claimed}."')
+    return entities, statements
+class KnightsKnavesDriver(BaseDriver):
+    """
+    Procedural driver for Knights & Knaves logic puzzles.
+    Args:
+        min_entities: Minimum number of entities per puzzle (default 2).
+        max_entities: Maximum number of entities per puzzle (default 4).
+    Difficulty = n_entities - 1, so:
+        2 entities → difficulty 1
+        3 entities → difficulty 2
+        4 entities → difficulty 3
+    """
+    def __init__(self, min_entities: int = 2, max_entities: int = 4) -> None:
+        if min_entities < 2:
+            raise ValueError("min_entities must be at least 2")
+        if max_entities > len(_NAME_POOL):
+            raise ValueError(f"max_entities cannot exceed {len(_NAME_POOL)}")
+        self._min = min_entities
+        self._max = max_entities
+    @property
+    def task_names(self) -> list[str]:
+        return [TASK_NAME]
+    def sample(
+        self,
+        rng: random.Random,
+        task_name: Optional[str] = None,
+    ) -> Optional[RuleTask]:
+        if task_name is not None and task_name != TASK_NAME:
+            return None
+        n = rng.randint(self._min, self._max)
+        entities, statements = _generate_puzzle(rng, n)
+        facts = "\n".join(statements)
+        # Ask about a randomly chosen entity
+        subject = rng.choice(entities)
+        correct = "Knight" if subject.is_knight else "Knave"
+        return RuleTask(
+            rule=RULE_TEXT,
+            facts=facts,
+            question=f"Based solely on the rule above, is {subject.name} a Knight or a Knave?",
+            valid_answers=["Knight", "Knave"],
+            task_name=TASK_NAME,
+            difficulty=n - 1,
+            task_type="binary",
+            correct_answer=correct,
+        )

server/drivers/legalbench.py ADDED Viewed

	@@ -0,0 +1,271 @@

+"""
+server/drivers/legalbench.py
+----------------------------
+LegalBenchDriver — loads tasks from LegalBench (nguha/legalbench on HuggingFace).
+Selected tasks (Tier 1 — rule explicitly provided, pure deductive reasoning):
+  diversity_1–6         binary,     difficulty 1–6  (§1332 diversity jurisdiction)
+  ucc_v_common_law      binary,     difficulty 2    (UCC vs. Common Law)
+  abercrombie           multiclass, difficulty 3    (trademark distinctiveness)
+  hearsay               binary,     difficulty 4    (FRE 801 hearsay)
+  telemarketing_sales_rule  binary, difficulty 5    (FTC TSR)
+"""
+from __future__ import annotations
+import random
+from typing import Optional
+from ..core.base_driver import BaseDriver, RuleTask
+# ---------------------------------------------------------------------------
+# Task registry
+# ---------------------------------------------------------------------------
+TASK_REGISTRY: list[dict] = [
+    {
+        "name": "diversity_1",
+        "difficulty": 1,
+        "task_type": "binary",
+        "valid_answers": ["Yes", "No"],
+        "rule": (
+            "Under 28 U.S.C. § 1332, a federal district court has diversity jurisdiction "
+            "when: (1) the matter in controversy exceeds $75,000 exclusive of interest and "
+            "costs, AND (2) the action is between citizens of different States. "
+            "A corporation is deemed a citizen of its state of incorporation AND its "
+            "principal place of business."
+        ),
+        "question": "Based solely on the rule above, does the federal court have diversity jurisdiction?",
+    },
+    {
+        "name": "diversity_2",
+        "difficulty": 2,
+        "task_type": "binary",
+        "valid_answers": ["Yes", "No"],
+        "rule": (
+            "Under 28 U.S.C. § 1332, a federal district court has diversity jurisdiction "
+            "when: (1) the matter in controversy exceeds $75,000 exclusive of interest and "
+            "costs, AND (2) complete diversity exists — no plaintiff may be a citizen of the "
+            "same state as any defendant. A corporation is a citizen of its state of "
+            "incorporation AND its principal place of business."
+        ),
+        "question": "Based solely on the rule above, does the federal court have diversity jurisdiction?",
+    },
+    {
+        "name": "diversity_3",
+        "difficulty": 3,
+        "task_type": "binary",
+        "valid_answers": ["Yes", "No"],
+        "rule": (
+            "Under 28 U.S.C. § 1332, complete diversity jurisdiction requires: "
+            "(1) the amount in controversy exceeds $75,000 exclusive of interest and costs, "
+            "AND (2) every plaintiff is a citizen of a different state from every defendant. "
+            "A natural person's citizenship is determined by their domicile (the place they "
+            "reside with intent to remain). A corporation is a citizen of both its state of "
+            "incorporation and its principal place of business (the 'nerve center')."
+        ),
+        "question": "Based solely on the rule above, does the federal court have diversity jurisdiction?",
+    },
+    {
+        "name": "diversity_4",
+        "difficulty": 4,
+        "task_type": "binary",
+        "valid_answers": ["Yes", "No"],
+        "rule": (
+            "Under 28 U.S.C. § 1332, diversity jurisdiction requires: "
+            "(1) complete diversity — no plaintiff shares citizenship with any defendant, "
+            "(2) amount in controversy exceeds $75,000 (exclusive of interest and costs). "
+            "For aggregation of claims: a single plaintiff may aggregate all claims against "
+            "a single defendant to meet the amount requirement. When multiple plaintiffs sue "
+            "a single defendant, each plaintiff must independently satisfy the amount. "
+            "A corporation is a citizen of its state of incorporation and principal place of "
+            "business. An individual's citizenship is their domicile."
+        ),
+        "question": "Based solely on the rule above, does the federal court have diversity jurisdiction?",
+    },
+    {
+        "name": "diversity_5",
+        "difficulty": 5,
+        "task_type": "binary",
+        "valid_answers": ["Yes", "No"],
+        "rule": (
+            "Under 28 U.S.C. § 1332, diversity jurisdiction requires complete diversity and "
+            "an amount in controversy exceeding $75,000. Complete diversity means no plaintiff "
+            "is a citizen of the same state as any defendant. For unincorporated associations "
+            "(partnerships, LLCs), citizenship is determined by the citizenship of ALL members. "
+            "For class actions under CAFA (28 U.S.C. § 1332(d)), jurisdiction exists if any "
+            "member of the plaintiff class is diverse from any defendant and the aggregate "
+            "amount exceeds $5,000,000. For standard diversity (non-CAFA), each plaintiff's "
+            "claim must independently exceed $75,000."
+        ),
+        "question": "Based solely on the rule above, does the federal court have diversity jurisdiction?",
+    },
+    {
+        "name": "diversity_6",
+        "difficulty": 6,
+        "task_type": "binary",
+        "valid_answers": ["Yes", "No"],
+        "rule": (
+            "Under 28 U.S.C. § 1332, diversity jurisdiction (non-CAFA) requires: "
+            "(1) complete diversity: the citizenship of each plaintiff must differ from that "
+            "of each defendant across all claims; and (2) each plaintiff's amount in "
+            "controversy must independently exceed $75,000, unless the claims are so "
+            "intertwined that they constitute a single indivisible harm (permitting "
+            "aggregation). For an LLC or partnership, citizenship is the citizenship of ALL "
+            "members/partners (applied recursively for nested entities). For a corporation, "
+            "citizenship is the state of incorporation plus the principal place of business. "
+            "For a natural person, citizenship is domicile."
+        ),
+        "question": "Based solely on the rule above, does the federal court have diversity jurisdiction?",
+    },
+    {
+        "name": "ucc_v_common_law",
+        "difficulty": 2,
+        "task_type": "binary",
+        "valid_answers": ["UCC", "Common Law"],
+        "text_field": "contract",
+        "rule": (
+            "The Uniform Commercial Code (UCC) Article 2 governs contracts for the SALE OF "
+            "GOODS — tangible, movable personal property. The Common Law of contracts governs "
+            "all other contracts, including contracts for services, real estate, employment, "
+            "and intellectual property. When a contract involves both goods and services "
+            "(a 'mixed' contract), the predominant purpose test applies: if the predominant "
+            "purpose is the sale of goods, UCC applies; if the predominant purpose is services, "
+            "Common Law applies."
+        ),
+        "question": "Based solely on the rule above, does UCC or Common Law govern this contract?",
+    },
+    {
+        "name": "abercrombie",
+        "difficulty": 3,
+        "task_type": "multiclass",
+        "valid_answers": ["generic", "descriptive", "suggestive", "arbitrary", "fanciful"],
+        "rule": (
+            "Under the Abercrombie & Fitch spectrum, trademarks are classified by their "
+            "distinctiveness in relation to the goods/services they identify:\n"
+            "- GENERIC: the common name for the product itself (e.g., 'Apple' for apples). "
+            "Cannot be registered.\n"
+            "- DESCRIPTIVE: directly describes a feature, quality, or characteristic of the "
+            "product (e.g., 'Cold and Creamy' for ice cream). Registrable only with acquired "
+            "distinctiveness (secondary meaning).\n"
+            "- SUGGESTIVE: suggests a quality or characteristic but requires imagination to "
+            "connect to the product (e.g., 'Coppertone' for suntan lotion). Inherently "
+            "distinctive.\n"
+            "- ARBITRARY: a real word used in an unrelated context (e.g., 'Apple' for "
+            "computers). Inherently distinctive.\n"
+            "- FANCIFUL: an invented word with no prior meaning (e.g., 'Kodak', 'Xerox'). "
+            "Highest level of distinctiveness."
+        ),
+        "question": "Based solely on the rule above, how should this mark be classified?",
+    },
+    {
+        "name": "hearsay",
+        "difficulty": 4,
+        "task_type": "binary",
+        "valid_answers": ["Yes", "No"],
+        "rule": (
+            "Under Federal Rule of Evidence 801, hearsay is an out-of-court statement that "
+            "a party offers to prove the TRUTH OF THE MATTER ASSERTED in the statement. "
+            "A 'statement' is an oral assertion, written assertion, or assertive conduct. "
+            "Key exclusions: (1) a statement is NOT hearsay if offered for a purpose other "
+            "than proving its truth (e.g., to show effect on listener, to show knowledge, "
+            "to prove legally operative words, or to show the declarant's state of mind); "
+            "(2) prior inconsistent statements made under oath are not hearsay under FRE "
+            "801(d)(1)(A); (3) admissions by a party-opponent are not hearsay under FRE "
+            "801(d)(2)."
+        ),
+        "question": "Based solely on the rule above, is this evidence hearsay?",
+    },
+    {
+        "name": "telemarketing_sales_rule",
+        "difficulty": 5,
+        "task_type": "binary",
+        "valid_answers": ["Yes", "No"],
+        "rule": (
+            "The FTC Telemarketing Sales Rule (16 C.F.R. § 310) prohibits telemarketers from: "
+            "(1) misrepresenting the total costs or material restrictions of any goods or "
+            "services; (2) misrepresenting any material aspect of the performance or "
+            "efficacy of goods or services; (3) making false or misleading statements to "
+            "induce a charitable contribution; (4) calling any person who has registered "
+            "their phone number on the National Do Not Call Registry, unless the caller has "
+            "an established business relationship with the consumer (defined as a transaction "
+            "within the prior 18 months or an inquiry within the prior 3 months); "
+            "(5) abandoning an outbound telephone call — defined as failing to connect the "
+            "call to a sales representative within 2 seconds of the consumer's greeting."
+        ),
+        "question": "Based solely on the rule above, does this conduct violate the Telemarketing Sales Rule?",
+    },
+]
+def _load_examples(task_name: str, text_field: str = "text") -> list[dict]:
+    """Load examples from HuggingFace LegalBench dataset."""
+    try:
+        from datasets import load_dataset
+        ds = load_dataset(
+            "nguha/legalbench", task_name, split="test", trust_remote_code=True
+        )
+        examples = []
+        for item in ds:
+            text = item.get(text_field) or item.get("text", "")
+            label = str(item.get("answer", "")).strip()
+            if text and label:
+                examples.append({"text": text, "label": label})
+        return examples
+    except Exception:
+        return []
+class LegalBenchDriver(BaseDriver):
+    """
+    Driver for LegalBench (nguha/legalbench).
+    Loads examples from HuggingFace on first access and caches them in memory.
+    Sampling is weighted by inverse difficulty within the driver.
+    """
+    def __init__(self) -> None:
+        self._registry = TASK_REGISTRY
+        self._by_name: dict[str, dict] = {t["name"]: t for t in self._registry}
+        self._weights: list[float] = [1.0 / t["difficulty"] for t in self._registry]
+        self._cache: dict[str, list[dict]] = {}
+    @property
+    def task_names(self) -> list[str]:
+        return list(self._by_name.keys())
+    def sample(
+        self,
+        rng: random.Random,
+        task_name: Optional[str] = None,
+    ) -> Optional[RuleTask]:
+        if task_name is not None:
+            if task_name not in self._by_name:
+                return None
+            task = self._by_name[task_name]
+            examples = self._get_examples(task)
+        else:
+            task = rng.choices(self._registry, weights=self._weights, k=1)[0]
+            examples = self._get_examples(task)
+        if not examples:
+            return None
+        ex = rng.choice(examples)
+        return RuleTask(
+            rule=task["rule"],
+            facts=ex["text"],
+            question=task["question"],
+            valid_answers=task["valid_answers"],
+            task_name=task["name"],
+            difficulty=task["difficulty"],
+            task_type=task["task_type"],
+            correct_answer=ex["label"],
+        )
+    def _get_examples(self, task: dict) -> list[dict]:
+        name = task["name"]
+        if name not in self._cache:
+            self._cache[name] = _load_examples(name, task.get("text_field", "text"))
+        return self._cache[name]

server/drivers/proofwriter.py ADDED Viewed

	@@ -0,0 +1,148 @@

+"""
+server/drivers/proofwriter.py
+------------------------------
+ProofWriterDriver — loads tasks from the ProofWriter dataset
+(tasksource/proofwriter on HuggingFace).
+ProofWriter is a large-scale deductive reasoning dataset (585k examples).
+Each example contains a "theory" (natural language facts + rules) and a
+question (a proposition to evaluate as True / False / Unknown).
+Reasoning depth (QDep) maps directly to difficulty:
+    0 → difficulty 1  (direct fact lookup, no inference)
+    1 → difficulty 2  (one inference step)
+    2 → difficulty 3  (two steps)
+    3 → difficulty 4  (three steps)
+    ≥4 → difficulty 5 (deep chain)
+Dataset: https://huggingface.co/datasets/tasksource/proofwriter
+Paper: "ProofWriter: Generating Implications, Proofs, and Abductive
+        Statements over Natural Language" (Tafjord et al., 2021)
+"""
+from __future__ import annotations
+import random
+from typing import Optional
+from ..core.base_driver import BaseDriver, RuleTask
+TASK_NAME = "proofwriter"
+RULE_TEXT = (
+    "You are given a theory consisting of facts and inference rules expressed in "
+    "natural language. Facts state what is directly true about the world. Rules state "
+    "conditional relationships (e.g., 'If someone is X then they are Y'). "
+    "Apply the rules to the facts using deductive reasoning to determine whether the "
+    "given statement is True, False, or Unknown (cannot be determined from the theory)."
+)
+QUESTION_TEMPLATE = "Based solely on the theory above, is the following statement True, False, or Unknown?\n\"{statement}\""
+# Depth → difficulty mapping (capped at 5)
+_DEPTH_TO_DIFFICULTY = {0: 1, 1: 2, 2: 3, 3: 4}
+# We restrict to configs with clean, structured language (exclude NatLang variants
+# which have informal prose and are harder to verify programmatically).
+_VALID_CONFIGS = {"depth-0", "depth-1", "depth-2", "depth-3", "depth-3ext", "depth-5"}
+def _load_examples() -> dict[int, list[dict]]:
+    """
+    Load ProofWriter examples grouped by difficulty (QDep).
+    Returns {difficulty: [{"facts_rules": str, "statement": str, "answer": str}, ...]}
+    """
+    try:
+        from datasets import load_dataset
+        ds = load_dataset("tasksource/proofwriter", split="train", trust_remote_code=True)
+        by_difficulty: dict[int, list[dict]] = {1: [], 2: [], 3: [], 4: [], 5: []}
+        for ex in ds:
+            if ex["config"] not in _VALID_CONFIGS:
+                continue
+            # Skip Unknown answers for cleaner binary evaluation (True/False only)
+            if ex["answer"] == "Unknown":
+                continue
+            depth = ex.get("QDep", 0)
+            difficulty = _DEPTH_TO_DIFFICULTY.get(depth, 5)
+            by_difficulty[difficulty].append({
+                "facts_rules": ex["theory"].strip(),
+                "statement": ex["question"].strip(),
+                "answer": ex["answer"],   # "True" or "False"
+            })
+        total = sum(len(v) for v in by_difficulty.values())
+        return by_difficulty, total
+    except Exception:
+        return {1: [], 2: [], 3: [], 4: [], 5: []}, 0
+# Task names — one per difficulty level so single-task mode works
+_TASK_NAMES = [f"proofwriter_d{d}" for d in range(1, 6)]
+class ProofWriterDriver(BaseDriver):
+    """
+    Driver for ProofWriter deductive reasoning dataset.
+    Exposes 5 task names (proofwriter_d1 … proofwriter_d5) corresponding to
+    reasoning depths 0–4+. Each task name targets a specific difficulty level.
+    Args:
+        max_per_difficulty: Cap examples per difficulty to limit memory.
+                            Default: 5000 (from ~160k True/False examples total).
+    """
+    def __init__(self, max_per_difficulty: int = 5000) -> None:
+        self._max = max_per_difficulty
+        self._cache: Optional[dict[int, list[dict]]] = None
+        self._total: int = 0
+    @property
+    def task_names(self) -> list[str]:
+        return _TASK_NAMES
+    def _ensure_loaded(self) -> dict[int, list[dict]]:
+        if self._cache is None:
+            self._cache, self._total = _load_examples()
+            # Cap per difficulty
+            for d in self._cache:
+                if len(self._cache[d]) > self._max:
+                    self._cache[d] = self._cache[d][: self._max]
+        return self._cache
+    def sample(
+        self,
+        rng: random.Random,
+        task_name: Optional[str] = None,
+    ) -> Optional[RuleTask]:
+        by_difficulty = self._ensure_loaded()
+        if task_name is not None:
+            if task_name not in _TASK_NAMES:
+                return None
+            # task_name = "proofwriter_d{difficulty}"
+            difficulty = int(task_name[-1])
+            examples = by_difficulty.get(difficulty, [])
+        else:
+            # Mixed: weight toward harder difficulties (more interesting)
+            weights = [len(by_difficulty.get(d, [])) for d in range(1, 6)]
+            if not any(weights):
+                return None
+            difficulty = rng.choices(range(1, 6), weights=weights, k=1)[0]
+            examples = by_difficulty.get(difficulty, [])
+        if not examples:
+            return None
+        ex = rng.choice(examples)
+        return RuleTask(
+            rule=RULE_TEXT,
+            facts=ex["facts_rules"],
+            question=QUESTION_TEMPLATE.format(statement=ex["statement"]),
+            valid_answers=["True", "False"],
+            task_name=task_name or f"proofwriter_d{difficulty}",
+            difficulty=difficulty,
+            task_type="binary",
+            correct_answer=ex["answer"],
+        )

server/drivers/rulebreakers.py ADDED Viewed

	@@ -0,0 +1,145 @@

+"""
+server/drivers/rulebreakers.py
+--------------------------------
+RuleBreakersDriver — loads tasks from the RULEBREAKERS dataset
+(jason-c/rulebreakers on HuggingFace, ICML 2025).
+RULEBREAKERS is a 25,600-example benchmark of classical syllogistic reasoning
+covering two inference patterns:
+  - MT  (Modus Tollens):         P→¬Q, Q ⊢ ¬P
+  - DS  (Disjunctive Syllogism): P∨Q, ¬Q ⊢ P
+Each example has two premises and a candidate conclusion. The label indicates
+whether the conclusion validly follows (True) or not (False, "rulebreaker" case).
+The dataset is perfectly balanced: 12,800 MT + 12,800 DS, 50% True / 50% False.
+Difficulty:
+  All examples share the same one-step inference depth → difficulty 2.
+  (Simple but adversarial: the False examples look plausible at first glance.)
+Dataset: https://huggingface.co/datasets/jason-c/rulebreakers
+Paper: "RULEBREAKERS: Challenging LLMs at Modus Tollens Reasoning" (ICML 2025)
+"""
+from __future__ import annotations
+import random
+from typing import Optional
+from ..core.base_driver import BaseDriver, RuleTask
+TASK_NAME_MT = "rulebreakers_mt"
+TASK_NAME_DS = "rulebreakers_ds"
+_TASK_NAMES = [TASK_NAME_MT, TASK_NAME_DS]
+_TYPE_TO_TASK = {"mt": TASK_NAME_MT, "ds": TASK_NAME_DS}
+RULE_TEXT_MT = (
+    "You are given two premises. The first is a conditional rule of the form "
+    "\"If P then not Q\". The second states that Q is true. "
+    "Using Modus Tollens, determine whether the given conclusion validly follows "
+    "from these two premises. Answer True if the conclusion follows, False if it does not."
+)
+RULE_TEXT_DS = (
+    "You are given two premises. The first states that either P or Q is true (or both). "
+    "The second states that one of the two options is not true. "
+    "Using Disjunctive Syllogism, determine whether the given conclusion validly follows "
+    "from these two premises. Answer True if the conclusion follows, False if it does not."
+)
+QUESTION_TEMPLATE = (
+    "Given the premises above, does the following conclusion validly follow?\n"
+    "Conclusion: \"{conclusion}\""
+)
+def _load_examples() -> dict[str, list[dict]]:
+    """Load RULEBREAKERS examples grouped by type (mt / ds)."""
+    try:
+        from datasets import load_dataset
+        ds = load_dataset("jason-c/rulebreakers", split="train", trust_remote_code=True)
+        by_type: dict[str, list[dict]] = {"mt": [], "ds": []}
+        for ex in ds:
+            rb_type = ex.get("rulebreaker_type", "")
+            if rb_type not in by_type:
+                continue
+            label = ex.get("label")
+            # label is a Python bool in this dataset
+            if not isinstance(label, bool):
+                continue
+            by_type[rb_type].append({
+                "premise1": ex["premise1"].strip(),
+                "premise2": ex["premise2"].strip(),
+                "conclusion": ex["conclusion"].strip(),
+                "label": "True" if label else "False",
+            })
+        return by_type
+    except Exception:
+        return {"mt": [], "ds": []}
+class RuleBreakersDriver(BaseDriver):
+    """
+    Driver for the RULEBREAKERS syllogistic reasoning dataset.
+    Exposes 2 task names:
+      - rulebreakers_mt : Modus Tollens examples
+      - rulebreakers_ds : Disjunctive Syllogism examples
+    The dataset is small and adversarial: half the conclusions look valid but
+    are not. It tests whether models can resist superficially plausible but
+    logically invalid inferences.
+    """
+    def __init__(self) -> None:
+        self._cache: Optional[dict[str, list[dict]]] = None
+    @property
+    def task_names(self) -> list[str]:
+        return _TASK_NAMES
+    @property
+    def weight(self) -> float:
+        return 2.0
+    def _ensure_loaded(self) -> dict[str, list[dict]]:
+        if self._cache is None:
+            self._cache = _load_examples()
+        return self._cache
+    def sample(
+        self,
+        rng: random.Random,
+        task_name: Optional[str] = None,
+    ) -> Optional[RuleTask]:
+        by_type = self._ensure_loaded()
+        if task_name is not None:
+            if task_name == TASK_NAME_MT:
+                rb_type = "mt"
+            elif task_name == TASK_NAME_DS:
+                rb_type = "ds"
+            else:
+                return None
+        else:
+            rb_type = rng.choice(["mt", "ds"])
+        examples = by_type.get(rb_type, [])
+        if not examples:
+            return None
+        ex = rng.choice(examples)
+        rule_text = RULE_TEXT_MT if rb_type == "mt" else RULE_TEXT_DS
+        return RuleTask(
+            rule=rule_text,
+            facts=f"{ex['premise1']}\n{ex['premise2']}",
+            question=QUESTION_TEMPLATE.format(conclusion=ex["conclusion"]),
+            valid_answers=["True", "False"],
+            task_name=task_name or _TYPE_TO_TASK[rb_type],
+            difficulty=2,
+            task_type="binary",
+            correct_answer=ex["label"],
+        )