farffadet commited on
Commit
c6837f1
·
verified ·
1 Parent(s): 25f6703

Upload folder using huggingface_hub

Browse files
Dockerfile CHANGED
@@ -9,12 +9,18 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
9
  # Copy the environment package (build context = envs/syllogym_env/)
10
  COPY . ./syllogym_env/
11
 
 
 
 
12
  # Install all dependencies from pyproject.toml
13
  RUN pip install --no-cache-dir -e ./syllogym_env/
14
 
15
  ENV PYTHONUNBUFFERED=1
16
  ENV PYTHONPATH="/app:$PYTHONPATH"
17
 
 
 
 
18
  EXPOSE 7860
19
 
20
  HEALTHCHECK --interval=30s --timeout=3s --start-period=30s --retries=5 \
 
9
  # Copy the environment package (build context = envs/syllogym_env/)
10
  COPY . ./syllogym_env/
11
 
12
+ # Copy README to /app/README.md so the web interface can load it
13
+ COPY README.md ./README.md
14
+
15
  # Install all dependencies from pyproject.toml
16
  RUN pip install --no-cache-dir -e ./syllogym_env/
17
 
18
  ENV PYTHONUNBUFFERED=1
19
  ENV PYTHONPATH="/app:$PYTHONPATH"
20
 
21
+ # Enable the OpenEnv built-in web interface (served at /web)
22
+ ENV ENABLE_WEB_INTERFACE=true
23
+
24
  EXPOSE 7860
25
 
26
  HEALTHCHECK --interval=30s --timeout=3s --start-period=30s --retries=5 \
README.md CHANGED
@@ -1,10 +1,91 @@
1
  ---
2
- title: Syllogym Env
3
- emoji: 🐢
4
- colorFrom: blue
5
- colorTo: yellow
6
  sdk: docker
7
- pinned: false
 
 
 
 
 
 
 
 
 
 
 
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: SylloGym — Deductive Reasoning Environment
 
 
 
3
  sdk: docker
4
+ app_port: 7860
5
+ base_path: /web
6
+ colorFrom: blue
7
+ colorTo: green
8
+ tags:
9
+ - openenv
10
+ - reasoning
11
+ - rlvr
12
+ - legalbench
13
+ - grpo
14
+ - deductive-reasoning
15
+ - first-order-logic
16
  ---
17
 
18
+ # SylloGym Deductive Reasoning Environment
19
+
20
+ SylloGym is a multi-dataset deductive reasoning environment for training LLMs via reinforcement learning. The model receives a rule, a set of facts, and a question — and must apply strict logical deduction to reach the correct conclusion.
21
+
22
+ ## How to interact
23
+
24
+ 1. Click **Reset** to get a new reasoning problem
25
+ 2. Fill in the **reasoning** field with your chain-of-thought (wrap in `<reasoning>...</reasoning>`)
26
+ 3. Fill in the **answer** field with your final answer (wrap in `<answer>...</answer>`)
27
+ 4. Click **Step** to submit and see your reward
28
+
29
+ **Example:**
30
+ ```
31
+ reasoning: <reasoning>The rule states that if a contract contains a confidentiality clause, it is an NDA. The facts show a confidentiality clause is present. Therefore, this is an NDA.</reasoning>
32
+ answer: <answer>Yes</answer>
33
+ ```
34
+
35
+ ## Reward breakdown
36
+
37
+ | Component | Points | Condition |
38
+ |-----------|--------|-----------|
39
+ | Format | +0.1 | Both `<reasoning>` and `<answer>` tags present |
40
+ | Answer | +1.0 | Correct answer |
41
+ | Reasoning | +0.2 | Non-trivial reasoning (>50 chars, not just restating the answer) |
42
+ | **Max** | **1.3**| |
43
+
44
+ ## Datasets & Tasks
45
+
46
+ SylloGym covers **6 datasets** and **20 tasks** across different reasoning domains:
47
+
48
+ | Dataset | Tasks | Type | Difficulty |
49
+ |---------|-------|------|------------|
50
+ | **LegalBench** | `diversity_1–6`, `ucc_v_common_law`, `abercrombie`, `hearsay`, `telemarketing_sales_rule` | Yes/No or multi-class | 1–5 |
51
+ | **Knights & Knaves** | `knights_knaves` | Who is knight/knave? | 1–3 |
52
+ | **ProofWriter** | `proofwriter_d1–d5` | True/False | 1–5 |
53
+ | **FOLIO** | `folio` | True/False/Uncertain | 2–5 |
54
+ | **RuleBreakers** | `rulebreakers_mt`, `rulebreakers_ds` | True/False | 2 |
55
+ | **FOL-NLI** | `fol_nli` | entailment/contradiction/neutral | 2–5 |
56
+
57
+ ## Reset options
58
+
59
+ Pass optional parameters to `reset()` to control task selection:
60
+
61
+ - `task_mode="mixed"` (default) — weighted random across all datasets
62
+ - `task_mode="single"` + `task_name="hearsay"` — restrict to one task
63
+ - `seed=42` — reproducible sampling
64
+
65
+ ## Connect from code
66
+
67
+ ```python
68
+ from syllogym_env import SylloGymEnv, SylloAction
69
+
70
+ env = SylloGymEnv(base_url="https://huggingface.co/spaces/eliot-gtn/syllogym-env")
71
+ env.connect()
72
+
73
+ result = env.reset(task_mode="mixed")
74
+ obs = result.observation
75
+ print(obs.rule)
76
+ print(obs.facts)
77
+ print(obs.question)
78
+
79
+ result = env.step(SylloAction(
80
+ reasoning="<reasoning>Applying the rule to the facts...</reasoning>",
81
+ answer=f"<answer>{obs.valid_answers[0]}</answer>",
82
+ ))
83
+ print(f"Reward: {result.reward}")
84
+ env.disconnect()
85
+ ```
86
+
87
+ ## About
88
+
89
+ **Competition:** [OpenEnv Challenge](https://huggingface.co/openenv) by Meta PyTorch × HuggingFace × Unsloth
90
+ **Training:** GRPO (Group Relative Policy Optimization) via TRL + Unsloth
91
+ **Base model:** Qwen2.5-3B-Instruct
openenv.yaml CHANGED
@@ -3,15 +3,17 @@ name: syllogym_env
3
  type: space
4
  runtime: fastapi
5
  app: server.app:app
6
- port: 8000
7
  description: >
8
- SylloGym: Legal Syllogistic Reasoning Environment.
9
- Trains LLMs to apply deductive reasoning on LegalBench tasks.
10
- The model receives a legal rule + case facts and must derive the correct
11
- legal conclusion (Yes/No or multi-class).
12
  tags:
13
- - legal
14
  - reasoning
15
  - rlvr
16
  - legalbench
17
  - grpo
 
 
 
3
  type: space
4
  runtime: fastapi
5
  app: server.app:app
6
+ port: 7860
7
  description: >
8
+ SylloGym: Multi-dataset Deductive Reasoning Environment.
9
+ Trains LLMs to apply strict logical deduction across 6 datasets and 20 tasks:
10
+ LegalBench, Knights & Knaves, ProofWriter, FOLIO, RuleBreakers, FOL-NLI.
11
+ The model receives premises/rules and must derive the correct conclusion.
12
  tags:
13
+ - openenv
14
  - reasoning
15
  - rlvr
16
  - legalbench
17
  - grpo
18
+ - deductive-reasoning
19
+ - first-order-logic
pyproject.toml CHANGED
@@ -21,5 +21,15 @@ server = "syllogym_env.server.app:main"
21
 
22
  [tool.setuptools]
23
  include-package-data = true
24
- packages = ["syllogym_env", "syllogym_env.server"]
25
- package-dir = {"syllogym_env" = ".", "syllogym_env.server" = "server"}
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  [tool.setuptools]
23
  include-package-data = true
24
+ packages = [
25
+ "syllogym_env",
26
+ "syllogym_env.server",
27
+ "syllogym_env.server.core",
28
+ "syllogym_env.server.drivers",
29
+ ]
30
+ package-dir = {
31
+ "syllogym_env" = ".",
32
+ "syllogym_env.server" = "server",
33
+ "syllogym_env.server.core" = "server/core",
34
+ "syllogym_env.server.drivers" = "server/drivers",
35
+ }
server/Dockerfile ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ WORKDIR /app
4
+
5
+ RUN apt-get update && apt-get install -y --no-install-recommends \
6
+ git curl \
7
+ && rm -rf /var/lib/apt/lists/*
8
+
9
+ # Copy the environment package (build context = envs/syllogym_env/)
10
+ COPY . ./syllogym_env/
11
+
12
+ # Copy README to /app/README.md so the web interface can load it
13
+ COPY README.md ./README.md
14
+
15
+ # Install all dependencies from pyproject.toml
16
+ RUN pip install --no-cache-dir -e ./syllogym_env/
17
+
18
+ ENV PYTHONUNBUFFERED=1
19
+ ENV PYTHONPATH="/app:$PYTHONPATH"
20
+
21
+ # Enable the OpenEnv built-in web interface (served at /web)
22
+ ENV ENABLE_WEB_INTERFACE=true
23
+
24
+ EXPOSE 7860
25
+
26
+ HEALTHCHECK --interval=30s --timeout=3s --start-period=30s --retries=5 \
27
+ CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:7860/health')" || exit 1
28
+
29
+ CMD ["uvicorn", "syllogym_env.server.app:app", "--host", "0.0.0.0", "--port", "7860"]
server/__init__.py CHANGED
@@ -1,5 +1,5 @@
1
  """SylloGym environment server components."""
2
 
3
- from .syllogym_environment import SylloGymEnvironment
4
 
5
  __all__ = ["SylloGymEnvironment"]
 
1
  """SylloGym environment server components."""
2
 
3
+ from .core.environment import SylloGymEnvironment
4
 
5
  __all__ = ["SylloGymEnvironment"]
server/app.py CHANGED
@@ -12,11 +12,11 @@ Usage:
12
  try:
13
  from openenv.core.env_server import create_app
14
  from syllogym_env.models import SylloAction, SylloObservation
15
- from syllogym_env.server.syllogym_environment import SylloGymEnvironment
16
  except ImportError:
17
  from openenv.core.env_server import create_app
18
  from ..models import SylloAction, SylloObservation
19
- from .syllogym_environment import SylloGymEnvironment
20
 
21
  app = create_app(
22
  SylloGymEnvironment,
 
12
  try:
13
  from openenv.core.env_server import create_app
14
  from syllogym_env.models import SylloAction, SylloObservation
15
+ from syllogym_env.server.core.environment import SylloGymEnvironment
16
  except ImportError:
17
  from openenv.core.env_server import create_app
18
  from ..models import SylloAction, SylloObservation
19
+ from .core.environment import SylloGymEnvironment
20
 
21
  app = create_app(
22
  SylloGymEnvironment,
server/core/__init__.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ """SylloGym core abstractions."""
2
+
3
+ from .base_driver import BaseDriver, RuleTask
4
+ from .reward import compute_reward
5
+
6
+ __all__ = ["BaseDriver", "RuleTask", "compute_reward"]
server/core/base_driver.py ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ server/core/base_driver.py
3
+ --------------------------
4
+ RuleTask — the shared data contract between drivers and the environment.
5
+ BaseDriver — the interface every dataset driver must implement.
6
+ """
7
+
8
+ from __future__ import annotations
9
+
10
+ import random
11
+ from abc import ABC, abstractmethod
12
+ from dataclasses import dataclass, field
13
+
14
+
15
+ @dataclass
16
+ class RuleTask:
17
+ """
18
+ A fully resolved reasoning problem, ready to present to a model.
19
+
20
+ Drivers produce RuleTask objects; the environment consumes them.
21
+ All fields map directly to SylloObservation fields.
22
+
23
+ Fields shown to the model:
24
+ rule — the explicit rule or principle to apply
25
+ facts — the case / scenario to reason about
26
+ question — what the model must answer
27
+ valid_answers — accepted answer strings (case-insensitive match)
28
+
29
+ Metadata:
30
+ task_name — unique identifier for this task type
31
+ difficulty — 1 (easiest) to N (hardest); used for curriculum sampling
32
+ task_type — "binary" (Yes/No or two-class) | "multiclass"
33
+
34
+ Ground truth (server-side only, not shown to model):
35
+ correct_answer — the expected answer string
36
+ """
37
+
38
+ rule: str
39
+ facts: str
40
+ question: str
41
+ valid_answers: list[str]
42
+ task_name: str
43
+ difficulty: int
44
+ task_type: str # "binary" | "multiclass"
45
+ correct_answer: str
46
+
47
+
48
+ class BaseDriver(ABC):
49
+ """
50
+ Interface for dataset drivers.
51
+
52
+ A driver is responsible for:
53
+ - Maintaining its own example pool or procedural generator
54
+ - Sampling a single RuleTask on demand via sample()
55
+ - Reporting which task names it owns via task_names
56
+
57
+ The environment owns the random.Random instance and passes it to sample()
58
+ so that episode reproducibility is controlled by a single seed.
59
+
60
+ Minimal implementation example:
61
+ class MyDriver(BaseDriver):
62
+ @property
63
+ def task_names(self) -> list[str]:
64
+ return ["my_task"]
65
+
66
+ def sample(self, rng, task_name=None) -> RuleTask | None:
67
+ if task_name is not None and task_name != "my_task":
68
+ return None
69
+ return RuleTask(
70
+ rule="...", facts="...", question="...",
71
+ valid_answers=["Yes", "No"],
72
+ task_name="my_task", difficulty=1,
73
+ task_type="binary", correct_answer="Yes",
74
+ )
75
+ """
76
+
77
+ @property
78
+ @abstractmethod
79
+ def task_names(self) -> list[str]:
80
+ """All task names this driver can produce. Used for single-task routing."""
81
+ ...
82
+
83
+ @abstractmethod
84
+ def sample(
85
+ self,
86
+ rng: random.Random,
87
+ task_name: str | None = None,
88
+ ) -> RuleTask | None:
89
+ """
90
+ Sample one RuleTask.
91
+
92
+ Args:
93
+ rng: Shared Random instance from the environment.
94
+ task_name: If provided, restrict to this task.
95
+ Return None if this driver does not own that task_name.
96
+
97
+ Returns:
98
+ A RuleTask, or None if no examples are available
99
+ (dataset not loaded, generation failed, task not owned, etc.).
100
+ """
101
+ ...
102
+
103
+ @property
104
+ def weight(self) -> float:
105
+ """
106
+ Relative sampling weight for this driver in mixed mode.
107
+ Override to up- or down-weight a driver relative to others.
108
+ Default: 1.0 per task_name (so total weight scales with number of tasks).
109
+ """
110
+ return float(len(self.task_names))
server/core/environment.py ADDED
@@ -0,0 +1,233 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ server/core/environment.py
3
+ --------------------------
4
+ SylloGymEnvironment — driver-aware deductive reasoning environment.
5
+
6
+ The environment is dataset-agnostic. It delegates task sampling to registered
7
+ BaseDriver instances and uses the shared reward functions from core.reward.
8
+
9
+ Each episode is a single step:
10
+ reset() → SylloObservation (rule + facts + question)
11
+ step(SylloAction) → SylloObservation (with reward + done=True)
12
+ """
13
+
14
+ from __future__ import annotations
15
+
16
+ import random
17
+ import uuid
18
+ from typing import Any, Optional
19
+
20
+ from openenv.core.env_server.interfaces import Action, Environment, Observation
21
+
22
+ try:
23
+ from ...models import SylloAction, SylloObservation, SylloState
24
+ except ImportError:
25
+ from syllogym_env.models import SylloAction, SylloObservation, SylloState
26
+
27
+ from .base_driver import BaseDriver, RuleTask
28
+ from .reward import check_format, check_answer, check_reasoning_quality, compute_reward
29
+ from ..drivers.legalbench import LegalBenchDriver
30
+ from ..drivers.knights_knaves import KnightsKnavesDriver
31
+ from ..drivers.proofwriter import ProofWriterDriver
32
+ from ..drivers.folio import FOLIODriver
33
+ from ..drivers.rulebreakers import RuleBreakersDriver
34
+ from ..drivers.fol_nli import FOLNLIDriver
35
+
36
+
37
+ def _default_drivers() -> list[BaseDriver]:
38
+ return [
39
+ LegalBenchDriver(),
40
+ KnightsKnavesDriver(),
41
+ ProofWriterDriver(),
42
+ FOLIODriver(),
43
+ RuleBreakersDriver(),
44
+ FOLNLIDriver(),
45
+ ]
46
+
47
+
48
+ class SylloGymEnvironment(Environment):
49
+ """
50
+ SylloGym: Multi-dataset Deductive Reasoning Environment.
51
+
52
+ Trains LLMs to apply deductive (syllogistic) reasoning across domains.
53
+ The model receives a rule + facts and must derive the correct conclusion.
54
+
55
+ Drivers provide the actual tasks. The default set is LegalBench + Knights & Knaves.
56
+ Additional drivers can be registered at construction time.
57
+
58
+ Args:
59
+ task_mode: Sampling strategy.
60
+ "mixed" — weighted random across all drivers and tasks.
61
+ "single" — restrict to one specific task_name.
62
+ task_name: When task_mode="single", the specific task to use.
63
+ seed: Optional random seed for reproducibility.
64
+ drivers: List of BaseDriver instances. Defaults to [LegalBenchDriver(), KnightsKnavesDriver()].
65
+
66
+ Example:
67
+ >>> env = SylloGymEnvironment(task_mode="mixed")
68
+ >>> obs = env.reset()
69
+ >>> action = SylloAction(
70
+ ... reasoning="<reasoning>The rule states...</reasoning>",
71
+ ... answer="<answer>Yes</answer>"
72
+ ... )
73
+ >>> result = env.step(action)
74
+ >>> print(result.reward) # 0.0 to 1.3
75
+
76
+ # Single-task mode (K&K only):
77
+ >>> env = SylloGymEnvironment(task_mode="single", task_name="knights_knaves")
78
+ """
79
+
80
+ def __init__(
81
+ self,
82
+ task_mode: str = "mixed",
83
+ task_name: Optional[str] = None,
84
+ seed: Optional[int] = None,
85
+ drivers: Optional[list[BaseDriver]] = None,
86
+ ):
87
+ self._task_mode = task_mode
88
+ self._task_name = task_name
89
+ self._rng = random.Random(seed)
90
+ self._drivers: list[BaseDriver] = drivers if drivers is not None else _default_drivers()
91
+
92
+ # Build task_name → driver lookup for O(1) single-task routing
93
+ self._task_to_driver: dict[str, BaseDriver] = {}
94
+ for driver in self._drivers:
95
+ for name in driver.task_names:
96
+ self._task_to_driver[name] = driver
97
+
98
+ self._state = SylloState(
99
+ episode_id=str(uuid.uuid4()),
100
+ task_mode=task_mode,
101
+ )
102
+ self._current_task: Optional[RuleTask] = None
103
+
104
+ # Public property for callers that need the full task list (e.g. eval callbacks)
105
+ @property
106
+ def task_registry(self) -> list[dict]:
107
+ """Return all tasks across all drivers as a list of {name, difficulty} dicts."""
108
+ tasks = []
109
+ for driver in self._drivers:
110
+ for name in driver.task_names:
111
+ # Ask the driver for a sample to get difficulty; use a placeholder rng
112
+ sample = driver.sample(random.Random(0), task_name=name)
113
+ tasks.append({
114
+ "name": name,
115
+ "difficulty": sample.difficulty if sample else 1,
116
+ })
117
+ return tasks
118
+
119
+ def _sample_task(self) -> Optional[RuleTask]:
120
+ """Sample one RuleTask from the appropriate driver."""
121
+ if self._task_mode == "single" and self._task_name:
122
+ driver = self._task_to_driver.get(self._task_name)
123
+ if driver is None:
124
+ return None
125
+ return driver.sample(self._rng, task_name=self._task_name)
126
+
127
+ # "mixed": weighted selection across drivers, then delegate internally
128
+ weights = [d.weight for d in self._drivers]
129
+ driver = self._rng.choices(self._drivers, weights=weights, k=1)[0]
130
+ return driver.sample(self._rng)
131
+
132
+ def reset(
133
+ self,
134
+ seed: Optional[int] = None,
135
+ episode_id: Optional[str] = None,
136
+ task_mode: Optional[str] = None,
137
+ task_name: Optional[str] = None,
138
+ **kwargs: Any,
139
+ ) -> Observation:
140
+ if seed is not None:
141
+ self._rng = random.Random(seed)
142
+ if task_mode is not None:
143
+ self._task_mode = task_mode
144
+ if task_name is not None:
145
+ self._task_name = task_name
146
+
147
+ self._state = SylloState(
148
+ episode_id=episode_id or str(uuid.uuid4()),
149
+ task_mode=self._task_mode,
150
+ task_name=self._task_name or "",
151
+ total_correct=self._state.total_correct,
152
+ total_steps=self._state.total_steps,
153
+ )
154
+
155
+ self._current_task = self._sample_task()
156
+
157
+ if self._current_task is None:
158
+ return SylloObservation(
159
+ facts="[Dataset unavailable — check internet connection and datasets library]",
160
+ reward=None,
161
+ done=False,
162
+ )
163
+
164
+ t = self._current_task
165
+ return SylloObservation(
166
+ rule=t.rule,
167
+ facts=t.facts,
168
+ question=t.question,
169
+ task_type=t.task_type,
170
+ valid_answers=t.valid_answers,
171
+ task_name=t.task_name,
172
+ difficulty=t.difficulty,
173
+ correct_answer=t.correct_answer,
174
+ reward=None,
175
+ done=False,
176
+ )
177
+
178
+ def step(self, action: Action, **kwargs: Any) -> Observation:
179
+ if self._current_task is None:
180
+ return SylloObservation(
181
+ reward=0.0,
182
+ done=True,
183
+ metadata={"error": "Environment not initialized. Call reset() first."},
184
+ )
185
+
186
+ if not isinstance(action, SylloAction):
187
+ try:
188
+ action = SylloAction(
189
+ reasoning=getattr(action, "reasoning", ""),
190
+ answer=getattr(action, "answer", ""),
191
+ )
192
+ except Exception:
193
+ return SylloObservation(reward=0.0, done=True)
194
+
195
+ t = self._current_task
196
+
197
+ total, breakdown = compute_reward(
198
+ reasoning=action.reasoning,
199
+ answer=action.answer,
200
+ correct_answer=t.correct_answer,
201
+ valid_answers=t.valid_answers,
202
+ rule=t.rule,
203
+ facts=t.facts,
204
+ )
205
+
206
+ self._state.step_count += 1
207
+ self._state.total_steps += 1
208
+ if total >= 1.0:
209
+ self._state.total_correct += 1
210
+
211
+ return SylloObservation(
212
+ rule=t.rule,
213
+ facts=t.facts,
214
+ question=t.question,
215
+ task_type=t.task_type,
216
+ valid_answers=t.valid_answers,
217
+ task_name=t.task_name,
218
+ difficulty=t.difficulty,
219
+ correct_answer=t.correct_answer,
220
+ reward=total,
221
+ done=True,
222
+ metadata={
223
+ "predicted_answer": action.answer,
224
+ "correct_answer": t.correct_answer,
225
+ "format_reward": breakdown["format"],
226
+ "answer_reward": breakdown["answer"],
227
+ "reasoning_reward": breakdown["reasoning"],
228
+ },
229
+ )
230
+
231
+ @property
232
+ def state(self) -> SylloState:
233
+ return self._state
server/core/reward.py ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ server/core/reward.py
3
+ ---------------------
4
+ Reward functions for SylloGym.
5
+
6
+ Composite reward structure (max 1.3):
7
+ +0.1 format reward — <reasoning> and <answer> tags present
8
+ +1.0 answer reward — correct answer (exact match, case-insensitive)
9
+ +0.2 reasoning quality — reasoning references keywords from rule AND facts
10
+ """
11
+
12
+ from __future__ import annotations
13
+
14
+ import re
15
+
16
+ _STOPWORDS = {
17
+ "under", "shall", "must", "with", "that", "this", "from", "into",
18
+ "have", "been", "were", "they", "their", "there", "when", "which",
19
+ }
20
+
21
+
22
+ def check_format(reasoning: str, answer: str) -> float:
23
+ """Return 0.1 if both <reasoning>...</reasoning> and <answer>...</answer> tags are present."""
24
+ combined = reasoning + answer
25
+ has_reasoning = bool(re.search(r"<reasoning>.*?</reasoning>", combined, re.DOTALL))
26
+ has_answer = bool(re.search(r"<answer>.*?</answer>", combined, re.DOTALL | re.IGNORECASE))
27
+ return 0.1 if (has_reasoning and has_answer) else 0.0
28
+
29
+
30
+ def check_answer(predicted: str, correct: str, valid_answers: list[str]) -> float:
31
+ """Return 1.0 for exact match (case-insensitive). Extracts from <answer> tags if present."""
32
+ tag_match = re.search(r"<answer>(.*?)</answer>", predicted, re.DOTALL | re.IGNORECASE)
33
+ pred_clean = tag_match.group(1).strip().lower() if tag_match else predicted.strip().lower()
34
+ return 1.0 if pred_clean == correct.strip().lower() else 0.0
35
+
36
+
37
+ def check_reasoning_quality(reasoning: str, rule: str, facts: str) -> float:
38
+ """
39
+ Return 0.2 if the reasoning references ≥2 significant keywords from both rule and facts.
40
+ Heuristic for genuine deductive reasoning vs pattern-matching.
41
+ """
42
+ if not reasoning:
43
+ return 0.0
44
+ reasoning_lower = reasoning.lower()
45
+ rule_words = {w for w in re.findall(r"\b[a-z]{5,}\b", rule.lower()) if w not in _STOPWORDS}
46
+ facts_words = {w for w in re.findall(r"\b[a-z]{5,}\b", facts.lower()) if w not in _STOPWORDS}
47
+ rule_hits = sum(1 for w in rule_words if w in reasoning_lower)
48
+ facts_hits = sum(1 for w in facts_words if w in reasoning_lower)
49
+ return 0.2 if (rule_hits >= 2 and facts_hits >= 2) else 0.0
50
+
51
+
52
+ def compute_reward(
53
+ reasoning: str,
54
+ answer: str,
55
+ correct_answer: str,
56
+ valid_answers: list[str],
57
+ rule: str,
58
+ facts: str,
59
+ ) -> tuple[float, dict[str, float]]:
60
+ """
61
+ Compute the composite reward and return (total, breakdown).
62
+
63
+ Args:
64
+ reasoning: The model's reasoning text (may include <reasoning> tags).
65
+ answer: The model's answer text (may include <answer> tags).
66
+ correct_answer: Ground truth answer string.
67
+ valid_answers: List of accepted answer strings.
68
+ rule: The rule text shown in the prompt.
69
+ facts: The facts text shown in the prompt.
70
+
71
+ Returns:
72
+ (total_reward, {"format": float, "answer": float, "reasoning": float})
73
+ """
74
+ fmt = check_format(reasoning, answer)
75
+ ans = check_answer(answer, correct_answer, valid_answers)
76
+ rsn = check_reasoning_quality(reasoning, rule, facts)
77
+ total = round(fmt + ans + rsn, 4)
78
+ return total, {"format": fmt, "answer": ans, "reasoning": rsn}
server/drivers/__init__.py ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """SylloGym dataset drivers."""
2
+
3
+ from .legalbench import LegalBenchDriver
4
+ from .knights_knaves import KnightsKnavesDriver
5
+ from .proofwriter import ProofWriterDriver
6
+ from .folio import FOLIODriver
7
+ from .rulebreakers import RuleBreakersDriver
8
+ from .fol_nli import FOLNLIDriver
9
+
10
+ __all__ = [
11
+ "LegalBenchDriver",
12
+ "KnightsKnavesDriver",
13
+ "ProofWriterDriver",
14
+ "FOLIODriver",
15
+ "RuleBreakersDriver",
16
+ "FOLNLIDriver",
17
+ ]
server/drivers/fol_nli.py ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ server/drivers/fol_nli.py
3
+ --------------------------
4
+ FOLNLIDriver — loads tasks from the FOL-NLI dataset
5
+ (tasksource/FOL-nli on HuggingFace).
6
+
7
+ FOL-NLI (First-Order Logic Natural Language Inference) is a 82,219-example
8
+ dataset of NLI problems grounded in formal first-order logic. Each example
9
+ consists of a set of premises (natural language rules + facts) and a hypothesis
10
+ to evaluate. Labels are theorem-prover verified:
11
+ - entailment : hypothesis follows from premises
12
+ - contradiction: hypothesis is inconsistent with premises
13
+ - neutral : cannot be determined from premises alone
14
+
15
+ Unlike SNLI/MultiNLI (linguistic NLI), FOL-NLI is derived from formal proofs,
16
+ making it a rigorous test of deductive reasoning rather than linguistic intuition.
17
+
18
+ Difficulty is estimated from rule_concentration (density of logical rules):
19
+ rule_concentration < 0.05 → difficulty 2 (few rules, mostly facts)
20
+ 0.05 ≤ rc < 0.12 → difficulty 3
21
+ 0.12 ≤ rc < 0.20 → difficulty 4
22
+ rc ≥ 0.20 → difficulty 5 (rule-heavy, deepest inference)
23
+
24
+ Dataset: https://huggingface.co/datasets/tasksource/FOL-nli
25
+ """
26
+
27
+ from __future__ import annotations
28
+
29
+ import random
30
+ from typing import Optional
31
+
32
+ from ..core.base_driver import BaseDriver, RuleTask
33
+
34
+
35
+ TASK_NAME = "fol_nli"
36
+
37
+ RULE_TEXT = (
38
+ "You are given a set of premises expressed in natural language. "
39
+ "Some premises are facts (direct statements about the world); others are "
40
+ "logical rules (conditional or universal statements). "
41
+ "Using only these premises and strict logical deduction, determine whether "
42
+ "the hypothesis is:\n"
43
+ " - entailment : it necessarily follows from the premises\n"
44
+ " - contradiction: it is inconsistent with the premises\n"
45
+ " - neutral : it cannot be determined from the premises alone"
46
+ )
47
+
48
+ QUESTION_TEMPLATE = (
49
+ 'Based solely on the premises above, what is the relationship to the '
50
+ 'following hypothesis?\nHypothesis: "{hypothesis}"'
51
+ )
52
+
53
+ _VALID_LABELS = {"entailment", "contradiction", "neutral"}
54
+
55
+
56
+ def _difficulty_from_rule_concentration(rc: float) -> int:
57
+ if rc < 0.05:
58
+ return 2
59
+ if rc < 0.12:
60
+ return 3
61
+ if rc < 0.20:
62
+ return 4
63
+ return 5
64
+
65
+
66
+ def _load_examples() -> list[dict]:
67
+ """Load all FOL-NLI examples (train split only — large enough)."""
68
+ try:
69
+ from datasets import load_dataset
70
+ ds = load_dataset("tasksource/FOL-nli", split="train", trust_remote_code=True)
71
+ examples = []
72
+ for ex in ds:
73
+ label = ex.get("label", "")
74
+ if label not in _VALID_LABELS:
75
+ continue
76
+ premise = ex.get("premise", "").strip()
77
+ hypothesis = ex.get("hypothesis", "").strip()
78
+ if not premise or not hypothesis:
79
+ continue
80
+ rc = float(ex.get("rule_concentration") or 0.0)
81
+ examples.append({
82
+ "premise": premise,
83
+ "hypothesis": hypothesis,
84
+ "label": label,
85
+ "difficulty": _difficulty_from_rule_concentration(rc),
86
+ })
87
+ return examples
88
+ except Exception:
89
+ return []
90
+
91
+
92
+ class FOLNLIDriver(BaseDriver):
93
+ """
94
+ Driver for the FOL-NLI formal logic NLI dataset.
95
+
96
+ Single task name "fol_nli" covering all 82K examples.
97
+ Three-class: entailment / contradiction / neutral.
98
+ Difficulty 2–5 based on rule_concentration.
99
+
100
+ This driver provides the largest pure-logic reasoning dataset in SylloGym,
101
+ with theorem-prover verified labels — no annotation noise.
102
+ """
103
+
104
+ def __init__(self, max_examples: int = 20000) -> None:
105
+ self._max = max_examples
106
+ self._cache: Optional[list[dict]] = None
107
+
108
+ @property
109
+ def task_names(self) -> list[str]:
110
+ return [TASK_NAME]
111
+
112
+ @property
113
+ def weight(self) -> float:
114
+ # Large and high-quality — weight between ProofWriter (5) and LegalBench (10)
115
+ return 4.0
116
+
117
+ def _ensure_loaded(self) -> list[dict]:
118
+ if self._cache is None:
119
+ examples = _load_examples()
120
+ if len(examples) > self._max:
121
+ # Subsample while preserving label balance
122
+ rng = random.Random(42)
123
+ rng.shuffle(examples)
124
+ examples = examples[: self._max]
125
+ self._cache = examples
126
+ return self._cache
127
+
128
+ def sample(
129
+ self,
130
+ rng: random.Random,
131
+ task_name: Optional[str] = None,
132
+ ) -> Optional[RuleTask]:
133
+ if task_name is not None and task_name != TASK_NAME:
134
+ return None
135
+
136
+ examples = self._ensure_loaded()
137
+ if not examples:
138
+ return None
139
+
140
+ ex = rng.choice(examples)
141
+
142
+ return RuleTask(
143
+ rule=RULE_TEXT,
144
+ facts=ex["premise"],
145
+ question=QUESTION_TEMPLATE.format(hypothesis=ex["hypothesis"]),
146
+ valid_answers=["entailment", "contradiction", "neutral"],
147
+ task_name=TASK_NAME,
148
+ difficulty=ex["difficulty"],
149
+ task_type="multiclass",
150
+ correct_answer=ex["label"],
151
+ )
server/drivers/folio.py ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ server/drivers/folio.py
3
+ ------------------------
4
+ FOLIODriver — loads tasks from the FOLIO dataset
5
+ (tasksource/folio on HuggingFace, open-access mirror of yale-nlp/FOLIO).
6
+
7
+ FOLIO (First-Order Logic Inference on Natural Language) is an expert-written
8
+ dataset of 1,204 examples requiring first-order logic reasoning over natural
9
+ language premises. Unlike ProofWriter (synthetic), FOLIO uses real-world
10
+ knowledge and genuinely complex multi-step inference.
11
+
12
+ Each example:
13
+ premises — a set of natural language statements (the theory)
14
+ conclusion — a statement to evaluate
15
+ label — "True", "False", or "Uncertain"
16
+
17
+ Difficulty is estimated from the number of premises (proxy for reasoning depth):
18
+ ≤3 premises → difficulty 2
19
+ 4–5 premises → difficulty 3
20
+ 6–7 premises → difficulty 4
21
+ ≥8 premises → difficulty 5
22
+
23
+ Dataset: https://huggingface.co/datasets/tasksource/folio
24
+ Paper: "FOLIO: Natural Language Reasoning with First-Order Logic"
25
+ (Han et al., EMNLP 2022)
26
+ """
27
+
28
+ from __future__ import annotations
29
+
30
+ import random
31
+ from typing import Optional
32
+
33
+ from ..core.base_driver import BaseDriver, RuleTask
34
+
35
+
36
+ TASK_NAME = "folio"
37
+
38
+ RULE_TEXT = (
39
+ "You are given a set of premises (statements that are true). "
40
+ "Using only these premises and logical deduction, determine whether "
41
+ "the conclusion is True, False, or Uncertain (cannot be determined "
42
+ "from the premises alone)."
43
+ )
44
+
45
+ QUESTION_TEMPLATE = (
46
+ 'Based solely on the premises above, is the following conclusion '
47
+ 'True, False, or Uncertain?\n"{conclusion}"'
48
+ )
49
+
50
+
51
+ def _difficulty_from_n_premises(n: int) -> int:
52
+ if n <= 3:
53
+ return 2
54
+ if n <= 5:
55
+ return 3
56
+ if n <= 7:
57
+ return 4
58
+ return 5
59
+
60
+
61
+ def _load_examples() -> list[dict]:
62
+ """Load all FOLIO examples (train + validation splits)."""
63
+ try:
64
+ from datasets import load_dataset
65
+ examples = []
66
+ for split in ("train", "validation"):
67
+ try:
68
+ ds = load_dataset("tasksource/folio", split=split, trust_remote_code=True)
69
+ for ex in ds:
70
+ premises = ex["premises"].strip()
71
+ conclusion = ex["conclusion"].strip()
72
+ label = ex["label"].strip() # "True" | "False" | "Uncertain"
73
+ if not premises or not conclusion or label not in ("True", "False", "Uncertain"):
74
+ continue
75
+ n_premises = len([p for p in premises.split("\n") if p.strip()])
76
+ examples.append({
77
+ "premises": premises,
78
+ "conclusion": conclusion,
79
+ "label": label,
80
+ "difficulty": _difficulty_from_n_premises(n_premises),
81
+ })
82
+ except Exception:
83
+ continue
84
+ return examples
85
+ except Exception:
86
+ return []
87
+
88
+
89
+ class FOLIODriver(BaseDriver):
90
+ """
91
+ Driver for the FOLIO first-order logic reasoning dataset.
92
+
93
+ Single task name "folio" covering all examples (True / False / Uncertain).
94
+ Difficulty ranges from 2 to 5 based on number of premises.
95
+
96
+ The dataset is small (~1200 examples total) and expert-written, making it
97
+ qualitatively different from ProofWriter (synthetic) and LegalBench (legal domain).
98
+ It is best used as a hard evaluation set or mixed in at low weight.
99
+ """
100
+
101
+ def __init__(self) -> None:
102
+ self._cache: Optional[list[dict]] = None
103
+
104
+ @property
105
+ def task_names(self) -> list[str]:
106
+ return [TASK_NAME]
107
+
108
+ @property
109
+ def weight(self) -> float:
110
+ # Small dataset — down-weight relative to ProofWriter (weight=5) and LegalBench (weight=10)
111
+ return 1.0
112
+
113
+ def _ensure_loaded(self) -> list[dict]:
114
+ if self._cache is None:
115
+ self._cache = _load_examples()
116
+ return self._cache
117
+
118
+ def sample(
119
+ self,
120
+ rng: random.Random,
121
+ task_name: Optional[str] = None,
122
+ ) -> Optional[RuleTask]:
123
+ if task_name is not None and task_name != TASK_NAME:
124
+ return None
125
+
126
+ examples = self._ensure_loaded()
127
+ if not examples:
128
+ return None
129
+
130
+ ex = rng.choice(examples)
131
+
132
+ return RuleTask(
133
+ rule=RULE_TEXT,
134
+ facts=ex["premises"],
135
+ question=QUESTION_TEMPLATE.format(conclusion=ex["conclusion"]),
136
+ valid_answers=["True", "False", "Uncertain"],
137
+ task_name=TASK_NAME,
138
+ difficulty=ex["difficulty"],
139
+ task_type="multiclass",
140
+ correct_answer=ex["label"],
141
+ )
server/drivers/knights_knaves.py ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ server/drivers/knights_knaves.py
3
+ ---------------------------------
4
+ KnightsKnavesDriver — procedural generation of Knights & Knaves logic puzzles.
5
+
6
+ No dataset required. Problems are generated on demand with full control over
7
+ difficulty (number of entities) and are infinitely varied.
8
+
9
+ Puzzle format:
10
+ RULE: "On this island, every person is either a Knight (who always tells
11
+ the truth) or a Knave (who always lies)..."
12
+ FACTS: One statement per entity, e.g.:
13
+ "Alex says: 'Blake is a Knave.'"
14
+ "Blake says: 'Alex is a Knight.'"
15
+ QUESTION: "Based solely on the rule above, is Alex a Knight or a Knave?"
16
+ ANSWER: "Knight" or "Knave"
17
+
18
+ Difficulty scales with number of entities:
19
+ 2 entities → difficulty 1
20
+ 3 entities → difficulty 2
21
+ 4 entities → difficulty 3
22
+ ...
23
+
24
+ Multi-turn mode (future): the environment could reveal one statement at a time,
25
+ asking the model to update its belief state at each step. This driver supports
26
+ single-turn mode only for now.
27
+ """
28
+
29
+ from __future__ import annotations
30
+
31
+ import random
32
+ from dataclasses import dataclass
33
+ from typing import Optional
34
+
35
+ from ..core.base_driver import BaseDriver, RuleTask
36
+
37
+
38
+ TASK_NAME = "knights_knaves"
39
+
40
+ RULE_TEXT = (
41
+ "On this island, every person is either a Knight (who always tells the truth) "
42
+ "or a Knave (who always lies). A Knight's statements are always true; "
43
+ "a Knave's statements are always false. "
44
+ "Use these facts to determine each person's type."
45
+ )
46
+
47
+ # A diverse pool of first names to avoid repetition within a puzzle
48
+ _NAME_POOL = [
49
+ "Alex", "Blake", "Casey", "Dana", "Ellis",
50
+ "Fran", "Gray", "Harper", "Indra", "Jules",
51
+ "Kit", "Lane", "Morgan", "Noel", "Paige",
52
+ "Quinn", "Reese", "Sage", "Taylor", "Uma",
53
+ ]
54
+
55
+
56
+ @dataclass
57
+ class _Entity:
58
+ name: str
59
+ is_knight: bool
60
+
61
+
62
+ def _generate_puzzle(
63
+ rng: random.Random,
64
+ n_entities: int,
65
+ ) -> tuple[list[_Entity], list[str]]:
66
+ """
67
+ Generate entity roles and statements for a K&K puzzle.
68
+
69
+ Each entity makes exactly one statement about a randomly chosen other entity.
70
+ A knight states the true type of the target; a knave states the false type.
71
+
72
+ Returns:
73
+ (entities, statements) where statements[i] is entity[i]'s statement.
74
+ """
75
+ names = rng.sample(_NAME_POOL, n_entities)
76
+ entities = [_Entity(name=name, is_knight=rng.choice([True, False])) for name in names]
77
+
78
+ statements: list[str] = []
79
+ for i, speaker in enumerate(entities):
80
+ # Each entity addresses a different entity (not itself)
81
+ others = [e for e in entities if e.name != speaker.name]
82
+ target = rng.choice(others)
83
+
84
+ # Knight tells truth; knave lies about the target's type
85
+ if speaker.is_knight:
86
+ claimed = "Knight" if target.is_knight else "Knave"
87
+ else:
88
+ claimed = "Knave" if target.is_knight else "Knight"
89
+
90
+ statements.append(f'{speaker.name} says: "{target.name} is a {claimed}."')
91
+
92
+ return entities, statements
93
+
94
+
95
+ class KnightsKnavesDriver(BaseDriver):
96
+ """
97
+ Procedural driver for Knights & Knaves logic puzzles.
98
+
99
+ Args:
100
+ min_entities: Minimum number of entities per puzzle (default 2).
101
+ max_entities: Maximum number of entities per puzzle (default 4).
102
+
103
+ Difficulty = n_entities - 1, so:
104
+ 2 entities → difficulty 1
105
+ 3 entities → difficulty 2
106
+ 4 entities → difficulty 3
107
+ """
108
+
109
+ def __init__(self, min_entities: int = 2, max_entities: int = 4) -> None:
110
+ if min_entities < 2:
111
+ raise ValueError("min_entities must be at least 2")
112
+ if max_entities > len(_NAME_POOL):
113
+ raise ValueError(f"max_entities cannot exceed {len(_NAME_POOL)}")
114
+ self._min = min_entities
115
+ self._max = max_entities
116
+
117
+ @property
118
+ def task_names(self) -> list[str]:
119
+ return [TASK_NAME]
120
+
121
+ def sample(
122
+ self,
123
+ rng: random.Random,
124
+ task_name: Optional[str] = None,
125
+ ) -> Optional[RuleTask]:
126
+ if task_name is not None and task_name != TASK_NAME:
127
+ return None
128
+
129
+ n = rng.randint(self._min, self._max)
130
+ entities, statements = _generate_puzzle(rng, n)
131
+
132
+ facts = "\n".join(statements)
133
+
134
+ # Ask about a randomly chosen entity
135
+ subject = rng.choice(entities)
136
+ correct = "Knight" if subject.is_knight else "Knave"
137
+
138
+ return RuleTask(
139
+ rule=RULE_TEXT,
140
+ facts=facts,
141
+ question=f"Based solely on the rule above, is {subject.name} a Knight or a Knave?",
142
+ valid_answers=["Knight", "Knave"],
143
+ task_name=TASK_NAME,
144
+ difficulty=n - 1,
145
+ task_type="binary",
146
+ correct_answer=correct,
147
+ )
server/drivers/legalbench.py ADDED
@@ -0,0 +1,271 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ server/drivers/legalbench.py
3
+ ----------------------------
4
+ LegalBenchDriver — loads tasks from LegalBench (nguha/legalbench on HuggingFace).
5
+
6
+ Selected tasks (Tier 1 — rule explicitly provided, pure deductive reasoning):
7
+ diversity_1–6 binary, difficulty 1–6 (§1332 diversity jurisdiction)
8
+ ucc_v_common_law binary, difficulty 2 (UCC vs. Common Law)
9
+ abercrombie multiclass, difficulty 3 (trademark distinctiveness)
10
+ hearsay binary, difficulty 4 (FRE 801 hearsay)
11
+ telemarketing_sales_rule binary, difficulty 5 (FTC TSR)
12
+ """
13
+
14
+ from __future__ import annotations
15
+
16
+ import random
17
+ from typing import Optional
18
+
19
+ from ..core.base_driver import BaseDriver, RuleTask
20
+
21
+
22
+ # ---------------------------------------------------------------------------
23
+ # Task registry
24
+ # ---------------------------------------------------------------------------
25
+
26
+ TASK_REGISTRY: list[dict] = [
27
+ {
28
+ "name": "diversity_1",
29
+ "difficulty": 1,
30
+ "task_type": "binary",
31
+ "valid_answers": ["Yes", "No"],
32
+ "rule": (
33
+ "Under 28 U.S.C. § 1332, a federal district court has diversity jurisdiction "
34
+ "when: (1) the matter in controversy exceeds $75,000 exclusive of interest and "
35
+ "costs, AND (2) the action is between citizens of different States. "
36
+ "A corporation is deemed a citizen of its state of incorporation AND its "
37
+ "principal place of business."
38
+ ),
39
+ "question": "Based solely on the rule above, does the federal court have diversity jurisdiction?",
40
+ },
41
+ {
42
+ "name": "diversity_2",
43
+ "difficulty": 2,
44
+ "task_type": "binary",
45
+ "valid_answers": ["Yes", "No"],
46
+ "rule": (
47
+ "Under 28 U.S.C. § 1332, a federal district court has diversity jurisdiction "
48
+ "when: (1) the matter in controversy exceeds $75,000 exclusive of interest and "
49
+ "costs, AND (2) complete diversity exists — no plaintiff may be a citizen of the "
50
+ "same state as any defendant. A corporation is a citizen of its state of "
51
+ "incorporation AND its principal place of business."
52
+ ),
53
+ "question": "Based solely on the rule above, does the federal court have diversity jurisdiction?",
54
+ },
55
+ {
56
+ "name": "diversity_3",
57
+ "difficulty": 3,
58
+ "task_type": "binary",
59
+ "valid_answers": ["Yes", "No"],
60
+ "rule": (
61
+ "Under 28 U.S.C. § 1332, complete diversity jurisdiction requires: "
62
+ "(1) the amount in controversy exceeds $75,000 exclusive of interest and costs, "
63
+ "AND (2) every plaintiff is a citizen of a different state from every defendant. "
64
+ "A natural person's citizenship is determined by their domicile (the place they "
65
+ "reside with intent to remain). A corporation is a citizen of both its state of "
66
+ "incorporation and its principal place of business (the 'nerve center')."
67
+ ),
68
+ "question": "Based solely on the rule above, does the federal court have diversity jurisdiction?",
69
+ },
70
+ {
71
+ "name": "diversity_4",
72
+ "difficulty": 4,
73
+ "task_type": "binary",
74
+ "valid_answers": ["Yes", "No"],
75
+ "rule": (
76
+ "Under 28 U.S.C. § 1332, diversity jurisdiction requires: "
77
+ "(1) complete diversity — no plaintiff shares citizenship with any defendant, "
78
+ "(2) amount in controversy exceeds $75,000 (exclusive of interest and costs). "
79
+ "For aggregation of claims: a single plaintiff may aggregate all claims against "
80
+ "a single defendant to meet the amount requirement. When multiple plaintiffs sue "
81
+ "a single defendant, each plaintiff must independently satisfy the amount. "
82
+ "A corporation is a citizen of its state of incorporation and principal place of "
83
+ "business. An individual's citizenship is their domicile."
84
+ ),
85
+ "question": "Based solely on the rule above, does the federal court have diversity jurisdiction?",
86
+ },
87
+ {
88
+ "name": "diversity_5",
89
+ "difficulty": 5,
90
+ "task_type": "binary",
91
+ "valid_answers": ["Yes", "No"],
92
+ "rule": (
93
+ "Under 28 U.S.C. § 1332, diversity jurisdiction requires complete diversity and "
94
+ "an amount in controversy exceeding $75,000. Complete diversity means no plaintiff "
95
+ "is a citizen of the same state as any defendant. For unincorporated associations "
96
+ "(partnerships, LLCs), citizenship is determined by the citizenship of ALL members. "
97
+ "For class actions under CAFA (28 U.S.C. § 1332(d)), jurisdiction exists if any "
98
+ "member of the plaintiff class is diverse from any defendant and the aggregate "
99
+ "amount exceeds $5,000,000. For standard diversity (non-CAFA), each plaintiff's "
100
+ "claim must independently exceed $75,000."
101
+ ),
102
+ "question": "Based solely on the rule above, does the federal court have diversity jurisdiction?",
103
+ },
104
+ {
105
+ "name": "diversity_6",
106
+ "difficulty": 6,
107
+ "task_type": "binary",
108
+ "valid_answers": ["Yes", "No"],
109
+ "rule": (
110
+ "Under 28 U.S.C. § 1332, diversity jurisdiction (non-CAFA) requires: "
111
+ "(1) complete diversity: the citizenship of each plaintiff must differ from that "
112
+ "of each defendant across all claims; and (2) each plaintiff's amount in "
113
+ "controversy must independently exceed $75,000, unless the claims are so "
114
+ "intertwined that they constitute a single indivisible harm (permitting "
115
+ "aggregation). For an LLC or partnership, citizenship is the citizenship of ALL "
116
+ "members/partners (applied recursively for nested entities). For a corporation, "
117
+ "citizenship is the state of incorporation plus the principal place of business. "
118
+ "For a natural person, citizenship is domicile."
119
+ ),
120
+ "question": "Based solely on the rule above, does the federal court have diversity jurisdiction?",
121
+ },
122
+ {
123
+ "name": "ucc_v_common_law",
124
+ "difficulty": 2,
125
+ "task_type": "binary",
126
+ "valid_answers": ["UCC", "Common Law"],
127
+ "text_field": "contract",
128
+ "rule": (
129
+ "The Uniform Commercial Code (UCC) Article 2 governs contracts for the SALE OF "
130
+ "GOODS — tangible, movable personal property. The Common Law of contracts governs "
131
+ "all other contracts, including contracts for services, real estate, employment, "
132
+ "and intellectual property. When a contract involves both goods and services "
133
+ "(a 'mixed' contract), the predominant purpose test applies: if the predominant "
134
+ "purpose is the sale of goods, UCC applies; if the predominant purpose is services, "
135
+ "Common Law applies."
136
+ ),
137
+ "question": "Based solely on the rule above, does UCC or Common Law govern this contract?",
138
+ },
139
+ {
140
+ "name": "abercrombie",
141
+ "difficulty": 3,
142
+ "task_type": "multiclass",
143
+ "valid_answers": ["generic", "descriptive", "suggestive", "arbitrary", "fanciful"],
144
+ "rule": (
145
+ "Under the Abercrombie & Fitch spectrum, trademarks are classified by their "
146
+ "distinctiveness in relation to the goods/services they identify:\n"
147
+ "- GENERIC: the common name for the product itself (e.g., 'Apple' for apples). "
148
+ "Cannot be registered.\n"
149
+ "- DESCRIPTIVE: directly describes a feature, quality, or characteristic of the "
150
+ "product (e.g., 'Cold and Creamy' for ice cream). Registrable only with acquired "
151
+ "distinctiveness (secondary meaning).\n"
152
+ "- SUGGESTIVE: suggests a quality or characteristic but requires imagination to "
153
+ "connect to the product (e.g., 'Coppertone' for suntan lotion). Inherently "
154
+ "distinctive.\n"
155
+ "- ARBITRARY: a real word used in an unrelated context (e.g., 'Apple' for "
156
+ "computers). Inherently distinctive.\n"
157
+ "- FANCIFUL: an invented word with no prior meaning (e.g., 'Kodak', 'Xerox'). "
158
+ "Highest level of distinctiveness."
159
+ ),
160
+ "question": "Based solely on the rule above, how should this mark be classified?",
161
+ },
162
+ {
163
+ "name": "hearsay",
164
+ "difficulty": 4,
165
+ "task_type": "binary",
166
+ "valid_answers": ["Yes", "No"],
167
+ "rule": (
168
+ "Under Federal Rule of Evidence 801, hearsay is an out-of-court statement that "
169
+ "a party offers to prove the TRUTH OF THE MATTER ASSERTED in the statement. "
170
+ "A 'statement' is an oral assertion, written assertion, or assertive conduct. "
171
+ "Key exclusions: (1) a statement is NOT hearsay if offered for a purpose other "
172
+ "than proving its truth (e.g., to show effect on listener, to show knowledge, "
173
+ "to prove legally operative words, or to show the declarant's state of mind); "
174
+ "(2) prior inconsistent statements made under oath are not hearsay under FRE "
175
+ "801(d)(1)(A); (3) admissions by a party-opponent are not hearsay under FRE "
176
+ "801(d)(2)."
177
+ ),
178
+ "question": "Based solely on the rule above, is this evidence hearsay?",
179
+ },
180
+ {
181
+ "name": "telemarketing_sales_rule",
182
+ "difficulty": 5,
183
+ "task_type": "binary",
184
+ "valid_answers": ["Yes", "No"],
185
+ "rule": (
186
+ "The FTC Telemarketing Sales Rule (16 C.F.R. § 310) prohibits telemarketers from: "
187
+ "(1) misrepresenting the total costs or material restrictions of any goods or "
188
+ "services; (2) misrepresenting any material aspect of the performance or "
189
+ "efficacy of goods or services; (3) making false or misleading statements to "
190
+ "induce a charitable contribution; (4) calling any person who has registered "
191
+ "their phone number on the National Do Not Call Registry, unless the caller has "
192
+ "an established business relationship with the consumer (defined as a transaction "
193
+ "within the prior 18 months or an inquiry within the prior 3 months); "
194
+ "(5) abandoning an outbound telephone call — defined as failing to connect the "
195
+ "call to a sales representative within 2 seconds of the consumer's greeting."
196
+ ),
197
+ "question": "Based solely on the rule above, does this conduct violate the Telemarketing Sales Rule?",
198
+ },
199
+ ]
200
+
201
+
202
+ def _load_examples(task_name: str, text_field: str = "text") -> list[dict]:
203
+ """Load examples from HuggingFace LegalBench dataset."""
204
+ try:
205
+ from datasets import load_dataset
206
+ ds = load_dataset(
207
+ "nguha/legalbench", task_name, split="test", trust_remote_code=True
208
+ )
209
+ examples = []
210
+ for item in ds:
211
+ text = item.get(text_field) or item.get("text", "")
212
+ label = str(item.get("answer", "")).strip()
213
+ if text and label:
214
+ examples.append({"text": text, "label": label})
215
+ return examples
216
+ except Exception:
217
+ return []
218
+
219
+
220
+ class LegalBenchDriver(BaseDriver):
221
+ """
222
+ Driver for LegalBench (nguha/legalbench).
223
+
224
+ Loads examples from HuggingFace on first access and caches them in memory.
225
+ Sampling is weighted by inverse difficulty within the driver.
226
+ """
227
+
228
+ def __init__(self) -> None:
229
+ self._registry = TASK_REGISTRY
230
+ self._by_name: dict[str, dict] = {t["name"]: t for t in self._registry}
231
+ self._weights: list[float] = [1.0 / t["difficulty"] for t in self._registry]
232
+ self._cache: dict[str, list[dict]] = {}
233
+
234
+ @property
235
+ def task_names(self) -> list[str]:
236
+ return list(self._by_name.keys())
237
+
238
+ def sample(
239
+ self,
240
+ rng: random.Random,
241
+ task_name: Optional[str] = None,
242
+ ) -> Optional[RuleTask]:
243
+ if task_name is not None:
244
+ if task_name not in self._by_name:
245
+ return None
246
+ task = self._by_name[task_name]
247
+ examples = self._get_examples(task)
248
+ else:
249
+ task = rng.choices(self._registry, weights=self._weights, k=1)[0]
250
+ examples = self._get_examples(task)
251
+
252
+ if not examples:
253
+ return None
254
+
255
+ ex = rng.choice(examples)
256
+ return RuleTask(
257
+ rule=task["rule"],
258
+ facts=ex["text"],
259
+ question=task["question"],
260
+ valid_answers=task["valid_answers"],
261
+ task_name=task["name"],
262
+ difficulty=task["difficulty"],
263
+ task_type=task["task_type"],
264
+ correct_answer=ex["label"],
265
+ )
266
+
267
+ def _get_examples(self, task: dict) -> list[dict]:
268
+ name = task["name"]
269
+ if name not in self._cache:
270
+ self._cache[name] = _load_examples(name, task.get("text_field", "text"))
271
+ return self._cache[name]
server/drivers/proofwriter.py ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ server/drivers/proofwriter.py
3
+ ------------------------------
4
+ ProofWriterDriver — loads tasks from the ProofWriter dataset
5
+ (tasksource/proofwriter on HuggingFace).
6
+
7
+ ProofWriter is a large-scale deductive reasoning dataset (585k examples).
8
+ Each example contains a "theory" (natural language facts + rules) and a
9
+ question (a proposition to evaluate as True / False / Unknown).
10
+
11
+ Reasoning depth (QDep) maps directly to difficulty:
12
+ 0 → difficulty 1 (direct fact lookup, no inference)
13
+ 1 → difficulty 2 (one inference step)
14
+ 2 → difficulty 3 (two steps)
15
+ 3 → difficulty 4 (three steps)
16
+ ≥4 → difficulty 5 (deep chain)
17
+
18
+ Dataset: https://huggingface.co/datasets/tasksource/proofwriter
19
+ Paper: "ProofWriter: Generating Implications, Proofs, and Abductive
20
+ Statements over Natural Language" (Tafjord et al., 2021)
21
+ """
22
+
23
+ from __future__ import annotations
24
+
25
+ import random
26
+ from typing import Optional
27
+
28
+ from ..core.base_driver import BaseDriver, RuleTask
29
+
30
+
31
+ TASK_NAME = "proofwriter"
32
+
33
+ RULE_TEXT = (
34
+ "You are given a theory consisting of facts and inference rules expressed in "
35
+ "natural language. Facts state what is directly true about the world. Rules state "
36
+ "conditional relationships (e.g., 'If someone is X then they are Y'). "
37
+ "Apply the rules to the facts using deductive reasoning to determine whether the "
38
+ "given statement is True, False, or Unknown (cannot be determined from the theory)."
39
+ )
40
+
41
+ QUESTION_TEMPLATE = "Based solely on the theory above, is the following statement True, False, or Unknown?\n\"{statement}\""
42
+
43
+ # Depth → difficulty mapping (capped at 5)
44
+ _DEPTH_TO_DIFFICULTY = {0: 1, 1: 2, 2: 3, 3: 4}
45
+
46
+ # We restrict to configs with clean, structured language (exclude NatLang variants
47
+ # which have informal prose and are harder to verify programmatically).
48
+ _VALID_CONFIGS = {"depth-0", "depth-1", "depth-2", "depth-3", "depth-3ext", "depth-5"}
49
+
50
+
51
+ def _load_examples() -> dict[int, list[dict]]:
52
+ """
53
+ Load ProofWriter examples grouped by difficulty (QDep).
54
+ Returns {difficulty: [{"facts_rules": str, "statement": str, "answer": str}, ...]}
55
+ """
56
+ try:
57
+ from datasets import load_dataset
58
+ ds = load_dataset("tasksource/proofwriter", split="train", trust_remote_code=True)
59
+ by_difficulty: dict[int, list[dict]] = {1: [], 2: [], 3: [], 4: [], 5: []}
60
+ for ex in ds:
61
+ if ex["config"] not in _VALID_CONFIGS:
62
+ continue
63
+ # Skip Unknown answers for cleaner binary evaluation (True/False only)
64
+ if ex["answer"] == "Unknown":
65
+ continue
66
+ depth = ex.get("QDep", 0)
67
+ difficulty = _DEPTH_TO_DIFFICULTY.get(depth, 5)
68
+ by_difficulty[difficulty].append({
69
+ "facts_rules": ex["theory"].strip(),
70
+ "statement": ex["question"].strip(),
71
+ "answer": ex["answer"], # "True" or "False"
72
+ })
73
+ total = sum(len(v) for v in by_difficulty.values())
74
+ return by_difficulty, total
75
+ except Exception:
76
+ return {1: [], 2: [], 3: [], 4: [], 5: []}, 0
77
+
78
+
79
+ # Task names — one per difficulty level so single-task mode works
80
+ _TASK_NAMES = [f"proofwriter_d{d}" for d in range(1, 6)]
81
+
82
+
83
+ class ProofWriterDriver(BaseDriver):
84
+ """
85
+ Driver for ProofWriter deductive reasoning dataset.
86
+
87
+ Exposes 5 task names (proofwriter_d1 … proofwriter_d5) corresponding to
88
+ reasoning depths 0–4+. Each task name targets a specific difficulty level.
89
+
90
+ Args:
91
+ max_per_difficulty: Cap examples per difficulty to limit memory.
92
+ Default: 5000 (from ~160k True/False examples total).
93
+ """
94
+
95
+ def __init__(self, max_per_difficulty: int = 5000) -> None:
96
+ self._max = max_per_difficulty
97
+ self._cache: Optional[dict[int, list[dict]]] = None
98
+ self._total: int = 0
99
+
100
+ @property
101
+ def task_names(self) -> list[str]:
102
+ return _TASK_NAMES
103
+
104
+ def _ensure_loaded(self) -> dict[int, list[dict]]:
105
+ if self._cache is None:
106
+ self._cache, self._total = _load_examples()
107
+ # Cap per difficulty
108
+ for d in self._cache:
109
+ if len(self._cache[d]) > self._max:
110
+ self._cache[d] = self._cache[d][: self._max]
111
+ return self._cache
112
+
113
+ def sample(
114
+ self,
115
+ rng: random.Random,
116
+ task_name: Optional[str] = None,
117
+ ) -> Optional[RuleTask]:
118
+ by_difficulty = self._ensure_loaded()
119
+
120
+ if task_name is not None:
121
+ if task_name not in _TASK_NAMES:
122
+ return None
123
+ # task_name = "proofwriter_d{difficulty}"
124
+ difficulty = int(task_name[-1])
125
+ examples = by_difficulty.get(difficulty, [])
126
+ else:
127
+ # Mixed: weight toward harder difficulties (more interesting)
128
+ weights = [len(by_difficulty.get(d, [])) for d in range(1, 6)]
129
+ if not any(weights):
130
+ return None
131
+ difficulty = rng.choices(range(1, 6), weights=weights, k=1)[0]
132
+ examples = by_difficulty.get(difficulty, [])
133
+
134
+ if not examples:
135
+ return None
136
+
137
+ ex = rng.choice(examples)
138
+
139
+ return RuleTask(
140
+ rule=RULE_TEXT,
141
+ facts=ex["facts_rules"],
142
+ question=QUESTION_TEMPLATE.format(statement=ex["statement"]),
143
+ valid_answers=["True", "False"],
144
+ task_name=task_name or f"proofwriter_d{difficulty}",
145
+ difficulty=difficulty,
146
+ task_type="binary",
147
+ correct_answer=ex["answer"],
148
+ )
server/drivers/rulebreakers.py ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ server/drivers/rulebreakers.py
3
+ --------------------------------
4
+ RuleBreakersDriver — loads tasks from the RULEBREAKERS dataset
5
+ (jason-c/rulebreakers on HuggingFace, ICML 2025).
6
+
7
+ RULEBREAKERS is a 25,600-example benchmark of classical syllogistic reasoning
8
+ covering two inference patterns:
9
+ - MT (Modus Tollens): P→¬Q, Q ⊢ ¬P
10
+ - DS (Disjunctive Syllogism): P∨Q, ¬Q ⊢ P
11
+
12
+ Each example has two premises and a candidate conclusion. The label indicates
13
+ whether the conclusion validly follows (True) or not (False, "rulebreaker" case).
14
+ The dataset is perfectly balanced: 12,800 MT + 12,800 DS, 50% True / 50% False.
15
+
16
+ Difficulty:
17
+ All examples share the same one-step inference depth → difficulty 2.
18
+ (Simple but adversarial: the False examples look plausible at first glance.)
19
+
20
+ Dataset: https://huggingface.co/datasets/jason-c/rulebreakers
21
+ Paper: "RULEBREAKERS: Challenging LLMs at Modus Tollens Reasoning" (ICML 2025)
22
+ """
23
+
24
+ from __future__ import annotations
25
+
26
+ import random
27
+ from typing import Optional
28
+
29
+ from ..core.base_driver import BaseDriver, RuleTask
30
+
31
+
32
+ TASK_NAME_MT = "rulebreakers_mt"
33
+ TASK_NAME_DS = "rulebreakers_ds"
34
+ _TASK_NAMES = [TASK_NAME_MT, TASK_NAME_DS]
35
+
36
+ _TYPE_TO_TASK = {"mt": TASK_NAME_MT, "ds": TASK_NAME_DS}
37
+
38
+ RULE_TEXT_MT = (
39
+ "You are given two premises. The first is a conditional rule of the form "
40
+ "\"If P then not Q\". The second states that Q is true. "
41
+ "Using Modus Tollens, determine whether the given conclusion validly follows "
42
+ "from these two premises. Answer True if the conclusion follows, False if it does not."
43
+ )
44
+
45
+ RULE_TEXT_DS = (
46
+ "You are given two premises. The first states that either P or Q is true (or both). "
47
+ "The second states that one of the two options is not true. "
48
+ "Using Disjunctive Syllogism, determine whether the given conclusion validly follows "
49
+ "from these two premises. Answer True if the conclusion follows, False if it does not."
50
+ )
51
+
52
+ QUESTION_TEMPLATE = (
53
+ "Given the premises above, does the following conclusion validly follow?\n"
54
+ "Conclusion: \"{conclusion}\""
55
+ )
56
+
57
+
58
+ def _load_examples() -> dict[str, list[dict]]:
59
+ """Load RULEBREAKERS examples grouped by type (mt / ds)."""
60
+ try:
61
+ from datasets import load_dataset
62
+ ds = load_dataset("jason-c/rulebreakers", split="train", trust_remote_code=True)
63
+ by_type: dict[str, list[dict]] = {"mt": [], "ds": []}
64
+ for ex in ds:
65
+ rb_type = ex.get("rulebreaker_type", "")
66
+ if rb_type not in by_type:
67
+ continue
68
+ label = ex.get("label")
69
+ # label is a Python bool in this dataset
70
+ if not isinstance(label, bool):
71
+ continue
72
+ by_type[rb_type].append({
73
+ "premise1": ex["premise1"].strip(),
74
+ "premise2": ex["premise2"].strip(),
75
+ "conclusion": ex["conclusion"].strip(),
76
+ "label": "True" if label else "False",
77
+ })
78
+ return by_type
79
+ except Exception:
80
+ return {"mt": [], "ds": []}
81
+
82
+
83
+ class RuleBreakersDriver(BaseDriver):
84
+ """
85
+ Driver for the RULEBREAKERS syllogistic reasoning dataset.
86
+
87
+ Exposes 2 task names:
88
+ - rulebreakers_mt : Modus Tollens examples
89
+ - rulebreakers_ds : Disjunctive Syllogism examples
90
+
91
+ The dataset is small and adversarial: half the conclusions look valid but
92
+ are not. It tests whether models can resist superficially plausible but
93
+ logically invalid inferences.
94
+ """
95
+
96
+ def __init__(self) -> None:
97
+ self._cache: Optional[dict[str, list[dict]]] = None
98
+
99
+ @property
100
+ def task_names(self) -> list[str]:
101
+ return _TASK_NAMES
102
+
103
+ @property
104
+ def weight(self) -> float:
105
+ return 2.0
106
+
107
+ def _ensure_loaded(self) -> dict[str, list[dict]]:
108
+ if self._cache is None:
109
+ self._cache = _load_examples()
110
+ return self._cache
111
+
112
+ def sample(
113
+ self,
114
+ rng: random.Random,
115
+ task_name: Optional[str] = None,
116
+ ) -> Optional[RuleTask]:
117
+ by_type = self._ensure_loaded()
118
+
119
+ if task_name is not None:
120
+ if task_name == TASK_NAME_MT:
121
+ rb_type = "mt"
122
+ elif task_name == TASK_NAME_DS:
123
+ rb_type = "ds"
124
+ else:
125
+ return None
126
+ else:
127
+ rb_type = rng.choice(["mt", "ds"])
128
+
129
+ examples = by_type.get(rb_type, [])
130
+ if not examples:
131
+ return None
132
+
133
+ ex = rng.choice(examples)
134
+ rule_text = RULE_TEXT_MT if rb_type == "mt" else RULE_TEXT_DS
135
+
136
+ return RuleTask(
137
+ rule=rule_text,
138
+ facts=f"{ex['premise1']}\n{ex['premise2']}",
139
+ question=QUESTION_TEMPLATE.format(conclusion=ex["conclusion"]),
140
+ valid_answers=["True", "False"],
141
+ task_name=task_name or _TYPE_TO_TASK[rb_type],
142
+ difficulty=2,
143
+ task_type="binary",
144
+ correct_answer=ex["label"],
145
+ )