Spaces:

NeerajCodz
/

scrapeRL

Sleeping

File size: 4,012 Bytes

df47251

# OpenEnv Specification (Enhanced)

## Overview

This document defines the OpenEnv contract for WebScraper-OpenEnv with advanced memory, MCP tooling, multi-model routing, and long-page batch handling.

## Core Interfaces

### Observation

```python
class Observation(BaseModel):
    episode_id: str
    task_id: str
    step_number: int
    current_url: str
    page_html: str
    page_title: str
    available_actions: list[str]
    extracted_so_far: dict
    pages_visited: list[str]
    budget_remaining: int
    task_description: str
    target_fields: list[str]
    hints: list[str]

    # Enhanced
    memory_context: dict | None
    tool_registry_snapshot: list[dict] | None
    search_results: list[dict] | None
    page_chunks: list[dict] | None
```

### Action

```python
class Action(BaseModel):
    action_type: str

    # Existing
    target_field: str | None = None
    selector: str | None = None
    navigate_to: str | None = None
    submit_extraction: dict | None = None
    notes: str | None = None

    # Search
    query: str | None = None
    search_engine: str | None = None
    result_limit: int = 5

    # Verification
    field_name: str | None = None
    claimed_value: str | None = None
    verification_source: str | None = None

    # Conflict resolution
    conflicting_sources: list[str] | None = None
    chosen_source: str | None = None
    rationale: str | None = None

    # MCP + Memory
    tool_name: str | None = None
    tool_params: dict | None = None
    memory_layer: str | None = None
    memory_key: str | None = None
    memory_query: str | None = None
```

### Action Types

- `EXTRACT_FIELD`
- `NAVIGATE`
- `SEARCH_PAGE`
- `INSPECT_ELEMENT`
- `SUBMIT`
- `SKIP_PAGE`
- `SEARCH_ENGINE`
- `VERIFY_FACT`
- `RESOLVE_CONFLICT`
- `FETCH_URL`
- `MCP_TOOL_CALL`
- `WRITE_MEMORY`
- `READ_MEMORY`
- `SEARCH_MEMORY`
- `SUMMARIZE_MEMORY`
- `PRUNE_MEMORY`

### Reward

```python
class Reward(BaseModel):
    value: float
    cumulative: float
    breakdown: dict
    message: str
```

## Episode Lifecycle

```text
reset(task_id, seed?)
  -> observation(step=0)

step(action)
  -> observation, reward, done, info

state(episode_id)
  -> current snapshot
```

Terminal conditions:

- `SUBMIT` called
- budget exhausted
- max page limit reached
- fatal policy error

## State Machine

```text
RESET -> RUNNING -> TERMINAL
            |
            +-- NAVIGATE / EXTRACT / SEARCH / VERIFY / MCP / MEMORY
```

## Task Profiles

### Easy

- single-page extraction
- low noise
- hints enabled

### Medium

- pagination
- moderate noise
- partial hints

### Hard

- multi-hop search
- conflicting sources
- verification required
- no hints

## Long Page Handling

When HTML exceeds token/size thresholds:

1. Semantic segmentation
2. Adaptive chunking
3. Batch extraction
4. Merge + dedupe + confidence rank
5. Optional diff-based incremental update

## MCP Integration Contract

On each step, environment may expose:

- tool registry snapshot
- per-tool input/output schema
- timeout and retry policy

Tool calls are evaluated for:

- correctness
- efficiency
- safety constraints

## Search Engine Contract

Search action supports provider routing:

- Google
- Bing
- Brave
- DuckDuckGo
- Perplexity
- custom providers

Environment stores query + result metadata for observability.

## Memory Contract

Layers:

- short-term (episode)
- working (reasoning)
- long-term (persistent)
- shared (multi-agent)

Mandatory metadata for write operations:

- `episode_id`
- `task_id`
- `confidence`
- `source`

## API Surface

- `POST /api/reset`
- `POST /api/step`
- `GET /api/state/{episode_id}`
- `GET /api/tasks`
- `GET /api/reward/{episode_id}`
- `GET /api/tool-registry`
- `POST /api/tool-test`

## Determinism

Given `task_id + seed + config`, environment should be reproducible for grading and benchmarking.

## Safety and Guardrails

- enforce max steps and request budgets
- enforce MCP tool allowlist/denylist
- prevent secret leakage from tool outputs
- sanitize logs and traces