DocSweeper / README.md
arjeet
commit2
de5eb3d
metadata
title: Doc Sweeper Environment
emoji: 🧹
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
tags:
  - openenv

Doc Sweeper Environment

A virtual file system and text-editing environment for OpenEnv. This environment tasks autonomous LLM agents with acting as automated documentation engineers, requiring them to navigate a directory tree, read files, and apply precise string manipulations to complete complex refactoring tasks.

Overview

The Doc Sweeper environment provides a sandboxed, in-memory file system where agents can interact with dummy codebases and documentation. It evaluates an agent's ability to retain context, plan multi-step operations, and use tools correctly.

Features

  • Virtual File System: In-memory directory tree with nested files.
  • Strict Tooling: Requires agents to explicitly open files before applying edit commands.
  • Granular Feedback: Provides immediate terminal feedback and linter issues upon illegal actions or formatting errors.
  • Three Distinct Scenarios: Evaluates different logic flows (global search/replace, YAML refactoring, path resolution).

Task Rules

The environment supports three primary tasks:

  • version_bump: The agent must find all outdated version numbers (e.g., v1.0.0 or v1.00) across all files and update them to v2.0.0.
  • config_migration: The agent must open docker-compose files, update the version to 3.8, and migrate links keys to networks.
  • broken_links: The agent must find broken relative markdown links and edit them to point to correct file paths.

Quick Start

Running the Baseline Inference (Recommended)

The easiest way to test the environment is using the provided Chain-of-Thought agent script.

# Export your required credentials
export HF_TOKEN="your_api_key_here"
export API_BASE_URL="[https://api.openai.com/v1](https://api.openai.com/v1)"
export MODEL_NAME="gpt-4o-mini"

Run the inference script across all tasks

python inference.py

Using Local Server

You can host the environment locally to manually test the API endpoints.

# Install dependencies
pip install -r requirements.txt

Run server

python -m uvicorn server.app:app --host 0.0.0.0 --port 8000

Actions

The action space is defined by the DocAction schema. The agent must provide a single JSON object with a tool_name and the corresponding required fields:

  • open: Opens a file. Requires the path parameter.
  • edit: Replaces text in the currently active file. Requires exact string matching via old_str and new_str.
  • grep: Searches the active file (or directory). Requires search_query.
  • done: Signals that the task is complete.

Observations

Each observation (DocObservation) returned by the environment includes:

  • active_file: The file currently opened by the agent.
  • terminal_feedback: Error messages, success logs, or system alerts resulting from the last action.
  • directory_tree: A JSON representation of the current file system hierarchy.
  • file_content: The textual content of the currently active file.
  • issues_detected: A list of simulated linter errors (if the agent breaks a file's formatting).

Configuration

Reward Structure

The environment issues rewards based on the agent's efficiency and accuracy:

  • Valid Tool Usage: 0.0 (Neutral, but advances the state).
  • Tool Misuse Penalty: -0.1 (e.g., trying to edit without opening a file, or providing a bad file path).
  • Task Completion: 1.0 (Awarded only when done is called and all objective checks pass).
  • Early/Failed Completion: -1.0 (Calling done before fixing all required strings).

Building and Deployment

Build Docker Image

From the repository root:

Build the environment image

docker build -t doc-sweeper-env:latest .

The Dockerfile uses pip install with requirements.txt for maximum compatibility with Hugging Face Spaces.

Run the container locally

docker run -p 8000:8000 doc-sweeper-env:latest

The FastAPI OpenEnv endpoints will be available at http://localhost:8000/reset and http://localhost:8000/step.


Dependencies

The Doc Sweeper environment requires:

  • fastapi & uvicorn: For serving the OpenEnv endpoints.
  • pydantic: For strict action and observation schema validation.
  • openai / groq: For the baseline LLM inference script.

These are automatically installed when using Docker or installing via pip install -r requirements.txt.


Example Evaluation Log Output

When running inference.py, the agent emits strictly formatted logs for the automated graders:

[START] task=version_bump model=gpt-4o-mini
[STEP] step=1 action=open reward=0.00 done=False thought="Opening setup.md to check for versions."
[STEP] step=2 action=edit reward=0.00 done=False thought="Replacing v1.0.0 with v2.0.0."
[STEP] step=3 action=done reward=1.00 done=True thought="All files have been checked."
[END] task=version_bump score=1.00 total_steps=3 runtime_seconds=4.2