Spaces:
Sleeping
Sleeping
| title: Doc Sweeper Environment | |
| emoji: 🧹 | |
| colorFrom: 'blue' | |
| colorTo: 'green' | |
| sdk: docker | |
| pinned: false | |
| app_port: 8000 | |
| tags: | |
| - openenv | |
| # Doc Sweeper Environment | |
| A virtual file system and text-editing environment for OpenEnv. This environment tasks autonomous LLM agents with acting as automated documentation engineers, requiring them to navigate a directory tree, read files, and apply precise string manipulations to complete complex refactoring tasks. | |
| ## Overview | |
| The Doc Sweeper environment provides a sandboxed, in-memory file system where agents can interact with dummy codebases and documentation. It evaluates an agent's ability to retain context, plan multi-step operations, and use tools correctly. | |
| ### Features | |
| * **Virtual File System**: In-memory directory tree with nested files. | |
| * **Strict Tooling**: Requires agents to explicitly `open` files before applying `edit` commands. | |
| * **Granular Feedback**: Provides immediate terminal feedback and linter issues upon illegal actions or formatting errors. | |
| * **Three Distinct Scenarios**: Evaluates different logic flows (global search/replace, YAML refactoring, path resolution). | |
| ### Task Rules | |
| The environment supports three primary tasks: | |
| * `version_bump`: The agent must find all outdated version numbers (e.g., `v1.0.0` or `v1.00`) across all files and update them to `v2.0.0`. | |
| * `config_migration`: The agent must open docker-compose files, update the version to `3.8`, and migrate `links` keys to `networks`. | |
| * `broken_links`: The agent must find broken relative markdown links and edit them to point to correct file paths. | |
| --- | |
| ## Quick Start | |
| ### Running the Baseline Inference (Recommended) | |
| The easiest way to test the environment is using the provided Chain-of-Thought agent script. | |
| ```bash | |
| # Export your required credentials | |
| export HF_TOKEN="your_api_key_here" | |
| export API_BASE_URL="[https://api.openai.com/v1](https://api.openai.com/v1)" | |
| export MODEL_NAME="gpt-4o-mini" | |
| ``` | |
| # Run the inference script across all tasks | |
| python inference.py | |
| ## Using Local Server | |
| You can host the environment locally to manually test the API endpoints. | |
| ```bash | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| ``` | |
| # Run server | |
| ```bash | |
| python -m uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| ## Actions | |
| The action space is defined by the `DocAction` schema. The agent must provide a single JSON object with a `tool_name` and the corresponding required fields: | |
| * **`open`**: Opens a file. Requires the `path` parameter. | |
| * **`edit`**: Replaces text in the currently active file. Requires exact string matching via `old_str` and `new_str`. | |
| * **`grep`**: Searches the active file (or directory). Requires `search_query`. | |
| * **`done`**: Signals that the task is complete. | |
| ## Observations | |
| Each observation (`DocObservation`) returned by the environment includes: | |
| * **`active_file`**: The file currently opened by the agent. | |
| * **`terminal_feedback`**: Error messages, success logs, or system alerts resulting from the last action. | |
| * **`directory_tree`**: A JSON representation of the current file system hierarchy. | |
| * **`file_content`**: The textual content of the currently active file. | |
| * **`issues_detected`**: A list of simulated linter errors (if the agent breaks a file's formatting). | |
| ## Configuration | |
| ### Reward Structure | |
| The environment issues rewards based on the agent's efficiency and accuracy: | |
| * **Valid Tool Usage**: `0.0` (Neutral, but advances the state). | |
| * **Tool Misuse Penalty**: `-0.1` (e.g., trying to edit without opening a file, or providing a bad file path). | |
| * **Task Completion**: `1.0` (Awarded only when `done` is called and all objective checks pass). | |
| * **Early/Failed Completion**: `-1.0` (Calling `done` before fixing all required strings). | |
| ## Building and Deployment | |
| ### Build Docker Image | |
| From the repository root: | |
| # Build the environment image | |
| ```bash | |
| docker build -t doc-sweeper-env:latest . | |
| ``` | |
| The Dockerfile uses pip install with requirements.txt for maximum compatibility with Hugging Face Spaces. | |
| # Run the container locally | |
| ```bash | |
| docker run -p 8000:8000 doc-sweeper-env:latest | |
| ``` | |
| The FastAPI OpenEnv endpoints will be available at `http://localhost:8000/reset` and `http://localhost:8000/step`. | |
| --- | |
| ## Dependencies | |
| The Doc Sweeper environment requires: | |
| * **`fastapi` & `uvicorn`**: For serving the OpenEnv endpoints. | |
| * **`pydantic`**: For strict action and observation schema validation. | |
| * **`openai` / `groq`**: For the baseline LLM inference script. | |
| These are automatically installed when using Docker or installing via `pip install -r requirements.txt`. | |
| --- | |
| ## Example Evaluation Log Output | |
| When running `inference.py`, the agent emits strictly formatted logs for the automated graders: | |
| ```text | |
| [START] task=version_bump model=gpt-4o-mini | |
| [STEP] step=1 action=open reward=0.00 done=False thought="Opening setup.md to check for versions." | |
| [STEP] step=2 action=edit reward=0.00 done=False thought="Replacing v1.0.0 with v2.0.0." | |
| [STEP] step=3 action=done reward=1.00 done=True thought="All files have been checked." | |
| [END] task=version_bump score=1.00 total_steps=3 runtime_seconds=4.2 |