File size: 5,133 Bytes
de5eb3d
 
 
 
 
 
 
 
 
 
 
 
 
e03aa3c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
title: Doc Sweeper Environment
emoji: 🧹
colorFrom: 'blue'
colorTo: 'green'
sdk: docker
pinned: false
app_port: 8000
tags:
  - openenv
---


# Doc Sweeper Environment

A virtual file system and text-editing environment for OpenEnv. This environment tasks autonomous LLM agents with acting as automated documentation engineers, requiring them to navigate a directory tree, read files, and apply precise string manipulations to complete complex refactoring tasks.

## Overview

The Doc Sweeper environment provides a sandboxed, in-memory file system where agents can interact with dummy codebases and documentation. It evaluates an agent's ability to retain context, plan multi-step operations, and use tools correctly.

### Features

* **Virtual File System**: In-memory directory tree with nested files.
* **Strict Tooling**: Requires agents to explicitly `open` files before applying `edit` commands.
* **Granular Feedback**: Provides immediate terminal feedback and linter issues upon illegal actions or formatting errors.
* **Three Distinct Scenarios**: Evaluates different logic flows (global search/replace, YAML refactoring, path resolution).

### Task Rules

The environment supports three primary tasks:

* `version_bump`: The agent must find all outdated version numbers (e.g., `v1.0.0` or `v1.00`) across all files and update them to `v2.0.0`.
* `config_migration`: The agent must open docker-compose files, update the version to `3.8`, and migrate `links` keys to `networks`.
* `broken_links`: The agent must find broken relative markdown links and edit them to point to correct file paths.

---

## Quick Start

### Running the Baseline Inference (Recommended)

The easiest way to test the environment is using the provided Chain-of-Thought agent script.

```bash
# Export your required credentials
export HF_TOKEN="your_api_key_here"
export API_BASE_URL="[https://api.openai.com/v1](https://api.openai.com/v1)"
export MODEL_NAME="gpt-4o-mini"
```

# Run the inference script across all tasks
python inference.py

## Using Local Server
You can host the environment locally to manually test the API endpoints.

```bash
# Install dependencies
pip install -r requirements.txt
```


# Run server
```bash
python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
```
## Actions

The action space is defined by the `DocAction` schema. The agent must provide a single JSON object with a `tool_name` and the corresponding required fields:

* **`open`**: Opens a file. Requires the `path` parameter.
* **`edit`**: Replaces text in the currently active file. Requires exact string matching via `old_str` and `new_str`.
* **`grep`**: Searches the active file (or directory). Requires `search_query`.
* **`done`**: Signals that the task is complete.

## Observations

Each observation (`DocObservation`) returned by the environment includes:

* **`active_file`**: The file currently opened by the agent.
* **`terminal_feedback`**: Error messages, success logs, or system alerts resulting from the last action.
* **`directory_tree`**: A JSON representation of the current file system hierarchy.
* **`file_content`**: The textual content of the currently active file.
* **`issues_detected`**: A list of simulated linter errors (if the agent breaks a file's formatting).

## Configuration

### Reward Structure

The environment issues rewards based on the agent's efficiency and accuracy:

* **Valid Tool Usage**: `0.0` (Neutral, but advances the state).
* **Tool Misuse Penalty**: `-0.1` (e.g., trying to edit without opening a file, or providing a bad file path).
* **Task Completion**: `1.0` (Awarded only when `done` is called and all objective checks pass).
* **Early/Failed Completion**: `-1.0` (Calling `done` before fixing all required strings).

## Building and Deployment

### Build Docker Image

From the repository root:

# Build the environment image

```bash
docker build -t doc-sweeper-env:latest .
```

The Dockerfile uses pip install with requirements.txt for maximum compatibility with Hugging Face Spaces.

# Run the container locally

```bash
docker run -p 8000:8000 doc-sweeper-env:latest
```
The FastAPI OpenEnv endpoints will be available at `http://localhost:8000/reset` and `http://localhost:8000/step`.

---

## Dependencies

The Doc Sweeper environment requires:

* **`fastapi` & `uvicorn`**: For serving the OpenEnv endpoints.
* **`pydantic`**: For strict action and observation schema validation.
* **`openai` / `groq`**: For the baseline LLM inference script.

These are automatically installed when using Docker or installing via `pip install -r requirements.txt`.

---

## Example Evaluation Log Output

When running `inference.py`, the agent emits strictly formatted logs for the automated graders:

```text
[START] task=version_bump model=gpt-4o-mini
[STEP] step=1 action=open reward=0.00 done=False thought="Opening setup.md to check for versions."
[STEP] step=2 action=edit reward=0.00 done=False thought="Replacing v1.0.0 with v2.0.0."
[STEP] step=3 action=done reward=1.00 done=True thought="All files have been checked."
[END] task=version_bump score=1.00 total_steps=3 runtime_seconds=4.2