File size: 8,458 Bytes
3955490
9763ffa
 
 
 
3955490
9763ffa
 
3955490
9763ffa
 
 
 
3955490
 
9763ffa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e96c0d4
9763ffa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e96c0d4
9763ffa
 
e96c0d4
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
---
title: Rust Coder OpenEnv
emoji: πŸ¦€
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
base_path: /web
pinned: false
tags:
  - openenv
  - software-engineering
  - rust
---

# Rust Coder: Systems Engineering Environment

Rust Coder is a high-fidelity **OpenEnv** environment designed to evaluate and train LLM agents on real-world Rust systems programming tasks. Unlike toy environments, Rust Coder simulates valid engineering scenarios involving the borrow checker, concurrency, and memory safety.

## Motivation

Rust is uniquely challenging for AI agents due to its strict compile-time safety guarantees. This environment provides a 10-task progression that measures an agent's ability to:

1. Fix borrow checker violations
2. Correctly annotate lifetimes
3. Resolve concurrency deadlocks
4. Write unsafe FFI code correctly
5. Identify and prevent memory leaks
6. Optimize data pipelines for performance

---

## Action Space

**Type**: `RustCoderAction`

The agent submits a single string containing the complete, fixed Rust source code.

| Field | Type   | Description                              |
|-------|--------|------------------------------------------|
| `code` | string | Full Rust source code to compile and test |

## Observation Space

**Type**: `RustCoderObservation`

The environment returns detailed feedback after each submission:

| Field                  | Type        | Description                                         |
|------------------------|-------------|-----------------------------------------------------|
| `problem_description`  | string      | Task requirements and context                       |
| `header_section`       | string      | LeetCode-style scaffold (imports + signatures/types) |
| `compilation_success`  | bool        | Whether `rustc` compiled the submitted code         |
| `compilation_output`   | string      | Raw compiler errors and warnings                    |
| `test_results`         | list[dict]  | Per-test pass/fail results with error details       |
| `reward_breakdown`     | dict        | Weighted score breakdown across 5 dimensions        |

---

## Reward Function

Total reward is a weighted sum of 5 dimensions, each normalized to [0, 1]:

| Dimension       | Weight | Metric                                            |
|-----------------|--------|---------------------------------------------------|
| Compilation     | 40%    | Binary success/failure of `rustc`                 |
| Correctness     | 20%    | Fraction of test assertions that pass             |
| Coverage        | 20%    | Fraction of tests that successfully ran           |
| Elegance        | 10%    | Code quality heuristics (avoids `.unwrap()`, long lines, `unsafe`) |
| Efficiency      | 10%    | Execution time vs. per-problem baseline           |

Reward provides partial signal at every step β€” compilation alone earns 0.40, passing all tests earns up to 1.0.

---

## Tasks

10 sequential problems with increasing difficulty:

| ID | Title                              | Difficulty | Skill Evaluated               |
|----|------------------------------------|------------|-------------------------------|
| 1  | Broken CLI Argument Parser         | Easy       | Enums & pattern matching      |
| 2  | Conflicting Borrows                | Easy→Med   | Borrow checker                |
| 3  | Invalid Lifetime Annotations       | Medium     | Lifetime annotations          |
| 4  | Business Logic Errors              | Medium     | Math & correctness            |
| 5  | Linked List Management             | Medium     | Ownership & data structures   |
| 6  | Multi-threaded Deadlocks           | Hard       | Mutex & concurrency           |
| 7  | Async Borrowing Conflicts          | Hard       | Async/await lifetimes         |
| 8  | Unsafe FFI Integration             | Hard       | `unsafe` & C interop          |
| 9  | Inefficient Data Pipeline          | Hard       | Performance optimization      |
| 10 | Memory Leak Prevention             | Hard+      | Weak pointers & ownership     |

---

## Environment Variables / Secrets

The environment reads the following variables. Set them as **HF Space secrets** (Settings β†’ Variables and Secrets) when deploying to Hugging Face, or in a local `.env` file for development.

| Variable       | Required | Default                              | Description                          |
|----------------|----------|--------------------------------------|--------------------------------------|
| `HF_TOKEN`     | Yes      | β€”                                    | Hugging Face API token for LLM calls |
| `API_BASE_URL` | No       | `https://router.huggingface.co/v1`   | Inference endpoint                   |
| `MODEL_NAME`   | No       | `Qwen/Qwen2.5-72B-Instruct`          | Model to use for evaluation          |

> **Note**: The `.env` file is excluded from Docker images by `.dockerignore`. On HF Spaces, secrets are injected as OS environment variables by the platform β€” `load_dotenv()` silently does nothing if no file is present, and `os.getenv()` reads from the platform-injected vars. This is the correct behavior.

---

## Setup & Usage

### Local Development

```bash
# 1. Clone and enter the repo
git clone https://github.com/your-username/rust_coder
cd rust_coder

# 2. Create .env with your credentials
cat > .env << EOF
HF_TOKEN=hf_your_token_here
API_BASE_URL=https://router.huggingface.co/v1
MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
EOF

# 3. Build the Docker image (uses root Dockerfile)
docker build -t rust_coder:latest .

# 4. Run the environment server
docker run -d -p 8000:8000 --env-file .env --name rust_env rust_coder:latest

# 5. Verify it's healthy
curl http://localhost:8000/health
# β†’ {"status": "healthy"}

# 6. Run the inference benchmark
python inference.py
```

### Docker Commands Reference

```bash
# Build
docker build -t rust_coder:latest .

# Run with .env file
docker run -d -p 8000:8000 --env-file .env --name rust_env rust_coder:latest

# View logs
docker logs rust_env

# Stop
docker stop rust_env
```

### Environment API

```bash
# Reset (returns first problem)
curl -X POST http://localhost:8000/reset

# Step (submit Rust code)
curl -X POST http://localhost:8000/step \
  -H "Content-Type: application/json" \
  -d '{"action": {"code": "fn main() { println!(\"hello\"); }"}}'

# Health check
curl http://localhost:8000/health
```

### HF Spaces Deployment

```bash
# Install HF CLI
pip install huggingface_hub

# Login
huggingface-cli login

# Push to Space
openenv push --repo-id your-username/rust-coder
```

Then go to your Space settings and add secrets:
- `HF_TOKEN` β†’ your Hugging Face API token
- `MODEL_NAME` β†’ e.g. `Qwen/Qwen2.5-72B-Instruct`

---

## Baseline Scores

Baseline using **Qwen/Qwen2.5-72B-Instruct** via Hugging Face router:

| Metric         | Score |
|----------------|-------|
| Average reward | 0.59  |
| Compilation %  | ~85%  |
| Correctness %  | ~45%  |

---

## Project Structure

```
rust_coder/
β”œβ”€β”€ Dockerfile                     # Root Dockerfile (used by validator + HF Spaces)
β”œβ”€β”€ server/Dockerfile              # Identical copy (used for -f flag builds)
β”œβ”€β”€ openenv.yaml                   # OpenEnv spec metadata
β”œβ”€β”€ pyproject.toml                 # Python package config
β”œβ”€β”€ uv.lock                        # Locked dependencies
β”œβ”€β”€ problems.json                  # 10 coding problems dataset
β”œβ”€β”€ models.py                      # Pydantic action/observation types
β”œβ”€β”€ client.py                      # WebSocket client for RustCoderEnv
β”œβ”€β”€ inference.py                   # Baseline inference script (entry point)
β”œβ”€β”€ __init__.py                    # Package exports
└── server/
    β”œβ”€β”€ app.py                     # FastAPI OpenEnv server entrypoint
    └── rust_coder_environment.py  # Core environment logic
```

## HF Space runtime model

- The Hugging Face Space serves the environment via `uvicorn server.app:app` (see `openenv.yaml` and `Dockerfile`).
- The built-in OpenEnv web UI may send an empty action on Step; this environment supports that by auto-calling the LLM when `action.code` is empty (unless disabled via `AUTO_LLM_ON_EMPTY_STEP=0`).
- `inference.py` is the required baseline runner used by the validator/judge. It connects to the running Space and drives `reset()`/`step()` in a loop, emitting strict `[START]`/`[STEP]`/`[END]` stdout lines.