Spaces:

Parthiban007
/

rust_coder

Running

File size: 8,458 Bytes

---
title: Rust Coder OpenEnv
emoji: 🦀
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
base_path: /web
pinned: false
tags:
  - openenv
  - software-engineering
  - rust
---

# Rust Coder: Systems Engineering Environment

Rust Coder is a high-fidelity **OpenEnv** environment designed to evaluate and train LLM agents on real-world Rust systems programming tasks. Unlike toy environments, Rust Coder simulates valid engineering scenarios involving the borrow checker, concurrency, and memory safety.

## Motivation

Rust is uniquely challenging for AI agents due to its strict compile-time safety guarantees. This environment provides a 10-task progression that measures an agent's ability to:

1. Fix borrow checker violations
2. Correctly annotate lifetimes
3. Resolve concurrency deadlocks
4. Write unsafe FFI code correctly
5. Identify and prevent memory leaks
6. Optimize data pipelines for performance

---

## Action Space

**Type**: `RustCoderAction`

The agent submits a single string containing the complete, fixed Rust source code.

| Field | Type   | Description                              |
|-------|--------|------------------------------------------|
| `code` | string | Full Rust source code to compile and test |

## Observation Space

**Type**: `RustCoderObservation`

The environment returns detailed feedback after each submission:

| Field                  | Type        | Description                                         |
|------------------------|-------------|-----------------------------------------------------|
| `problem_description`  | string      | Task requirements and context                       |
| `header_section`       | string      | LeetCode-style scaffold (imports + signatures/types) |
| `compilation_success`  | bool        | Whether `rustc` compiled the submitted code         |
| `compilation_output`   | string      | Raw compiler errors and warnings                    |
| `test_results`         | list[dict]  | Per-test pass/fail results with error details       |
| `reward_breakdown`     | dict        | Weighted score breakdown across 5 dimensions        |

---

## Reward Function

Total reward is a weighted sum of 5 dimensions, each normalized to [0, 1]:

| Dimension       | Weight | Metric                                            |
|-----------------|--------|---------------------------------------------------|
| Compilation     | 40%    | Binary success/failure of `rustc`                 |
| Correctness     | 20%    | Fraction of test assertions that pass             |
| Coverage        | 20%    | Fraction of tests that successfully ran           |
| Elegance        | 10%    | Code quality heuristics (avoids `.unwrap()`, long lines, `unsafe`) |
| Efficiency      | 10%    | Execution time vs. per-problem baseline           |

Reward provides partial signal at every step — compilation alone earns 0.40, passing all tests earns up to 1.0.

---

## Tasks

10 sequential problems with increasing difficulty:

| ID | Title                              | Difficulty | Skill Evaluated               |
|----|------------------------------------|------------|-------------------------------|
| 1  | Broken CLI Argument Parser         | Easy       | Enums & pattern matching      |
| 2  | Conflicting Borrows                | Easy→Med   | Borrow checker                |
| 3  | Invalid Lifetime Annotations       | Medium     | Lifetime annotations          |
| 4  | Business Logic Errors              | Medium     | Math & correctness            |
| 5  | Linked List Management             | Medium     | Ownership & data structures   |
| 6  | Multi-threaded Deadlocks           | Hard       | Mutex & concurrency           |
| 7  | Async Borrowing Conflicts          | Hard       | Async/await lifetimes         |
| 8  | Unsafe FFI Integration             | Hard       | `unsafe` & C interop          |
| 9  | Inefficient Data Pipeline          | Hard       | Performance optimization      |
| 10 | Memory Leak Prevention             | Hard+      | Weak pointers & ownership     |

---

## Environment Variables / Secrets

The environment reads the following variables. Set them as **HF Space secrets** (Settings → Variables and Secrets) when deploying to Hugging Face, or in a local `.env` file for development.

| Variable       | Required | Default                              | Description                          |
|----------------|----------|--------------------------------------|--------------------------------------|
| `HF_TOKEN`     | Yes      | —                                    | Hugging Face API token for LLM calls |
| `API_BASE_URL` | No       | `https://router.huggingface.co/v1`   | Inference endpoint                   |
| `MODEL_NAME`   | No       | `Qwen/Qwen2.5-72B-Instruct`          | Model to use for evaluation          |

> **Note**: The `.env` file is excluded from Docker images by `.dockerignore`. On HF Spaces, secrets are injected as OS environment variables by the platform — `load_dotenv()` silently does nothing if no file is present, and `os.getenv()` reads from the platform-injected vars. This is the correct behavior.

---

## Setup & Usage

### Local Development

```bash
# 1. Clone and enter the repo
git clone https://github.com/your-username/rust_coder
cd rust_coder

# 2. Create .env with your credentials
cat > .env << EOF
HF_TOKEN=hf_your_token_here
API_BASE_URL=https://router.huggingface.co/v1
MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
EOF

# 3. Build the Docker image (uses root Dockerfile)
docker build -t rust_coder:latest .

# 4. Run the environment server
docker run -d -p 8000:8000 --env-file .env --name rust_env rust_coder:latest

# 5. Verify it's healthy
curl http://localhost:8000/health
# → {"status": "healthy"}

# 6. Run the inference benchmark
python inference.py
```

### Docker Commands Reference

```bash
# Build
docker build -t rust_coder:latest .

# Run with .env file
docker run -d -p 8000:8000 --env-file .env --name rust_env rust_coder:latest

# View logs
docker logs rust_env

# Stop
docker stop rust_env
```

### Environment API

```bash
# Reset (returns first problem)
curl -X POST http://localhost:8000/reset

# Step (submit Rust code)
curl -X POST http://localhost:8000/step \
  -H "Content-Type: application/json" \
  -d '{"action": {"code": "fn main() { println!(\"hello\"); }"}}'

# Health check
curl http://localhost:8000/health
```

### HF Spaces Deployment

```bash
# Install HF CLI
pip install huggingface_hub

# Login
huggingface-cli login

# Push to Space
openenv push --repo-id your-username/rust-coder
```

Then go to your Space settings and add secrets:
- `HF_TOKEN` → your Hugging Face API token
- `MODEL_NAME` → e.g. `Qwen/Qwen2.5-72B-Instruct`

---

## Baseline Scores

Baseline using **Qwen/Qwen2.5-72B-Instruct** via Hugging Face router:

| Metric         | Score |
|----------------|-------|
| Average reward | 0.59  |
| Compilation %  | ~85%  |
| Correctness %  | ~45%  |

---

## Project Structure

```
rust_coder/
├── Dockerfile                     # Root Dockerfile (used by validator + HF Spaces)
├── server/Dockerfile              # Identical copy (used for -f flag builds)
├── openenv.yaml                   # OpenEnv spec metadata
├── pyproject.toml                 # Python package config
├── uv.lock                        # Locked dependencies
├── problems.json                  # 10 coding problems dataset
├── models.py                      # Pydantic action/observation types
├── client.py                      # WebSocket client for RustCoderEnv
├── inference.py                   # Baseline inference script (entry point)
├── __init__.py                    # Package exports
└── server/
    ├── app.py                     # FastAPI OpenEnv server entrypoint
    └── rust_coder_environment.py  # Core environment logic
```

## HF Space runtime model

- The Hugging Face Space serves the environment via `uvicorn server.app:app` (see `openenv.yaml` and `Dockerfile`).
- The built-in OpenEnv web UI may send an empty action on Step; this environment supports that by auto-calling the LLM when `action.code` is empty (unless disabled via `AUTO_LLM_ON_EMPTY_STEP=0`).
- `inference.py` is the required baseline runner used by the validator/judge. It connects to the running Space and drives `reset()`/`step()` in a loop, emitting strict `[START]`/`[STEP]`/`[END]` stdout lines.