Spaces:
Running
Running
| title: Rust Coder OpenEnv | |
| emoji: π¦ | |
| colorFrom: red | |
| colorTo: yellow | |
| sdk: docker | |
| app_port: 8000 | |
| base_path: /web | |
| pinned: false | |
| tags: | |
| - openenv | |
| - software-engineering | |
| - rust | |
| # Rust Coder: Systems Engineering Environment | |
| Rust Coder is a high-fidelity **OpenEnv** environment designed to evaluate and train LLM agents on real-world Rust systems programming tasks. Unlike toy environments, Rust Coder simulates valid engineering scenarios involving the borrow checker, concurrency, and memory safety. | |
| ## Motivation | |
| Rust is uniquely challenging for AI agents due to its strict compile-time safety guarantees. This environment provides a 10-task progression that measures an agent's ability to: | |
| 1. Fix borrow checker violations | |
| 2. Correctly annotate lifetimes | |
| 3. Resolve concurrency deadlocks | |
| 4. Write unsafe FFI code correctly | |
| 5. Identify and prevent memory leaks | |
| 6. Optimize data pipelines for performance | |
| --- | |
| ## Action Space | |
| **Type**: `RustCoderAction` | |
| The agent submits a single string containing the complete, fixed Rust source code. | |
| | Field | Type | Description | | |
| |-------|--------|------------------------------------------| | |
| | `code` | string | Full Rust source code to compile and test | | |
| ## Observation Space | |
| **Type**: `RustCoderObservation` | |
| The environment returns detailed feedback after each submission: | |
| | Field | Type | Description | | |
| |------------------------|-------------|-----------------------------------------------------| | |
| | `problem_description` | string | Task requirements and context | | |
| | `header_section` | string | LeetCode-style scaffold (imports + signatures/types) | | |
| | `compilation_success` | bool | Whether `rustc` compiled the submitted code | | |
| | `compilation_output` | string | Raw compiler errors and warnings | | |
| | `test_results` | list[dict] | Per-test pass/fail results with error details | | |
| | `reward_breakdown` | dict | Weighted score breakdown across 5 dimensions | | |
| --- | |
| ## Reward Function | |
| Total reward is a weighted sum of 5 dimensions, each normalized to [0, 1]: | |
| | Dimension | Weight | Metric | | |
| |-----------------|--------|---------------------------------------------------| | |
| | Compilation | 40% | Binary success/failure of `rustc` | | |
| | Correctness | 20% | Fraction of test assertions that pass | | |
| | Coverage | 20% | Fraction of tests that successfully ran | | |
| | Elegance | 10% | Code quality heuristics (avoids `.unwrap()`, long lines, `unsafe`) | | |
| | Efficiency | 10% | Execution time vs. per-problem baseline | | |
| Reward provides partial signal at every step β compilation alone earns 0.40, passing all tests earns up to 1.0. | |
| --- | |
| ## Tasks | |
| 10 sequential problems with increasing difficulty: | |
| | ID | Title | Difficulty | Skill Evaluated | | |
| |----|------------------------------------|------------|-------------------------------| | |
| | 1 | Broken CLI Argument Parser | Easy | Enums & pattern matching | | |
| | 2 | Conflicting Borrows | EasyβMed | Borrow checker | | |
| | 3 | Invalid Lifetime Annotations | Medium | Lifetime annotations | | |
| | 4 | Business Logic Errors | Medium | Math & correctness | | |
| | 5 | Linked List Management | Medium | Ownership & data structures | | |
| | 6 | Multi-threaded Deadlocks | Hard | Mutex & concurrency | | |
| | 7 | Async Borrowing Conflicts | Hard | Async/await lifetimes | | |
| | 8 | Unsafe FFI Integration | Hard | `unsafe` & C interop | | |
| | 9 | Inefficient Data Pipeline | Hard | Performance optimization | | |
| | 10 | Memory Leak Prevention | Hard+ | Weak pointers & ownership | | |
| --- | |
| ## Environment Variables / Secrets | |
| The environment reads the following variables. Set them as **HF Space secrets** (Settings β Variables and Secrets) when deploying to Hugging Face, or in a local `.env` file for development. | |
| | Variable | Required | Default | Description | | |
| |----------------|----------|--------------------------------------|--------------------------------------| | |
| | `HF_TOKEN` | Yes | β | Hugging Face API token for LLM calls | | |
| | `API_BASE_URL` | No | `https://router.huggingface.co/v1` | Inference endpoint | | |
| | `MODEL_NAME` | No | `Qwen/Qwen2.5-72B-Instruct` | Model to use for evaluation | | |
| > **Note**: The `.env` file is excluded from Docker images by `.dockerignore`. On HF Spaces, secrets are injected as OS environment variables by the platform β `load_dotenv()` silently does nothing if no file is present, and `os.getenv()` reads from the platform-injected vars. This is the correct behavior. | |
| --- | |
| ## Setup & Usage | |
| ### Local Development | |
| ```bash | |
| # 1. Clone and enter the repo | |
| git clone https://github.com/your-username/rust_coder | |
| cd rust_coder | |
| # 2. Create .env with your credentials | |
| cat > .env << EOF | |
| HF_TOKEN=hf_your_token_here | |
| API_BASE_URL=https://router.huggingface.co/v1 | |
| MODEL_NAME=Qwen/Qwen2.5-72B-Instruct | |
| EOF | |
| # 3. Build the Docker image (uses root Dockerfile) | |
| docker build -t rust_coder:latest . | |
| # 4. Run the environment server | |
| docker run -d -p 8000:8000 --env-file .env --name rust_env rust_coder:latest | |
| # 5. Verify it's healthy | |
| curl http://localhost:8000/health | |
| # β {"status": "healthy"} | |
| # 6. Run the inference benchmark | |
| python inference.py | |
| ``` | |
| ### Docker Commands Reference | |
| ```bash | |
| # Build | |
| docker build -t rust_coder:latest . | |
| # Run with .env file | |
| docker run -d -p 8000:8000 --env-file .env --name rust_env rust_coder:latest | |
| # View logs | |
| docker logs rust_env | |
| # Stop | |
| docker stop rust_env | |
| ``` | |
| ### Environment API | |
| ```bash | |
| # Reset (returns first problem) | |
| curl -X POST http://localhost:8000/reset | |
| # Step (submit Rust code) | |
| curl -X POST http://localhost:8000/step \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"action": {"code": "fn main() { println!(\"hello\"); }"}}' | |
| # Health check | |
| curl http://localhost:8000/health | |
| ``` | |
| ### HF Spaces Deployment | |
| ```bash | |
| # Install HF CLI | |
| pip install huggingface_hub | |
| # Login | |
| huggingface-cli login | |
| # Push to Space | |
| openenv push --repo-id your-username/rust-coder | |
| ``` | |
| Then go to your Space settings and add secrets: | |
| - `HF_TOKEN` β your Hugging Face API token | |
| - `MODEL_NAME` β e.g. `Qwen/Qwen2.5-72B-Instruct` | |
| --- | |
| ## Baseline Scores | |
| Baseline using **Qwen/Qwen2.5-72B-Instruct** via Hugging Face router: | |
| | Metric | Score | | |
| |----------------|-------| | |
| | Average reward | 0.59 | | |
| | Compilation % | ~85% | | |
| | Correctness % | ~45% | | |
| --- | |
| ## Project Structure | |
| ``` | |
| rust_coder/ | |
| βββ Dockerfile # Root Dockerfile (used by validator + HF Spaces) | |
| βββ server/Dockerfile # Identical copy (used for -f flag builds) | |
| βββ openenv.yaml # OpenEnv spec metadata | |
| βββ pyproject.toml # Python package config | |
| βββ uv.lock # Locked dependencies | |
| βββ problems.json # 10 coding problems dataset | |
| βββ models.py # Pydantic action/observation types | |
| βββ client.py # WebSocket client for RustCoderEnv | |
| βββ inference.py # Baseline inference script (entry point) | |
| βββ __init__.py # Package exports | |
| βββ server/ | |
| βββ app.py # FastAPI OpenEnv server entrypoint | |
| βββ rust_coder_environment.py # Core environment logic | |
| ``` | |
| ## HF Space runtime model | |
| - The Hugging Face Space serves the environment via `uvicorn server.app:app` (see `openenv.yaml` and `Dockerfile`). | |
| - The built-in OpenEnv web UI may send an empty action on Step; this environment supports that by auto-calling the LLM when `action.code` is empty (unless disabled via `AUTO_LLM_ON_EMPTY_STEP=0`). | |
| - `inference.py` is the required baseline runner used by the validator/judge. It connects to the running Space and drives `reset()`/`step()` in a loop, emitting strict `[START]`/`[STEP]`/`[END]` stdout lines. | |