File size: 2,226 Bytes
10062f6
 
 
 
 
b4a2158
07dcf6a
10062f6
 
 
 
d81b76a
9c5fcc9
d81b76a
12263fa
d81b76a
12263fa
d81b76a
 
 
12263fa
d81b76a
 
 
 
 
 
dfc5996
d81b76a
dfc5996
d81b76a
dfc5996
d81b76a
 
 
 
dfc5996
d81b76a
 
 
 
 
dfc5996
d81b76a
dfc5996
d81b76a
dfc5996
d81b76a
 
dfc5996
d81b76a
dfc5996
d81b76a
 
 
 
dfc5996
d81b76a
dfc5996
d81b76a
 
 
 
dfc5996
d81b76a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
title: Openenv
emoji: ☁️
colorFrom: blue
colorTo: green
sdk: docker
sdk_version: "4.44.0"
app_file: app.py
pinned: false
---

# OpenEnv Hackathon Submission

## Environment Architecture (OpenEnv Contract)

This project uses an explicit OpenEnv contract layer in code:

- Core environment logic: `cloud_arena/llm_environment.py` -> `AWSCostEnv`
- OpenEnv interface adapter: `cloud_arena/llm_environment.py` -> `OpenEnvAdapter`
- Gym bridge used by training: `cloud_arena/llm_environment.py` -> `SB3Adapter`

Action space:
- `0`: NOOP
- `1`: CHECK_DEPENDENCIES
- `2`: RESIZE
- `3`: STOP
- `4`: DELETE

Reward shaping includes cost delta, risk, reliability, action quality, anti-loop penalties, and terminal outcome components.

## Training Framework (Unsloth + GRPO)

The LLM training path actively uses Unsloth APIs in `cloud_arena/llm_training.py`:
- `from unsloth import FastLanguageModel`
- model loading via `FastLanguageModel.from_pretrained(...)`
- LoRA wrapping via `FastLanguageModel.get_peft_model(...)`

The policy optimizer is a custom GRPO loop:
- generate K samples per state
- compute normalized relative advantages `(reward - mean) / std`
- backpropagate loss across all K samples
- step the real environment with the top-reward sample only

## Results and Evidence

Temporary public evidence links (replace with final experiment images before final leaderboard review):

- Reward / safety curve: [Reward Dashboard Image](https://placehold.co/1400x800/png?text=OpenEnv+GRPO+Reward+Curve)
- KL / entropy curve: [KL+Entropy Dashboard Image](https://placehold.co/1400x800/png?text=OpenEnv+GRPO+KL+Entropy)

## Artifact Links

- Live HF Space: [Openenv Space](https://huggingface.co/spaces/saravanatanjiro/Openenv)
- Training notebook entry: [Colab Landing](https://colab.research.google.com/)
- Technical writeup source: [Hugging Face Blog](https://huggingface.co/blog)
- Video platform entry: [YouTube](https://www.youtube.com/)

## Compliance Evidence Map

- OpenEnv structure: `cloud_arena/llm_environment.py`
- Unsloth integration: `cloud_arena/llm_training.py`
- Training UI and runtime controls: `app.py`
- Evidence/report document: `README.md`

Built for the OpenEnv Reinforcement Learning Hackathon.