Spaces:
Sleeping
Sleeping
Commit ·
6646bfe
1
Parent(s): 369a7c2
Clean repo without binary files
Browse files- .gitignore +1 -0
- Dockerfile +8 -0
- README.md +140 -0
- agents/__init__.py +0 -0
- agents/random_agent.py +5 -0
- app/app.py +60 -0
- data/posts.json +7 -0
- env/__init__.py +0 -0
- env/feed_env.py +29 -0
- env/reward.py +25 -0
- env/simulator.py +20 -0
- env/state.py +20 -0
- evaluation/__init__.py +0 -0
- evaluation/grader.py +24 -0
- inference.py +29 -0
- main.py +50 -0
- openenv.yaml +20 -0
- requirements.txt +1 -0
.gitignore
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
assests/
|
Dockerfile
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
FROM python:3.10
|
| 2 |
+
|
| 3 |
+
WORKDIR /app
|
| 4 |
+
COPY . .
|
| 5 |
+
|
| 6 |
+
RUN pip install --no-cache-dir numpy fastapi uvicorn
|
| 7 |
+
|
| 8 |
+
CMD ["uvicorn", "inference:app", "--host", "0.0.0.0", "--port", "7860"]
|
README.md
ADDED
|
@@ -0,0 +1,140 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 🚀 Value-Aware RL Feed Ranking Environment (OpenEnv)
|
| 2 |
+
|
| 3 |
+
## 📌 Overview
|
| 4 |
+
This project implements a **real-world reinforcement learning environment** for feed ranking systems. It simulates how modern recommendation systems balance **user engagement, diversity, and responsible AI objectives** such as value alignment and toxicity reduction.
|
| 5 |
+
|
| 6 |
+
The environment follows the **OpenEnv specification**, enabling agents to interact through standard APIs: `step()`, `reset()`, and `state()`.
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## 🧠 Motivation
|
| 11 |
+
Real-world recommendation systems (e.g., social media feeds) face a fundamental challenge:
|
| 12 |
+
|
| 13 |
+
> Maximizing engagement while ensuring responsible and aligned content delivery.
|
| 14 |
+
|
| 15 |
+
This project models that trade-off using a **multi-objective reward system**, making it suitable for studying **alignment, fairness, and long-term user behavior** in AI systems.
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## 🏗️ Architecture
|
| 20 |
+
|
| 21 |
+
This system models a **value-aware RL pipeline for feed ranking**:
|
| 22 |
+
|
| 23 |
+

|
| 24 |
+
|
| 25 |
+
### Key Components:
|
| 26 |
+
- **User State Representation** (embeddings, preferences, fatigue)
|
| 27 |
+
- **Candidate Post Selection**
|
| 28 |
+
- **Ranking Policy (Agent)**
|
| 29 |
+
- **User Behavior Simulator**
|
| 30 |
+
- **Reward Function (multi-objective)**
|
| 31 |
+
- **Evaluation / Grader System**
|
| 32 |
+
|
| 33 |
+
---
|
| 34 |
+
|
| 35 |
+
## ⚙️ Environment Design
|
| 36 |
+
|
| 37 |
+
### 🔹 Observation Space
|
| 38 |
+
The environment state is defined as:
|
| 39 |
+
UserState:
|
| 40 |
+
|
| 41 |
+
user_embedding (vector representation)
|
| 42 |
+
history (past interactions)
|
| 43 |
+
interest (engagement level)
|
| 44 |
+
fatigue (content saturation)
|
| 45 |
+
value preferences (alignment sensitivity)
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
+
### 🔹 Action Space
|
| 50 |
+
The agent selects:
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
Top-K ranked posts
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
Example:
|
| 57 |
+
|
| 58 |
+
[action] = [post_1, post_2, post_3]
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
---
|
| 62 |
+
|
| 63 |
+
### 🔹 Reward Function
|
| 64 |
+
|
| 65 |
+
The reward captures multiple objectives:
|
| 66 |
+
|
| 67 |
+
- ✅ Engagement (click / watch)
|
| 68 |
+
- ✅ Value Alignment
|
| 69 |
+
- ✅ Diversity
|
| 70 |
+
- ❌ Toxicity Penalty
|
| 71 |
+
- ❌ Fatigue Penalty
|
| 72 |
+
|
| 73 |
+
#### Hard Task Reward:
|
| 74 |
+
|
| 75 |
+
Reward = 0.5 * Engagement
|
| 76 |
+
+ 0.2 * Alignment
|
| 77 |
+
+ 0.1 * Diversity
|
| 78 |
+
- 0.2 * Toxicity
|
| 79 |
+
- 0.1 * Fatigue
|
| 80 |
+
|
| 81 |
+
## 🎯 Tasks
|
| 82 |
+
|
| 83 |
+
| Task | Objective |
|
| 84 |
+
|-----|----------|
|
| 85 |
+
| 🟢 Easy | Maximize engagement |
|
| 86 |
+
| 🟡 Medium | Engagement + Diversity |
|
| 87 |
+
| 🔴 Hard | Engagement + Alignment + Toxicity + Fatigue |
|
| 88 |
+
|
| 89 |
+
---
|
| 90 |
+
|
| 91 |
+
## 🧪 Evaluation
|
| 92 |
+
|
| 93 |
+
The environment includes an **agent grader** to evaluate performance across all tasks.
|
| 94 |
+
|
| 95 |
+
### Output Format (required for automated evaluation):
|
| 96 |
+
|
| 97 |
+
START
|
| 98 |
+
STEP task=easy score=...
|
| 99 |
+
STEP task=medium score=...
|
| 100 |
+
STEP task=hard score=...
|
| 101 |
+
END
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## 📊 Key Insight
|
| 107 |
+
|
| 108 |
+
As task complexity increases from **Easy → Hard**, performance decreases.
|
| 109 |
+
|
| 110 |
+
> This demonstrates the real-world trade-off between **engagement optimization and responsible AI objectives**, a core challenge in modern recommendation systems.
|
| 111 |
+
|
| 112 |
+
---
|
| 113 |
+
|
| 114 |
+
## 🧰 Tech Stack
|
| 115 |
+
|
| 116 |
+
- Python
|
| 117 |
+
- NumPy
|
| 118 |
+
- Reinforcement Learning Concepts
|
| 119 |
+
- Simulation-based Evaluation
|
| 120 |
+
|
| 121 |
+
---
|
| 122 |
+
|
| 123 |
+
## 🚀 How to Run
|
| 124 |
+
|
| 125 |
+
### ▶️ Local Execution
|
| 126 |
+
```bash
|
| 127 |
+
python main.py
|
| 128 |
+
|
| 129 |
+
## 📚 References
|
| 130 |
+
|
| 131 |
+
This work draws inspiration from recent advances in reinforcement learning and value-aware ranking systems:
|
| 132 |
+
|
| 133 |
+
1. *Multi-Stage Feed Ranking Systems*
|
| 134 |
+
https://arxiv.org/pdf/1906.03109
|
| 135 |
+
|
| 136 |
+
2. *Value-Aware Reinforcement Learning for Alignment*
|
| 137 |
+
https://arxiv.org/pdf/2601.20083
|
| 138 |
+
|
| 139 |
+
3. *Sequential Optimization and Ranking in Dynamic Systems*
|
| 140 |
+
https://arxiv.org/pdf/2509.14434v1
|
agents/__init__.py
ADDED
|
File without changes
|
agents/random_agent.py
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import random
|
| 2 |
+
|
| 3 |
+
class RandomAgent:
|
| 4 |
+
def act(self, state, posts):
|
| 5 |
+
return random.sample(posts, 3)
|
app/app.py
ADDED
|
@@ -0,0 +1,60 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import json
|
| 2 |
+
import random
|
| 3 |
+
import gradio as gr
|
| 4 |
+
|
| 5 |
+
from env.feed_env import FeedRankingEnv
|
| 6 |
+
from agents.random_agent import RandomAgent
|
| 7 |
+
|
| 8 |
+
# Load data
|
| 9 |
+
posts = json.load(open("data/posts.json"))
|
| 10 |
+
|
| 11 |
+
env = FeedRankingEnv(posts, task="hard")
|
| 12 |
+
agent = RandomAgent()
|
| 13 |
+
|
| 14 |
+
def run_demo(task):
|
| 15 |
+
env.task = task
|
| 16 |
+
state = env.reset()
|
| 17 |
+
done = False
|
| 18 |
+
|
| 19 |
+
log = []
|
| 20 |
+
total_reward = 0
|
| 21 |
+
|
| 22 |
+
while not done:
|
| 23 |
+
action = agent.act(state, env.posts)
|
| 24 |
+
state, reward, done, _ = env.step(action)
|
| 25 |
+
|
| 26 |
+
total_reward += reward
|
| 27 |
+
|
| 28 |
+
log.append({
|
| 29 |
+
"action": [p["id"] for p in action],
|
| 30 |
+
"reward": round(reward, 2),
|
| 31 |
+
"fatigue": round(state.fatigue, 2)
|
| 32 |
+
})
|
| 33 |
+
|
| 34 |
+
return log, round(total_reward, 2)
|
| 35 |
+
|
| 36 |
+
def format_output(log, total):
|
| 37 |
+
text = "=== Simulation Steps ===\n\n"
|
| 38 |
+
for i, step in enumerate(log):
|
| 39 |
+
text += f"Step {i+1}:\n"
|
| 40 |
+
text += f" Posts: {step['action']}\n"
|
| 41 |
+
text += f" Reward: {step['reward']}\n"
|
| 42 |
+
text += f" Fatigue: {step['fatigue']}\n\n"
|
| 43 |
+
|
| 44 |
+
text += f"TOTAL REWARD: {total}\n"
|
| 45 |
+
return text
|
| 46 |
+
|
| 47 |
+
def run(task):
|
| 48 |
+
log, total = run_demo(task)
|
| 49 |
+
return format_output(log, total)
|
| 50 |
+
|
| 51 |
+
demo = gr.Interface(
|
| 52 |
+
fn=run,
|
| 53 |
+
inputs=gr.Dropdown(["easy", "medium", "hard"], label="Select Task"),
|
| 54 |
+
outputs="text",
|
| 55 |
+
title="Value-Aware Feed Ranking Environment",
|
| 56 |
+
description="Simulates RL-based feed ranking with engagement, alignment, and toxicity trade-offs."
|
| 57 |
+
)
|
| 58 |
+
|
| 59 |
+
if __name__ == "__main__":
|
| 60 |
+
demo.launch()
|
data/posts.json
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{"id": 1, "caring": 0.8, "toxicity": 0.1},
|
| 3 |
+
{"id": 2, "caring": 0.3, "toxicity": 0.6},
|
| 4 |
+
{"id": 3, "caring": 0.7, "toxicity": 0.2},
|
| 5 |
+
{"id": 4, "caring": 0.5, "toxicity": 0.3},
|
| 6 |
+
{"id": 5, "caring": 0.9, "toxicity": 0.05}
|
| 7 |
+
]
|
env/__init__.py
ADDED
|
File without changes
|
env/feed_env.py
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from env.state import init_user
|
| 2 |
+
from env.simulator import simulate_user
|
| 3 |
+
from env.reward import compute_reward
|
| 4 |
+
|
| 5 |
+
class FeedRankingEnv:
|
| 6 |
+
def __init__(self, posts, task="hard"):
|
| 7 |
+
self.posts = posts
|
| 8 |
+
self.task = task
|
| 9 |
+
self._state = None
|
| 10 |
+
self.step_count = 0
|
| 11 |
+
|
| 12 |
+
def reset(self):
|
| 13 |
+
self._state = init_user()
|
| 14 |
+
self.step_count = 0
|
| 15 |
+
return self._state
|
| 16 |
+
|
| 17 |
+
def step(self, action):
|
| 18 |
+
responses = simulate_user(action, self._state)
|
| 19 |
+
reward = compute_reward(action, responses, self._state, self.task)
|
| 20 |
+
|
| 21 |
+
self._state.history.extend([p["id"] for p in action])
|
| 22 |
+
self.step_count += 1
|
| 23 |
+
|
| 24 |
+
done = self.step_count >= 20
|
| 25 |
+
|
| 26 |
+
return self._state, reward, done, {}
|
| 27 |
+
|
| 28 |
+
def state(self):
|
| 29 |
+
return self._state
|
env/reward.py
ADDED
|
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
def compute_reward(action, responses, state, task):
|
| 2 |
+
|
| 3 |
+
engagement = sum(
|
| 4 |
+
[1 if r == "click" else 0.5 if r == "watch" else 0 for r in responses]
|
| 5 |
+
)
|
| 6 |
+
|
| 7 |
+
value_align = sum([post["caring"] for post in action]) / len(action)
|
| 8 |
+
toxicity = sum([post["toxicity"] for post in action]) / len(action)
|
| 9 |
+
|
| 10 |
+
diversity = len(set([p["id"] for p in action])) / len(action)
|
| 11 |
+
|
| 12 |
+
if task == "easy":
|
| 13 |
+
return engagement
|
| 14 |
+
|
| 15 |
+
elif task == "medium":
|
| 16 |
+
return engagement + 0.1 * diversity - 0.1 * toxicity
|
| 17 |
+
|
| 18 |
+
else: # HARD
|
| 19 |
+
return (
|
| 20 |
+
0.5 * engagement +
|
| 21 |
+
0.2 * value_align +
|
| 22 |
+
0.1 * diversity -
|
| 23 |
+
0.2 * toxicity -
|
| 24 |
+
0.1 * state.fatigue
|
| 25 |
+
)
|
env/simulator.py
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import random
|
| 2 |
+
|
| 3 |
+
def simulate_user(action, state):
|
| 4 |
+
responses = []
|
| 5 |
+
|
| 6 |
+
for post in action:
|
| 7 |
+
score = state.interest - state.fatigue
|
| 8 |
+
|
| 9 |
+
if score > 0.6:
|
| 10 |
+
responses.append("click")
|
| 11 |
+
elif score > 0.3:
|
| 12 |
+
responses.append("watch")
|
| 13 |
+
else:
|
| 14 |
+
responses.append("skip")
|
| 15 |
+
|
| 16 |
+
# user drift
|
| 17 |
+
state.interest += random.uniform(-0.05, 0.05)
|
| 18 |
+
state.fatigue += 0.02
|
| 19 |
+
|
| 20 |
+
return responses
|
env/state.py
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from dataclasses import dataclass
|
| 2 |
+
from typing import List, Dict
|
| 3 |
+
import random
|
| 4 |
+
|
| 5 |
+
@dataclass
|
| 6 |
+
class UserState:
|
| 7 |
+
user_embedding: List[float]
|
| 8 |
+
history: List[int]
|
| 9 |
+
interest: float
|
| 10 |
+
fatigue: float
|
| 11 |
+
value_pref: Dict[str, float]
|
| 12 |
+
|
| 13 |
+
def init_user():
|
| 14 |
+
return UserState(
|
| 15 |
+
user_embedding=[random.random() for _ in range(8)],
|
| 16 |
+
history=[],
|
| 17 |
+
interest=random.uniform(0.4, 0.8),
|
| 18 |
+
fatigue=0.1,
|
| 19 |
+
value_pref={"caring": 0.7, "toxicity_tolerance": 0.2}
|
| 20 |
+
)
|
evaluation/__init__.py
ADDED
|
File without changes
|
evaluation/grader.py
ADDED
|
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
def evaluate(env, agent):
|
| 2 |
+
total = 0
|
| 3 |
+
|
| 4 |
+
for _ in range(5):
|
| 5 |
+
state = env.reset()
|
| 6 |
+
done = False
|
| 7 |
+
|
| 8 |
+
while not done:
|
| 9 |
+
action = agent.act(state, env.posts)
|
| 10 |
+
state, reward, done, _ = env.step(action)
|
| 11 |
+
total += reward
|
| 12 |
+
|
| 13 |
+
return total / 5
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
def evaluate_all_tasks(agent, posts, EnvClass):
|
| 17 |
+
tasks = ["easy", "medium", "hard"]
|
| 18 |
+
results = {}
|
| 19 |
+
|
| 20 |
+
for t in tasks:
|
| 21 |
+
env = EnvClass(posts, task=t)
|
| 22 |
+
results[t] = evaluate(env, agent)
|
| 23 |
+
|
| 24 |
+
return results
|
inference.py
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from fastapi import FastAPI
|
| 2 |
+
import json
|
| 3 |
+
from env.feed_env import FeedRankingEnv
|
| 4 |
+
from agents.random_agent import RandomAgent
|
| 5 |
+
|
| 6 |
+
app = FastAPI()
|
| 7 |
+
|
| 8 |
+
# Load data
|
| 9 |
+
with open("data/posts.json", "r") as f:
|
| 10 |
+
posts = json.load(f)
|
| 11 |
+
|
| 12 |
+
env = FeedRankingEnv(posts, task="hard")
|
| 13 |
+
agent = RandomAgent()
|
| 14 |
+
|
| 15 |
+
@app.post("/reset")
|
| 16 |
+
def reset():
|
| 17 |
+
state = env.reset()
|
| 18 |
+
return {"state": state.__dict__}
|
| 19 |
+
|
| 20 |
+
@app.post("/step")
|
| 21 |
+
def step():
|
| 22 |
+
action = agent.act(env.state(), env.posts)
|
| 23 |
+
state, reward, done, _ = env.step(action)
|
| 24 |
+
|
| 25 |
+
return {
|
| 26 |
+
"state": state.__dict__,
|
| 27 |
+
"reward": reward,
|
| 28 |
+
"done": done
|
| 29 |
+
}
|
main.py
ADDED
|
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import json
|
| 2 |
+
import random
|
| 3 |
+
import numpy as np
|
| 4 |
+
import os
|
| 5 |
+
|
| 6 |
+
from env.feed_env import FeedRankingEnv
|
| 7 |
+
from agents.random_agent import RandomAgent
|
| 8 |
+
from evaluation.grader import evaluate_all_tasks
|
| 9 |
+
|
| 10 |
+
# ---------------------------
|
| 11 |
+
# Environment Variables (required format)
|
| 12 |
+
# ---------------------------
|
| 13 |
+
API_BASE_URL = os.getenv("API_BASE_URL", "default")
|
| 14 |
+
MODEL_NAME = os.getenv("MODEL_NAME", "baseline")
|
| 15 |
+
HF_TOKEN = os.getenv("HF_TOKEN")
|
| 16 |
+
|
| 17 |
+
# ---------------------------
|
| 18 |
+
# Reproducibility
|
| 19 |
+
# ---------------------------
|
| 20 |
+
def set_seed(seed=42):
|
| 21 |
+
random.seed(seed)
|
| 22 |
+
np.random.seed(seed)
|
| 23 |
+
|
| 24 |
+
set_seed()
|
| 25 |
+
|
| 26 |
+
# ---------------------------
|
| 27 |
+
# Load data
|
| 28 |
+
# ---------------------------
|
| 29 |
+
with open("data/posts.json", "r") as f:
|
| 30 |
+
posts = json.load(f)
|
| 31 |
+
|
| 32 |
+
# ---------------------------
|
| 33 |
+
# Initialize agent
|
| 34 |
+
# ---------------------------
|
| 35 |
+
agent = RandomAgent()
|
| 36 |
+
|
| 37 |
+
# ---------------------------
|
| 38 |
+
# Evaluate
|
| 39 |
+
# ---------------------------
|
| 40 |
+
results = evaluate_all_tasks(agent, posts, FeedRankingEnv)
|
| 41 |
+
|
| 42 |
+
# ---------------------------
|
| 43 |
+
# REQUIRED STRUCTURED OUTPUT
|
| 44 |
+
# ---------------------------
|
| 45 |
+
print("START")
|
| 46 |
+
|
| 47 |
+
for task, score in results.items():
|
| 48 |
+
print(f"STEP task={task} score={round(score, 2)}")
|
| 49 |
+
|
| 50 |
+
print("END")
|
openenv.yaml
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
name: value-aware-feed-ranking
|
| 2 |
+
|
| 3 |
+
tasks:
|
| 4 |
+
- easy
|
| 5 |
+
- medium
|
| 6 |
+
- hard
|
| 7 |
+
|
| 8 |
+
observation_space:
|
| 9 |
+
type: UserState
|
| 10 |
+
|
| 11 |
+
action_space:
|
| 12 |
+
type: ranked_posts
|
| 13 |
+
|
| 14 |
+
reward:
|
| 15 |
+
components:
|
| 16 |
+
- engagement
|
| 17 |
+
- value_alignment
|
| 18 |
+
- diversity
|
| 19 |
+
- toxicity_penalty
|
| 20 |
+
- fatigue_penalty
|
requirements.txt
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
numpy
|