ATISHAY005 commited on
Commit
6646bfe
·
1 Parent(s): 369a7c2

Clean repo without binary files

Browse files
.gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ assests/
Dockerfile ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.10
2
+
3
+ WORKDIR /app
4
+ COPY . .
5
+
6
+ RUN pip install --no-cache-dir numpy fastapi uvicorn
7
+
8
+ CMD ["uvicorn", "inference:app", "--host", "0.0.0.0", "--port", "7860"]
README.md ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🚀 Value-Aware RL Feed Ranking Environment (OpenEnv)
2
+
3
+ ## 📌 Overview
4
+ This project implements a **real-world reinforcement learning environment** for feed ranking systems. It simulates how modern recommendation systems balance **user engagement, diversity, and responsible AI objectives** such as value alignment and toxicity reduction.
5
+
6
+ The environment follows the **OpenEnv specification**, enabling agents to interact through standard APIs: `step()`, `reset()`, and `state()`.
7
+
8
+ ---
9
+
10
+ ## 🧠 Motivation
11
+ Real-world recommendation systems (e.g., social media feeds) face a fundamental challenge:
12
+
13
+ > Maximizing engagement while ensuring responsible and aligned content delivery.
14
+
15
+ This project models that trade-off using a **multi-objective reward system**, making it suitable for studying **alignment, fairness, and long-term user behavior** in AI systems.
16
+
17
+ ---
18
+
19
+ ## 🏗️ Architecture
20
+
21
+ This system models a **value-aware RL pipeline for feed ranking**:
22
+
23
+ ![Architecture](assests/architecture.jpeg)
24
+
25
+ ### Key Components:
26
+ - **User State Representation** (embeddings, preferences, fatigue)
27
+ - **Candidate Post Selection**
28
+ - **Ranking Policy (Agent)**
29
+ - **User Behavior Simulator**
30
+ - **Reward Function (multi-objective)**
31
+ - **Evaluation / Grader System**
32
+
33
+ ---
34
+
35
+ ## ⚙️ Environment Design
36
+
37
+ ### 🔹 Observation Space
38
+ The environment state is defined as:
39
+ UserState:
40
+
41
+ user_embedding (vector representation)
42
+ history (past interactions)
43
+ interest (engagement level)
44
+ fatigue (content saturation)
45
+ value preferences (alignment sensitivity)
46
+
47
+ ---
48
+
49
+ ### 🔹 Action Space
50
+ The agent selects:
51
+
52
+
53
+ Top-K ranked posts
54
+
55
+
56
+ Example:
57
+
58
+ [action] = [post_1, post_2, post_3]
59
+
60
+
61
+ ---
62
+
63
+ ### 🔹 Reward Function
64
+
65
+ The reward captures multiple objectives:
66
+
67
+ - ✅ Engagement (click / watch)
68
+ - ✅ Value Alignment
69
+ - ✅ Diversity
70
+ - ❌ Toxicity Penalty
71
+ - ❌ Fatigue Penalty
72
+
73
+ #### Hard Task Reward:
74
+
75
+ Reward = 0.5 * Engagement
76
+ + 0.2 * Alignment
77
+ + 0.1 * Diversity
78
+ - 0.2 * Toxicity
79
+ - 0.1 * Fatigue
80
+
81
+ ## 🎯 Tasks
82
+
83
+ | Task | Objective |
84
+ |-----|----------|
85
+ | 🟢 Easy | Maximize engagement |
86
+ | 🟡 Medium | Engagement + Diversity |
87
+ | 🔴 Hard | Engagement + Alignment + Toxicity + Fatigue |
88
+
89
+ ---
90
+
91
+ ## 🧪 Evaluation
92
+
93
+ The environment includes an **agent grader** to evaluate performance across all tasks.
94
+
95
+ ### Output Format (required for automated evaluation):
96
+
97
+ START
98
+ STEP task=easy score=...
99
+ STEP task=medium score=...
100
+ STEP task=hard score=...
101
+ END
102
+
103
+
104
+ ---
105
+
106
+ ## 📊 Key Insight
107
+
108
+ As task complexity increases from **Easy → Hard**, performance decreases.
109
+
110
+ > This demonstrates the real-world trade-off between **engagement optimization and responsible AI objectives**, a core challenge in modern recommendation systems.
111
+
112
+ ---
113
+
114
+ ## 🧰 Tech Stack
115
+
116
+ - Python
117
+ - NumPy
118
+ - Reinforcement Learning Concepts
119
+ - Simulation-based Evaluation
120
+
121
+ ---
122
+
123
+ ## 🚀 How to Run
124
+
125
+ ### ▶️ Local Execution
126
+ ```bash
127
+ python main.py
128
+
129
+ ## 📚 References
130
+
131
+ This work draws inspiration from recent advances in reinforcement learning and value-aware ranking systems:
132
+
133
+ 1. *Multi-Stage Feed Ranking Systems*
134
+ https://arxiv.org/pdf/1906.03109
135
+
136
+ 2. *Value-Aware Reinforcement Learning for Alignment*
137
+ https://arxiv.org/pdf/2601.20083
138
+
139
+ 3. *Sequential Optimization and Ranking in Dynamic Systems*
140
+ https://arxiv.org/pdf/2509.14434v1
agents/__init__.py ADDED
File without changes
agents/random_agent.py ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ import random
2
+
3
+ class RandomAgent:
4
+ def act(self, state, posts):
5
+ return random.sample(posts, 3)
app/app.py ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import random
3
+ import gradio as gr
4
+
5
+ from env.feed_env import FeedRankingEnv
6
+ from agents.random_agent import RandomAgent
7
+
8
+ # Load data
9
+ posts = json.load(open("data/posts.json"))
10
+
11
+ env = FeedRankingEnv(posts, task="hard")
12
+ agent = RandomAgent()
13
+
14
+ def run_demo(task):
15
+ env.task = task
16
+ state = env.reset()
17
+ done = False
18
+
19
+ log = []
20
+ total_reward = 0
21
+
22
+ while not done:
23
+ action = agent.act(state, env.posts)
24
+ state, reward, done, _ = env.step(action)
25
+
26
+ total_reward += reward
27
+
28
+ log.append({
29
+ "action": [p["id"] for p in action],
30
+ "reward": round(reward, 2),
31
+ "fatigue": round(state.fatigue, 2)
32
+ })
33
+
34
+ return log, round(total_reward, 2)
35
+
36
+ def format_output(log, total):
37
+ text = "=== Simulation Steps ===\n\n"
38
+ for i, step in enumerate(log):
39
+ text += f"Step {i+1}:\n"
40
+ text += f" Posts: {step['action']}\n"
41
+ text += f" Reward: {step['reward']}\n"
42
+ text += f" Fatigue: {step['fatigue']}\n\n"
43
+
44
+ text += f"TOTAL REWARD: {total}\n"
45
+ return text
46
+
47
+ def run(task):
48
+ log, total = run_demo(task)
49
+ return format_output(log, total)
50
+
51
+ demo = gr.Interface(
52
+ fn=run,
53
+ inputs=gr.Dropdown(["easy", "medium", "hard"], label="Select Task"),
54
+ outputs="text",
55
+ title="Value-Aware Feed Ranking Environment",
56
+ description="Simulates RL-based feed ranking with engagement, alignment, and toxicity trade-offs."
57
+ )
58
+
59
+ if __name__ == "__main__":
60
+ demo.launch()
data/posts.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ [
2
+ {"id": 1, "caring": 0.8, "toxicity": 0.1},
3
+ {"id": 2, "caring": 0.3, "toxicity": 0.6},
4
+ {"id": 3, "caring": 0.7, "toxicity": 0.2},
5
+ {"id": 4, "caring": 0.5, "toxicity": 0.3},
6
+ {"id": 5, "caring": 0.9, "toxicity": 0.05}
7
+ ]
env/__init__.py ADDED
File without changes
env/feed_env.py ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from env.state import init_user
2
+ from env.simulator import simulate_user
3
+ from env.reward import compute_reward
4
+
5
+ class FeedRankingEnv:
6
+ def __init__(self, posts, task="hard"):
7
+ self.posts = posts
8
+ self.task = task
9
+ self._state = None
10
+ self.step_count = 0
11
+
12
+ def reset(self):
13
+ self._state = init_user()
14
+ self.step_count = 0
15
+ return self._state
16
+
17
+ def step(self, action):
18
+ responses = simulate_user(action, self._state)
19
+ reward = compute_reward(action, responses, self._state, self.task)
20
+
21
+ self._state.history.extend([p["id"] for p in action])
22
+ self.step_count += 1
23
+
24
+ done = self.step_count >= 20
25
+
26
+ return self._state, reward, done, {}
27
+
28
+ def state(self):
29
+ return self._state
env/reward.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ def compute_reward(action, responses, state, task):
2
+
3
+ engagement = sum(
4
+ [1 if r == "click" else 0.5 if r == "watch" else 0 for r in responses]
5
+ )
6
+
7
+ value_align = sum([post["caring"] for post in action]) / len(action)
8
+ toxicity = sum([post["toxicity"] for post in action]) / len(action)
9
+
10
+ diversity = len(set([p["id"] for p in action])) / len(action)
11
+
12
+ if task == "easy":
13
+ return engagement
14
+
15
+ elif task == "medium":
16
+ return engagement + 0.1 * diversity - 0.1 * toxicity
17
+
18
+ else: # HARD
19
+ return (
20
+ 0.5 * engagement +
21
+ 0.2 * value_align +
22
+ 0.1 * diversity -
23
+ 0.2 * toxicity -
24
+ 0.1 * state.fatigue
25
+ )
env/simulator.py ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import random
2
+
3
+ def simulate_user(action, state):
4
+ responses = []
5
+
6
+ for post in action:
7
+ score = state.interest - state.fatigue
8
+
9
+ if score > 0.6:
10
+ responses.append("click")
11
+ elif score > 0.3:
12
+ responses.append("watch")
13
+ else:
14
+ responses.append("skip")
15
+
16
+ # user drift
17
+ state.interest += random.uniform(-0.05, 0.05)
18
+ state.fatigue += 0.02
19
+
20
+ return responses
env/state.py ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from dataclasses import dataclass
2
+ from typing import List, Dict
3
+ import random
4
+
5
+ @dataclass
6
+ class UserState:
7
+ user_embedding: List[float]
8
+ history: List[int]
9
+ interest: float
10
+ fatigue: float
11
+ value_pref: Dict[str, float]
12
+
13
+ def init_user():
14
+ return UserState(
15
+ user_embedding=[random.random() for _ in range(8)],
16
+ history=[],
17
+ interest=random.uniform(0.4, 0.8),
18
+ fatigue=0.1,
19
+ value_pref={"caring": 0.7, "toxicity_tolerance": 0.2}
20
+ )
evaluation/__init__.py ADDED
File without changes
evaluation/grader.py ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ def evaluate(env, agent):
2
+ total = 0
3
+
4
+ for _ in range(5):
5
+ state = env.reset()
6
+ done = False
7
+
8
+ while not done:
9
+ action = agent.act(state, env.posts)
10
+ state, reward, done, _ = env.step(action)
11
+ total += reward
12
+
13
+ return total / 5
14
+
15
+
16
+ def evaluate_all_tasks(agent, posts, EnvClass):
17
+ tasks = ["easy", "medium", "hard"]
18
+ results = {}
19
+
20
+ for t in tasks:
21
+ env = EnvClass(posts, task=t)
22
+ results[t] = evaluate(env, agent)
23
+
24
+ return results
inference.py ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fastapi import FastAPI
2
+ import json
3
+ from env.feed_env import FeedRankingEnv
4
+ from agents.random_agent import RandomAgent
5
+
6
+ app = FastAPI()
7
+
8
+ # Load data
9
+ with open("data/posts.json", "r") as f:
10
+ posts = json.load(f)
11
+
12
+ env = FeedRankingEnv(posts, task="hard")
13
+ agent = RandomAgent()
14
+
15
+ @app.post("/reset")
16
+ def reset():
17
+ state = env.reset()
18
+ return {"state": state.__dict__}
19
+
20
+ @app.post("/step")
21
+ def step():
22
+ action = agent.act(env.state(), env.posts)
23
+ state, reward, done, _ = env.step(action)
24
+
25
+ return {
26
+ "state": state.__dict__,
27
+ "reward": reward,
28
+ "done": done
29
+ }
main.py ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import random
3
+ import numpy as np
4
+ import os
5
+
6
+ from env.feed_env import FeedRankingEnv
7
+ from agents.random_agent import RandomAgent
8
+ from evaluation.grader import evaluate_all_tasks
9
+
10
+ # ---------------------------
11
+ # Environment Variables (required format)
12
+ # ---------------------------
13
+ API_BASE_URL = os.getenv("API_BASE_URL", "default")
14
+ MODEL_NAME = os.getenv("MODEL_NAME", "baseline")
15
+ HF_TOKEN = os.getenv("HF_TOKEN")
16
+
17
+ # ---------------------------
18
+ # Reproducibility
19
+ # ---------------------------
20
+ def set_seed(seed=42):
21
+ random.seed(seed)
22
+ np.random.seed(seed)
23
+
24
+ set_seed()
25
+
26
+ # ---------------------------
27
+ # Load data
28
+ # ---------------------------
29
+ with open("data/posts.json", "r") as f:
30
+ posts = json.load(f)
31
+
32
+ # ---------------------------
33
+ # Initialize agent
34
+ # ---------------------------
35
+ agent = RandomAgent()
36
+
37
+ # ---------------------------
38
+ # Evaluate
39
+ # ---------------------------
40
+ results = evaluate_all_tasks(agent, posts, FeedRankingEnv)
41
+
42
+ # ---------------------------
43
+ # REQUIRED STRUCTURED OUTPUT
44
+ # ---------------------------
45
+ print("START")
46
+
47
+ for task, score in results.items():
48
+ print(f"STEP task={task} score={round(score, 2)}")
49
+
50
+ print("END")
openenv.yaml ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: value-aware-feed-ranking
2
+
3
+ tasks:
4
+ - easy
5
+ - medium
6
+ - hard
7
+
8
+ observation_space:
9
+ type: UserState
10
+
11
+ action_space:
12
+ type: ranked_posts
13
+
14
+ reward:
15
+ components:
16
+ - engagement
17
+ - value_alignment
18
+ - diversity
19
+ - toxicity_penalty
20
+ - fatigue_penalty
requirements.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ numpy