Spaces:

ATISHAY005
/

openenv-feed-ranking

Sleeping

App Files Files Community

ATISHAY005 commited on 11 days ago

Commit

6646bfe

1 Parent(s): 369a7c2

Clean repo without binary files

Browse files

Files changed (18) hide show

.gitignore +1 -0
Dockerfile +8 -0
README.md +140 -0
agents/__init__.py +0 -0
agents/random_agent.py +5 -0
app/app.py +60 -0
data/posts.json +7 -0
env/__init__.py +0 -0
env/feed_env.py +29 -0
env/reward.py +25 -0
env/simulator.py +20 -0
env/state.py +20 -0
evaluation/__init__.py +0 -0
evaluation/grader.py +24 -0
inference.py +29 -0
main.py +50 -0
openenv.yaml +20 -0
requirements.txt +1 -0

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ assests/

Dockerfile ADDED Viewed

	@@ -0,0 +1,8 @@

+FROM python:3.10
+WORKDIR /app
+COPY . .
+RUN pip install --no-cache-dir numpy fastapi uvicorn
+CMD ["uvicorn", "inference:app", "--host", "0.0.0.0", "--port", "7860"]

README.md ADDED Viewed

	@@ -0,0 +1,140 @@

+# 🚀 Value-Aware RL Feed Ranking Environment (OpenEnv)
+## 📌 Overview
+This project implements a **real-world reinforcement learning environment** for feed ranking systems. It simulates how modern recommendation systems balance **user engagement, diversity, and responsible AI objectives** such as value alignment and toxicity reduction.
+The environment follows the **OpenEnv specification**, enabling agents to interact through standard APIs: `step()`, `reset()`, and `state()`.
+---
+## 🧠 Motivation
+Real-world recommendation systems (e.g., social media feeds) face a fundamental challenge:
+> Maximizing engagement while ensuring responsible and aligned content delivery.
+This project models that trade-off using a **multi-objective reward system**, making it suitable for studying **alignment, fairness, and long-term user behavior** in AI systems.
+---
+## 🏗️ Architecture
+This system models a **value-aware RL pipeline for feed ranking**:
+![Architecture](assests/architecture.jpeg)
+### Key Components:
+- **User State Representation** (embeddings, preferences, fatigue)
+- **Candidate Post Selection**
+- **Ranking Policy (Agent)**
+- **User Behavior Simulator**
+- **Reward Function (multi-objective)**
+- **Evaluation / Grader System**
+---
+## ⚙️ Environment Design
+### 🔹 Observation Space
+The environment state is defined as:
+UserState:
+user_embedding (vector representation)
+history (past interactions)
+interest (engagement level)
+fatigue (content saturation)
+value preferences (alignment sensitivity)
+---
+### 🔹 Action Space
+The agent selects:
+Top-K ranked posts
+Example:
+[action] = [post_1, post_2, post_3]
+---
+### 🔹 Reward Function
+The reward captures multiple objectives:
+- ✅ Engagement (click / watch)
+- ✅ Value Alignment
+- ✅ Diversity
+- ❌ Toxicity Penalty
+- ❌ Fatigue Penalty
+#### Hard Task Reward:
+Reward = 0.5 * Engagement
++ 0.2 * Alignment
++ 0.1 * Diversity
+- 0.2 * Toxicity
+- 0.1 * Fatigue
+## 🎯 Tasks
+| Task | Objective |
+|-----|----------|
+| 🟢 Easy | Maximize engagement |
+| 🟡 Medium | Engagement + Diversity |
+| 🔴 Hard | Engagement + Alignment + Toxicity + Fatigue |
+---
+## 🧪 Evaluation
+The environment includes an **agent grader** to evaluate performance across all tasks.
+### Output Format (required for automated evaluation):
+START
+STEP task=easy score=...
+STEP task=medium score=...
+STEP task=hard score=...
+END
+---
+## 📊 Key Insight
+As task complexity increases from **Easy → Hard**, performance decreases.
+> This demonstrates the real-world trade-off between **engagement optimization and responsible AI objectives**, a core challenge in modern recommendation systems.
+---
+## 🧰 Tech Stack
+- Python
+- NumPy
+- Reinforcement Learning Concepts
+- Simulation-based Evaluation
+---
+## 🚀 How to Run
+### ▶️ Local Execution
+```bash
+python main.py
+## 📚 References
+This work draws inspiration from recent advances in reinforcement learning and value-aware ranking systems:
+1. *Multi-Stage Feed Ranking Systems*
+   https://arxiv.org/pdf/1906.03109
+2. *Value-Aware Reinforcement Learning for Alignment*
+   https://arxiv.org/pdf/2601.20083
+3. *Sequential Optimization and Ranking in Dynamic Systems*
+   https://arxiv.org/pdf/2509.14434v1

agents/__init__.py ADDED Viewed

File without changes

agents/random_agent.py ADDED Viewed

	@@ -0,0 +1,5 @@

+import random
+class RandomAgent:
+    def act(self, state, posts):
+        return random.sample(posts, 3)

app/app.py ADDED Viewed

	@@ -0,0 +1,60 @@

+import json
+import random
+import gradio as gr
+from env.feed_env import FeedRankingEnv
+from agents.random_agent import RandomAgent
+# Load data
+posts = json.load(open("data/posts.json"))
+env = FeedRankingEnv(posts, task="hard")
+agent = RandomAgent()
+def run_demo(task):
+    env.task = task
+    state = env.reset()
+    done = False
+    log = []
+    total_reward = 0
+    while not done:
+        action = agent.act(state, env.posts)
+        state, reward, done, _ = env.step(action)
+        total_reward += reward
+        log.append({
+            "action": [p["id"] for p in action],
+            "reward": round(reward, 2),
+            "fatigue": round(state.fatigue, 2)
+        })
+    return log, round(total_reward, 2)
+def format_output(log, total):
+    text = "=== Simulation Steps ===\n\n"
+    for i, step in enumerate(log):
+        text += f"Step {i+1}:\n"
+        text += f"  Posts: {step['action']}\n"
+        text += f"  Reward: {step['reward']}\n"
+        text += f"  Fatigue: {step['fatigue']}\n\n"
+    text += f"TOTAL REWARD: {total}\n"
+    return text
+def run(task):
+    log, total = run_demo(task)
+    return format_output(log, total)
+demo = gr.Interface(
+    fn=run,
+    inputs=gr.Dropdown(["easy", "medium", "hard"], label="Select Task"),
+    outputs="text",
+    title="Value-Aware Feed Ranking Environment",
+    description="Simulates RL-based feed ranking with engagement, alignment, and toxicity trade-offs."
+)
+if __name__ == "__main__":
+    demo.launch()

data/posts.json ADDED Viewed

	@@ -0,0 +1,7 @@

+[
+  {"id": 1, "caring": 0.8, "toxicity": 0.1},
+  {"id": 2, "caring": 0.3, "toxicity": 0.6},
+  {"id": 3, "caring": 0.7, "toxicity": 0.2},
+  {"id": 4, "caring": 0.5, "toxicity": 0.3},
+  {"id": 5, "caring": 0.9, "toxicity": 0.05}
+]

env/__init__.py ADDED Viewed

File without changes

env/feed_env.py ADDED Viewed

	@@ -0,0 +1,29 @@

+from env.state import init_user
+from env.simulator import simulate_user
+from env.reward import compute_reward
+class FeedRankingEnv:
+    def __init__(self, posts, task="hard"):
+        self.posts = posts
+        self.task = task
+        self._state = None
+        self.step_count = 0
+    def reset(self):
+        self._state = init_user()
+        self.step_count = 0
+        return self._state
+    def step(self, action):
+        responses = simulate_user(action, self._state)
+        reward = compute_reward(action, responses, self._state, self.task)
+        self._state.history.extend([p["id"] for p in action])
+        self.step_count += 1
+        done = self.step_count >= 20
+        return self._state, reward, done, {}
+    def state(self):
+        return self._state

env/reward.py ADDED Viewed

	@@ -0,0 +1,25 @@

+def compute_reward(action, responses, state, task):
+    engagement = sum(
+        [1 if r == "click" else 0.5 if r == "watch" else 0 for r in responses]
+    )
+    value_align = sum([post["caring"] for post in action]) / len(action)
+    toxicity = sum([post["toxicity"] for post in action]) / len(action)
+    diversity = len(set([p["id"] for p in action])) / len(action)
+    if task == "easy":
+        return engagement
+    elif task == "medium":
+        return engagement + 0.1 * diversity - 0.1 * toxicity
+    else:  # HARD
+        return (
+            0.5 * engagement +
+            0.2 * value_align +
+            0.1 * diversity -
+            0.2 * toxicity -
+            0.1 * state.fatigue
+        )

env/simulator.py ADDED Viewed

	@@ -0,0 +1,20 @@

+import random
+def simulate_user(action, state):
+    responses = []
+    for post in action:
+        score = state.interest - state.fatigue
+        if score > 0.6:
+            responses.append("click")
+        elif score > 0.3:
+            responses.append("watch")
+        else:
+            responses.append("skip")
+    # user drift
+    state.interest += random.uniform(-0.05, 0.05)
+    state.fatigue += 0.02
+    return responses

env/state.py ADDED Viewed

	@@ -0,0 +1,20 @@

+from dataclasses import dataclass
+from typing import List, Dict
+import random
+@dataclass
+class UserState:
+    user_embedding: List[float]
+    history: List[int]
+    interest: float
+    fatigue: float
+    value_pref: Dict[str, float]
+def init_user():
+    return UserState(
+        user_embedding=[random.random() for _ in range(8)],
+        history=[],
+        interest=random.uniform(0.4, 0.8),
+        fatigue=0.1,
+        value_pref={"caring": 0.7, "toxicity_tolerance": 0.2}
+    )

evaluation/__init__.py ADDED Viewed

File without changes

evaluation/grader.py ADDED Viewed

	@@ -0,0 +1,24 @@

+def evaluate(env, agent):
+    total = 0
+    for _ in range(5):
+        state = env.reset()
+        done = False
+        while not done:
+            action = agent.act(state, env.posts)
+            state, reward, done, _ = env.step(action)
+            total += reward
+    return total / 5
+def evaluate_all_tasks(agent, posts, EnvClass):
+    tasks = ["easy", "medium", "hard"]
+    results = {}
+    for t in tasks:
+        env = EnvClass(posts, task=t)
+        results[t] = evaluate(env, agent)
+    return results

inference.py ADDED Viewed

	@@ -0,0 +1,29 @@

+from fastapi import FastAPI
+import json
+from env.feed_env import FeedRankingEnv
+from agents.random_agent import RandomAgent
+app = FastAPI()
+# Load data
+with open("data/posts.json", "r") as f:
+    posts = json.load(f)
+env = FeedRankingEnv(posts, task="hard")
+agent = RandomAgent()
+@app.post("/reset")
+def reset():
+    state = env.reset()
+    return {"state": state.__dict__}
+@app.post("/step")
+def step():
+    action = agent.act(env.state(), env.posts)
+    state, reward, done, _ = env.step(action)
+    return {
+        "state": state.__dict__,
+        "reward": reward,
+        "done": done
+    }

main.py ADDED Viewed

	@@ -0,0 +1,50 @@

+import json
+import random
+import numpy as np
+import os
+from env.feed_env import FeedRankingEnv
+from agents.random_agent import RandomAgent
+from evaluation.grader import evaluate_all_tasks
+# ---------------------------
+# Environment Variables (required format)
+# ---------------------------
+API_BASE_URL = os.getenv("API_BASE_URL", "default")
+MODEL_NAME = os.getenv("MODEL_NAME", "baseline")
+HF_TOKEN = os.getenv("HF_TOKEN")
+# ---------------------------
+# Reproducibility
+# ---------------------------
+def set_seed(seed=42):
+    random.seed(seed)
+    np.random.seed(seed)
+set_seed()
+# ---------------------------
+# Load data
+# ---------------------------
+with open("data/posts.json", "r") as f:
+    posts = json.load(f)
+# ---------------------------
+# Initialize agent
+# ---------------------------
+agent = RandomAgent()
+# ---------------------------
+# Evaluate
+# ---------------------------
+results = evaluate_all_tasks(agent, posts, FeedRankingEnv)
+# ---------------------------
+# REQUIRED STRUCTURED OUTPUT
+# ---------------------------
+print("START")
+for task, score in results.items():
+    print(f"STEP task={task} score={round(score, 2)}")
+print("END")

openenv.yaml ADDED Viewed

	@@ -0,0 +1,20 @@

+name: value-aware-feed-ranking
+tasks:
+  - easy
+  - medium
+  - hard
+observation_space:
+  type: UserState
+action_space:
+  type: ranked_posts
+reward:
+  components:
+    - engagement
+    - value_alignment
+    - diversity
+    - toxicity_penalty
+    - fatigue_penalty

requirements.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ numpy