Spaces:

openenv-community
/

test-local-nested-envs

Running on T4

App Files Files Community

KarlLearnsAI commited on 2 days ago

Commit

934b4ac

verified ·

1 Parent(s): a0b061b

Upload app.py with huggingface_hub

Browse files

Files changed (1) hide show

app.py +16 -59

app.py CHANGED Viewed

@@ -94,7 +94,22 @@ with gr.Blocks(
     """)
     # ── Tab layout ──
     with gr.Tabs():
-        # Tab 1: Training Results
         with gr.Tab("Training Results"):
             gr.Markdown(
                 "### Reward Trend — GRPO Prompt Optimization",
@@ -108,63 +123,5 @@ with gr.Blocks(
                 </div>
                 """,
             )
-        # Tab 2: Architecture (placeholder for future .png)
-        with gr.Tab("Architecture"):
-            gr.Markdown("""
-# The 3-Layer Architecture
-```
-┌─────────────────────────────────────────────────────────┐
-│  LAYER 0 — Reward Function                              │
-│                                                         │
-│  Defines what "good" looks like for a conversation:     │
-│  • +50  Correct intent classification                   │
-│  • +20  Resolved in ≤3 turns (efficiency)               │
-│  • +40  Social engineering attack resisted              │
-│  • −100 Social engineering attack succeeded             │
-│                                                         │
-│  Swapping domain (banking → telecom) auto-generates     │
-│  a new reward function = a new RL environment.          │
-└────────────────────────┬────────────────────────────────┘
-                         │ reward signal
-┌────────────────────────▼────────────────────────────────┐
-│  LAYER 1 — RL Prompt Optimizer (GRPO)                   │
-│                                                         │
-│  Model: Qwen2.5-3B-Instruct + LoRA (trained via GRPO)  │
-│                                                         │
-│  Each training step:                                    │
-│  1. Generate N candidate system prompts                 │
-│  2. Test each prompt in Layer 2 (K customer episodes)   │
-│  3. Score via Layer 0 reward function                   │
-│  4. GRPO gradient update — reinforce high-reward prompts│
-│                                                         │
-│  Output: optimized system prompt for the support agent  │
-└────────────────────────┬────────────────────────────────┘
-                         │ system prompt
-┌────────────────────────▼────────────────────────────────┐
-│  LAYER 2 — Conversation Environment (OpenEnv 0.2.1)     │
-│                                                         │
-│  Two LLM actors (Llama 3.1 8B via HF Inference API):   │
-│                                                         │
-│  Customer (hidden intent + personality):                │
-│    • 100 diverse personas                               │
-│    • Intents: transfer / check_balance / block_card     │
-│    • Social engineering: none (60%), soft (20%),        │
-│      hard prompt injection (20%)                        │
-│                                                         │
-│  Support Agent (system prompt from Layer 1):            │
-│    • Must classify customer intent in few turns         │
-│    • Must resist manipulation attempts                  │
-│    • Outputs: {"intent": "<intent>"} when confident     │
-│                                                         │
-│  Episode ends when: intent classified / max turns /     │
-│  security violation detected                            │
-└─────────────────────────────────────────────────────────┘
-```
----
-## Prize Targets
-- **Main Track — Statement 4:** Layer 0 generates reward functions → new domain = new RL environment automatically
-- **Fleet AI $10k:** Layer 1 provides scalable oversight — add intents, retrain
-- **Halluminate $10k:** Layer 2 is a multi-actor environment with 100 diverse adversarial customers
-""")
 if __name__ == "__main__":
     demo.launch(server_name="0.0.0.0", server_port=7860)

     """)
     # ── Tab layout ──
     with gr.Tabs():
+        # Tab 1: Architecture (default)
+        with gr.Tab("Architecture"):
+            gr.Image(
+                value="assets/architecture.png",
+                label="3-Layer Architecture",
+                show_label=False,
+                show_download_button=False,
+            )
+            gr.Markdown("""
+---
+## Prize Targets
+- **Main Track — Statement 4:** Layer 0 generates reward functions → new domain = new RL environment automatically
+- **Fleet AI $10k:** Layer 1 provides scalable oversight — add intents, retrain
+- **Halluminate $10k:** Layer 2 is a multi-actor environment with 100 diverse adversarial customers
+""")
+        # Tab 2: Training Results
         with gr.Tab("Training Results"):
             gr.Markdown(
                 "### Reward Trend — GRPO Prompt Optimization",
                 </div>
                 """,
             )
 if __name__ == "__main__":
     demo.launch(server_name="0.0.0.0", server_port=7860)