Somuai12 commited on
Commit
82f6517
·
1 Parent(s): 1ad2a1f

Remove emojis from README and track reward_progression image

Browse files
Files changed (2) hide show
  1. .gitignore +0 -1
  2. README.md +6 -6
.gitignore CHANGED
@@ -13,4 +13,3 @@ venv.bak/
13
  *results.json
14
  .DS_Store
15
  .env
16
- reward_progression.png
 
13
  *results.json
14
  .DS_Store
15
  .env
 
README.md CHANGED
@@ -12,7 +12,7 @@ base_path: /dashboard/
12
 
13
  ---
14
 
15
- ### 🧪 Advanced Reward Shaping (RLVR Integration)
16
  Unlike standard environments with static rewards, **PolicyEvolverEnv v2.0** implements a sophisticated, deterministic grading engine designed to harden LLM strategic reasoning:
17
 
18
  * **Tiered CoT Bonus**: Rewards analytical reasoning (up to +0.20) based on keyword density and length.
@@ -159,9 +159,9 @@ The agent uses **In-Context Reinforcement Learning (ICL-RL)**: no weight updates
159
 
160
  | Task | Step 1 | Step 2 | Step 3 | Step 4 | Step 5 | Converged |
161
  |------|--------|--------|--------|--------|--------|-----------|
162
- | task_easy | 0.94 | N/A | N/A | N/A | N/A | |
163
- | task_medium | 1.00 | N/A | N/A | N/A | N/A | |
164
- | task_hard | 0.90 | N/A | N/A | N/A | N/A | |
165
 
166
  **Model:** llama-3.1-8b-instant (via Groq)
167
  **Reproducible:** temperature=0.0, seed=42 (**Bit-for-bit identical results verified**)
@@ -183,12 +183,12 @@ The agent uses **In-Context Reinforcement Learning (ICL-RL)**: no weight updates
183
  3. API Keys → Create API Key
184
  4. Export: `export HF_TOKEN=gsk_your_key_here`
185
 
186
- ## 📈 Strategic Reward Evolution & RLVR
187
  PolicyEvolverEnv serves as the **Strategic Sandbox** for the **Reinforcement Learning from Verifiable Rewards (RLVR)** stage of the modern LLM inference pipeline. Unlike static evaluation, this environment enables agents to refine their strategies iteratively based on high-quality, verifiable feedback.
188
 
189
  ![Reward Progression](https://raw.githubusercontent.com/Luciferai04/PolicyEvolverEnv/master/reward_progression.png)
190
 
191
- ### 🧠 How It Works: The Iterative Refinement Process
192
  1. **Refinement Hub**: The baseline agent tracks its previous rewards and actions through the observation's metadata (`info`).
193
  2. **Strategic pivoting**: If a policy proposal receives low rewards (due to lack of specificity or missing justifications), the agent identifies the failure points and pivots its strategy in subsequent steps.
194
  3. **Measurable Improvement**: As shown in the progression chart, iterative refinement leads to **Strategic Convergence**, where the policy quality reaches institutional standards (Score ≥ 0.85).
 
12
 
13
  ---
14
 
15
+ ### Advanced Reward Shaping (RLVR Integration)
16
  Unlike standard environments with static rewards, **PolicyEvolverEnv v2.0** implements a sophisticated, deterministic grading engine designed to harden LLM strategic reasoning:
17
 
18
  * **Tiered CoT Bonus**: Rewards analytical reasoning (up to +0.20) based on keyword density and length.
 
159
 
160
  | Task | Step 1 | Step 2 | Step 3 | Step 4 | Step 5 | Converged |
161
  |------|--------|--------|--------|--------|--------|-----------|
162
+ | task_easy | 0.94 | N/A | N/A | N/A | N/A | Yes |
163
+ | task_medium | 1.00 | N/A | N/A | N/A | N/A | Yes |
164
+ | task_hard | 0.90 | N/A | N/A | N/A | N/A | Yes |
165
 
166
  **Model:** llama-3.1-8b-instant (via Groq)
167
  **Reproducible:** temperature=0.0, seed=42 (**Bit-for-bit identical results verified**)
 
183
  3. API Keys → Create API Key
184
  4. Export: `export HF_TOKEN=gsk_your_key_here`
185
 
186
+ ## Strategic Reward Evolution & RLVR
187
  PolicyEvolverEnv serves as the **Strategic Sandbox** for the **Reinforcement Learning from Verifiable Rewards (RLVR)** stage of the modern LLM inference pipeline. Unlike static evaluation, this environment enables agents to refine their strategies iteratively based on high-quality, verifiable feedback.
188
 
189
  ![Reward Progression](https://raw.githubusercontent.com/Luciferai04/PolicyEvolverEnv/master/reward_progression.png)
190
 
191
+ ### How It Works: The Iterative Refinement Process
192
  1. **Refinement Hub**: The baseline agent tracks its previous rewards and actions through the observation's metadata (`info`).
193
  2. **Strategic pivoting**: If a policy proposal receives low rewards (due to lack of specificity or missing justifications), the agent identifies the failure points and pivots its strategy in subsequent steps.
194
  3. **Measurable Improvement**: As shown in the progression chart, iterative refinement leads to **Strategic Convergence**, where the policy quality reaches institutional standards (Score ≥ 0.85).