🕉️ Sanskrit-PPO: Hopper-v5 SOTA
2979.5 peak reward on Hopper-v5 — 125% of the CleanRL benchmark (2382 ± 271).
This is the base PPO policy from the SanskritLM project — a research initiative by ParamTatva.org exploring how Sanskrit linguistic embeddings can drive robotic control. This release establishes the SOTA baseline; the Sanskrit-conditioned multi-task policy (which accepts behavioral commands in Sanskrit) is coming in a future release.
What's in This Release
| File | Description |
|---|---|
hopper_v5_sota.pt |
Trained PPO weights (135 KB) — 125% of CleanRL SOTA |
model.py |
Agent architecture (2-layer MLP, ~10K params) |
train.py |
Fully reproducible training script |
inference.py |
CLI inference tool with --render support |
Results
| Metric | Value |
|---|---|
| Peak avg return (last 20) | 2979.5 |
| Best checkpoint return | 2731.3 |
| CleanRL benchmark | 2382 ± 271 |
| Our ratio vs SOTA | 125% |
| Training time | ~1 hour (single GPU) |
| Steps to SOTA | ~300K of 1M |
Training Curve
Update 50 | 102K steps | Return: 716
Update 80 | 163K steps | Return: 2023
Update 150 | 307K steps | Return: 2296
Update 230 | 471K steps | Return: 2620
Update 375 | 768K steps | Return: 2731
PEAK | Return: 2979 ← 125% of CleanRL SOTA
Architecture
| Component | Details |
|---|---|
| Algorithm | PPO (Proximal Policy Optimization) |
| Actor | MLP: 11 → 64 → 64 → 3, Tanh activations |
| Critic | MLP: 11 → 64 → 64 → 1, Tanh activations |
| Initialization | Orthogonal (√2 hidden, 0.01 actor, 1.0 critic) |
| Parameters | ~10K |
The Key Insight: Environment Wrappers
The breakthrough was using Gymnasium's built-in normalization wrappers. Without them, PPO plateaus at ~474. With them: 2979.
env = gym.make("Hopper-v5")
env = gym.wrappers.RecordEpisodeStatistics(env)
env = gym.wrappers.FlattenObservation(env)
env = gym.wrappers.NormalizeObservation(env)
env = gym.wrappers.NormalizeReward(env, gamma=0.99)
Full investigation: The 371 Wall — A Detective Story
Quick Start
Inference
import torch
import gymnasium as gym
from model import Agent
agent = Agent(obs_dim=11, act_dim=3)
ckpt = torch.load("hopper_v5_sota.pt", map_location="cpu")
agent.load_state_dict(ckpt["model_state_dict"])
agent.eval()
env = gym.make("Hopper-v5", render_mode="human")
obs, _ = env.reset()
total_reward = 0
for _ in range(1000):
with torch.no_grad():
action, _, _, _ = agent.get_action_and_value(
torch.FloatTensor(obs).unsqueeze(0)
)
obs, reward, term, trunc, _ = env.step(action.numpy().flatten())
total_reward += reward
if term or trunc: break
print(f"Episode reward: {total_reward:.0f}")
Train from Scratch
pip install gymnasium[mujoco] torch numpy
python train.py
~1 hour on a single GPU (T4/A100), ~4 hours on CPU.
🕉️ Coming Soon: Sanskrit-Conditioned Multi-Task Policy
This release is the base PPO agent — it takes raw observations and produces actions without any language conditioning.
The full SanskritLM pipeline adds a proprietary encoder that accepts behavioral commands in Sanskrit (Devanagari script) and conditions the policy via Feature-wise Linear Modulation (FiLM). A single policy learns multiple behaviors from Sanskrit commands:
| Sanskrit Command | Transliteration | Meaning | Behavior |
|---|---|---|---|
| अग्रे गच्छ | agre gaccha | "go forward" | Forward locomotion |
| पृष्ठतः गच्छ | pṛṣṭhataḥ gaccha | "go backward" | Backward locomotion |
| ऊर्ध्वं कूर्द | ūrdhvaṃ kūrda | "jump up" | Hopping/jumping |
| तिष्ठ | tiṣṭha | "stand still" | Stationary balance |
Why Sanskrit? Sanskrit's compositional morphology (sandhi, vibhakti, dhātu system) produces inherently structured embeddings. A single verb root (dhātu) encodes motion type, direction, intensity, and aspect — information that requires multiple English words. This linguistic density gives the encoder a natural advantage for encoding complex behavioral commands.
The multi-task release will include:
- Sanskrit-conditioned policy weights for Hopper, HalfCheetah, Walker2d, Humanoid, Ant, and Reacher
- The encoder interface (commands must be in Sanskrit — use an LLM or translation API to generate Devanagari input)
- Multi-environment benchmark results
🔔 Watch this repo for the multi-task release, or visit ParamTatva.org for updates.
Hyperparameters
| Parameter | Value |
|---|---|
| Total timesteps | 1,000,000 |
| Learning rate | 3e-4 (linear anneal) |
| Rollout steps | 2,048 |
| Minibatch size | 64 |
| Update epochs | 10 |
| Gamma | 0.99 |
| GAE Lambda | 0.95 |
| Clip coefficient | 0.2 |
| Value function clipping | ✓ |
| Entropy coefficient | 0.0 |
| Value loss coefficient | 0.5 |
| Max gradient norm | 0.5 |
| Seed | 1 |
License
Released under the ParamTatva Commercial License. See LICENSE.
- ✅ Academic research and evaluation
- ✅ Personal/educational use
- ❌ Commercial deployment requires a separate license
- ❌ Redistribution of weights without attribution
Citation
@misc{paramtatva2026ppohopper,
title={Sanskrit-PPO: SOTA Reinforcement Learning with Linguistic Embeddings},
author={ParamTatva Research},
year={2026},
url={https://huggingface.co/paramtatva/sanskrit-ppo-hopper-v5}
}
Contact
For commercial licensing and multi-task model access, contact the ParamTatva team at ParamTatva.org.
Evaluation results
- Hopper V5
on
ParamTatva/mujoco-sota-benchmark
3183.200 *
- Walker2d V5
on
ParamTatva/mujoco-sota-benchmark
4918.500 *
- HalfCheetah V5
on
ParamTatva/mujoco-sota-benchmark
5803.900 *
- Reacher V5
on
ParamTatva/mujoco-sota-benchmark
-4.200 *
- Best Return (seed=3, 2M steps) on Hopper-v5self-reported3183.200
- Best Return (seed=42, 3M steps) on Walker2d-v5self-reported4918.500
- Best Return (seed=2, 3M steps, still training) on HalfCheetah-v5self-reported5803.900
- Best Return (seeds 1,3) on Reacher-v5self-reported-4.200