🕉️ Sanskrit-PPO: Hopper-v5 SOTA

2979.5 peak reward on Hopper-v5 — 125% of the CleanRL benchmark (2382 ± 271).

This is the base PPO policy from the SanskritLM project — a research initiative by ParamTatva.org exploring how Sanskrit linguistic embeddings can drive robotic control. This release establishes the SOTA baseline; the Sanskrit-conditioned multi-task policy (which accepts behavioral commands in Sanskrit) is coming in a future release.

What's in This Release

File	Description
`hopper_v5_sota.pt`	Trained PPO weights (135 KB) — 125% of CleanRL SOTA
`model.py`	Agent architecture (2-layer MLP, ~10K params)
`train.py`	Fully reproducible training script
`inference.py`	CLI inference tool with `--render` support

Results

Metric	Value
Peak avg return (last 20)	2979.5
Best checkpoint return	2731.3
CleanRL benchmark	2382 ± 271
Our ratio vs SOTA	125%
Training time	~1 hour (single GPU)
Steps to SOTA	~300K of 1M

Training Curve

Update  50  | 102K steps | Return:  716
Update  80  | 163K steps | Return: 2023
Update 150  | 307K steps | Return: 2296
Update 230  | 471K steps | Return: 2620
Update 375  | 768K steps | Return: 2731
PEAK                     | Return: 2979  ← 125% of CleanRL SOTA

Architecture

Component	Details
Algorithm	PPO (Proximal Policy Optimization)
Actor	MLP: 11 → 64 → 64 → 3, Tanh activations
Critic	MLP: 11 → 64 → 64 → 1, Tanh activations
Initialization	Orthogonal (√2 hidden, 0.01 actor, 1.0 critic)
Parameters	~10K

The Key Insight: Environment Wrappers

The breakthrough was using Gymnasium's built-in normalization wrappers. Without them, PPO plateaus at ~474. With them: 2979.

env = gym.make("Hopper-v5")
env = gym.wrappers.RecordEpisodeStatistics(env)
env = gym.wrappers.FlattenObservation(env)
env = gym.wrappers.NormalizeObservation(env)
env = gym.wrappers.NormalizeReward(env, gamma=0.99)

Full investigation: The 371 Wall — A Detective Story

Quick Start

Inference

import torch
import gymnasium as gym
from model import Agent

agent = Agent(obs_dim=11, act_dim=3)
ckpt = torch.load("hopper_v5_sota.pt", map_location="cpu")
agent.load_state_dict(ckpt["model_state_dict"])
agent.eval()

env = gym.make("Hopper-v5", render_mode="human")
obs, _ = env.reset()
total_reward = 0

for _ in range(1000):
    with torch.no_grad():
        action, _, _, _ = agent.get_action_and_value(
            torch.FloatTensor(obs).unsqueeze(0)
        )
    obs, reward, term, trunc, _ = env.step(action.numpy().flatten())
    total_reward += reward
    if term or trunc: break

print(f"Episode reward: {total_reward:.0f}")

Train from Scratch

pip install gymnasium[mujoco] torch numpy
python train.py

~1 hour on a single GPU (T4/A100), ~4 hours on CPU.

🕉️ Coming Soon: Sanskrit-Conditioned Multi-Task Policy

This release is the base PPO agent — it takes raw observations and produces actions without any language conditioning.

The full SanskritLM pipeline adds a proprietary encoder that accepts behavioral commands in Sanskrit (Devanagari script) and conditions the policy via Feature-wise Linear Modulation (FiLM). A single policy learns multiple behaviors from Sanskrit commands:

Sanskrit Command	Transliteration	Meaning	Behavior
अग्रे गच्छ	agre gaccha	"go forward"	Forward locomotion
पृष्ठतः गच्छ	pṛṣṭhataḥ gaccha	"go backward"	Backward locomotion
ऊर्ध्वं कूर्द	ūrdhvaṃ kūrda	"jump up"	Hopping/jumping
तिष्ठ	tiṣṭha	"stand still"	Stationary balance

Why Sanskrit? Sanskrit's compositional morphology (sandhi, vibhakti, dhātu system) produces inherently structured embeddings. A single verb root (dhātu) encodes motion type, direction, intensity, and aspect — information that requires multiple English words. This linguistic density gives the encoder a natural advantage for encoding complex behavioral commands.

The multi-task release will include:

Sanskrit-conditioned policy weights for Hopper, HalfCheetah, Walker2d, Humanoid, Ant, and Reacher
The encoder interface (commands must be in Sanskrit — use an LLM or translation API to generate Devanagari input)
Multi-environment benchmark results

🔔 Watch this repo for the multi-task release, or visit ParamTatva.org for updates.

Hyperparameters

Parameter	Value
Total timesteps	1,000,000
Learning rate	3e-4 (linear anneal)
Rollout steps	2,048
Minibatch size	64
Update epochs	10
Gamma	0.99
GAE Lambda	0.95
Clip coefficient	0.2
Value function clipping	✓
Entropy coefficient	0.0
Value loss coefficient	0.5
Max gradient norm	0.5
Seed	1

License

Released under the ParamTatva Commercial License. See LICENSE.

✅ Academic research and evaluation
✅ Personal/educational use
❌ Commercial deployment requires a separate license
❌ Redistribution of weights without attribution

Citation

@misc{paramtatva2026ppohopper,
  title={Sanskrit-PPO: SOTA Reinforcement Learning with Linguistic Embeddings},
  author={ParamTatva Research},
  year={2026},
  url={https://huggingface.co/paramtatva/sanskrit-ppo-hopper-v5}
}

Contact

For commercial licensing and multi-task model access, contact the ParamTatva team at ParamTatva.org.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Evaluation results

Hopper V5 on ParamTatva/mujoco-sota-benchmark View evaluation results

3,183.2 ^*
Walker2d V5 on ParamTatva/mujoco-sota-benchmark View evaluation results

4,918.5 ^*
HalfCheetah V5 on ParamTatva/mujoco-sota-benchmark View evaluation results

5,803.9 ^*
Reacher V5 on ParamTatva/mujoco-sota-benchmark View evaluation results

-4.2 ^*

Best Return (seed=3, 2M steps) on Hopper-v5
self-reported

3183.200
Best Return (seed=42, 3M steps) on Walker2d-v5
self-reported

4918.500
Best Return (seed=2, 3M steps, still training) on HalfCheetah-v5
self-reported

5803.900
Best Return (seeds 1,3) on Reacher-v5
self-reported

-4.200