Sanskrit-PPO Banner

🕉️ Sanskrit-PPO: Hopper-v5 SOTA

2979.5 peak reward on Hopper-v5 — 125% of the CleanRL benchmark (2382 ± 271).

This is the base PPO policy from the SanskritLM project — a research initiative by ParamTatva.org exploring how Sanskrit linguistic embeddings can drive robotic control. This release establishes the SOTA baseline; the Sanskrit-conditioned multi-task policy (which accepts behavioral commands in Sanskrit) is coming in a future release.

What's in This Release

File Description
hopper_v5_sota.pt Trained PPO weights (135 KB) — 125% of CleanRL SOTA
model.py Agent architecture (2-layer MLP, ~10K params)
train.py Fully reproducible training script
inference.py CLI inference tool with --render support

Results

Metric Value
Peak avg return (last 20) 2979.5
Best checkpoint return 2731.3
CleanRL benchmark 2382 ± 271
Our ratio vs SOTA 125%
Training time ~1 hour (single GPU)
Steps to SOTA ~300K of 1M

Training Curve

Update  50  | 102K steps | Return:  716
Update  80  | 163K steps | Return: 2023
Update 150  | 307K steps | Return: 2296
Update 230  | 471K steps | Return: 2620
Update 375  | 768K steps | Return: 2731
PEAK                     | Return: 2979  ← 125% of CleanRL SOTA

Architecture

Component Details
Algorithm PPO (Proximal Policy Optimization)
Actor MLP: 11 → 64 → 64 → 3, Tanh activations
Critic MLP: 11 → 64 → 64 → 1, Tanh activations
Initialization Orthogonal (√2 hidden, 0.01 actor, 1.0 critic)
Parameters ~10K

The Key Insight: Environment Wrappers

The breakthrough was using Gymnasium's built-in normalization wrappers. Without them, PPO plateaus at ~474. With them: 2979.

env = gym.make("Hopper-v5")
env = gym.wrappers.RecordEpisodeStatistics(env)
env = gym.wrappers.FlattenObservation(env)
env = gym.wrappers.NormalizeObservation(env)
env = gym.wrappers.NormalizeReward(env, gamma=0.99)

Full investigation: The 371 Wall — A Detective Story

Quick Start

Inference

import torch
import gymnasium as gym
from model import Agent

agent = Agent(obs_dim=11, act_dim=3)
ckpt = torch.load("hopper_v5_sota.pt", map_location="cpu")
agent.load_state_dict(ckpt["model_state_dict"])
agent.eval()

env = gym.make("Hopper-v5", render_mode="human")
obs, _ = env.reset()
total_reward = 0

for _ in range(1000):
    with torch.no_grad():
        action, _, _, _ = agent.get_action_and_value(
            torch.FloatTensor(obs).unsqueeze(0)
        )
    obs, reward, term, trunc, _ = env.step(action.numpy().flatten())
    total_reward += reward
    if term or trunc: break

print(f"Episode reward: {total_reward:.0f}")

Train from Scratch

pip install gymnasium[mujoco] torch numpy
python train.py

~1 hour on a single GPU (T4/A100), ~4 hours on CPU.

🕉️ Coming Soon: Sanskrit-Conditioned Multi-Task Policy

This release is the base PPO agent — it takes raw observations and produces actions without any language conditioning.

The full SanskritLM pipeline adds a proprietary encoder that accepts behavioral commands in Sanskrit (Devanagari script) and conditions the policy via Feature-wise Linear Modulation (FiLM). A single policy learns multiple behaviors from Sanskrit commands:

Sanskrit Command Transliteration Meaning Behavior
अग्रे गच्छ agre gaccha "go forward" Forward locomotion
पृष्ठतः गच्छ pṛṣṭhataḥ gaccha "go backward" Backward locomotion
ऊर्ध्वं कूर्द ūrdhvaṃ kūrda "jump up" Hopping/jumping
तिष्ठ tiṣṭha "stand still" Stationary balance

Why Sanskrit? Sanskrit's compositional morphology (sandhi, vibhakti, dhātu system) produces inherently structured embeddings. A single verb root (dhātu) encodes motion type, direction, intensity, and aspect — information that requires multiple English words. This linguistic density gives the encoder a natural advantage for encoding complex behavioral commands.

The multi-task release will include:

  • Sanskrit-conditioned policy weights for Hopper, HalfCheetah, Walker2d, Humanoid, Ant, and Reacher
  • The encoder interface (commands must be in Sanskrit — use an LLM or translation API to generate Devanagari input)
  • Multi-environment benchmark results

🔔 Watch this repo for the multi-task release, or visit ParamTatva.org for updates.

Hyperparameters

Parameter Value
Total timesteps 1,000,000
Learning rate 3e-4 (linear anneal)
Rollout steps 2,048
Minibatch size 64
Update epochs 10
Gamma 0.99
GAE Lambda 0.95
Clip coefficient 0.2
Value function clipping
Entropy coefficient 0.0
Value loss coefficient 0.5
Max gradient norm 0.5
Seed 1

License

Released under the ParamTatva Commercial License. See LICENSE.

  • ✅ Academic research and evaluation
  • ✅ Personal/educational use
  • ❌ Commercial deployment requires a separate license
  • ❌ Redistribution of weights without attribution

Citation

@misc{paramtatva2026ppohopper,
  title={Sanskrit-PPO: SOTA Reinforcement Learning with Linguistic Embeddings},
  author={ParamTatva Research},
  year={2026},
  url={https://huggingface.co/paramtatva/sanskrit-ppo-hopper-v5}
}

Contact

For commercial licensing and multi-task model access, contact the ParamTatva team at ParamTatva.org.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Evaluation results

  • Best Return (seed=3, 2M steps) on Hopper-v5
    self-reported
    3183.200
  • Best Return (seed=42, 3M steps) on Walker2d-v5
    self-reported
    4918.500
  • Best Return (seed=2, 3M steps, still training) on HalfCheetah-v5
    self-reported
    5803.900
  • Best Return (seeds 1,3) on Reacher-v5
    self-reported
    -4.200