YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Holosoma FastSAC Locomotion Policy Architecture

Model: fastsac_g1_29dof.onnx
Source: Holosoma Whole-Body Locomotion
Repository: nepyope/holosoma_locomotion
Framework: PyTorch 2.4.1 β†’ ONNX (Opset 13)
Total Parameters: 221,727


Table of Contents

  1. Overview
  2. Input Space (Observation)
  3. Output Space (Action)
  4. Network Architecture
  5. Forward Pass Data Flow
  6. PD Gains
  7. Comparison with GR00T

Overview

The Holosoma FastSAC policy is a single unified actor network trained using the Fast Soft Actor-Critic (SAC) algorithm for bipedal whole-body locomotion. Unlike GR00T's dual-policy system (balance + walk), Holosoma uses one network to handle all locomotion states.

Key Features

  • Single frame input: 100D observation (no temporal stacking)
  • Full body control: All 29 DOF controlled simultaneously
  • Observation normalization: Running mean/std baked into model
  • Per-joint action scaling: Action scale factors embedded in ONNX
  • LayerNorm: Used instead of BatchNorm for stability
  • Gait phase encoding: Sin/cos phase signals for periodic locomotion

Input Space (Observation)

Observation Vector (100D)

Index Dim Content Scale
[0:29] 29 last_action (unscaled) 1.0
[29:32] 3 angular_velocity (IMU gyro) 0.25
[32] 1 yaw_velocity_command 1.0
[33:35] 2 linear_velocity_command (x, y) 1.0
[35:37] 2 gait_phase_cos 1.0
[37:66] 29 joint_positions (relative to default) 1.0
[66:95] 29 joint_velocities 0.05
[95:98] 3 gravity_orientation 1.0
[98:100] 2 gait_phase_sin 1.0

Total: 100D

Observation Scaling Constants

ANG_VEL_SCALE = 0.25      # Angular velocity scaling
DOF_POS_SCALE = 1.0       # Joint position scaling
DOF_VEL_SCALE = 0.05      # Joint velocity scaling

Input Normalization

The model includes a baked-in observation normalizer applied internally:

# Applied inside ONNX graph
obs_normalized = (obs - obs_mean) / obs_std

# obs_mean: shape (1, 100), range [-0.99, 0.71]
# obs_std:  shape (1, 100), range [0.04, 1.40]

Default Joint Positions (29D Standing Pose)

DEFAULT_ANGLES = np.zeros(29, dtype=np.float32)
DEFAULT_ANGLES[[0, 6]] = -0.312    # hip pitch (left, right)
DEFAULT_ANGLES[[3, 9]] = 0.669     # knee
DEFAULT_ANGLES[[4, 10]] = -0.363   # ankle pitch
DEFAULT_ANGLES[[15, 22]] = 0.2     # shoulder pitch
DEFAULT_ANGLES[16] = 0.2           # left shoulder roll
DEFAULT_ANGLES[23] = -0.2          # right shoulder roll
DEFAULT_ANGLES[[18, 25]] = 0.6     # elbow

Gait Phase

GAIT_PERIOD = 1.0  # seconds
CONTROL_FREQ = 50  # Hz
phase_dt = 2 * Ο€ / (CONTROL_FREQ * GAIT_PERIOD)

# Two-phase system (left/right legs offset by Ο€)
phase = [phase_left, phase_right]

# When standing still (no command)
if cmd_magnitude < 0.01:
    phase = [Ο€, Ο€]  # Both legs synchronized (standing)

Output Space (Action)

Action Vector (29D)

All 29 joints are controlled:

Index Joint Group DOF
[0:6] Left leg (hip_pitch, hip_roll, hip_yaw, knee, ankle_pitch, ankle_roll) 6
[6:12] Right leg 6
[12:15] Waist (yaw, roll, pitch) 3
[15:22] Left arm (shoulder_pitch/roll/yaw, elbow, wrist_roll/pitch/yaw) 7
[22:29] Right arm 7

Total: 29D

Action Scaling & Application

ACTION_SCALE = 0.25  # External scaling applied in control loop

# Target joint positions
target_pos = DEFAULT_ANGLES + action * ACTION_SCALE

Command Ranges

Command Range
Linear velocity X [-1.0, 1.0] m/s
Linear velocity Y [-1.0, 1.0] m/s
Angular velocity (yaw) [-1.0, 1.0] rad/s
Heading [-3.14, 3.14] rad

Network Architecture

Overview Diagram

Input (100D)
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Observation Normalizer β”‚  (mean/std normalization)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Linear(100 β†’ 512)      β”‚  51,200 params
β”‚  LayerNorm(512)         β”‚  1,024 params
β”‚  ELU()                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Linear(512 β†’ 256)      β”‚  131,072 params
β”‚  LayerNorm(256)         β”‚  512 params
β”‚  ELU()                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Linear(256 β†’ 128)      β”‚  32,768 params
β”‚  LayerNorm(128)         β”‚  256 params
β”‚  ELU()                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Linear(128 β†’ 29)       β”‚  3,712 params
β”‚  Tanh()                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Action Scaling         β”‚  action * scale + bias
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
Output (29D)

Layer Details

Layer 1: Linear(100 β†’ 512) + LayerNorm + ELU

  • Input: 100D normalized observation
  • Output: 512D features
  • Parameters: 51,200 (weights) + 512 (bias) + 1,024 (LayerNorm) = 52,736

Layer 2: Linear(512 β†’ 256) + LayerNorm + ELU

  • Input: 512D features
  • Output: 256D features
  • Parameters: 131,072 (weights) + 256 (bias) + 512 (LayerNorm) = 131,840

Layer 3: Linear(256 β†’ 128) + LayerNorm + ELU

  • Input: 256D features
  • Output: 128D features
  • Parameters: 32,768 (weights) + 128 (bias) + 256 (LayerNorm) = 33,152

Layer 4: Linear(128 β†’ 29) (Action Head)

  • Input: 128D features
  • Output: 29D raw actions
  • Parameters: 3,712 (weights) + 29 (bias) = 3,741

Weight Summary

Weight Name Shape Parameters
obs_normalizer._mean [1, 100] 100
obs_normalizer._std [1, 100] 100
actor.net.0.weight [512, 100] 51,200
actor.net.0.bias [512] 512
actor.net.1.weight (LN) [512] 512
actor.net.1.bias (LN) [512] 512
actor.net.3.weight [256, 512] 131,072
actor.net.3.bias [256] 256
actor.net.4.weight (LN) [256] 256
actor.net.4.bias (LN) [256] 256
actor.net.6.weight [128, 256] 32,768
actor.net.6.bias [128] 128
actor.net.7.weight (LN) [128] 128
actor.net.7.bias (LN) [128] 128
actor.fc_mu.0.weight [29, 128] 3,712
actor.fc_mu.0.bias [29] 29
actor.action_scale [29] 29
actor.action_bias [29] 29

Total: 221,727 parameters


Forward Pass Data Flow

Complete Forward Pass

def forward(obs_100d):
    """
    Args:
        obs_100d: (batch, 100) observation vector
        
    Returns:
        actions: (batch, 29) joint position deltas
    """
    # 1. Normalize observation (baked into ONNX)
    x = (obs_100d - obs_mean) / obs_std
    
    # 2. Hidden layer 1
    x = linear_0(x)      # (batch, 512)
    x = layer_norm_1(x)
    x = elu(x)
    
    # 3. Hidden layer 2
    x = linear_3(x)      # (batch, 256)
    x = layer_norm_4(x)
    x = elu(x)
    
    # 4. Hidden layer 3
    x = linear_6(x)      # (batch, 128)
    x = layer_norm_7(x)
    x = elu(x)
    
    # 5. Action head
    x = fc_mu(x)         # (batch, 29)
    x = tanh(x)          # Bound to [-1, 1]
    
    # 6. Scale actions (baked into ONNX)
    actions = x * action_scale + action_bias
    
    return actions

Computational Cost

Forward pass operations:
1. Normalization: 100D subtract + divide     (~200 ops)
2. Layer 1: 100β†’512 matmul + LN + ELU        (~52K MACs)
3. Layer 2: 512β†’256 matmul + LN + ELU        (~131K MACs)
4. Layer 3: 256β†’128 matmul + LN + ELU        (~33K MACs)
5. Action head: 128β†’29 matmul + tanh + scale (~4K MACs)

Total: ~220K multiply-accumulate operations per inference

PD Gains

PD gains are stored in ONNX metadata and loaded at runtime.

PD Gain Selection Methodology

Holosoma uses the Raibert-style rule (BeyondMimic method) to compute PD gains from joint armature values:

Kp = I Γ— (2Ο€ Γ— fn)Β²
Kd = 2 Γ— I Γ— ΞΆ Γ— (2Ο€ Γ— fn)

Where:

  • I = Joint armature (rotational inertia, from URDF/MuJoCo)
  • fn = Natural frequency = 10 Hz
  • ΞΆ = Damping ratio = 2.0 (over-damped for stability)

This physics-based approach ensures each joint's stiffness/damping is proportional to its inertia.

Verification

Joint Armature (I) Formula Kp Actual Kp Match
hip_yaw 0.01018 40.179 40.179 βœ“
hip_roll 0.02510 99.098 99.098 βœ“
ankle 0.00722 28.501 28.501 βœ“
shoulder 0.00361 14.251 14.251 βœ“
wrist 0.00425 16.778 16.778 βœ“

Alternative Rule of Thumb (Kyle's Method)

Another common approach in RL locomotion:

Kp = Ο„_max / RoM   (or Ο„_max / (RoM/2) for stiffer response)
Kd = Kp / 20

Where Ο„_max is max torque and RoM is range of motion. This gives similar ballpark values but doesn't account for joint inertia.

Kp (Position Gains)

Joint Group Kp Values Armature
Hip pitch/yaw 40.18 0.0102
Hip roll, Knee 99.10 0.0251
Ankle 28.50 0.0072
Waist yaw 40.18 0.0102
Waist roll/pitch 28.50 0.0072
Shoulder/Elbow 14.25 0.0036
Wrist 16.78 0.0043

Kd (Velocity Gains)

Joint Group Kd Values Armature
Hip pitch/yaw 2.56 0.0102
Hip roll, Knee 6.31 0.0251
Ankle 1.81 0.0072
Waist yaw 2.56 0.0102
Waist roll/pitch 1.81 0.0072
Shoulder/Elbow 0.91 0.0036
Wrist 1.07 0.0043

References


Comparison with GR00T

Feature Holosoma FastSAC GR00T
Architecture Single unified actor Dual policy (Balance + Walk)
Observation 100D (single frame) 516D (6 frames Γ— 86D)
Actions 29D (full body) 15D (legs + waist only)
Parameters 221,727 470,578
Policy Switching None Based on command magnitude
Temporal Context Gait phase encoding 6-frame history stack
Normalization LayerNorm + obs normalizer L2 norm on latent
Control Frequency 50 Hz 50 Hz

Key Differences

  1. Single vs Dual Policy: Holosoma handles balance and walking with one network; GR00T switches between dedicated balance/walk policies.

  2. Temporal Handling: GR00T stacks 6 frames (516D) for temporal context; Holosoma uses gait phase signals (sin/cos) to encode periodicity.

  3. Body Coverage: Holosoma controls all 29 DOF including arms; GR00T only controls 15 DOF (legs + waist), leaving arms for other tasks.

  4. Architecture: GR00T uses an encoder-decoder (estimator β†’ actor) design with latent normalization; Holosoma is a straightforward MLP with LayerNorm.


Usage

from huggingface_hub import hf_hub_download
import onnxruntime as ort
import numpy as np

# Load model
path = hf_hub_download(repo_id="nepyope/holosoma_locomotion", 
                       filename="fastsac_g1_29dof.onnx")
policy = ort.InferenceSession(path)

# Build 100D observation
obs = np.zeros((1, 100), dtype=np.float32)
obs[0, 0:29] = last_action
obs[0, 29:32] = ang_vel * 0.25
obs[0, 32] = cmd_yaw
obs[0, 33:35] = cmd_lin
obs[0, 35:37] = np.cos(phase)
obs[0, 37:66] = (joint_pos - default_pos) * 1.0
obs[0, 66:95] = joint_vel * 0.05
obs[0, 95:98] = gravity
obs[0, 98:100] = np.sin(phase)

# Run inference
action = policy.run(None, {"actor_obs": obs})[0].squeeze()

# Apply to robot
target = default_pos + action * 0.25

References

  1. Holosoma Project: Whole-body locomotion for humanoid robots
  2. SAC Algorithm: "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning"
  3. Unitree G1: https://www.unitree.com/g1

Last Updated: 2025-12-17

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support