RL: A Structured Human Action & Intent Dataset for Physical AI and World Models

Community Article Published April 21, 2026

(The accompanying dataset can be found here.)

The hardest unsolved problem in physical AI isn't compute or model architecture: it's data. Specifically, the kind of data that captures not just what a body did, but why, and whether it succeeded. Web scraping can't produce it. Synthetic generation approximates it but misses the texture of real human decision-making, i.e. intent data. Robots collecting their own training data measure the result, not the intent behind it.

We’re sharing a dataset opportunity that may be highly relevant particularly around emerging vision–language–action, world models and real-world AI systems.

Over the past decade, FL-S has built a proprietary data platform that captures aligned intent, action, and outcome trajectories from human operators in physics-grounded environments. It was recorded by human forklift operators completing structured, compliant training exercises in a high-fidelity VR simulator, instrumented at every layer: raw physics, operator body, vehicle mechanics, exercise progression, and scored outcomes, all locked to the same clock.

The dataset includes:

  • Structured state → action → outcome sequences
  • Explicit human intent, task context (via xAPI) and rewards
  • Continuous control inputs and multimodal streams
  • Large-scale edge cases and failure scenarios

Unlike existing datasets focused on perception or passive observation, this data is learning-ready for action, enabling:

  • Policy learning
  • Goal-conditioned agents
  • Grounding of world models in human behavior

This could be relevant as a dataset category for physical AI, potentially complementing existing efforts around multimodal and agent-based systems.

What's in the Dataset

ML Pipeline Overview from Marc 3

This dataset records a training exercise designed to teach forklift operators how to pivot and interact with loads at various rack heights. Operators first engage and disengage a load using guided instructions, followed by an unguided repetition of the task to test their independent proficiency.

Each session captures a single operator completing one or more training exercises. Every episode produces:

Layer Streams Rate
Vehicle kinematics forklift_state, forklift_actuators 50 Hz (physics)
Scene dynamics env_rigidbodies, env_contacts 50 Hz / event
Operator body hmd_pose, hand_controllers, gaze_rays ~80 Hz (render)
Hardware inputs human_controls 120 Hz (on-change)
Task structure step_ledger event-driven
Episode lifecycle episode_markers, sync, system_perf event / periodic
Scene geometry environment_snapshot, calibration_snapshot one-shot
Semantic outcomes xapi_statements event-driven
Safety violations rule_events event-driven

This is just a small sample. Our customer facing training software has been deployed over the years to 500 simulators+ in 17 countries, generating a very large number of sessions and records. The number of data records per session varies based on the complexity of the exercise, part of a full training curriculum. Thousands of new sessions are recorded weekly. We expect the compressed dataset to be approximately 500 MB per session, scaling proportionally with the duration of each exercise.

All data is distributed as JSONL, one file per stream per episode, plus a derived trajectory.parquet per episode aligned to a deterministic 50 Hz physics grid.

Temporal Alignment

Every record across every stream carries the same 11-field common header:

{
"session_id":               "a1b2c3d4-...",
"episode_id":               "e5f6a7b8-...",
"user_id_hash":             "sha256_abc123",
"build_id":                 "1.2.3-beta",
"telemetry_schema_version": "1.0.0",
"t_sim":                    10.0,
"t_sim_delta":              0.02,
"t_wall_utc":               1710000000000000,
"t_mono_ns":                12345678901234,
"frame_index":              600,
"fixed_step_index":         500
}
  • t_sim is simulation time in seconds since episode start, deterministic and unaffected by real-world clock drift.
  • t_mono_ns is a monotonic hardware nanosecond counter for sub-millisecond delta calculations.
  • fixed_step_index is the Unity physics step counter, incremented at a fixed 50 Hz. It is the canonical join key for aligning physics-rate streams.
  • frame_index is the render frame counter for aligning render-rate streams.

This means joining forklift_state to hmd_pose on a given timestep is a simple fixed_step_index lookup, not a fuzzy time-range merge.

Coordinate system: Unity left-handed, Y-up, Z-forward. All positions in metres; rotations as quaternions (x, y, z, w).

Stream Reference

Vehicle State and Actuation

forklift_state | 50 Hz, 16 physics fields

The vehicle's complete kinematic state every physics step: world-space position (3D), orientation quaternion (4D), linear velocity (3D), angular velocity (3D), steer angle, motor torque, and gear.

{
"...header...",
"pos_x": 5.12,    "pos_y": 0.05,    "pos_z": -3.45,
"rot_x": 0.0,     "rot_y": 0.383,   "rot_z": 0.0,    "rot_w": 0.924,
"lin_vel_x": 0.5, "lin_vel_y": 0.0, "lin_vel_z": 1.8,
"ang_vel_x": 0.0, "ang_vel_y": 0.02,"ang_vel_z": 0.0,
"steer_angle": 5.5,
"motor_torque": 120.0,
"gear": "Forward"
}

forklift_actuators | 50 Hz, 14 fields

The stream that uniquely bridges operator intent and mechanical execution. Raw joystick inputs (input_lift, input_tilt, input_sideshift, input_boost, all normalized [-1, 1]) are recorded alongside their resulting physical outputs: mast height, mast tilt, sideshift offset, and the 6DOF world-space pose of the carriage.

{
"...header...",
"mast_height": 1.85,   "mast_tilt": -2.0,  "mast_side": 0.05,
"input_lift": 0.3,     "input_tilt": 0.0,  "input_sideshift": 0.0, "input_boost": 0.0,
"carriage_pos_x": 5.12,"carriage_pos_y": 1.85, "carriage_pos_z": -3.10,
"carriage_rot_x": -0.017, "carriage_rot_y": 0.383, "carriage_rot_z": 0.0, "carriage_rot_w": 0.924
}

A model learning from this stream knows not just where the mast is, but exactly how hard the operator pushed the joystick to get it there.

Environment

env_rigidbodies | 50 Hz per tracked object

Position, rotation, linear velocity, angular velocity, mass, and drag for every dynamic object in the scene. High-volume: in a busy warehouse exercise this is the largest stream.

env_contacts | event-driven

Physics collision events: the two entity IDs, relative velocity at impact, penetration depth, and up to 4 contact points with normals. This is the source for the collision reward channel in trajectory.parquet.

environment_snapshot | one-shot at episode start

The complete static scene layout: collider geometry, names, positions, and spatial properties of all static objects (aisles, pedestrian walkways, stop zones, ramps). Provides ground truth scene context for any spatial reasoning model.

calibration_snapshot | one-shot at episode start

VR play-area dimensions, floor offset, and per-controller calibration offsets. Essential context for correctly interpreting HMD and hand positions.

Operator Body

hmd_pose | render-rate (~80 Hz)

6DOF head position and rotation quaternion, plus a tracking validity flag. Captures exactly how the operator physically orients themselves throughout the task.

hand_controllers | render-rate, two records per frame (left/right)

Spatial pose plus analog finger inputs: grip (0–1) and trigger (0–1). Reveals anticipatory movements, capturing where hands move before the joystick input registers.

gaze_rays | render-rate

Eye-tracking: gaze ray origin and direction, hit distance, and hit normal. Tells you what the operator was actually looking at in the fraction of a second before every physical action.

human_controls | 120 Hz, on-change only

Raw OEM hardware input events: device path, control path (e.g. joystick axis, button), analog value, and a hardware-level input timestamp. Emitted only on state change, not on a fixed-rate poll, so each record represents a genuine operator decision.

Diagnostics and Sync

episode_markers | event-driven

Episode boundaries (episode_start, episode_end), pause events (pause_start, pause_end), and HMD connectivity events. Useful for filtering out paused or disconnected segments before training.

sync | 1 Hz and system_perf | 2 Hz

Simulation heartbeat and GPU/render diagnostics: Unity timeScale, frame times, dropped frame counts, GPU timings, physics step counter. Used to detect degraded simulation fidelity and exclude affected segments.

The Intent Layer

Physics streams tell you what the operator's body did. The following three streams tell you what they were supposed to do, what they actually accomplished, and where they went wrong. This is what makes the dataset usable for RLHF: you have the reward signal, not just the trajectory.

step_ledger | Exercise Progression

The exercise activity tree is a hierarchical curriculum of named steps (e.g. "approach the pallet", "raise the mast", "insert the forks"). The step_ledger stream emits a step_start / step_end event pair for each step, including the step token, its canonical URI, its depth in the hierarchy, and a unique span ID that links start to end.

{
"...header...",
"event_type":       "step_start",
"step_token":       "block_1.0",
"step_uri":         "https://flsimulators.com/xapi/activities/exercise/step/1.0.0",
"parent_step_token": "",
"exercise_id":      "FLS-CON-001",
"depth":            0,
"step_uid":         "f1e2d3c4-..."
}

This stream segments every episode into named, time-bounded task intervals. A model can learn which physical behaviors belong to which task phase, and the step_uid makes it straightforward to compute per-step metrics: time-on-task, velocity profile during approach, gaze allocation during insertion.

xapi_statements | Scored Outcomes

xAPI (Experience API) statements record semantic learning events: the operator, the action (verb), the object (activity), and the result. Scored statements include success/failure and a normalized score. These are the signals used to derive reward_task in trajectory.parquet.

The statements are linked back to the step_ledger via canonical activity URIs: a step's step_uri corresponds to the activity identifier in the xAPI statement that scores it.

rule_events | Safety Violations and Completions

The rules engine evaluates operator behavior against a configurable set of safety rules in real time: speed limits in pedestrian zones, prohibited maneuvers, load height restrictions. Each event records the rule identifier, event type (triggered, violated, completed), severity, and timestamp.

These provide structured binary feedback: clean signal for both RLHF penalty shaping and for training classifiers on unsafe operator behavior.

ML-Ready Format: trajectory.parquet

Each episode is also distributed as a pre-built trajectory.parquet, aligned to the 50 Hz physics grid with paused segments removed. This is the format intended for direct use with PyTorch datasets or Isaac Lab.

Schema per row (one physics timestep):

Column group Dimensions Source
obs_* (observation) 171D float32 forklift_state + forklift_actuators + hmd_pose + gaze_rays + hand_controllers + env_rigidbodies
act_* (action) 7D float32 forklift_state + forklift_actuators
reward_collision 1D env_contacts → penalty scaled by impact velocity
reward_step_completed 1D step_ledger → bonus on step_end
reward_task 1D xapi_statements → success/failure signal
reward_time 1D constant per-step cost (efficiency incentive)
step_token string current exercise step from step_ledger
done, truncated, paused bool episode state
fixed_step_index, t_sim index join key back to raw streams

Observation breakdown (171D):

  • Vehicle body pose + velocities + drivetrain: 17D (forklift_state)
  • Mast state + carriage 6DOF: 10D (forklift_actuators)
  • Operator head pose + tracking: 8D (hmd_pose)
  • Gaze direction + hit distance: 4D (gaze_rays)
  • Raw actuator inputs: 4D (forklift_actuators)
  • Hand Controller poses: 20D
  • Scene rigid body states - 8 tracked objects : 112D (env_rigidbodies)

Action vector (7D): normalized throttle (torque × gear sign), normalized steer, brake, lift, tilt, sideshift, boost, all in [-1, 1] or.

The step_token column is the bridge from trajectory to task structure. You can slice any episode into per-step segments by filtering on step_token, then compute per-step reward sums, behavioral metrics, or train step-conditioned policies.

ML Pipeline Overview from Marc 2

Use Cases

Behavior cloning: The trajectory.parquet format is drop-in compatible with standard BC pipelines. obs_* → model → act_*. The 44D observation is compact enough to train on CPU for prototyping.

RLHF reward modeling: xapi_statements provides scored human judgments on task completion. rule_events provides structured safety feedback. Both are pre-aligned to the physics timeline via t_wall_utc.

Step-conditioned imitation: The step_token column enables training policies conditioned on the current task phase, a natural fit for hierarchical RL or option-framework approaches.

Operator assessment models: Predict step completion time, collision probability, or score from the first N seconds of an episode. The step_ledger + rule_events combination provides the labels.

Gaze-informed action prediction: gaze_rays + hmd_pose provide the anticipatory signal. Operators look at their target before acting on it, and this lead time is measurable and consistent across operators.

Scene-conditioned policy transfer: environment_snapshot provides the warehouse layout as structured data. Policies can condition on scene geometry rather than treating each exercise as a distinct task.

Dataset Structure

Each session is a self-contained folder:

sessions/{session_id}/{episode_id}/
telemetry/
episode_manifest.json
forklift_state.jsonl
forklift_actuators.jsonl
env_rigidbodies.jsonl
env_contacts.jsonl
hmd_pose.jsonl
hand_controllers.jsonl
gaze_rays.jsonl
human_controls.jsonl
step_ledger.jsonl
episode_markers.jsonl
environment_snapshot.jsonl
calibration_snapshot.jsonl
sync.jsonl
system_perf.jsonl
xapi/
xapi_statements.jsonl
rules/
rule_events.jsonl
episodes/{session_id}/{episode_id}/
trajectory.parquet
meta.json

The episode_manifest.json lists every stream present and its schema version, so partial episodes (hardware disconnect, editor stop) are identifiable before loading.

Hugging Face Export

The pipeline includes an export-hf step that produces a ready-to-use Hugging Face dataset from the built trajectories.

Output layout:

exports/hf/
    README.md                           ← auto-generated dataset card
    catalog/
        episodes.parquet                ← episode-level metadata index
    data/
        train-00000-of-XXXXX.parquet    ← trajectory shards (~150 MB each)
        val-00000-of-XXXXX.parquet
        test-00000-of-XXXXX.parquet
    xapi/
        data.parquet                    ← xAPI statements companion table
    rule_events/
        data.parquet                    ← rule events companion table

Episodes are split 80 / 10 / 10 (train / val / test) using stratified sampling by exercise_id, so each split has the same proportion of exercise types. Shards target ~150 MB each for efficient streaming.

The companion tables (xapi/ and rule_events/) contain the raw semantic and safety annotation records for every exported episode, joinable back to the trajectory via session_id, episode_id, and t_sim.

Loading with the datasets library:

from datasets import load_dataset

ds = load_dataset("parquet", data_files={
    "train": "exports/hf/data/train-*.parquet",
    "validation": "exports/hf/data/val-*.parquet",
    "test": "exports/hf/data/test-*.parquet",
})

# Observation and action tensors
import numpy as np
obs_cols = [c for c in ds["train"].column_names if c.startswith("obs_")]
act_cols = [c for c in ds["train"].column_names if c.startswith("act_")]

Getting Started

import pandas as pd

# Load the ML-ready trajectory for one episode
traj = pd.read_parquet("episodes/{sid}/{eid}/trajectory.parquet")

# Observation and action tensors
obs = traj[[c for c in traj.columns if c.startswith("obs_")]].values   # (T, 171)
act = traj[[c for c in traj.columns if c.startswith("act_")]].values   # (T, 7)

reward = traj[["reward_collision","reward_step_completed","reward_task","reward_time"]].sum(axis=1).values

# Slice by exercise step
approach_phase = traj[traj["step_token"] == "block_1.0"]

# Load raw JSONL for a specific stream
forklift = pd.read_json(
"sessions/{sid}/{eid}/telemetry/forklift_state.jsonl",
lines=True
)

Next

We plan to release the Unity simulation environment to allow researchers to benchmark models on equivalent exercises and collect their own episodes. Please contact us for questions or queries.

Community

Sign up or log in to comment