File size: 5,495 Bytes

b8d1b9e

---
license: mit
library_name: pytorch
tags:
- robotics
- progress-estimation
- behavior-cloning
---

# SARM Progress Prediction

Stage-aware progress prediction model for robot manipulation tasks

## Model Description

SARM predicts:
- **Progress**: How far through the task (0.0 to 1.0)
- **Stage**: Which stage of the task is being executed

The model uses a transformer architecture to process sequences of RGB images and robot states.

**Task**: clearing_food_from_table_into_fridge
**Dataset**: IliaLarchenko/behavior_224_rgb

## Model Details

### Architecture
- **Type**: Transformer with dual prediction heads (stage classification + progress regression)
- **Model dimension**: 768
- **Attention heads**: 12
- **Transformer layers**: 8
- **MLP dimension**: 512
- **Number of stages**: 100
- **Number of tasks**: 50

### Training Details
- **Checkpoint**: `best_model.pt`
- **Training step**: 4800
- **Epoch**: unknown
- **Training loss**: unknown
- **Validation loss**: 1.0865614609792829
- **Batch size**: 16
- **Learning rate**: 0.0001
- **Max sequence length**: 13

## Usage

### Download and Load Model

```python
from hf_model_hub import download_model_from_hub
from model import SARM
import torch
import json

# Download model and config
files = download_model_from_hub(
    repo_id="YOUR_USERNAME/YOUR_REPO",
    checkpoint_name="best_model.pt",
    output_dir="./downloaded_model"
)

# Load config
with open(files["config"], "r") as f:
    config = json.load(f)

# Create model
model_config = config["model"]
model = SARM(
    d_model=model_config["d_model"],
    n_heads=model_config["n_heads"],
    n_layers=model_config["n_layers"],
    d_mlp=model_config["d_mlp"],
    num_stages=model_config["num_stages"],
    d_state=model_config["d_state"],
    num_tasks=model_config["num_tasks"],
)

# Load checkpoint
checkpoint = torch.load(files["checkpoint"])
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()
```

### Run Inference

```python
# Assuming you have images and states prepared
with torch.no_grad():
    stage_logits, progress = model(images, states, tasks, padding_mask)
    
    # Get predictions for the last frame
    predicted_stage = stage_logits[:, -1].argmax(dim=-1)
    predicted_progress = progress[:, -1]
```

## Training Data

This model was trained on the **IliaLarchenko/behavior_224_rgb** for robot manipulation tasks.

Training episodes: 90 episodes
Validation episodes: 15 episodes

## Intended Use

- Progress estimation for robot manipulation tasks
- Stage classification for multi-step tasks  
- Adaptive window sampling for VLA training
- Task monitoring and intervention detection

## Limitations

- Trained on specific tasks from BEHAVIOR dataset
- Requires RGB images (224x224) and robot state information
- Fixed sequence length input

## Citation

If you use this model, please cite:

```bibtex
@misc{sarm-model,
  author = {Your Name},
  title = {SARM Progress Prediction},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/YOUR_USERNAME/YOUR_REPO}
}
```

## Training Configuration

<details>
<summary>Click to expand full training configuration</summary>

```json
{
  "metadata": {
    "model_name": "SARM Progress Prediction",
    "description": "Stage-aware progress prediction model for robot manipulation tasks",
    "task": "clearing_food_from_table_into_fridge",
    "task_number": 25,
    "dataset": "IliaLarchenko/behavior_224_rgb",
    "version": "1.0",
    "author": "Your Name",
    "tags": [
      "robotics",
      "progress-estimation",
      "behavior-cloning"
    ]
  },
  "model": {
    "d_model": 768,
    "n_heads": 12,
    "n_layers": 8,
    "d_mlp": 512,
    "num_stages": 100,
    "d_state": 256,
    "num_tasks": 50
  },
  "training": {
    "max_steps": 10000,
    "learning_rate": 0.0001,
    "weight_decay": 0.0001,
    "batch_size": 16,
    "gradient_accumulation_steps": 4,
    "max_grad_norm": 1.0,
    "scheduler": "cosine",
    "stage_loss_weight": 1.0,
    "progress_loss_weight": 1.0,
    "validation_steps": 100,
    "save_steps": 200
  },
  "data": {
    "max_sequence_length": 13,
    "image_size": 224,
    "num_workers": 10,
    "val_workers": 10,
    "val_samples": 500,
    "train_episodes": [
      1,
      2,
      3,
      4,
      5,
      6,
      7,
      8,
      9,
      10,
      11,
      12,
      13,
      14,
      15,
      16,
      17,
      18,
      19,
      20,
      21,
      22,
      23,
      24,
      25,
      26,
      27,
      28,
      29,
      30,
      31,
      32,
      33,
      34,
      35,
      36,
      37,
      38,
      39,
      40,
      41,
      42,
      43,
      44,
      45,
      46,
      47,
      48,
      49,
      50,
      51,
      52,
      53,
      54,
      55,
      56,
      57,
      58,
      59,
      60,
      61,
      62,
      63,
      64,
      65,
      66,
      67,
      68,
      69,
      70,
      71,
      72,
      73,
      74,
      75,
      76,
      77,
      78,
      79,
      80,
      81,
      82,
      83,
      84,
      85,
      86,
      87,
      88,
      89,
      90
    ],
    "val_episodes": [
      91,
      92,
      93,
      94,
      95,
      96,
      97,
      98,
      99,
      100,
      101,
      102,
      103,
      104,
      105
    ],
    "seed": 42
  },
  "logging": {
    "project_name": "sarm-training",
    "run_name": null,
    "log_freq": 10,
    "checkpoint_dir": "checkpoints_sarm_25_2"
  }
}
```

</details>