File size: 2,477 Bytes
126cc14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
license: apache-2.0
tags:
- motion-generation
- text-to-motion
- human-motion
- surveillance
- synthetic-data
- docker
- rest-api
- kimodo
- nvidia
pipeline_tag: text-to-video
---

# kimodo-api πŸƒ

A **REST API wrapper** around [NVIDIA Kimodo](https://github.com/nv-tlabs/kimodo) β€” the state-of-the-art text-to-motion diffusion model trained on 700 hours of commercial mocap data.

This image turns Kimodo into a microservice you can call from any pipeline, no Python environment needed.

## Quick Start

```bash
docker pull ghcr.io/eyalenav/kimodo-api:latest

docker run --rm --gpus '"device=0"' -p 9551:9551 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HUGGINGFACE_TOKEN=hf_... \
  ghcr.io/eyalenav/kimodo-api:latest
```

> ⚠️ First run downloads Llama-3-8B-Instruct (~16GB) for the text encoder. Requires a HuggingFace token with access to `meta-llama/Meta-Llama-3-8B-Instruct`.

## API

### `POST /generate`

Generate a motion clip from a text prompt.

```bash
curl -X POST http://localhost:9551/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "person pushing through a crowd aggressively"}'
```

**Response:** NPZ file (binary) β€” SOMA 77-joint skeleton format, compatible with BVH export.

### `GET /health`

```bash
curl http://localhost:9551/health
# {"status": "ok"}
```

## Requirements

| Resource | Minimum |
|---|---|
| GPU | RTX 3090 / A100 / RTX 6000 Ada |
| VRAM | 24 GB |
| RAM | 32 GB |
| Disk | 50 GB (model weights) |

## What's inside

- **Kimodo** β€” NVIDIA's kinematic motion diffusion model (77-joint SOMA skeleton)
- **LLM2Vec** text encoder backed by **Llama-3-8B-Instruct**
- **FastAPI** server on port 9551
- Health check + graceful startup

## Part of VisionAI-Flywheel

This service is one component of a full synthetic surveillance data pipeline:

```
[kimodo-api] β†’ NPZ motion
    ↓
[render-api] β†’ SOMA mesh render (MP4)
    ↓
[cosmos-transfer] β†’ Sim2Real photorealistic video
    ↓
[NVIDIA VSS] β†’ VLM annotation β†’ fine-tuning dataset
```

πŸ”— Full pipeline: [github.com/EyalEnav/VisionAI-Flywheel](https://github.com/EyalEnav/VisionAI-Flywheel)

## License

Apache 2.0 β€” see [LICENSE](https://github.com/EyalEnav/VisionAI-Flywheel/blob/main/LICENSE)

> Kimodo model weights are released under the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) and downloaded at runtime. They are not bundled in this image.