NVisionAI
/

kimodo-api

motion-generation

Model card Files Files and versions

kimodo-api / README.md

NVisionAI's picture

Add model card

126cc14 verified 20 days ago

|

history blame contribute delete

2.48 kB

	---
	license: apache-2.0
	tags:
	- motion-generation
	- text-to-motion
	- human-motion
	- surveillance
	- synthetic-data
	- docker
	- rest-api
	- kimodo
	- nvidia
	pipeline_tag: text-to-video
	---

	# kimodo-api 🏃

	A REST API wrapper around [NVIDIA Kimodo](https://github.com/nv-tlabs/kimodo) — the state-of-the-art text-to-motion diffusion model trained on 700 hours of commercial mocap data.

	This image turns Kimodo into a microservice you can call from any pipeline, no Python environment needed.

	## Quick Start

	```bash
	docker pull ghcr.io/eyalenav/kimodo-api:latest

	docker run --rm --gpus '"device=0"' -p 9551:9551 \
	-v ~/.cache/huggingface:/root/.cache/huggingface \
	-e HUGGINGFACE_TOKEN=hf_... \
	ghcr.io/eyalenav/kimodo-api:latest
	```

	> ⚠️ First run downloads Llama-3-8B-Instruct (~16GB) for the text encoder. Requires a HuggingFace token with access to `meta-llama/Meta-Llama-3-8B-Instruct`.

	## API

	### `POST /generate`

	Generate a motion clip from a text prompt.

	```bash
	curl -X POST http://localhost:9551/generate \
	-H "Content-Type: application/json" \
	-d '{"prompt": "person pushing through a crowd aggressively"}'
	```

	Response: NPZ file (binary) — SOMA 77-joint skeleton format, compatible with BVH export.

	### `GET /health`

	```bash
	curl http://localhost:9551/health
	# {"status": "ok"}
	```

	## Requirements

	\| Resource \| Minimum \|
	\|---\|---\|
	\| GPU \| RTX 3090 / A100 / RTX 6000 Ada \|
	\| VRAM \| 24 GB \|
	\| RAM \| 32 GB \|
	\| Disk \| 50 GB (model weights) \|

	## What's inside

	- Kimodo — NVIDIA's kinematic motion diffusion model (77-joint SOMA skeleton)
	- LLM2Vec text encoder backed by Llama-3-8B-Instruct
	- FastAPI server on port 9551
	- Health check + graceful startup

	## Part of VisionAI-Flywheel

	This service is one component of a full synthetic surveillance data pipeline:

	```
	[kimodo-api] → NPZ motion
	↓
	[render-api] → SOMA mesh render (MP4)
	↓
	[cosmos-transfer] → Sim2Real photorealistic video
	↓
	[NVIDIA VSS] → VLM annotation → fine-tuning dataset
	```

	🔗 Full pipeline: [github.com/EyalEnav/VisionAI-Flywheel](https://github.com/EyalEnav/VisionAI-Flywheel)

	## License

	Apache 2.0 — see [LICENSE](https://github.com/EyalEnav/VisionAI-Flywheel/blob/main/LICENSE)

	> Kimodo model weights are released under the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) and downloaded at runtime. They are not bundled in this image.