| --- |
| license: apache-2.0 |
| tags: |
| - motion-generation |
| - text-to-motion |
| - human-motion |
| - surveillance |
| - synthetic-data |
| - docker |
| - rest-api |
| - kimodo |
| - nvidia |
| pipeline_tag: text-to-video |
| --- |
| |
| # kimodo-api π |
|
|
| A **REST API wrapper** around [NVIDIA Kimodo](https://github.com/nv-tlabs/kimodo) β the state-of-the-art text-to-motion diffusion model trained on 700 hours of commercial mocap data. |
|
|
| This image turns Kimodo into a microservice you can call from any pipeline, no Python environment needed. |
|
|
| ## Quick Start |
|
|
| ```bash |
| docker pull ghcr.io/eyalenav/kimodo-api:latest |
| |
| docker run --rm --gpus '"device=0"' -p 9551:9551 \ |
| -v ~/.cache/huggingface:/root/.cache/huggingface \ |
| -e HUGGINGFACE_TOKEN=hf_... \ |
| ghcr.io/eyalenav/kimodo-api:latest |
| ``` |
|
|
| > β οΈ First run downloads Llama-3-8B-Instruct (~16GB) for the text encoder. Requires a HuggingFace token with access to `meta-llama/Meta-Llama-3-8B-Instruct`. |
|
|
| ## API |
|
|
| ### `POST /generate` |
|
|
| Generate a motion clip from a text prompt. |
|
|
| ```bash |
| curl -X POST http://localhost:9551/generate \ |
| -H "Content-Type: application/json" \ |
| -d '{"prompt": "person pushing through a crowd aggressively"}' |
| ``` |
|
|
| **Response:** NPZ file (binary) β SOMA 77-joint skeleton format, compatible with BVH export. |
|
|
| ### `GET /health` |
|
|
| ```bash |
| curl http://localhost:9551/health |
| # {"status": "ok"} |
| ``` |
|
|
| ## Requirements |
|
|
| | Resource | Minimum | |
| |---|---| |
| | GPU | RTX 3090 / A100 / RTX 6000 Ada | |
| | VRAM | 24 GB | |
| | RAM | 32 GB | |
| | Disk | 50 GB (model weights) | |
|
|
| ## What's inside |
|
|
| - **Kimodo** β NVIDIA's kinematic motion diffusion model (77-joint SOMA skeleton) |
| - **LLM2Vec** text encoder backed by **Llama-3-8B-Instruct** |
| - **FastAPI** server on port 9551 |
| - Health check + graceful startup |
|
|
| ## Part of VisionAI-Flywheel |
|
|
| This service is one component of a full synthetic surveillance data pipeline: |
|
|
| ``` |
| [kimodo-api] β NPZ motion |
| β |
| [render-api] β SOMA mesh render (MP4) |
| β |
| [cosmos-transfer] β Sim2Real photorealistic video |
| β |
| [NVIDIA VSS] β VLM annotation β fine-tuning dataset |
| ``` |
|
|
| π Full pipeline: [github.com/EyalEnav/VisionAI-Flywheel](https://github.com/EyalEnav/VisionAI-Flywheel) |
|
|
| ## License |
|
|
| Apache 2.0 β see [LICENSE](https://github.com/EyalEnav/VisionAI-Flywheel/blob/main/LICENSE) |
|
|
| > Kimodo model weights are released under the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) and downloaded at runtime. They are not bundled in this image. |
|
|