| # kimodo-api |
|
|
| REST API microservice wrapper around [NVIDIA Kimodo](https://github.com/nv-tlabs/kimodo) — text-to-motion diffusion model generating 77-joint SOMA skeleton motion from natural language prompts. |
|
|
| --- |
|
|
| ## Installation |
|
|
| ```bash |
| docker pull ghcr.io/eyalenav/kimodo-api:latest |
| ``` |
|
|
| ### Run |
|
|
| ```bash |
| docker run --rm \ |
| --gpus '"device=0"' \ |
| -p 9551:9551 \ |
| -v ~/.cache/huggingface:/root/.cache/huggingface \ |
| -e HUGGINGFACE_TOKEN=hf_... \ |
| ghcr.io/eyalenav/kimodo-api:latest |
| ``` |
|
|
| > **First run:** downloads Llama-3-8B-Instruct (~16 GB) and Kimodo weights. Subsequent starts are fast (weights cached in `/root/.cache/huggingface`). |
|
|
| --- |
|
|
| ## API Reference |
|
|
| ### `GET /health` |
|
|
| Check server status. |
|
|
| **Request** |
| ``` |
| GET http://localhost:9551/health |
| ``` |
|
|
| **Response** |
| ```json |
| { |
| "status": "ok" |
| } |
| ``` |
|
|
| --- |
|
|
| ### `POST /generate` |
|
|
| Generate a motion clip from a text prompt. |
|
|
| **Request** |
| ``` |
| POST http://localhost:9551/generate |
| Content-Type: application/json |
| ``` |
|
|
| ```json |
| { |
| "prompt": "person pushing through a crowd aggressively", |
| "num_frames": 120, |
| "fps": 30 |
| } |
| ``` |
|
|
| | Field | Type | Default | Description | |
| |---|---|---|---| |
| | `prompt` | string | required | Natural language motion description | |
| | `num_frames` | int | `120` | Number of frames to generate | |
| | `fps` | int | `30` | Frames per second (metadata only) | |
|
|
| **Response** |
|
|
| Binary NPZ file (`application/octet-stream`). |
|
|
| The NPZ contains: |
| | Key | Shape | Description | |
| |---|---|---| |
| | `poses` | `(T, 77, 3)` | Joint rotations (axis-angle) per frame | |
| | `trans` | `(T, 3)` | Root translation per frame | |
| | `betas` | `(16,)` | SMPL body shape parameters | |
|
|
| **Example** |
| ```bash |
| curl -X POST http://localhost:9551/generate \ |
| -H "Content-Type: application/json" \ |
| -d '{"prompt": "person falling to the ground after being pushed"}' \ |
| --output output_motion.npz |
| ``` |
|
|
| --- |
|
|
| ### `POST /generate_bvh` |
| |
| Generate motion and return as BVH (Biovision Hierarchy) format. |
| |
| **Request** |
| ``` |
| POST http://localhost:9551/generate_bvh |
| Content-Type: application/json |
| ``` |
| |
| ```json |
| { |
| "prompt": "two people fighting, punches thrown", |
| "num_frames": 150 |
| } |
| ``` |
| |
| **Response** |
| |
| BVH text file (`text/plain`). |
| |
| **Example** |
| ```bash |
| curl -X POST http://localhost:9551/generate_bvh \ |
| -H "Content-Type: application/json" \ |
| -d '{"prompt": "drunk person stumbling and falling"}' \ |
| --output output_motion.bvh |
| ``` |
| |
| --- |
| |
| ## Hardware Requirements |
| |
| | Resource | Minimum | Recommended | |
| |---|---|---| |
| | GPU | RTX 3090 (24 GB VRAM) | RTX 6000 Ada / A100 | |
| | VRAM | 24 GB | 48 GB | |
| | RAM | 32 GB | 64 GB | |
| | Disk | 50 GB | 100 GB | |
| | CUDA | 12.1+ | 12.8 | |
| |
| --- |
| |
| ## Environment Variables |
| |
| | Variable | Required | Description | |
| |---|---|---| |
| | `HUGGINGFACE_TOKEN` | Yes | HF token with access to `meta-llama/Meta-Llama-3-8B-Instruct` | |
| | `CUDA_VISIBLE_DEVICES` | No | Limit to specific GPU (e.g. `"0"`) | |
| | `PORT` | No | Override default port `9551` | |
|
|
| --- |
|
|
| ## Integration with VisionAI-Flywheel |
|
|
| `kimodo-api` is designed to run alongside `render-api` and `cosmos-transfer` as part of the full pipeline: |
|
|
| ```yaml |
| # docker-compose.yml excerpt |
| services: |
| kimodo-api: |
| image: ghcr.io/eyalenav/kimodo-api:latest |
| ports: |
| - "9551:9551" |
| deploy: |
| resources: |
| reservations: |
| devices: |
| - driver: nvidia |
| device_ids: ["1"] |
| capabilities: [gpu] |
| volumes: |
| - hf_cache:/root/.cache/huggingface |
| environment: |
| - HUGGINGFACE_TOKEN=${HUGGINGFACE_TOKEN} |
| ``` |
|
|
| Full `docker-compose.yml`: [github.com/EyalEnav/VisionAI-Flywheel](https://github.com/EyalEnav/VisionAI-Flywheel) |
|
|
| --- |
|
|
| ## Example: Full Python client |
|
|
| ```python |
| import requests |
| import numpy as np |
| import io |
| |
| def generate_motion(prompt: str, num_frames: int = 120) -> dict: |
| """Generate motion NPZ from text prompt.""" |
| response = requests.post( |
| "http://localhost:9551/generate", |
| json={"prompt": prompt, "num_frames": num_frames}, |
| timeout=120 |
| ) |
| response.raise_for_status() |
| |
| npz = np.load(io.BytesIO(response.content)) |
| return { |
| "poses": npz["poses"], # (T, 77, 3) |
| "trans": npz["trans"], # (T, 3) |
| "betas": npz["betas"], # (16,) |
| } |
| |
| # Example usage |
| motion = generate_motion("security guard running toward an incident") |
| print(f"Generated {motion['poses'].shape[0]} frames") |
| ``` |
|
|
| --- |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|
| > Kimodo model weights are released under the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). Weights are downloaded at runtime and are not bundled in this image. |
|
|