| # cosmos-transfer |
|
|
| REST API microservice wrapper around [NVIDIA Cosmos-Transfer2.5-2B](https://huggingface.co/nvidia/Cosmos-Transfer2.5-2B) β a video diffusion model that converts synthetic renders into photorealistic video (Sim2Real). |
|
|
| --- |
|
|
| ## Installation |
|
|
| ```bash |
| docker pull ghcr.io/eyalenav/cosmos-transfer:latest |
| ``` |
|
|
| ### Run |
|
|
| ```bash |
| docker run --rm \ |
| --gpus '"device=0"' \ |
| -p 8080:8080 \ |
| -v ~/.cache/huggingface:/root/.cache/huggingface \ |
| -e HUGGINGFACE_TOKEN=hf_... \ |
| ghcr.io/eyalenav/cosmos-transfer:latest |
| ``` |
|
|
| > **First run:** downloads Cosmos-Transfer2.5-2B weights (~20 GB). Subsequent starts are fast. |
|
|
| --- |
|
|
| ## API Reference |
|
|
| ### `GET /health` |
|
|
| Check server status. |
|
|
| **Request** |
| ``` |
| GET http://localhost:8080/health |
| ``` |
|
|
| **Response** |
| ```json |
| { |
| "status": "ok", |
| "model": "Cosmos-Transfer2.5-2B", |
| "device": "cuda:0" |
| } |
| ``` |
|
|
| --- |
|
|
| ### `POST /transfer` |
|
|
| Convert a synthetic video to photorealistic using multicontrol (edge + visual). |
|
|
| **Request** |
| ``` |
| POST http://localhost:8080/transfer |
| Content-Type: multipart/form-data |
| ``` |
|
|
| | Field | Type | Default | Description | |
| |---|---|---|---| |
| | `video` | file | required | Input synthetic MP4 (max 10s @ 24fps recommended) | |
| | `prompt` | string | `""` | Text describing the scene (improves realism) | |
| | `edge_strength` | float | `0.85` | Canny edge control strength (geometry preservation) | |
| | `vis_strength` | float | `0.45` | Visual/blur control strength (scene structure) | |
| | `sigma` | int | `100` | Noise level β lower = more faithful, higher = more realistic | |
| | `num_steps` | int | `35` | Diffusion steps (more = slower but higher quality) | |
| | `seed` | int | `-1` | Random seed (`-1` = random) | |
|
|
| **Response** |
|
|
| Binary MP4 file (`video/mp4`). |
|
|
| **Example** |
| ```bash |
| curl -X POST http://localhost:8080/transfer \ |
| -F "video=@synthetic_render.mp4" \ |
| -F "prompt=surveillance camera footage of a crowded urban street, overcast day" \ |
| -F "edge_strength=0.85" \ |
| -F "vis_strength=0.45" \ |
| -F "sigma=100" \ |
| --output photorealistic.mp4 |
| ``` |
|
|
| --- |
|
|
| ### `POST /transfer_async` |
| |
| Submit a job and poll for completion (recommended for long clips). |
| |
| **Submit** |
| ```bash |
| curl -X POST http://localhost:8080/transfer_async \ |
| -F "video=@render.mp4" \ |
| -F "prompt=security incident, parking lot" \ |
| -F "edge_strength=0.85" \ |
| --output job.json |
| # {"job_id": "abc123", "status": "queued"} |
| ``` |
| |
| **Poll** |
| ```bash |
| curl http://localhost:8080/status/abc123 |
| # {"job_id": "abc123", "status": "running", "progress": 0.42} |
| # ... |
| # {"job_id": "abc123", "status": "done"} |
| ``` |
| |
| **Download** |
| ```bash |
| curl http://localhost:8080/result/abc123 --output photorealistic.mp4 |
| ``` |
| |
| --- |
| |
| ## Tuned Parameters |
| |
| Tested across 80+ surveillance clips β confirmed sweet spot: |
| |
| ``` |
| edge_strength=0.85 + vis_strength=0.45 + sigma=100 |
| ``` |
| |
| | Parameter | Value | Effect | |
| |---|---|---| |
| | `edge_strength` | **0.85** | Strong silhouette/geometry preservation from Canny edges | |
| | `vis_strength` | **0.45** | Moderate scene structure via visual blur control | |
| | `sigma` | **100** | Balanced noise β realistic textures without losing layout | |
| |
| ### When to adjust |
| |
| | Scenario | Adjustment | |
| |---|---| |
| | Subject drifts from synthetic pose | Increase `edge_strength` β 0.90β0.95 | |
| | Background too synthetic-looking | Increase `vis_strength` β 0.55β0.65 | |
| | Output too faithful to render colors | Increase `sigma` β 120 | |
| | Too much motion blur | Decrease `sigma` β 80 | |
| |
| --- |
| |
| ## Hardware Requirements |
| |
| | Resource | Minimum | Recommended | |
| |---|---|---| |
| | GPU | A100 40GB / RTX 6000 Ada | H100 / RTX PRO 6000 Blackwell | |
| | VRAM | 40 GB | 48+ GB | |
| | RAM | 64 GB | 128 GB | |
| | Disk | 30 GB | 50 GB | |
| | CUDA | 12.1+ | 12.8 | |
| |
| **Processing time (RTX PRO 6000 Blackwell, 96GB VRAM):** |
| - 4s clip @ 24fps β ~3 min |
| - 10s clip @ 24fps β ~7 min |
| |
| --- |
| |
| ## Environment Variables |
| |
| | Variable | Required | Description | |
| |---|---|---| |
| | `HUGGINGFACE_TOKEN` | Yes | HF token with access to `nvidia/Cosmos-Transfer2.5-2B` | |
| | `CUDA_VISIBLE_DEVICES` | No | Limit to specific GPU (e.g. `"1"`) | |
| | `PORT` | No | Override default port `8080` | |
| |
| --- |
| |
| ## Integration with VisionAI-Flywheel |
| |
| ```yaml |
| # docker-compose.yml excerpt |
| services: |
| cosmos-transfer: |
| image: ghcr.io/eyalenav/cosmos-transfer:latest |
| ports: |
| - "8080:8080" |
| deploy: |
| resources: |
| reservations: |
| devices: |
| - driver: nvidia |
| device_ids: ["1"] |
| capabilities: [gpu] |
| volumes: |
| - hf_cache:/root/.cache/huggingface |
| environment: |
| - HUGGINGFACE_TOKEN=${HUGGINGFACE_TOKEN} |
| ``` |
| |
| Full `docker-compose.yml`: [github.com/EyalEnav/VisionAI-Flywheel](https://github.com/EyalEnav/VisionAI-Flywheel) |
|
|
| --- |
|
|
| ## Example: Full Python client |
|
|
| ```python |
| import requests |
| import time |
| |
| def transfer_video( |
| input_path: str, |
| output_path: str, |
| prompt: str = "", |
| edge_strength: float = 0.85, |
| vis_strength: float = 0.45, |
| sigma: int = 100 |
| ): |
| """Convert synthetic video to photorealistic.""" |
| with open(input_path, "rb") as f: |
| response = requests.post( |
| "http://localhost:8080/transfer", |
| files={"video": ("input.mp4", f, "video/mp4")}, |
| data={ |
| "prompt": prompt, |
| "edge_strength": edge_strength, |
| "vis_strength": vis_strength, |
| "sigma": sigma, |
| }, |
| timeout=600 |
| ) |
| response.raise_for_status() |
| |
| with open(output_path, "wb") as f: |
| f.write(response.content) |
| print(f"Saved to {output_path}") |
| |
| # Example usage |
| transfer_video( |
| input_path="soma_render.mp4", |
| output_path="photorealistic.mp4", |
| prompt="surveillance camera, urban street, daytime, overcast sky" |
| ) |
| ``` |
|
|
| --- |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|
| > Cosmos-Transfer2.5 model weights are released under the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). Weights are downloaded at runtime and are not bundled in this image. |
|
|