NVisionAI
/

cosmos-transfer

+# cosmos-transfer
+REST API microservice wrapper around [NVIDIA Cosmos-Transfer2.5-2B](https://huggingface.co/nvidia/Cosmos-Transfer2.5-2B) — a video diffusion model that converts synthetic renders into photorealistic video (Sim2Real).
+---
+## Installation
+```bash
+docker pull ghcr.io/eyalenav/cosmos-transfer:latest
+```
+### Run
+```bash
+docker run --rm \
+  --gpus '"device=0"' \
+  -p 8080:8080 \
+  -v ~/.cache/huggingface:/root/.cache/huggingface \
+  -e HUGGINGFACE_TOKEN=hf_... \
+  ghcr.io/eyalenav/cosmos-transfer:latest
+```
+> **First run:** downloads Cosmos-Transfer2.5-2B weights (~20 GB). Subsequent starts are fast.
+---
+## API Reference
+### `GET /health`
+Check server status.
+**Request**
+```
+GET http://localhost:8080/health
+```
+**Response**
+```json
+{
+  "status": "ok",
+  "model": "Cosmos-Transfer2.5-2B",
+  "device": "cuda:0"
+}
+```
+---
+### `POST /transfer`
+Convert a synthetic video to photorealistic using multicontrol (edge + visual).
+**Request**
+```
+POST http://localhost:8080/transfer
+Content-Type: multipart/form-data
+```
+| Field | Type | Default | Description |
+|---|---|---|---|
+| `video` | file | required | Input synthetic MP4 (max 10s @ 24fps recommended) |
+| `prompt` | string | `""` | Text describing the scene (improves realism) |
+| `edge_strength` | float | `0.85` | Canny edge control strength (geometry preservation) |
+| `vis_strength` | float | `0.45` | Visual/blur control strength (scene structure) |
+| `sigma` | int | `100` | Noise level — lower = more faithful, higher = more realistic |
+| `num_steps` | int | `35` | Diffusion steps (more = slower but higher quality) |
+| `seed` | int | `-1` | Random seed (`-1` = random) |
+**Response**
+Binary MP4 file (`video/mp4`).
+**Example**
+```bash
+curl -X POST http://localhost:8080/transfer \
+  -F "video=@synthetic_render.mp4" \
+  -F "prompt=surveillance camera footage of a crowded urban street, overcast day" \
+  -F "edge_strength=0.85" \
+  -F "vis_strength=0.45" \
+  -F "sigma=100" \
+  --output photorealistic.mp4
+```
+---
+### `POST /transfer_async`
+Submit a job and poll for completion (recommended for long clips).
+**Submit**
+```bash
+curl -X POST http://localhost:8080/transfer_async \
+  -F "video=@render.mp4" \
+  -F "prompt=security incident, parking lot" \
+  -F "edge_strength=0.85" \
+  --output job.json
+# {"job_id": "abc123", "status": "queued"}
+```
+**Poll**
+```bash
+curl http://localhost:8080/status/abc123
+# {"job_id": "abc123", "status": "running", "progress": 0.42}
+# ...
+# {"job_id": "abc123", "status": "done"}
+```
+**Download**
+```bash
+curl http://localhost:8080/result/abc123 --output photorealistic.mp4
+```
+---
+## Tuned Parameters
+Tested across 80+ surveillance clips — confirmed sweet spot:
+```
+edge_strength=0.85 + vis_strength=0.45 + sigma=100
+```
+| Parameter | Value | Effect |
+|---|---|---|
+| `edge_strength` | **0.85** | Strong silhouette/geometry preservation from Canny edges |
+| `vis_strength` | **0.45** | Moderate scene structure via visual blur control |
+| `sigma` | **100** | Balanced noise — realistic textures without losing layout |
+### When to adjust
+| Scenario | Adjustment |
+|---|---|
+| Subject drifts from synthetic pose | Increase `edge_strength` → 0.90–0.95 |
+| Background too synthetic-looking | Increase `vis_strength` → 0.55–0.65 |
+| Output too faithful to render colors | Increase `sigma` → 120 |
+| Too much motion blur | Decrease `sigma` → 80 |
+---
+## Hardware Requirements
+| Resource | Minimum | Recommended |
+|---|---|---|
+| GPU | A100 40GB / RTX 6000 Ada | H100 / RTX PRO 6000 Blackwell |
+| VRAM | 40 GB | 48+ GB |
+| RAM | 64 GB | 128 GB |
+| Disk | 30 GB | 50 GB |
+| CUDA | 12.1+ | 12.8 |
+**Processing time (RTX PRO 6000 Blackwell, 96GB VRAM):**
+- 4s clip @ 24fps → ~3 min
+- 10s clip @ 24fps → ~7 min
+---
+## Environment Variables
+| Variable | Required | Description |
+|---|---|---|
+| `HUGGINGFACE_TOKEN` | Yes | HF token with access to `nvidia/Cosmos-Transfer2.5-2B` |
+| `CUDA_VISIBLE_DEVICES` | No | Limit to specific GPU (e.g. `"1"`) |
+| `PORT` | No | Override default port `8080` |
+---
+## Integration with VisionAI-Flywheel
+```yaml
+# docker-compose.yml excerpt
+services:
+  cosmos-transfer:
+    image: ghcr.io/eyalenav/cosmos-transfer:latest
+    ports:
+      - "8080:8080"
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              device_ids: ["1"]
+              capabilities: [gpu]
+    volumes:
+      - hf_cache:/root/.cache/huggingface
+    environment:
+      - HUGGINGFACE_TOKEN=${HUGGINGFACE_TOKEN}
+```
+Full `docker-compose.yml`: [github.com/EyalEnav/VisionAI-Flywheel](https://github.com/EyalEnav/VisionAI-Flywheel)
+---
+## Example: Full Python client
+```python
+import requests
+import time
+def transfer_video(
+    input_path: str,
+    output_path: str,
+    prompt: str = "",
+    edge_strength: float = 0.85,
+    vis_strength: float = 0.45,
+    sigma: int = 100
+):
+    """Convert synthetic video to photorealistic."""
+    with open(input_path, "rb") as f:
+        response = requests.post(
+            "http://localhost:8080/transfer",
+            files={"video": ("input.mp4", f, "video/mp4")},
+            data={
+                "prompt": prompt,
+                "edge_strength": edge_strength,
+                "vis_strength": vis_strength,
+                "sigma": sigma,
+            },
+            timeout=600
+        )
+    response.raise_for_status()
+    with open(output_path, "wb") as f:
+        f.write(response.content)
+    print(f"Saved to {output_path}")
+# Example usage
+transfer_video(
+    input_path="soma_render.mp4",
+    output_path="photorealistic.mp4",
+    prompt="surveillance camera, urban street, daytime, overcast sky"
+)
+```
+---
+## License
+Apache 2.0
+> Cosmos-Transfer2.5 model weights are released under the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). Weights are downloaded at runtime and are not bundled in this image.