LingBot-World Base Cam NF4 Quantized Server
Docker-ready inference server for LingBot-World video generation model with pre-quantized NF4 weights.
Features
- 4-bit NF4 quantization via bitsandbytes - fits in 32GB VRAM
- Pre-quantized weights - no runtime quantization overhead
- Docker image with HTTP API - deploy on any machine with an NVIDIA GPU
- Optional cloud upload - upload finished videos to Cloudflare R2, or download directly via HTTP
Model Contents
| File | Size | Description |
|---|---|---|
high_noise_model_bnb_nf4/model.safetensors |
~9.6 GB | NF4 quantized diffusion model (high noise) |
low_noise_model_bnb_nf4/model.safetensors |
~9.6 GB | NF4 quantized diffusion model (low noise) |
models_t5_umt5-xxl-enc-bf16.pth |
~10.6 GB | T5-XXL text encoder (bfloat16) |
Wan2.1_VAE.pth |
~485 MB | VAE encoder/decoder |
Total: ~30 GB (vs ~85 GB for full-precision models)
Requirements
- Python 3.10+
- CUDA 11.8+ (tested with CUDA 12.x)
- ~32GB VRAM (RTX 5090, A100, etc)
- A lot of RAM (>64GB)
Quick Start
Without Docker
To run this package without Docker, see the Lingbot-World pre-quantized page here.
Runpod Template
https://console.runpod.io/deploy?template=j6rpw8zhj2&ref=szjabwfp
Docker Deployment
The included Dockerfile builds a self-contained image (~35 GB) with all weights baked in. Once built, you can run it on any machine with an NVIDIA GPU.
Build
docker build --platform linux/amd64 -t lingbot-nf4 .
Run
# Basic — videos saved to /app/outputs/ inside the container
docker run --gpus all -p 8080:8080 lingbot-nf4
# With a local directory mounted for output
docker run --gpus all -p 8080:8080 -v ./outputs:/app/outputs lingbot-nf4
# With Cloudflare R2 upload (optional)
docker run --gpus all -p 8080:8080 \
-e R2_ACCOUNT_ID=your_account_id \
-e R2_ACCESS_KEY=your_access_key \
-e R2_SECRET_KEY=your_secret_key \
-e R2_BUCKET=your_bucket_name \
lingbot-nf4
The server starts on port 8080 once the model is loaded.
API
The server exposes an async job queue to handle long-running generation.
Using the client script
python caller.py \
--url http://localhost:8080 \
--image photo.jpg \
--prompt "A cinematic shot of the scene" \
--frame_num 81 \
--output output.mp4
Without the client script
POST /generate
Content-Type: application/json
{
"image": "<base64-encoded JPEG/PNG>",
"prompt": "A cinematic video of the scene",
"frame_num": 81,
"size": "480*832",
"seed": -1,
"guide_scale": 5.0,
"sampling_steps": 40,
"action_poses": "<base64-encoded poses.npy (optional)>",
"action_intrinsics": "<base64-encoded intrinsics.npy (optional)>"
}
Returns immediately with HTTP 202:
{"id": "job-uuid", "status": "IN_PROGRESS"}
Poll for result
GET /status/<job-id>
Returns:
{
"id": "job-uuid",
"status": "COMPLETED",
"output": {
"video_url": "https://...",
"seed": 42,
"duration_sec": 185.3,
"frame_num": 81,
"size": "480*832"
}
}
If R2 is not configured, video_path is returned instead of video_url. The client script (caller.py) will automatically download the video via the /download endpoint in this case.
Download video
When R2 is not configured, you can download completed videos directly:
GET /download/<job-id>
Returns the MP4 file as a download.
Health check
GET /health
Cloud Deployment (eg RunPod)
- Push the image to a container registry (Docker Hub, etc)
- Create a GPU pod/instance with the image
- Expose port 8080
- Use
caller.pywith the pod's public URL
Quantization Details
The diffusion models are quantized using bitsandbytes NF4 with double quantization:
{
"format": "bnb_nf4",
"double_quant": true,
"compute_dtype": "bfloat16",
"blocksize": 64
}
This achieves ~3.9x compression while maintaining generation quality.
License
This model is based on LingBot-World and follows its license terms.