Any idea what the RAM requirements are for offloading the 14B, if 24Gb VRAM is available?

by cromagnone - opened Aug 11, 2025

Aug 11, 2025

sudo dmesg -T| grep -E -i -B100 'killed process'

[Mon Aug 11 23:23:41 2025] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-6b1f72d4-f529-4ffd-87ae-c99bd0d3f3ce.scope,task=python,pid=26749,uid=1000
[Mon Aug 11 23:23:41 2025] Out of memory: Killed process 26749 (python) total-vm:279241540kB, anon-rss:6808568kB, file-rss:65620kB, shmem-rss:52447240kB, UID:1000 pgtables:409380kB oom_score_adj:200

defectiverobot

Oct 3, 2025

•

edited Oct 3, 2025

~~probably more than 128GB, since it fails for me too.~~ (Update below)

for context, I've got one GPU with 32GB VRAM, running diffusers with torch: 2.8.0+cu128, cuda runtime: 12.8, cupy: 13.6.0, in a WSL2 environment, on a Windows machine with 128 GB RAM.
When it loads the shards, my system ram fills up to about 75GB before it crashes. I'm getting this error with the sample script direct from the model card:

...lib/python3.10/site-packages/dfloat11/dfloat11.py", line 260, in load_and_replace_tensors
    module.offloaded_tensors[parts[-1]] = tensor_value.pin_memory() if pin_memory else tensor_value
torch.AcceleratorError: CUDA error: out of memory
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

When I add pin_memory=False to the DFloat11.fromPretrained() calls, the error changes to:

...lib/python3.10/site-packages/diffusers/pipelines/wan/pipeline_wan_i2v.py", line 226, in _get_t5_prompt_embeds
    prompt_embeds = self.text_encoder(text_input_ids.to(device), mask.to(device)).last_hidden_state
RuntimeError: CUDA driver error: unknown error

also getting a lot of these from dmesg:

[264274.151244] misc dxg: dxgk: dxgkio_make_resident: Ioctl failed: -12
[265007.551878] misc dxg: dxgk: dxgkio_reserve_gpu_va: Ioctl failed: -75
[265508.334985] mini_init (354): drop_caches: 1
[275259.334341] mini_init (354): drop_caches: 1
[276905.219715] hv_vmbus: Failed to establish GPADL: err = 0xc0000044
[276905.220868] misc dxg: dxgk: create_existing_sysmem: establish_gpadl failed: -122
[276905.221375] misc dxg: dxgk: dxgkio_create_allocation: Ioctl failed: -12

Wondering if WSL2 is the problem, or if it just thinks there's not enough RAM left for the next step. I've had no trouble (and great results) with some other DFloat11 models including the Qwen-Image-Edit-2509-DF11 in this same environment.

Update:

It ran fine after I updated the NVIDIA driver of the Windows host machine, and set pin_memory=False. It never used more than 75 GB of system RAM and the VRAM peaked at around 22 GB.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment