Flux 2 Dev on a single RTX A6000 48gb GPU

Published January 21, 2026

First Draft, Under Revision

UPDATE! I've been running this pipeline almost non-stop since this article was published, with only minor changes to the LoRA(s) used and the source of the prompts ( a story for another article about synthetic data and self-eval! ). I can report that this pipeline has exceeded my expectations in terms of speed, quality and stability.

We've set the resolution nearly as low as Diffusers will allow; given that we're aiming for 5:7 illustrations, this gives us pixel ratios that are both in the low-triple-digits. This results in much faster generation times than we expected, and with two RTX A6000 cards running in parallel, sustaining < 40 degrees c. is a recipe for a veritable art factory. There are missing fingers, added fingers, fingers where they should definitely not be...you get the idea. But the majority are baseline good, and a significant portion are usable-as-is. In other words, 20% of the time, it one-shots the art for a card!

The following notes are related to why we bothered to do all of this, which is a brief divergence ( 3 paragraphs ) into the game design process of "how to make 100,000 cards, make them all playable, and illustrate them all automatically". In other words, this is a key component of our "one click game" pipeline, which I will briefly explain below.

Key Notes:

And indeed that is what we have proven, to the quantity of 10,000+ illustrations for our game! The propmts were generated via a suite of LLMs including frontier models in the cloud & local models on < GTX 1660 Supers > see List! , a 6gb low-end card that's still CUDA 12.x supported. Our card-generation algorithm created a TON of cards, but after teaching it how NOT to spam a million cards that nobody would ever play, we ended up with about 10,000 new, useful, balanced game cards.

For comparison, the longest-running card games top out around 20,000 unique cards. Extrapolating from this 10,000-size generation, which saturated the design space of "possible cards that can exist within the criteria", the implied card pool is around 40,000 cards at minimum. Our generation was targeted on ~25% of the card pool, segregated by card-type. The other 75% are split into approximately equal chunks of 25% as well. With 40,000 cards as the floor of the cardpool size, we're realistically looking at a game with nearly 100,000 unique cards within the first year or two of development.

So in summary, a strong workstation GPU is sufficient to generate huge quantities of artwork for your game or creative project. You don't need a half million dollar Blackwell to say "we have midjourney at home". The key to this is the A6000's 48gb of VRAM, an absolute outlier in today's GPU market. We use about 85% of that VRAM, and we'd likely use more if we spent more time configuring the various Diffusers steps.

< WARNING! POTENTIAL SLOP BEYOND THIS POINT! >

<< This article was written by the LLM that helped us implement the stack that's running right now. No human edits have been made to this article. Therefore, you should treat this as an experiment and proceed with the understanding that *the rest of this essay could be, in theory, utter bollocks*. That said, the stack works and the smart robot wrote some smart words. Expect a human-written version of this adventure once we have concrete results, not just workflows. >>

How we got Flux2-dev 4‑bit running on a dual‑A6000 box without losing our minds

Running a big diffusion stack like FLUX2 on a Windows workstation sounds simple: “just load the model and hit go.” In practice, it was a couple of evenings of “why is my GPU on fire?” and “why is only one GPU on fire?” We started with a pretty straightforward goal: We already had a prompt pipeline that turns gameplay data (hybrid units in our card game) into clean, canonical art prompts. We wanted to compare prompt sources (small local LLM vs bigger remote models) and push them all through the same Flux2 image stack. We have two RTX A6000s and didn’t want one of them sitting there sipping 15W while the other one dies doing all the work.

The naive attempts (a.k.a “just use both GPUs, right?”)

First pass was the classic ML developer move: Load diffusers/FLUX.2-dev-bnb-4bit. Tell Diffusers/Accelerate to use device_map="balanced" across both GPUs. Hope it shards the 4‑bit model pieces automatically. On paper this is great: 4‑bit text encoder + 4‑bit DiT, VAE in bf16, and 2×48GB of VRAM to play with. In reality, we slammed into a wall of: bitsandbytes meta-tensor weirdness: NotImplementedError: Cannot copy out of meta tensor; no data! This is what happens when model pieces are still on the “meta” device (no backing storage) and something tries to .to(cuda) them without fully loading the weights. CUDA OOM while trying to bring up a second pipeline: Even with 4‑bit weights, trying to stand up two full Flux2 pipelines in one process, or to shard a single pipeline, led to fragmentation and 20MB “last straw” allocations that failed. We also learned the hard way that some of the model-loading plumbing will happily allocate things on GPU 0 even if you intend to use cuda:1, unless you’ve set the current CUDA device very deliberately.

Lessons from the failed designs

We tried three main patterns: One process, one pipeline, “balanced” over both GPUs → Ran into bitsandbytes + meta-tensor mismatches and OOMs. Looked nice on paper, didn’t survive contact with reality. One process, two pipelines (one per GPU) → Getting both fully resident without OOM was unreliable, and sharing JSONL writes between them got messy. When it failed, it did so after a long load, which was extra painful. Two totally separate scripts launched by hand → Technically works, but brittle. Easy to misconfigure, easy to forget which GPU a given window is bound to, and no coordination around DB ranges or logging. The takeaway: Flux2-dev-bnb-4bit is heavy enough that you really only want one full pipeline per process and one GPU per pipeline. Fancy sharding with this quant stack is possible, but the happy path is narrow and library-version-dependent.

What we settled on

We backed off to something boring and solid: Use Hugging Face’s diffusers/FLUX.2-dev-bnb-4bit as-is, running the entire stack (text encoder + DiT + VAE) on one GPU per process. Control device selection explicitly: Add a --cuda-index flag and call torch.cuda.set_device(cuda_index) before from_pretrained. Then create torch.device(f"cuda:{cuda_index}") and .to(device) the pipeline. Split workloads by job, not by layers: One worker handles all prompts from one LLM job on one GPU. A second worker handles a different prompt job on the other GPU. No cross-GPU sharding, no shared state, no sneaky device_map="auto" magic. Concretely, we have: run_hybrid_units_flux2_4bit_from_db.py: Reads canonical prompts from Postgres for a single job (e.g. hybrid_prompts_smollm3_lan). Loads FLUX.2-dev-bnb-4bit on cuda:. Generates images at 376×528 (5:7 aspect) with 4‑bit weights and bf16 compute. Logs images + timings into JSONL. Two PowerShell launchers: Run-HybridFlux2-Smollm-NoExit.ps1 → GPU 0, job hybrid_prompts_smollm3_lan. Run-HybridFlux2Pairs-Devstral-NoExit.ps1 → GPU 1, job hybrid_prompts_devstral_2512. And a tiny orchestrator: Run-HybridFlux2-SmollmAndDevstral.ps1: Pops open two terminals, one per launcher. Mirrors sane defaults for width/height/steps. Both terminals are no‑exit and pause on error, so you can always see what went wrong. The end result: both A6000s sit near 100% utilization, each chewing through a different prompt source, and the system is stable enough to leave running overnight.

How you can reproduce this (minus our specific card prompts)

If you want to recreate a similar “dual‑boxing Flux2” setup on your own box: Grab the models from Hugging Face Flux2 4‑bit stack: diffusers/FLUX.2-dev-bnb-4bit https://huggingface.co/diffusers/FLUX.2-dev-bnb-4bit Optional tiny VAE for experiments: fal/FLUX.2-Tiny-AutoEncoder https://huggingface.co/fal/FLUX.2-Tiny-AutoEncoder Set up a prompt source pipeline Use whatever you like: Small local LLM (e.g. 3–7B) behind an OpenAI-compatible API. A bigger cloud model (Qwen, Devstral, etc.) via OpenRouter or direct HF Inference. The important part is structure, not the specific model: Turn your gameplay items (units, weapons, abilities) into structured JSON. Have the LLM convert that into a canonical art brief: Subject, pose, camera, mood, palette, background. Store results in one row per item per job: job_name (which LLM + settings you used) unit_id / item id prompt_canonical Optional prompt_log_path back to the raw LLM conversation. Write a small “prompt loader + Flux” script Pseudocode for a single job: import torch from diffusers import Flux2Pipeline def load_prompts(job_name): ... def snap(v): return max(256, (v // 16) * 16) def main(job_name: str, cuda_index: int = 0): prompts = load_prompts(job_name) torch.cuda.set_device(cuda_index) device = torch.device(f"cuda:{cuda_index}") pipe = Flux2Pipeline.from_pretrained( "diffusers/FLUX.2-dev-bnb-4bit", torch_dtype=torch.bfloat16, ) pipe.to(device) width, height = snap(376), snap(528) for row in prompts: result = pipe( prompt=row.prompt_canonical, num_inference_steps=30, guidance_scale=4.0, height=height, width=width, ) image = result.images[0] image.save(f"out/{row.unit_id}_{job_name}.png") Launch one process per GPU GPU 0: python run_flux2_single_job.py \ --job-name your_local_job \ --cuda-index 0 GPU 1: python run_flux2_single_job.py \ --job-name your_cloud_job \ --cuda-index 1 Compare outputs For each item id, you’ll have: item_local.png (small LLM prompt) item_cloud.png (big LLM prompt) This makes it very easy to: See where the small model is already “good enough.” Spot prompt pathologies (overly literal, too generic, weird camera choices). Decide where you actually need to pay for a bigger model. We deliberately didn’t talk about our exact Manacaster system prompts here, but the pattern is simple: Encode gameplay logic into JSON. Let an LLM turn that into an art brief. Run that brief through a consistent Flux2 stack. Use dual GPUs not by sharding layers, but by sharding jobs. And most importantly: always set torch.cuda.set_device(...) before you ask bitsandbytes + Diffusers to do anything clever, or you’ll be back in “why is only one GPU on fire?” land again.

Models mentioned in this article 2

Training, Distilling, and Embedding Tiny Models in Video Games

January 1, 2026

Community

YellowjacketGames

Article author May 22

btw we got it using 2 nvlinked RTX 6000 GPUs and its even better.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote