Loam: earthy, naturalistic, messy, rich, regenerative

(click the images, they're huge)

Over a period of multiple months, on a sequence of spot-rented 4090s, I have taken the abandoned husk of Stable Diffusion 2.1 and .. if not revived it, then certainly brought it into a different sort of life.

First I replaced the text encoder with Apple's DFN5B CLIP ViT-H-14 retrain which, in regular CLIP configuration, claims to boost ImageNet zero-shot accuracy from 78% to 83% compared to CLIP-ViT-H-14-laion2B-s32B-b79K (SD2's baseline).
Once the model recovered from that - which happened surprisingly quickly! - I upgraded the v-pred objective to Flow Matching.

The result is a model that retains SD2.1's fascinating weirdness while significantly boosting its baseline aesthetics coherence at higher resolutions.

I hope you like it - please reach out if you do! ♥️ -Damian

Demo

Running in a customised HF space here -> loam_dfn5b demo.

Quickstart

Works almost OOTB with 🧨diffusers StableDiffusionPipeline, just a tiny monkey patch required on the scheduler:

from diffusers import StableDiffusionPipeline, FlowMatchEulerDiscreteScheduler

scheduler = FlowMatchEulerDiscreteScheduler(shift=1) # see notes
# monkey patch scheduler to work with the SD pipeline
scheduler.init_noise_sigma = 1
scheduler.scale_model_input = lambda x, t: x
scheduler.add_noise = lambda l, n, t: scheduler.scale_noise(l, t, n) # diffusers consistency FTW

pipe = StableDiffusionPipeline.from_pretrained('damian0815/loam_dfn5b')
pipe.to('cuda')
pipe.scheduler = scheduler
image = pipe(prompt="the sea is strange; the sea is wild. it draws breath with care and swallows no hope. miserable, depressing, lost sailor's nightmare", negative_prompt="humanity, calm, summer",
   width=1096, height=1536, num_inference_steps=40, 
   guidance_scale=7.2, guidance_rescale=0.7
).images[0]

from uuid import uuid4
image.save(f'{uuid4()}.jpg', quality=95)

Prompts should be full English sentences, at least 10 tokens so as not to starve the cross-attention heads.
Treat the negative prompt with care. It should be something but keep it short.
You can try different shift values for the scheduler - stay <1.5 for resolutions up to 1024x1024 but it might help to go higher for higher resolutions
- Note that time_shift_type='exponential' is ignored unless use_dynamic_shifting=True, but dynamic shifting requires more elaborate monkey-patching to inject mu when the pipeline calls set_timesteps. See the demo Space source for an example.
I haven't found an upper bound to the resolution. 1800x1800 works just fine if you have the VRAM.
FlowMatchHeunDiscreteScheduler also works (same monkey patch necessary).
Suggested CFG guidance_scale range = 3-10.
Try guidance_rescale = 0 or 0.7, 0.7 can improve sharpness but it may harm coherence.
16 inference steps is often sufficient, but sometimes you need 50.

If you need prompt weighting or long prompt support you can use my prompt weighting library compel.

Metrics

forthcoming (does anybody want to donate me the compute to run MS-COCO FID-30K)

Dataset

30k high-res images randomly sampled from LAION-Aesthetic with the original noisy and sometimes hopelessly bad alt-tag captions. Sample caption:

St Peter's Square (Manchester) by Trevor Lingard, Local | Manchester | Transport.
20k images randomly sampled from AllenAI's unique and fun pixmo-cap (great caption quality generated by a unique process, check it out). Sample caption:

This is an image of a woman holding a box of doughnuts out on the street. You see it looks like a I'd say Western maybe California looking state. I think you can make out some palm trees in the back, potentially a little bit of water in the distance. You see these like very palmy looking trees near the buildings. She's holding a cardboard box. You see just her left finger, her left thumb with the aqua nail polish on the thumb and there's four doughnuts in it. Upper left just looks like a regular glaze with some extra white frosting on it and it's like a zigzag pattern. The two on the right, there's two on the left and two on the right, the two on the right look like Boston cream doughnuts. So they all have holes in them. It has the chocolate coating and then you can see the tan doughnut underneath it so it looks like a chocolate covered doughnut or a Boston cream doughnut. I guess Boston cream doesn't have a hole in it so it just looks like a chocolate frosted doughnut. And then the bottom left doughnut has it looks like the red star on it that looks like the Satan sign, the sign of Satan. It's a red star and it has like what it looks like crumbled cake on top and a white glaze that was strung all along it and you could and they're all on top of a piece of parchment paper inside the box.

(At runtime a random ~77 token subsection is sampled.)
20k images randomly sampled from Pexels-400k, augmented with machine-generated cinematography phrases. Sample caption:

Close-Up Photo of Wasp On Flower, centered composition, ambient light, saturated cool high contrast color, telephoto lens
A private collection of 20k film stills captioned 10x each by telling gemma3-27b to write like a film school undergrad while hinting it about 50% of the time with machine-generated cinematography tags. Sample caption:

Science fiction film still. Nebula’s expression suggests repressed trauma and a yearning for connection. The spaceship’s claustrophobic interior symbolizes her emotional confinement. Her metallic body represents a fractured self, a desperate attempt to control her inner demons. The overall atmosphere is one of melancholic intensity.

Tools

A set of forthcoming cinematography classifiers for precise lighting, framing, colour, lens, and shot type tagging that I can't name just yet.
My fork of Victor Hall's excellent and well-documented EveryDream2trainer, the only open source trainer that's focussed on reproducing the principles of full scale pretraining at hobby scale. My fork adds a number of features to the upstream codebase including:
- Flow Matching objective support (obviously).
- Multi-resolution training, building on the single-res aspect bucketing that Victor pioneered in EveryDream and SD3 later adopted. The multi-res adaptation lets us train simultaneously on images at 384, 512, 768, 1024, whatever res, at arbitrary aspect ratios.
- Flexible, memory adaptive forward & backward slicing to squeeze the absolute maximum performance out of limited VRAM. For example, in a single grad accum chunk of 128 samples on a 24GB GPU: for res 384 we push all 48 samples in a single forward pass through the model, compute a contrastive loss, and then backprop; then for res 1024 in the same chunk we subdivide the forward pass into slices of 2 and run loss/backward on sets of 4; the optimizer is stepped every 128th sample regardless of resolution, forward, backward, or loss slice sizes. On OOM, slice sizes are automatically adjusted and reattempted. NaNs are detected on the fly and surgically rejected.
Plus a mountain of custom filtering, tagging, group, sampling, and captioning code in a messy jumble of jupyter notebooks and python libs that will probably never see the light of day.

Training regime

Honestly idk this is just a hobby project that I've been hacking on in my spare time.

The core forward-loss-backprop loop has probably been executed on no fewer than 10 million (non-unique) image/caption pairs at various grad accumulated batch sizes from 1 through 4096 (median around 64), typically on 4-5 resolutions simultaneously (384/512/640/768 with occasional 896/1024). Maybe 100 epochs if considered as a single training run - but it wasn't that, it was an iterative process of doing a run, subjectively evaluating the resulting model, switching up the training code and captioning strategies, and then resuming.
Loss objective is always MSE or timestep-dependent MSE/Huber hybrid, with some exotic contrastive strategies introduced in the last 5 million samples and timestep shifting in the last 500k. Pro tip, LLMs don't know shit about timestep shifting. I have been given much bad and contradictory advice, which I've learnt to catch by having Claude and Gemini argue with each other.
I deliberately did not set out with rigour in mind. It's been a progressive, heterogenous process. I rent cheap 4090s (typically <20c/hr) and just mess about with implementing different things and trying them out on the latest state of the model. Sorry! If you want rigour I would be very happy to help you out with your model on a contract basis.

Why?

Yes.

No, really, why?

I enjoy stupid challenges.
SD2 is the weirdest thing that any big player in this space has produced. It is unloved and although it has some cult fans, it's generally been forgotten. This is a shame because if you're not actively interested in the flawless commercial aesthetic that Flux does so well, and you don't need sharpshooter prompt-following, it can do some pretty unique things.
SD2 occupies an interesting spot compute-wise. Its unet is the same size as SD1 but its text encoder, CLIP-H, has ~350M parameters which is >2x as many as SD1's (CLIP-L with 120M parameters). Those extra 230M parameters mean it can respond to language with markedly higher precision than SD1 - once you've found a prompt that works, which admittedly can be a little frustrating. Meanwhile its unet at ~860M parameters is the same size as SD1, so the VRAM requirements and training performance are almost the same as SD1. A well-optimised trainer doing a unet-only finetune on a single 4090 can rip through 8-12 image+caption pairs per second at 768x768 - try doing that with SDXL or Flux.
- Some more thoughts on this: the larger text encoder is a double-edged sword. To prompt it you can't get away with the soupy meaning sludge typical of SD1 or even many SDXL prompts - you actually have to form coherent sentences. It can be frustrating at first to find a phrasing of your intention that produces good images, but if you're willing to approach the prompting process as a collaboration rather than a command & control relationship, you'll find it responds to feather-light language tweaks in a uniquely enjoyable way.
- Due to its fully open training data it has notably coherent responses to keywords associated with the kind of art movements that public institutions (eg museums) have budgets to put online with high quality alt tags; it's also good at the sort of high-res images that amateur photographers are inclined to post online for free. For the parameter count it's incredible at aping art historical movements and nature photography, although it has only marginal response to the kind of internet-first """artstyle""" keywords that OpenAI's CLIP-L is good at.
- Of course 350M parameters is still much smaller than the current standard T5-XXL (as seen in Flux and Wan), which at 4.7B is 13x bigger than CLIP-H. Nobody's claiming that CLIP-H is going to get you anything near T5-XXL's precision, but if you can still generate interesting images with it, I say that needn't matter.
The shift in objective from v-pred to flow-matching isn't very large, and in theory (and indeed in practise) shouldn't take much training to do.
- A v-pred model predicts the velocity (change) needed to adjust the latents at timestep t in the denoising process to what they should be in timestep t-1 (note sign is reveresed):
  
  velocity_t = α_t * noise - σ_t * clean_latents
Important to note here is the presence of timestep t on the right hand side - this means that the target velocity (the one that the v-pred objective needs the unet to predict) is different for each timestep.
- For a flow matching model on the other hand, while it also predicts the velocity from noise to clean image, the objective is formulated such that the timestep doesn't matter: the target (again with the sign reversed) is always just
  
  velocity = noise - clean_latents
and that's the same at every timestep. (To be specific: this is true only for "rectified" or "straight-line" flow matching; so-called "general" flow matching is more complicated.)
- So when we adapt SD2 to flow matching we are actually reducing the complexity of the task. In contrast to v-pred, where the unet must learn to predict a different velocity at each timestep, for flow-matching the unet only needs to predict one velocity for a given noise+prompt pair, which stays the same for all timesteps.
- Intuitively, this means that a flow matching objective ought to result in significantly more efficient usage of those 860M unet parameters than the somewhat heterogenous v-pred objective (not to mention the wildly heterogenous epsilon objective that SD1 and for some strange reason SDXL were built upon).

(I'm afraid I can't back any of the above up with theory - my ability to hack far outstrips my ability to math.)

More images

There's just something about the combination of, really really f*n high fidelity texture combined with very clearly sub-par "semantic" knowledge that make the images this thing outputs - to me at least - really compelling.