Anthos
⚠️ IMPORTANT NOTICE
- It's tiny. 984K params. 120K steps on Oxford Flowers 102. 256×256 out.
- Petals are negotiable. It can draw a rose. Sometimes the rose has too many petals, or stamens in odd places, or a stem that becomes a vine. That's the territory at 0.98M params.
- Not Stable Diffusion. Class-conditional DiT-Nano/2. No text encoder, no safety filter, no upscaler. The 102 Oxford Flowers classes are the entire vocabulary.
Quick Stats
| Stat | Value |
|---|---|
| Parameters | 983,808 |
| Architecture | DiT-Nano/2 (6 blocks, hidden 96, 4 heads, patch 2, SwiGLU) |
| Training Steps | 120,000 |
| Training Time | ~18 min on an RTX Pro 6000 |
| Precision | bfloat16 |
| Output | 256 × 256 |
| Latents | 32 × 32 × 4 |
| Classes | 102 |
| Loss | 1.843 → 0.880 (flow-matching MSE) |
| Sampling | Heun, 50 steps, CFG 4.0 |
What Is This?
Anthos means flower in Greek. That's the whole naming story.
Trained it as a rectified flow rather than diffusion, so the network predicts the velocity field between noise and data. Architecture is just a regular DiT — adaLN-Zero, SwiGLU MLPs, sin-cos pos embed. SD-VAE for the latent side, Heun or Euler at sample time.
Dataset: every image in Oxford Flowers 102 (8,189 across train+val+test), plus a horizontal-flip copy. Encoded once through the VAE and dumped into VRAM as BF16 channels_last. 130 MB on the card. GPULatentLoader shuffles on-GPU and yields batches straight out of VRAM, so a step is just a forward pass.
120K steps in 18 minutes on the RTX Pro 6000. Loss went 1.84 → 0.88. We saved a sample grid every 2K steps to watch the convergence.
Meant as a sanity check on the training loop. It turned out to be a flower generator.
Samples
4×4, class-conditional, step 120K, CFG 4.0, Heun 50. Each tile is a different Oxford Flowers class. Some are clearly the right flower. Some are a guess.
Model Specifications
| Parameter | Value |
|---|---|
| Architecture | DiT |
| Variant | DiT-Nano/2 |
| Depth | 6 |
| Hidden Size | 96 |
| Heads | 4 (head dim 24) |
| Patch Size | 2 |
| Grid | 16 × 16 = 256 tokens |
| MLP | SwiGLU, ratio 2.0 |
| Norm | LayerNorm, no affine on block norms (adaLN) |
| Attention | QK-LayerNorm, SDPA |
| Conditioning | AdaLN-Zero, t + y |
| Class Dropout | 0.1 |
| Class Embed | 102 + 1 (null) |
| Pos Embed | 2D sin-cos, frozen |
| VAE | stabilityai/sd-vae-ft-ema, 8× downsample, 4 ch |
| VAE Scale | 0.18215 |
| Output Ch | 4 (no learned sigma) |
Training Details
| Parameter | Value |
|---|---|
| Dataset | Oxford Flowers 102 (train+val+test, 8,189 imgs) |
| Aug | identity + hflip = 16,378 latents |
| Storage | all in VRAM, channels_last BF16 |
| Batch | 256 |
| Grad Accum | 1 |
| Optim | AdamW, β=(0.9, 0.95), wd=0, fused |
| LR | 1e-4, 1K-step linear warmup, then constant |
| Grad Clip | 1.0 |
| EMA | 0.9999 |
| t Sampler | logit-normal (μ=0, σ=1) |
| Loss | flow-matching MSE on velocity |
| CFG Dropout | 0.1 (10% labels → null token) |
| Precision | BF16 autocast, FP32 reductions |
| Compile | torch.compile(mode="max-autotune") |
| GPU | RTX Pro 6000, 96 GB, sm_120 |
| Step Rate | ~111 it/s |
| Wall | 1078s for 120K steps |
Benchmarks
Loss curve, sampled from the log: 1.843 → 1.71 (1K) → 1.31 (10K) → 1.04 (50K) → 0.91 (100K) → 0.880 (120K). Monotone down. Didn't bother with FID or IS.
Usage
from pipeline import AnthosPipeline
pipe = AnthosPipeline(repo_dir=".")
# every class
imgs = pipe(classes="all", seed=0)
imgs[0].save("out.png") # class 0 is pink primrose, not a rose, by the way
# specific names or ids, comma-separated
imgs = pipe("rose,sunflower,daffodil", n_per_class=2, seed=42)
for i, im in enumerate(imgs):
im.save(f"flower_{i:02d}.png")
# fiddling
imgs = pipe(73, steps=100, cfg_scale=2.5, sampler="euler", seed=7)
CLI:
python pipeline.py "rose,sunflower,daffodil" --n-per-class 2 --seed 42 --out out.png
For the Gradio demo, see this.
Files
| File | What it is |
|---|---|
model.safetensors |
EMA weights, 3.95 MB |
config.json |
arch + sampling config |
modeling.py |
DiT + samplers |
pipeline.py |
AnthosPipeline |
classes.txt |
102 names, id⟂tab⟂name |
convert_checkpoint.py |
final.pt → safetensors |
sample_grid.png |
4×4 grid, step 120K |
requirements.txt |
deps |
Limitations
- 102 classes, hard-coded. There's no prompt, so no "sunset over a meadow."
- Output is 256×256. Bigger needs an upscaler.
- 0.98M params, which is enough for a rose but absolutely not enough for Stable Diffusion.
- A few classes stayed rough through training. "Barberton daisy" and "mexican petunia" in particular. Oxford Flowers 102 is class-imbalanced and we didn't rebalance it.
- No FID/IS numbers. We looked at the samples.
- Don't use it in a textbook or in production.
Citation
@misc{anthos2026,
author = {Glint Research},
title = {Anthos: a 984K-parameter class-conditional DiT on Oxford Flowers 102},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/Glint-Research/Anthos}
}
Built by Glint Research.
- Downloads last month
- -
