rootxhacker commited on
Commit
6750d81
·
verified ·
1 Parent(s): 6b8a9fa

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +5 -6
README.md CHANGED
@@ -17,8 +17,8 @@ is conditioned on **CLIP-L** text features, with classifier-free guidance.
17
 
18
  ## Intended use
19
 
20
- Text-to-image generation at 1024×1024. Good at single objects and cinematic scenes; usable single-person
21
- portraits. A sibling 512px checkpoint additionally does instruction-based image editing.
22
 
23
  ## How it works
24
 
@@ -51,10 +51,9 @@ rather than FID / GenEval (which we did not compute):
51
  | Resolution | 1024×1024 (32×32×32 latent) |
52
  | VAE reconstruction | ~26 dB PSNR @512px; sharper at 1024px (32×32 latent) |
53
 
54
- Qualitatively, the final checkpoint produces **watermark-free**, accurate objects, cinematic scenes, and
55
- coherent full-body humans. It is **soft on hands and multi-person scenes** — the real small-model /
56
- latent-resolution ceiling. Loss was still dropping at the end of training, so the 333M DiT is not yet
57
- saturated.
58
 
59
  ## Files
60
 
 
17
 
18
  ## Intended use
19
 
20
+ Text-to-image generation at 1024×1024. Strongest on single objects and cinematic scenes. A sibling 512px
21
+ checkpoint additionally does instruction-based image editing.
22
 
23
  ## How it works
24
 
 
51
  | Resolution | 1024×1024 (32×32×32 latent) |
52
  | VAE reconstruction | ~26 dB PSNR @512px; sharper at 1024px (32×32 latent) |
53
 
54
+ Qualitatively, the final checkpoint produces accurate objects and cinematic scenes. It is **soft on people,
55
+ hands, and multi-person scenes** — the real small-model / latent-resolution ceiling. Loss was still dropping
56
+ at the end of training, so the 333M DiT is not yet saturated.
 
57
 
58
  ## Files
59