| | --- |
| | license: creativeml-openrail-m |
| | tags: |
| | - computer vision |
| | - stable-diffusion |
| | - stable-diffusion-2-1 |
| | - photography |
| | - photoreal |
| | --- |
| | |
| | # Deprecation notice |
| |
|
| | This model was a research project that is deprecated in favour of ptx0/pseudo-flex-base |
| |
|
| | # Capabilities |
| |
|
| | This model is capable of producing photorealistic images of people. |
| |
|
| | It retains much of the base 2.1-v model knowledge, as its text encoder is minimally tuned. |
| |
|
| | # Limitations |
| |
|
| | This model does not produce perfect results every time. |
| |
|
| | This model cannot reproduce most real people. Instead, it makes "Derp-a-Like" equivalents to real people, which I prefer. |
| |
|
| | This model is not great at abstract imagery or digital art, though it certainly can produce a variety of amazing art styles. |
| |
|
| | # Dataset |
| |
|
| | * cushman (8000 kodachrome slides from 1939 to 1969) |
| | * midjourney v5.1-filtered (about 22,000 upscaled v5.1 images) |
| | * national geographic (about 3-4,000 >1024x768 images of animals, wildlife, landscapes, history) |
| | * a small dataset of stock images of people vaping / smoking |
| |
|
| | # Training parameters |
| |
|
| | * polynomial learning rate scheduler shared between TE and Unet starting at 4e-8 and decaying to 1e-8 |
| | * batch size 15, gradient accumulations 10 => effective BS=150 |
| | * target is 30,000 steps but will likely stop sooner |
| | * terminal SNR enforced betas |
| |
|
| | # Training goals |
| |
|
| | * explore the effects of terminal SNR scheduling |
| | * improve faces, especially "at a distance" |
| | * improve composition, eg. completeness of resulting image |
| | * improve prompt comprehension, eg. "do what i want, even if it is weird" |
| | * retain / introduce a slightly colourful flavour due to the midjourney data |
| | * enhance understanding of the past, through the Cushman collection |
| | * retain the ability to produce natural landscapes and animals via National Geographic |
| |
|
| | # Observations |
| |
|
| | * at 1650 steps, we still haven't cracked the code on faces. |
| | * at 250 steps, we had amazing photoreal Mars landscapes that have carried forward mostly to 1650 steps |
| | * lighting and composition are at their best |
| |
|
| | # Future work |
| |
|
| | This model inspired the search for a solution to the proliferation issue that led me to ttj/flex-diffusion-2-1, which led to the creation of ptx0/pseudo-flex-base, another photoreal model with multiple aspect support. |
| |
|
| | This model was trained **purely** on 768x768 square images, which were randomly resized and cropped. It can produce some higher resolution landscapes, but it cannot reliably do higher resolution subjects without deformities. |