Update README.md
Browse files
README.md
CHANGED
|
@@ -27,12 +27,13 @@ This ControlNet can generate high-quality RGB images for facial, mouth, and eye
|
|
| 27 |

|
| 28 |
|
| 29 |
|
| 30 |
-
## Training
|
| 31 |
|
| 32 |
The ControlNet is trained using the [Realistic Vision V5.1](https://huggingface.co/SG161222/Realistic_Vision_V5.1_noVAE) diffusion model. The training process takes approximately two days on an NVIDIA TITAN RTX GPU, with a batch size of 4 and a learning rate of 1e-4. During training, the probability of randomly dropping conditioning inputs is set to 0.1.
|
| 33 |
The conditional input consists of a concatenated normal map and segmentation map, resulting in a 4-channel input (3 channels for the normal map and 1 for the segmentation map). The resolution of training images is fixed at 512x512. To ensure approximately balanced quantities of face, mouth, and eye data, we duplicate relevant samples. For data augmentation, we employ random resized cropping during training.
|
| 34 |
|
| 35 |
-
|
|
|
|
| 36 |
For ControlNet guidance on the **face** region, we recommend utilizing the complete text prompt describing the full avatar (e.g., "a teen boy, pensive look, dark hair, preppy sweater, collared shirt, moody room, 80s memorabilia"). However, for the **mouth** and **eye** regions, which typically lack person-specific features, we observe that detailed prompts degrade image quality. Consequently, we use more abstract text prompts paired with region-specific prefixes for these areas (e.g., "right eye region, a boy"), broadly categorizing the avatar.
|
| 37 |
|
| 38 |

|
|
|
|
| 27 |

|
| 28 |
|
| 29 |
|
| 30 |
+
## Training Details
|
| 31 |
|
| 32 |
The ControlNet is trained using the [Realistic Vision V5.1](https://huggingface.co/SG161222/Realistic_Vision_V5.1_noVAE) diffusion model. The training process takes approximately two days on an NVIDIA TITAN RTX GPU, with a batch size of 4 and a learning rate of 1e-4. During training, the probability of randomly dropping conditioning inputs is set to 0.1.
|
| 33 |
The conditional input consists of a concatenated normal map and segmentation map, resulting in a 4-channel input (3 channels for the normal map and 1 for the segmentation map). The resolution of training images is fixed at 512x512. To ensure approximately balanced quantities of face, mouth, and eye data, we duplicate relevant samples. For data augmentation, we employ random resized cropping during training.
|
| 34 |
|
| 35 |
+
For the training dataset, please refer to [FaceNormalSeg-ControlNet-dataset](https://huggingface.co/datasets/onethousand/FaceNormalSeg-ControlNet-dataset).
|
| 36 |
+
|
| 37 |
For ControlNet guidance on the **face** region, we recommend utilizing the complete text prompt describing the full avatar (e.g., "a teen boy, pensive look, dark hair, preppy sweater, collared shirt, moody room, 80s memorabilia"). However, for the **mouth** and **eye** regions, which typically lack person-specific features, we observe that detailed prompts degrade image quality. Consequently, we use more abstract text prompts paired with region-specific prefixes for these areas (e.g., "right eye region, a boy"), broadly categorizing the avatar.
|
| 38 |
|
| 39 |

|