onethousand
/

AnimPortrait3D_controlnet

stable-diffusion

controlnet-v1-1

Model card Files Files and versions

onethousand commited on Mar 18, 2025

Commit

0fa35c3

·

verified ·

1 Parent(s): 0a8c759

Update README.md

Files changed (1) hide show

README.md +3 -2

README.md CHANGED Viewed

@@ -27,12 +27,13 @@ This ControlNet can generate high-quality RGB images for facial, mouth, and eye
 ![img](./assets/controlnet.png)
-## Training Detials
 The ControlNet is trained using the [Realistic Vision V5.1](https://huggingface.co/SG161222/Realistic_Vision_V5.1_noVAE) diffusion model. The training process takes approximately two days on an NVIDIA TITAN RTX GPU, with a batch size of 4 and a learning rate of 1e-4. During training, the probability of randomly dropping conditioning inputs is set to 0.1.
 The conditional input consists of a concatenated normal map and segmentation map, resulting in a 4-channel input (3 channels for the normal map and 1 for the segmentation map). The resolution of training images is fixed at 512x512. To ensure approximately balanced quantities of face, mouth, and eye data, we duplicate relevant samples. For data augmentation, we employ random resized cropping during training.
 For ControlNet guidance on the **face** region, we recommend utilizing the complete text prompt describing the full avatar (e.g., "a teen boy, pensive look, dark hair, preppy sweater, collared shirt, moody room, 80s memorabilia"). However, for the **mouth** and **eye** regions, which typically lack person-specific features, we observe that detailed prompts degrade image quality. Consequently, we use more abstract text prompts paired with region-specific prefixes for these areas (e.g., "right eye region, a boy"), broadly categorizing the avatar.
 ![img](./assets/prompt.png)

 ![img](./assets/controlnet.png)
+## Training Details
 The ControlNet is trained using the [Realistic Vision V5.1](https://huggingface.co/SG161222/Realistic_Vision_V5.1_noVAE) diffusion model. The training process takes approximately two days on an NVIDIA TITAN RTX GPU, with a batch size of 4 and a learning rate of 1e-4. During training, the probability of randomly dropping conditioning inputs is set to 0.1.
 The conditional input consists of a concatenated normal map and segmentation map, resulting in a 4-channel input (3 channels for the normal map and 1 for the segmentation map). The resolution of training images is fixed at 512x512. To ensure approximately balanced quantities of face, mouth, and eye data, we duplicate relevant samples. For data augmentation, we employ random resized cropping during training.
+For the training dataset, please refer to [FaceNormalSeg-ControlNet-dataset](https://huggingface.co/datasets/onethousand/FaceNormalSeg-ControlNet-dataset).
 For ControlNet guidance on the **face** region, we recommend utilizing the complete text prompt describing the full avatar (e.g., "a teen boy, pensive look, dark hair, preppy sweater, collared shirt, moody room, 80s memorabilia"). However, for the **mouth** and **eye** regions, which typically lack person-specific features, we observe that detailed prompts degrade image quality. Consequently, we use more abstract text prompts paired with region-specific prefixes for these areas (e.g., "right eye region, a boy"), broadly categorizing the avatar.
 ![img](./assets/prompt.png)