AtteConDA-SDE-Scratch-30K

Model Summary

AtteConDA-SDE-Scratch-30K is a checkpoint from the AtteConDA series (Attention-based Condition Disambiguation Architecture).

The AtteConDA series targets controllable image generation and synthetic data augmentation for autonomous-driving scenes using three local conditions:

  • semantic segmentation
  • depth
  • edge

SDE in the repository name denotes this Semantic-segmentation + Depth + Edge condition set.

This repository contains the 30K-step scratch-initialized AtteConDA SDE variant.

Upstream Foundations and Provenance

This series is built on two upstream bases:

  1. Uni-ControlNet as the architectural/code reference for composable local/global control.
  2. Stable Diffusion v1.5 as the latent diffusion foundation model.

Repository / upstream references:

Checkpoint Status

  • Repository name: AtteConDA-SDE-Scratch-30K
  • Stage: Fine-tuned scratch-initialized variant
  • Condition set: SDE = semantic segmentation + depth + edge
  • PAM status: PAM is not used.

The trainable local-control branch was randomly initialized rather than using the AtteConDA UniCon initialization checkpoint. The architecture family still follows the Uni-ControlNet-style design and Stable Diffusion v1.5 backbone setting.

Files in This Repository

This repository is intended to contain:

  • model weight file(s), e.g. *.ckpt or *.safetensors
  • README.md (this model card)
  • LICENSE (repository-specific distribution notice)

Important:

  • This release is described as a weights-focused repository.
  • The matching config file is not bundled here at the moment.
  • To run the checkpoint, use the companion project codebase and the matching config from that codebase.

Training Data

The trained AtteConDA variants in this release use the following training datasets:

  • BDD10K semantic segmentation subset: 8,000 images (train 7,000 + val 1,000)
  • Cityscapes train/val: 3,475 images (train 2,975 + val 500)
  • GTA5: 24,966 images
  • nuImages (front camera subset): 18,368 images
  • BDD100K (excluding BDD10K overlap): 92,000 images

Total training images used by the trained variants: 146,809

Not used for training: Waymo
Waymo is used only for evaluation in this release series.

Evaluation Data

This release series uses a Waymo front-camera evaluation subset only for evaluation.

Evaluation-set notes:

  • Waymo images are not part of training
  • evaluation subset size: 3,048 images
  • construction policy in the project materials: front-camera images extracted from the first / middle / last positions of segments

Training Procedure

  • Fine-tuning steps: 30K
  • Optimizer: AdamW
  • Learning rate: 1e-5
  • Batch size: 4
  • Resolution: 512 x 512
  • Frozen components: Stable Diffusion denoising backbone, VAE, and text encoder
  • Trainable focus: local control branch
  • Initialization difference: local control branch randomly initialized

Common project-side generation/evaluation settings for trained variants:

  • guidance backbone family: Stable Diffusion 1.5 latent diffusion
  • conditioning family: Uni-ControlNet-style controllable diffusion design
  • inference sampler used in project evaluation: DDIM
  • DDIM steps used in project evaluation: 50
  • intended domain: autonomous-driving scene appearance modification while preserving scene structure

Quantitative Results

The following quantitative results were reported for this 30K scratch-initialized variant under the project evaluation protocol:

Metric Value
Semantic Segmentation mIoU ↑ 0.2445
Depth RMSE ↓ 40.41
Edge L1 Error ↓ 0.03759
Object Preservation F1 ↑ 0.0455
Diversity (1 - MS-SSIM) ↑ 0.8450
Reality (CLIP-CMMD) ↓ 0.1827
Text Alignment (R-Precision@1) ↑ 0.2894

Intended Use

This repository is intended for:

  • research on controllable diffusion models
  • research on multi-condition generation
  • research on synthetic data augmentation for autonomous-driving perception and reasoning tasks
  • ablation studies on initialization, training steps, and PAM effects
  • reproducible comparison across AtteConDA variants

Out-of-Scope Use

This repository is not intended for:

  • commercial deployment
  • customer-facing or production systems
  • safety-critical decision making
  • real-world vehicle operation or vehicle assistance
  • any use that violates upstream model terms or dataset terms

Known Limitations

Known limitations of this release family include:

  • possible structural failures on small distant objects
  • possible distortion or disappearance of vehicles, traffic signs, or thin structures in difficult regions
  • possible imperfect preservation of text on signboards
  • evaluation is based on external projection models rather than full human relabeling
  • not yet a guarantee of downstream task improvement for every autonomous-driving task
  • current resolution and backbone scale may limit very fine-grained detail preservation

Bias, Domain Shift, and Generalization Notes

These checkpoints are trained on a mixture of road-scene datasets and should be treated as domain-dependent research artifacts. They may reflect:

  • geographic bias
  • weather / time imbalance
  • dataset-specific annotation conventions
  • camera viewpoint bias
  • urban-scene category bias

Generalization outside the project setting must not be assumed.

Licensing and Use Restrictions

Do not label this repository as MIT.

Why:

  • the Uni-ControlNet code repository is MIT-licensed, but
  • this checkpoint family is built on Stable Diffusion v1.5 and
  • Stable Diffusion v1.5 derivatives carry CreativeML Open RAIL-M obligations, while
  • multiple training datasets in this project are distributed under non-commercial and/or research-oriented terms.

Accordingly, this repository uses:

  • license: other in the Hugging Face metadata
  • a repository-root LICENSE file named AtteConDA Research-Only License

Practical summary:

  • non-commercial research, teaching, scientific publication, and personal experimentation only
  • preserve repository notices
  • do not relax restrictions when redistributing
  • comply with the upstream Stable Diffusion and dataset terms as well

Citation

If you use this repository, please cite the AtteConDA work and the upstream bases.

AtteConDA / thesis-level citation

@misc{noguchi2026atteconda,
  author = {Shogo Noguchi},
  title = {条件競合を抑制する注意機構に基づく多条件拡散モデルによる合成データ拡張フレームワーク},
  year = {2026},
  note = {Bachelor thesis, Gunma University}
}

Upstream references

@inproceedings{zhao2023unicontrolnet,
  title={Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models},
  author={Zhao, Shihao and others},
  booktitle={NeurIPS},
  year={2023}
}

@inproceedings{rombach2022high,
  title={High-Resolution Image Synthesis with Latent Diffusion Models},
  author={Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bjorn},
  booktitle={CVPR},
  year={2022}
}

Acknowledgements

This repository acknowledges the upstream foundations and datasets used in the AtteConDA project:

  • Uni-ControlNet
  • Stable Diffusion v1.5
  • BDD10K / BDD100K
  • Cityscapes
  • GTA5 (Playing for Data)
  • nuImages

Waymo is acknowledged as an evaluation dataset only for this release series and was not used for training.

Release Notes

This model card was written conservatively to avoid over-claiming. If you later publish exact benchmark tables, official project URLs, or bundled configs, update this card accordingly.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shogo-Noguchi/AtteConDA-SDE-Scratch-30K

Finetuned
(371)
this model

Collection including Shogo-Noguchi/AtteConDA-SDE-Scratch-30K