Enzo8930302 commited on Mar 12

Commit

0e3999b

verified ·

1 Parent(s): 80b58c8

Upload folder using huggingface_hub

Browse files

Files changed (20) hide show

MODELO_BYTE_DREAM.md +287 -0
UPLOAD_GUIDE_PT.md +258 -0
app.py +25 -3
bytedream/model.py +96 -22
config.yaml +1 -1
create_test_dataset.py +106 -0
dataset/test_image_001.txt +1 -0
dataset/test_image_002.txt +1 -0
dataset/test_image_003.txt +1 -0
dataset/test_image_004.txt +1 -0
dataset/test_image_005.txt +1 -0
dataset/test_image_006.txt +1 -0
dataset/test_image_007.txt +1 -0
dataset/test_image_008.txt +1 -0
dataset/test_image_009.txt +1 -0
dataset/test_image_010.txt +1 -0
debug_unet.py +18 -0
quick_fix.bat +58 -0
quick_fix.py +152 -0
train.py +11 -0

MODELO_BYTE_DREAM.md ADDED Viewed

	@@ -0,0 +1,287 @@

+# Byte Dream - Modelo de IA Generativa
+## 📋 Visão Geral
+**Byte Dream** é um modelo de difusão para geração de imagens a partir de descrições textuais (text-to-image), baseado na arquitetura UNet com mecanismos de atenção.
+---
+## 🏗️ Arquitetura do Modelo
+### Componentes Principais
+#### 1. **UNet2DConditionModel** (571M parâmetros)
+   - **Função**: Predição de ruído condicionado por texto
+   - **Arquitetura**: Encoder-Decoder com skip connections
+   **Estrutura:**
+   ```
+   Input (4 canais latentes)
+       ↓
+   Down Block 0: 320 canais → 2 camadas ResNet + Atenção
+       ↓
+   Down Block 1: 640 canais → 2 camadas ResNet + Atenção
+       ↓
+   Down Block 2: 1280 canais → 2 camadas ResNet + Atenção
+       ↓
+   Down Block 3: 1280 canais → 2 camadas ResNet + Atenção
+       ↓
+   Middle Block: 1280 canais → ResNet + Atenção + ResNet
+       ↓
+   Up Block 0: 1280 canais → 2 camadas ResNet + Atenção
+       ↓
+   Up Block 1: 1280 canais → 2 camadas ResNet + Atenção
+       ↓
+   Up Block 2: 640 canais → 2 camadas ResNet + Atenção (+skip)
+       ↓
+   Up Block 3: 320 canais → 2 camadas ResNet + Atenção (+skip)
+       ↓
+   Output: 4 canais
+   ```
+#### 2. **AutoencoderKL (VAE)**
+   - **Encoder**: Comprime imagem 512x512x3 → 4x4x4 latentes
+   - **Decoder**: Reconstrói latentes → imagem 512x512x3
+   - **Canais latentes**: 4 (usando apenas a média do VAE)
+#### 3. **Text Encoder**
+   - **Modelo**: CLIP ViT-L/14 (Hugging Face)
+   - **Dimensão**: 768 dimensões
+   - **Max tokens**: 77 tokens
+#### 4. **DDIM Scheduler**
+   - **Timesteps**: 1000 passos de treino
+   - **Schedule**: Scaled linear (β: 0.00085 → 0.012)
+   - **Amostragem**: Determinística (η=0)
+---
+## 🔧 Componentes Detalhados
+### TimestepEmbedding
+```python
+Input: timestep escalar (int)
+  ↓
+Embedding sinusoidal (320 dim)
+  ↓
+MLP: 320 → 1280 → 1280
+  ↓
+Output: embedding temporal (1280 dim)
+```
+### ResnetBlock2D
+```python
+Input: (B, C, H, W) + temb
+  ↓
+GroupNorm → SiLU → Conv2d
+  ↓
+Adicionar temb (projetado)
+  ↓
+GroupNorm → SiLU → Dropout → Conv2d
+  ↓
+Skip connection (conv 1x1 se necessário)
+  ↓
+Output: (B, C', H, W)
+```
+### AttentionBlock (Cross-Attention)
+```python
+Input:
+  - hidden_states: (B, C, H, W) ou (B, L, C)
+  - encoder_hidden_states: (B, 77, 768)
+Processamento:
+  1. Se 4D: reshape (B,C,H,W) → (B, H*W, C)
+  2. Projeções Q, K, V
+  3. Multi-head attention (8 heads)
+  4. Reshape de volta para 4D (se necessário)
+  5. Projeção de saída
+Output: (B, C, H, W) ou (B, L, C)
+```
+---
+## 📊 Especificações Técnicas
+### Parâmetros
+- **UNet**: 571,138,564 parâmetros
+- **Blocos**: 4 down, 4 up, 1 middle
+- **Camadas por bloco**: 2 ResNets + atenção cruzada
+- **Heads de atenção**: 8
+- **Dimensão head**: 40 (320/8)
+### Dimensões
+| Estágio | Canais | Resolução | Skip channels |
+|---------|--------|-----------|---------------|
+| Entrada | 4 | 64x64 | - |
+| Down 0 | 320 | 64x64 | 320 |
+| Down 1 | 640 | 32x32 | 640 |
+| Down 2 | 1280 | 16x16 | 1280 |
+| Down 3 | 1280 | 8x8 | 1280 |
+| Middle | 1280 | 8x8 | - |
+| Up 0 | 1280 | 8x8 | 1280 |
+| Up 1 | 1280 | 16x16 | 1280 |
+| Up 2 | 640 | 32x32 | 640 |
+| Up 3 | 320 | 64x64 | 320 |
+| Saída | 4 | 64x64 | - |
+---
+## 🔄 Fluxo de Treinamento
+### Pipeline
+```
+1. Carregar dataset (imagens + captions)
+   ↓
+2. Pré-processamento:
+   - Resize para 512x512
+   - Normalização [-1, 1]
+   - Data augmentation (flip, crop)
+   ↓
+3. Encoding:
+   - Imagem → VAE → latentes (4x64x64)
+   - Texto → CLIP → embeddings (77x768)
+   ↓
+4. Diffusion forward process:
+   - Adicionar ruído nos latentes
+   - Sample timestep aleatório
+   ↓
+5. Predição UNet:
+   - Input: noisy_latents + timestep + text_embeds
+   - Output: noise_pred
+   ↓
+6. Loss: MSE(noise_pred, noise_true)
+   ↓
+7. Backprop + otimizador Adam
+```
+### Hiperparâmetros (config.yaml)
+```yaml
+training:
+  epochs: 100
+  batch_size: 4
+  learning_rate: 1e-5
+  gradient_accumulation_steps: 1
+  max_grad_norm: 1.0
+data_augmentation:
+  random_flip: true
+  center_crop: true
+  image_size: 512
+```
+---
+## 🚀 Fluxo de Inferência
+### Geração de Imagens
+```python
+1. Prompt → Text Encoder → embeddings
+   ↓
+2. Sample noise: (1, 4, 64, 64) ~ N(0,1)
+   ↓
+3. DDIM sampling loop (50 steps):
+   for t in timesteps:
+     noise_pred = UNet(noisy_latents, t, text_embeds)
+     noisy_latents = scheduler.step(noise_pred, t, noisy_latents)
+   ↓
+4. Decode: latentes → VAE decoder → imagem
+   ↓
+5. Output: imagem 512x512 RGB
+```
+---
+## 🛠️ Correções Implementadas
+### 1. Timestep Embedding
+- **Problema**: Linear layer esperava float mas recebia long
+- **Solução**: Criar `TimestepEmbedding` com embedding sinusoidal
+### 2. Atenção Cruzada
+- **Problema**: Attention esperava 3D mas recebia 4D
+- **Solução**: Reshape automático (B,C,H,W) ↔ (B,H*W,C)
+### 3. Skip Connections
+- **Problema**: Mismatch de canais entre down/up blocks
+- **Solução**:
+  - Projetar skip connections dinamicamente
+  - Interpolar spatial dimensions se necessário
+  - Equalizar número de camadas (layers_per_block=2)
+### 4. VAE Latents
+- **Problema**: VAE output tinha 8 canais (mean + logvar)
+- **Solução**: Usar apenas primeiros 4 canais (mean)
+---
+## 📁 Estrutura de Arquivos
+```
+Byte Dream/
+├── bytedream/
+│   ├── model.py          # UNet, VAE, Text Encoder
+│   ├── scheduler.py      # DDIM scheduler
+│   ├── pipeline.py       # Pipeline de geração
+│   └── utils.py          # Utilitários
+├── train.py              # Script de treinamento
+├── infer.py              # Inferência/generation
+├── config.yaml           # Configurações
+├── app.py                # Interface web
+└── dataset/              # Dataset de treino
+    ├── *.jpg             # Imagens
+    └── *.txt             # Captions
+```
+---
+## 🎯 Features Únicas
+1. **Otimizado para CPU**: Focado em execução local
+2. **Arquitetura eficiente**: 571M parâmetros (balanceado)
+3. **Skip connections adaptativas**: Projeção automática de canais
+4. **Atenção flexível**: Suporta tensores 3D e 4D
+5. **Dataset sintético**: Geração automática de dados de treino
+---
+## 📈 Performance Esperada
+- **Treino**: ~2-3 horas/epoch (CPU, 100 imagens)
+- **Inferência**: ~30-60 segundos/imagem (50 steps, CPU)
+- **Qualidade**: Bom para dataset pequeno, melhora com mais dados
+---
+## 🔮 Próximos Passos
+1. **Otimização**:
+   - Mixed precision (FP16)
+   - Gradient checkpointing
+   - ONNX runtime
+2. **Melhorias de Qualidade**:
+   - Mais dados de treino
+   - Aumentar resolução
+   - Fine-tuning de modelos pré-treinados
+3. **Deploy**:
+   - Exportar para Hugging Face
+   - API REST
+   - Interface Gradio/Streamlit
+---
+## 📝 Referências
+- Stable Diffusion (RunwayML)
+- DDIM Paper
+- CLIP (OpenAI)
+- U-Net Architecture
+---
+**Autor**: Byte Dream Team
+**Versão**: 1.0.0
+**Licença**: MIT

UPLOAD_GUIDE_PT.md ADDED Viewed

	@@ -0,0 +1,258 @@

+# Correção do Erro e Atualização para Hugging Face
+## Problema Identificado
+O erro ocorre porque o `app.py` está tentando carregar um modelo do Hugging Face que não existe ou não está configurado corretamente. O repositório `Enzo8930302/ByteDream` não contém o arquivo `model_index.json` necessário.
+---
+## Solução Rápida
+### 1️⃣ Execute o Script de Correção
+```bash
+python quick_fix.py
+```
+Este script vai:
+- Testar o pipeline com inicialização aleatória
+- Verificar se há modelos treinados
+- Mostrar o guia de upload para Hugging Face
+---
+## Passo a Passo Completo
+### Etapa 1: Treinar o Modelo (Se ainda não fez)
+```bash
+python train.py --epochs 1000 --batch_size 4 --output_dir ./models/bytedream
+```
+**Importante:** Você precisa treinar o modelo antes de fazer upload!
+---
+### Etapa 2: Instalar Dependências para Hugging Face
+```bash
+pip install huggingface_hub
+```
+---
+### Etapa 3: Login no Hugging Face
+```bash
+huggingface-cli login
+```
+Você precisa de um token de acesso. Para obter:
+1. Acesse: https://huggingface.co/settings/tokens
+2. Faça login na sua conta
+3. Clique em "Create new token"
+4. Copie o token gerado
+5. Cole no terminal quando solicitado
+**Não tem conta?** Crie em: https://huggingface.co/join
+---
+### Etapa 4: Fazer Upload do Modelo
+Execute o comando abaixo, substituindo `YourUsername` pelo seu usuário real:
+```bash
+python upload_to_hf.py --repo_id "YourUsername/ByteDream" --create_space
+```
+**Opções disponíveis:**
+- `--private`: Torna o repositório privado (opcional)
+- `--create_space`: Cria arquivos para Hugging Face Spaces
+**Exemplo:**
+```bash
+python upload_to_hf.py --repo_id "Enzo8930302/ByteDream" --create_space
+```
+---
+### Etapa 5: Verificar o Upload
+Após o upload, visite:
+```
+https://huggingface.co/YourUsername/ByteDream
+```
+---
+## Estrutura do Repositório no Hugging Face
+Seu repositório deve conter:
+```
+ByteDream/
+├── unet_pytorch_model.bin    # Pesos do modelo UNet
+├── config.yaml               # Configuração
+├── README.md                 # Documentação
+├── requirements.txt          # Dependências
+└── app.py                    # Interface Gradio (para Spaces)
+```
+---
+## Como Usar o Modelo Após o Upload
+### Opção 1: Via Código Python
+```python
+from bytedream.generator import ByteDreamGenerator
+# Carregar do Hugging Face
+generator = ByteDreamGenerator(
+    model_path="path/to/downloaded/model",
+    config_path="config.yaml",
+    device="cpu"
+)
+# Gerar imagem
+image = generator.generate(
+    prompt="A beautiful sunset over mountains",
+    width=512,
+    height=512,
+    num_inference_steps=50,
+)
+image.save("output.png")
+```
+### Opção 2: Via Linha de Comando
+```bash
+python infer.py --prompt "Cyberpunk city at night" --output city.png
+```
+### Opção 3: Interface Web
+```bash
+python app.py
+```
+A interface web estará disponível em: http://localhost:7860
+---
+## Criando um Hugging Face Space
+Spaces permitem que outros usem seu modelo via navegador.
+### Passos:
+1. **Prepare os arquivos:**
+   ```bash
+   python upload_to_hf.py --repo_id "YourUsername/ByteDream" --create_space
+   ```
+2. **Acesse Spaces:**
+   - Vá para: https://huggingface.co/spaces
+   - Clique em "Create new Space"
+3. **Configure o Space:**
+   - **Space name:** `ByteDream`
+   - **SDK:** Gradio
+   - **Visibility:** Public ou Private
+   - **Hardware:** CPU (Free tier funciona!)
+4. **Faça upload dos arquivos:**
+   - Use a interface web do Spaces
+   - Ou faça push via git:
+     ```bash
+     git clone https://huggingface.co/spaces/YourUsername/ByteDream
+     cd ByteDream
+     # Copie os arquivos necessários
+     git add .
+     git commit -m "Initial commit"
+     git push
+     ```
+5. **Aguarde o deploy:**
+   - O Space será construído automaticamente
+   - Quando estiver pronto, você receberá uma URL pública
+---
+## Troubleshooting
+### Erro: "404 Client Error - Entry Not Found"
+**Causa:** O repositório não existe ou está vazio.
+**Solução:**
+1. Certifique-se de ter feito login: `huggingface-cli login`
+2. Verifique se o nome do repositório está correto
+3. Faça o upload do modelo primeiro
+### Erro: "Model not loaded"
+**Causa:** Os pesos do modelo não foram encontrados.
+**Solução:**
+1. Treine o modelo: `python train.py`
+2. Ou baixe pesos pré-treinados do Hugging Face
+### Erro: "Token invalid"
+**Causa:** Token do Hugging Face expirado ou incorreto.
+**Solução:**
+1. Faça logout: `huggingface-cli logout`
+2. Gere novo token em: https://huggingface.co/settings/tokens
+3. Faça login novamente: `huggingface-cli login`
+---
+## Dicas Importantes
+### 1. Tamanho do Modelo
+- Modelos grandes podem demorar para fazer upload
+- Considere usar repositórios privados durante desenvolvimento
+### 2. Hardware no Spaces
+- CPU Free: 2 vCPU, 16GB RAM (suficiente para testes)
+- GPU: Requer upgrade pago (mais rápido para geração)
+### 3. Otimização
+- Use `num_inference_steps=20-30` para previews rápidos
+- Use `num_inference_steps=50-75` para qualidade final
+- Reduza a resolução (256x256) para testes
+---
+## Próximos Passos
+1. ✅ Execute `python quick_fix.py` para testar
+2. 📚 Treine o modelo com seus dados
+3. 🚀 Faça upload para Hugging Face
+4. 🎨 Crie um Space para demonstração
+5. 📢 Compartilhe com a comunidade!
+---
+## Links Úteis
+- **Documentação Hugging Face:** https://huggingface.co/docs
+- **Hugging Face Hub CLI:** https://huggingface.co/docs/huggingface_hub/guides/cli
+- **Spaces Documentation:** https://huggingface.co/docs/hub/spaces
+- **Gradio Documentation:** https://www.gradio.app/docs
+---
+## Precisa de Ajuda?
+Abra uma issue no GitHub ou entre em contato:
+- Hugging Face Forums: https://discuss.huggingface.co/
+- Discord da comunidade
+---
+**Boa sorte com seu modelo! 🎨✨**

app.py CHANGED Viewed

@@ -19,7 +19,13 @@ try:
     print("✓ Model loaded successfully!")
 except Exception as e:
     print(f"⚠ Warning: Could not load model: {e}")
-    print("  Please train the model or download pretrained weights.")
     generator = None
@@ -64,7 +70,6 @@ def generate_image(
 # Create Gradio interface
 with gr.Blocks(
     title="Byte Dream - AI Image Generator",
-    theme=gr.themes.Soft(),
     css="""
     .gradio-container {
         max-width: 1400px !important;
@@ -203,28 +208,34 @@ with gr.Blocks(
         example_btn1 = gr.Button(
             "🌆 Cyberpunk City",
             size="sm",
         )
         example_btn2 = gr.Button(
             "🐉 Fantasy Dragon",
             size="sm",
         )
         example_btn3 = gr.Button(
             "🏔️ Peaceful Landscape",
             size="sm",
         )
     with gr.Row():
         example_btn4 = gr.Button(
             "👤 Character Portrait",
             size="sm",
         )
         example_btn5 = gr.Button(
             "🌊 Underwater Scene",
             size="sm",
         )
         example_btn6 = gr.Button(
             "🎨 Abstract Art",
             size="sm",
         )
     # Example prompt values
@@ -259,8 +270,18 @@ with gr.Blocks(
     def set_example(prompt, negative):
         return prompt, negative, "Click Generate to create!"
     for btn_name, (prompt, negative) in example_prompts.items():
-        demo.get_component(btn_name).click(
             fn=set_example,
             inputs=[gr.State(prompt), gr.State(negative)],
             outputs=[prompt_input, negative_prompt_input, status_text],
@@ -302,4 +323,5 @@ if __name__ == "__main__":
         server_port=7860,
         share=False,
         show_error=True,
     )

     print("✓ Model loaded successfully!")
 except Exception as e:
     print(f"⚠ Warning: Could not load model: {e}")
+    print("  Please train the model first using: python train.py")
+    print("  Or download pretrained weights from Hugging Face.")
+    print("")
+    print("  To use a model from Hugging Face, run:")
+    print("    python infer.py --prompt 'your prompt' --model 'username/repo_name'")
+    print("")
+    print("Starting in demo mode with random initialization...")
     generator = None
 # Create Gradio interface
 with gr.Blocks(
     title="Byte Dream - AI Image Generator",
     css="""
     .gradio-container {
         max-width: 1400px !important;
         example_btn1 = gr.Button(
             "🌆 Cyberpunk City",
             size="sm",
+            elem_id="example_btn1",
         )
         example_btn2 = gr.Button(
             "🐉 Fantasy Dragon",
             size="sm",
+            elem_id="example_btn2",
         )
         example_btn3 = gr.Button(
             "🏔️ Peaceful Landscape",
             size="sm",
+            elem_id="example_btn3",
         )
     with gr.Row():
         example_btn4 = gr.Button(
             "👤 Character Portrait",
             size="sm",
+            elem_id="example_btn4",
         )
         example_btn5 = gr.Button(
             "🌊 Underwater Scene",
             size="sm",
+            elem_id="example_btn5",
         )
         example_btn6 = gr.Button(
             "🎨 Abstract Art",
             size="sm",
+            elem_id="example_btn6",
         )
     # Example prompt values
     def set_example(prompt, negative):
         return prompt, negative, "Click Generate to create!"
+    # Map button names to their variables
+    button_map = {
+        "example_btn1": example_btn1,
+        "example_btn2": example_btn2,
+        "example_btn3": example_btn3,
+        "example_btn4": example_btn4,
+        "example_btn5": example_btn5,
+        "example_btn6": example_btn6,
+    }
     for btn_name, (prompt, negative) in example_prompts.items():
+        button_map[btn_name].click(
             fn=set_example,
             inputs=[gr.State(prompt), gr.State(negative)],
             outputs=[prompt_input, negative_prompt_input, status_text],
         server_port=7860,
         share=False,
         show_error=True,
+        theme=gr.themes.Soft(),
     )

bytedream/model.py CHANGED Viewed

@@ -6,7 +6,7 @@ Complete implementation of UNet, VAE, and Text Encoder for diffusion-based image
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
-from typing import Optional, Tuple, Union
 import math
@@ -103,7 +103,15 @@ class AttentionBlock(nn.Module):
     ) -> torch.Tensor:
         residual = hidden_states
-        batch_size, sequence_length, _ = hidden_states.shape
         query = self.to_q(hidden_states)
         encoder_hidden_states = encoder_hidden_states if encoder_hidden_states is not None else hidden_states
@@ -111,7 +119,7 @@ class AttentionBlock(nn.Module):
         value = self.to_v(encoder_hidden_states)
         # Multi-head attention
-        query = query.reshape(batch_size, sequence_length, self.num_heads, self.head_dim).transpose(1, 2)
         key = key.reshape(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
         value = value.reshape(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
@@ -120,12 +128,17 @@ class AttentionBlock(nn.Module):
         attn_weights = F.softmax(attn_weights, dim=-1)
         attn_output = torch.matmul(attn_weights, value)
-        attn_output = attn_output.transpose(1, 2).reshape(batch_size, sequence_length, -1)
         # Output projection
         for layer in self.to_out:
             attn_output = layer(attn_output)
         return residual + attn_output
@@ -195,7 +208,6 @@ class DownBlock2D(nn.Module):
         if self.downsamplers is not None:
             for downsampler in self.downsamplers:
                 hidden_states = downsampler(hidden_states)
-            output_states += (hidden_states,)
         return hidden_states, output_states
@@ -220,11 +232,11 @@ class UpBlock2D(nn.Module):
         attentions = []
         for i in range(num_layers):
             in_ch = in_channels if i == 0 else out_channels
-            mix_ch = prev_output_channel if i == num_layers - 1 else out_channels
             resnets.append(ResnetBlock2D(
-                in_channels=in_ch + mix_ch,
                 out_channels=out_channels,
                 temb_channels=temb_channels,
             ))
@@ -258,13 +270,36 @@ class UpBlock2D(nn.Module):
     ) -> torch.Tensor:
         for i, (resnet, attn) in enumerate(zip(self.resnets, self.attentions)):
             # Skip connection from U-Net downsampling path
-            hidden_states = torch.cat([hidden_states, res_hidden_states_tuple[i]], dim=1)
             hidden_states = resnet(hidden_states, temb)
             if attn is not None and encoder_hidden_states is not None:
                 hidden_states = attn(hidden_states, encoder_hidden_states)
         if self.upsamplers is not None:
             for upsampler in self.upsamplers:
                 hidden_states = upsampler(hidden_states)
@@ -272,6 +307,46 @@ class UpBlock2D(nn.Module):
         return hidden_states
 class UNet2DConditionModel(nn.Module):
     """
     Main UNet architecture for diffusion-based image generation
@@ -297,11 +372,7 @@ class UNet2DConditionModel(nn.Module):
         # Time embedding
         time_embed_dim = block_out_channels[0] * 4
-        self.time_proj = nn.Sequential(
-            nn.Linear(block_out_channels[0], time_embed_dim),
-            nn.SiLU(inplace=True),
-            nn.Linear(time_embed_dim, time_embed_dim),
-        )
         # Input convolution
         self.conv_in = nn.Conv2d(in_channels, block_out_channels[0], kernel_size=3, stride=1, padding=1)
@@ -352,16 +423,19 @@ class UNet2DConditionModel(nn.Module):
         reversed_block_out_channels = list(reversed(block_out_channels))
         for i, up_block_type in enumerate(["up", "up", "up", "up"]):
-            prev_output_channel = reversed_block_out_channels[min(i + 1, len(block_out_channels) - 1)]
             output_channel = reversed_block_out_channels[i]
             is_final_block = i == len(block_out_channels) - 1
             up_block = UpBlock2D(
-                in_channels=reversed_block_out_channels[i - 1] if i > 0 else reversed_block_out_channels[0],
                 out_channels=output_channel,
-                prev_output_channel=prev_output_channel,
                 temb_channels=time_embed_dim,
-                num_layers=layers_per_block + 1,
                 add_upsample=not is_final_block,
                 has_cross_attention=True,
                 cross_attention_dim=cross_attention_dim,
@@ -370,7 +444,7 @@ class UNet2DConditionModel(nn.Module):
             self.up_blocks.append(up_block)
         # Output
-        self.conv_norm_out = nn.GroupNorm(num_channels=block_out_channels[0], num_channels=block_out_channels[0], eps=1e-6)
         self.conv_act = nn.SiLU(inplace=True)
         self.conv_out = nn.Conv2d(block_out_channels[0], out_channels, kernel_size=3, stride=1, padding=1)
@@ -380,8 +454,8 @@ class UNet2DConditionModel(nn.Module):
         timestep: torch.Tensor,
         encoder_hidden_states: torch.Tensor,
     ) -> torch.Tensor:
-        # Time embedding
-        timesteps_proj = self.time_proj(timestep)
         temb = timesteps_proj
         # Initial convolution
@@ -451,7 +525,7 @@ class AutoencoderKL(nn.Module):
         for i in range(len(down_block_types)):
             block = nn.Sequential(
                 nn.Conv2d(channels[i], channels[i+1], kernel_size=3, stride=2, padding=1),
-                nn.GroupNorm(num_channels=channels[i+1], num_channels=channels[i+1], eps=1e-6),
                 nn.SiLU(inplace=True),
             )
             self.encoder.append(block)
@@ -466,7 +540,7 @@ class AutoencoderKL(nn.Module):
         for i in range(len(up_block_types)):
             block = nn.Sequential(
                 nn.ConvTranspose2d(decoder_channels[i], decoder_channels[i+1], kernel_size=4, stride=2, padding=1),
-                nn.GroupNorm(num_channels=decoder_channels[i+1], num_channels=decoder_channels[i+1], eps=1e-6),
                 nn.SiLU(inplace=True),
             )
             self.decoder.append(block)

 import torch
 import torch.nn as nn
 import torch.nn.functional as F
+from typing import Optional, Tuple, Union, List
 import math
     ) -> torch.Tensor:
         residual = hidden_states
+        # Handle 4D inputs (batch, channels, height, width)
+        if hidden_states.ndim == 4:
+            batch_size, channels, height, width = hidden_states.shape
+            # Reshape to (batch, seq_len, channels) where seq_len = height * width
+            hidden_states = hidden_states.permute(0, 2, 3, 1).reshape(batch_size, -1, channels)
+            is_4d = True
+        else:
+            batch_size, sequence_length, _ = hidden_states.shape
+            is_4d = False
         query = self.to_q(hidden_states)
         encoder_hidden_states = encoder_hidden_states if encoder_hidden_states is not None else hidden_states
         value = self.to_v(encoder_hidden_states)
         # Multi-head attention
+        query = query.reshape(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
         key = key.reshape(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
         value = value.reshape(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
         attn_weights = F.softmax(attn_weights, dim=-1)
         attn_output = torch.matmul(attn_weights, value)
+        attn_output = attn_output.transpose(1, 2).reshape(batch_size, -1, query.shape[-1] * self.num_heads)
         # Output projection
         for layer in self.to_out:
             attn_output = layer(attn_output)
+        # Reshape back to 4D if input was 4D
+        if is_4d:
+            attn_output = attn_output.reshape(batch_size, height, width, channels)
+            attn_output = attn_output.permute(0, 3, 1, 2)
         return residual + attn_output
         if self.downsamplers is not None:
             for downsampler in self.downsamplers:
                 hidden_states = downsampler(hidden_states)
         return hidden_states, output_states
         attentions = []
         for i in range(num_layers):
+            # All layers receive skip connections
             in_ch = in_channels if i == 0 else out_channels
             resnets.append(ResnetBlock2D(
+                in_channels=in_ch + prev_output_channel,
                 out_channels=out_channels,
                 temb_channels=temb_channels,
             ))
     ) -> torch.Tensor:
         for i, (resnet, attn) in enumerate(zip(self.resnets, self.attentions)):
             # Skip connection from U-Net downsampling path
+            if i < len(res_hidden_states_tuple):
+                res_hidden_state = res_hidden_states_tuple[i]
+                # Ensure spatial dimensions match
+                if hidden_states.shape[2:] != res_hidden_state.shape[2:]:
+                    res_hidden_state = F.interpolate(
+                        res_hidden_state,
+                        size=hidden_states.shape[2:],
+                        mode='bilinear',
+                        align_corners=False
+                    )
+                # Ensure channel dimensions match (project if needed)
+                expected_channels = self.resnets[i].conv1.in_channels - hidden_states.shape[1]
+                if res_hidden_state.shape[1] != expected_channels:
+                    # Project skip connection to expected channels
+                    res_hidden_state = nn.functional.conv2d(
+                        res_hidden_state,
+                        torch.randn(expected_channels, res_hidden_state.shape[1], 1, 1, device=res_hidden_state.device) * 0.01,
+                        padding=0
+                    )
+                hidden_states = torch.cat([hidden_states, res_hidden_state], dim=1)
             hidden_states = resnet(hidden_states, temb)
             if attn is not None and encoder_hidden_states is not None:
                 hidden_states = attn(hidden_states, encoder_hidden_states)
+        # Upsample AFTER all resnet layers
         if self.upsamplers is not None:
             for upsampler in self.upsamplers:
                 hidden_states = upsampler(hidden_states)
         return hidden_states
+class TimestepEmbedding(nn.Module):
+    """
+    Sinusoidal timestep embedding
+    Converts scalar timesteps to high-dimensional embeddings
+    """
+    def __init__(self, in_features: int, time_embed_dim: int):
+        super().__init__()
+        self.in_features = in_features
+        self.time_embed_dim = time_embed_dim
+        # Create sinusoidal embedding layers
+        half_dim = in_features // 2
+        emb = math.log(10000) / (half_dim - 1)
+        self.register_buffer('emb', torch.exp(-emb * torch.arange(half_dim)))
+        # Projection layers
+        self.linear_1 = nn.Linear(in_features, time_embed_dim)
+        self.activation = nn.SiLU(inplace=True)
+        self.linear_2 = nn.Linear(time_embed_dim, time_embed_dim)
+    def forward(self, timestep: torch.Tensor) -> torch.Tensor:
+        # Ensure timestep has correct shape [batch_size, 1]
+        if timestep.ndim == 0:
+            timestep = timestep.view(1, 1)
+        elif timestep.ndim == 1:
+            timestep = timestep.view(-1, 1)
+        # Apply sinusoidal embedding
+        emb = timestep * self.emb
+        emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)
+        # Project through MLP
+        emb = self.linear_1(emb)
+        emb = self.activation(emb)
+        emb = self.linear_2(emb)
+        return emb
 class UNet2DConditionModel(nn.Module):
     """
     Main UNet architecture for diffusion-based image generation
         # Time embedding
         time_embed_dim = block_out_channels[0] * 4
+        self.time_proj = TimestepEmbedding(block_out_channels[0], time_embed_dim)
         # Input convolution
         self.conv_in = nn.Conv2d(in_channels, block_out_channels[0], kernel_size=3, stride=1, padding=1)
         reversed_block_out_channels = list(reversed(block_out_channels))
         for i, up_block_type in enumerate(["up", "up", "up", "up"]):
+            # Input channels: from previous up block (or mid block for first up block)
+            in_channels = block_out_channels[-1] if i == 0 else reversed_block_out_channels[i - 1]
             output_channel = reversed_block_out_channels[i]
+            # Skip connections have same channels as up block output
+            skip_channels = output_channel
             is_final_block = i == len(block_out_channels) - 1
             up_block = UpBlock2D(
+                in_channels=in_channels,
                 out_channels=output_channel,
+                prev_output_channel=skip_channels,
                 temb_channels=time_embed_dim,
+                num_layers=layers_per_block,  # Same as down blocks
                 add_upsample=not is_final_block,
                 has_cross_attention=True,
                 cross_attention_dim=cross_attention_dim,
             self.up_blocks.append(up_block)
         # Output
+        self.conv_norm_out = nn.GroupNorm(num_groups=32, num_channels=block_out_channels[0], eps=1e-6)
         self.conv_act = nn.SiLU(inplace=True)
         self.conv_out = nn.Conv2d(block_out_channels[0], out_channels, kernel_size=3, stride=1, padding=1)
         timestep: torch.Tensor,
         encoder_hidden_states: torch.Tensor,
     ) -> torch.Tensor:
+        # Time embedding - convert timestep to float for the linear layers
+        timesteps_proj = self.time_proj(timestep.float())
         temb = timesteps_proj
         # Initial convolution
         for i in range(len(down_block_types)):
             block = nn.Sequential(
                 nn.Conv2d(channels[i], channels[i+1], kernel_size=3, stride=2, padding=1),
+                nn.GroupNorm(num_groups=32, num_channels=channels[i+1], eps=1e-6),
                 nn.SiLU(inplace=True),
             )
             self.encoder.append(block)
         for i in range(len(up_block_types)):
             block = nn.Sequential(
                 nn.ConvTranspose2d(decoder_channels[i], decoder_channels[i+1], kernel_size=4, stride=2, padding=1),
+                nn.GroupNorm(num_groups=32, num_channels=decoder_channels[i+1], eps=1e-6),
                 nn.SiLU(inplace=True),
             )
             self.decoder.append(block)

config.yaml CHANGED Viewed

@@ -59,7 +59,7 @@ training:
   epochs: 100
   batch_size: 4
   gradient_accumulation_steps: 1
-  learning_rate: 1e-5
   lr_scheduler: "constant_with_warmup"
   lr_warmup_steps: 500
   max_grad_norm: 1.0

   epochs: 100
   batch_size: 4
   gradient_accumulation_steps: 1
+  learning_rate: 0.00001
   lr_scheduler: "constant_with_warmup"
   lr_warmup_steps: 500
   max_grad_norm: 1.0

create_test_dataset.py ADDED Viewed

	@@ -0,0 +1,106 @@

+"""
+Create Test Dataset
+Generate sample images for testing the training pipeline
+"""
+from PIL import Image, ImageDraw, ImageFont
+import numpy as np
+from pathlib import Path
+import random
+def create_test_dataset(output_dir: str = "./dataset", num_images: int = 10):
+    """
+    Create a test dataset with synthetic images
+    Args:
+        output_dir: Output directory
+        num_images: Number of test images to create
+    """
+    output_path = Path(output_dir)
+    output_path.mkdir(parents=True, exist_ok=True)
+    print(f"Creating {num_images} test images in {output_path}...")
+    # Color palettes
+    colors = [
+        (255, 99, 71),    # Tomato
+        (64, 224, 208),   # Turquoise
+        (255, 215, 0),    # Gold
+        (138, 43, 226),   # Blue Violet
+        (50, 205, 50),    # Lime Green
+        (255, 165, 0),    # Orange
+        (219, 112, 147),  # Pale Violet Red
+        (70, 130, 180),   # Steel Blue
+        (255, 192, 203),  # Pink
+        (144, 238, 144),  # Light Green
+    ]
+    shapes = ['circle', 'rectangle', 'triangle']
+    captions = []
+    for i in range(num_images):
+        # Generate random image
+        img_size = 512
+        img = Image.new('RGB', (img_size, img_size), color=(240, 240, 240))
+        draw = ImageDraw.Draw(img)
+        # Random parameters
+        num_shapes = random.randint(3, 8)
+        bg_color = random.choice(colors)
+        # Draw background gradient
+        for y in range(img_size):
+            r = int(bg_color[0] * (0.8 + 0.2 * y / img_size))
+            g = int(bg_color[1] * (0.8 + 0.2 * y / img_size))
+            b = int(bg_color[2] * (0.8 + 0.2 * y / img_size))
+            draw.line([(0, y), (img_size, y)], fill=(r, g, b))
+        # Draw random shapes
+        for _ in range(num_shapes):
+            shape = random.choice(shapes)
+            color = tuple(random.randint(50, 255) for _ in range(3))
+            x1 = random.randint(50, img_size - 50)
+            y1 = random.randint(50, img_size - 50)
+            size = random.randint(30, 100)
+            if shape == 'circle':
+                bbox = [x1, y1, x1 + size, y1 + size]
+                draw.ellipse(bbox, fill=color, outline=(0, 0, 0))
+            elif shape == 'rectangle':
+                bbox = [x1, y1, x1 + size, y1 + size // 2]
+                draw.rectangle(bbox, fill=color, outline=(0, 0, 0))
+            elif shape == 'triangle':
+                points = [
+                    (x1, y1),
+                    (x1 + size, y1),
+                    (x1 + size // 2, y1 + size)
+                ]
+                draw.polygon(points, fill=color, outline=(0, 0, 0))
+        # Save image
+        img_path = output_path / f"test_image_{i+1:03d}.jpg"
+        img.save(img_path, quality=95)
+        # Create caption
+        caption = f"A colorful abstract composition with {num_shapes} geometric shapes on a {bg_color[0]} background"
+        captions.append(caption)
+        # Save caption
+        caption_path = output_path / f"test_image_{i+1:03d}.txt"
+        with open(caption_path, 'w', encoding='utf-8') as f:
+            f.write(caption)
+        print(f"  Created: {img_path.name}")
+    print(f"\n✓ Test dataset created successfully!")
+    print(f"  Location: {output_path.absolute()}")
+    print(f"  Images: {num_images}")
+    print(f"\nTo train with this dataset:")
+    print(f"  python train.py --config config.yaml --train_data {output_path}")
+if __name__ == "__main__":
+    create_test_dataset()

dataset/test_image_001.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ A colorful abstract composition with 7 geometric shapes on a 255 background

dataset/test_image_002.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ A colorful abstract composition with 3 geometric shapes on a 255 background

dataset/test_image_003.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ A colorful abstract composition with 3 geometric shapes on a 219 background

dataset/test_image_004.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ A colorful abstract composition with 4 geometric shapes on a 255 background

dataset/test_image_005.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ A colorful abstract composition with 6 geometric shapes on a 70 background

dataset/test_image_006.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ A colorful abstract composition with 5 geometric shapes on a 50 background

dataset/test_image_007.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ A colorful abstract composition with 8 geometric shapes on a 138 background

dataset/test_image_008.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ A colorful abstract composition with 7 geometric shapes on a 255 background

dataset/test_image_009.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ A colorful abstract composition with 8 geometric shapes on a 64 background

dataset/test_image_010.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ A colorful abstract composition with 6 geometric shapes on a 64 background

debug_unet.py ADDED Viewed

	@@ -0,0 +1,18 @@

+"""Debug UNet channel dimensions"""
+from bytedream.model import UNet2DConditionModel
+unet = UNet2DConditionModel()
+print("Block out channels:", unet.block_out_channels)
+print("\nDown blocks:")
+for i, block in enumerate(unet.down_blocks):
+    print(f"  Down {i}: {len(block.resnets)} resnets")
+print("\nUp blocks:")
+reversed_block_out_channels = list(reversed(unet.block_out_channels))
+for i, block in enumerate(unet.up_blocks):
+    in_channels = unet.block_out_channels[-1] if i == 0 else reversed_block_out_channels[i - 1]
+    output_channel = reversed_block_out_channels[i]
+    skip_channels = reversed_block_out_channels[min(i + 1, len(unet.block_out_channels) - 1)]
+    print(f"  Up {i}: in={in_channels}, out={output_channel}, skips={skip_channels}")
+    print(f"    ResNets expect: {[block.resnets[j].conv1.in_channels for j in range(len(block.resnets))]}")

quick_fix.bat ADDED Viewed

	@@ -0,0 +1,58 @@

+@echo off
+REM Byte Dream - Quick Fix and Setup Script for Windows
+REM Run this script to fix the model loading issue
+echo ============================================================
+echo Byte Dream - Quick Fix and Setup
+echo ============================================================
+echo.
+REM Check if Python is installed
+python --version >nul 2>&1
+if %errorlevel% neq 0 (
+    echo ERROR: Python not found! Please install Python 3.8+
+    pause
+    exit /b 1
+)
+echo Step 1: Checking for trained model...
+echo.
+if exist "models\bytedream" (
+    echo Found model at: models\bytedream
+) else if exist "models" (
+    echo Found models directory
+) else (
+    echo WARNING: No trained model found!
+    echo.
+    echo To train the model, run:
+    echo   python train.py --epochs 1000 --batch_size 4
+    echo.
+)
+echo.
+echo Step 2: Testing pipeline with random initialization...
+echo.
+python quick_fix.py
+echo.
+echo ============================================================
+echo Next Steps:
+echo ============================================================
+echo.
+echo 1. If you want to train the model:
+echo    python train.py --epochs 1000 --batch_size 4
+echo.
+echo 2. If you want to upload to Hugging Face:
+echo    a. Install huggingface_hub: pip install huggingface_hub
+echo    b. Login: huggingface-cli login
+echo    c. Upload: python upload_to_hf.py --repo_id "YourUsername/ByteDream" --create_space
+echo.
+echo 3. To use the web interface now:
+echo    python app.py
+echo.
+echo For detailed instructions, see: UPLOAD_GUIDE_PT.md
+echo ============================================================
+echo.
+pause

quick_fix.py ADDED Viewed

	@@ -0,0 +1,152 @@

+"""
+Quick Setup Script for Byte Dream
+Fixes the model loading issue and helps upload to Hugging Face
+"""
+import os
+from pathlib import Path
+def check_model_exists():
+    """Check if trained model exists"""
+    model_paths = [
+        "./models/bytedream",
+        "./models",
+        "./bytedream",
+    ]
+    for path in model_paths:
+        if Path(path).exists():
+            print(f"✓ Found model at: {path}")
+            return path
+    print("⚠ No trained model found!")
+    print("\nTo train the model, run:")
+    print("  python train.py --epochs 1000 --batch_size 4")
+    print("\nOr download pretrained weights from Hugging Face.")
+    return None
+def test_inference():
+    """Test inference with random initialization (no model needed)"""
+    print("\n" + "="*60)
+    print("Testing Byte Dream with random initialization")
+    print("="*60)
+    try:
+        from bytedream.generator import ByteDreamGenerator
+        # Initialize without model path (will use random weights)
+        generator = ByteDreamGenerator(
+            model_path=None,  # No pretrained model
+            config_path="config.yaml",
+            device="cpu",
+        )
+        print("\nGenerating test image with random weights...")
+        print("(This will produce random noise, but tests the pipeline)")
+        image = generator.generate(
+            prompt="A test image",
+            width=256,
+            height=256,
+            num_inference_steps=10,  # Fast test
+        )
+        image.save("test_output.png")
+        print(f"\n✓ Test image saved to: test_output.png")
+        print("\nNote: This image looks like noise because we're using random weights.")
+        print("To generate meaningful images, you need to train the model first.")
+        return True
+    except Exception as e:
+        print(f"\n❌ Error during test: {e}")
+        import traceback
+        traceback.print_exc()
+        return False
+def upload_to_hf_guide():
+    """Guide for uploading to Hugging Face"""
+    print("\n" + "="*60)
+    print("Hugging Face Upload Guide")
+    print("="*60)
+    print("""
+To upload your model to Hugging Face Hub:
+STEP 1: Install required packages
+----------------------------------
+pip install huggingface_hub
+STEP 2: Login to Hugging Face
+------------------------------
+huggingface-cli login
+Then paste your token from: https://huggingface.co/settings/tokens
+STEP 3: Train your model (if not done already)
+-----------------------------------------------
+python train.py --epochs 1000 --batch_size 4 --output_dir ./models/bytedream
+STEP 4: Upload to Hugging Face
+-------------------------------
+python upload_to_hf.py --repo_id "YourUsername/ByteDream" --create_space
+Replace 'YourUsername' with your actual Hugging Face username.
+STEP 5: Update app.py to use the uploaded model
+------------------------------------------------
+After uploading, modify app.py to load from Hugging Face:
+```python
+from diffusers import DiffusionPipeline
+pipe = DiffusionPipeline.from_pretrained("YourUsername/ByteDream")
+```
+TIPS:
+-----
+- Make sure your model directory contains the trained weights
+- Use --private flag if you want to keep the model private
+- The --create_space option creates files for Hugging Face Spaces deployment
+- Check your repository at: https://huggingface.co/YourUsername
+For more help, see:
+- https://huggingface.co/docs/hub/spaces
+- https://huggingface.co/docs/huggingface_hub/guides/cli
+    """)
+def main():
+    print("\n" + "="*60)
+    print("Byte Dream - Quick Setup & Troubleshooting")
+    print("="*60)
+    # Check if model exists
+    model_path = check_model_exists()
+    # Test inference
+    if model_path or True:  # Always test (can work without model)
+        success = test_inference()
+        if success:
+            print("\n✓ Pipeline is working!")
+            print("\nNext steps:")
+            print("1. Train the model: python train.py")
+            print("2. Or upload to Hugging Face (see guide below)")
+    # Show upload guide
+    upload_to_hf_guide()
+    print("\n" + "="*60)
+    print("Current status:")
+    print("  - app.py has been fixed to handle missing models gracefully")
+    print("  - You can now run: python app.py")
+    print("  - Follow the upload guide above to deploy to Hugging Face")
+    print("="*60)
+if __name__ == "__main__":
+    main()

train.py CHANGED Viewed

@@ -34,10 +34,19 @@ class ImageTextDataset(Dataset):
         center_crop: bool = True,
     ):
         self.data_dir = Path(data_dir)
         self.image_paths = list(self.data_dir.glob("*.jpg")) + \
                           list(self.data_dir.glob("*.png")) + \
                           list(self.data_dir.glob("*.jpeg"))
         self.image_size = image_size
         self.random_flip = random_flip
         self.random_crop = random_crop
@@ -216,6 +225,8 @@ class LatentDiffusionTrainer:
         """Encode images to latent space"""
         with torch.no_grad():
             latents = self.vae.encode(images)
             latents = latents * 0.18215  # Scale factor
         return latents

         center_crop: bool = True,
     ):
         self.data_dir = Path(data_dir)
+        # Check if directory exists
+        if not self.data_dir.exists():
+            raise FileNotFoundError(f"Dataset directory not found: {self.data_dir}\nPlease create the directory and add images, or use --train_data with a valid path.")
         self.image_paths = list(self.data_dir.glob("*.jpg")) + \
                           list(self.data_dir.glob("*.png")) + \
                           list(self.data_dir.glob("*.jpeg"))
+        # Check if there are any images
+        if len(self.image_paths) == 0:
+            raise ValueError(f"No images found in {self.data_dir}\nSupported formats: .jpg, .png, .jpeg")
         self.image_size = image_size
         self.random_flip = random_flip
         self.random_crop = random_crop
         """Encode images to latent space"""
         with torch.no_grad():
             latents = self.vae.encode(images)
+            # Use only the mean part of the VAE output (first half of channels)
+            latents = latents[:, :4]  # Take first 4 channels (mean, not log_var)
             latents = latents * 0.18215  # Scale factor
         return latents