Spaces:

dippoo
/

content-engine

Running

dippoo Claude Opus 4.6 commited on Feb 18

Commit

27fea48

1 Parent(s): b051fff

Add pod generation with FLUX.2, persistent state, training improvements

- Pod management: network volume support, HTTPS proxy URLs, fire-and-forget SSH
- FLUX.2 workflow: separate UNETLoader + CLIPLoader + VAELoader with Comfy-Org text encoder
- Persist pod state to disk (survives server restarts)
- Training: persistent job tracking in DB, live log streaming
- Remove tracked __pycache__ files, add .gitignore

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (32) hide show

.gitignore +3 -0
CLAUDE.md +219 -0
config/models.yaml +20 -14
src/content_engine/__pycache__/__init__.cpython-311.pyc +0 -0
src/content_engine/__pycache__/config.cpython-311.pyc +0 -0
src/content_engine/__pycache__/main.cpython-311.pyc +0 -0
src/content_engine/api/__pycache__/__init__.cpython-311.pyc +0 -0
src/content_engine/api/__pycache__/routes_catalog.cpython-311.pyc +0 -0
src/content_engine/api/__pycache__/routes_generation.cpython-311.pyc +0 -0
src/content_engine/api/__pycache__/routes_pod.cpython-311.pyc +0 -0
src/content_engine/api/__pycache__/routes_system.cpython-311.pyc +0 -0
src/content_engine/api/__pycache__/routes_training.cpython-311.pyc +0 -0
src/content_engine/api/__pycache__/routes_ui.cpython-311.pyc +0 -0
src/content_engine/api/__pycache__/routes_video.cpython-311.pyc +0 -0
src/content_engine/api/routes_pod.py +441 -112
src/content_engine/api/routes_training.py +51 -21
src/content_engine/api/ui.html +156 -20
src/content_engine/models/__pycache__/__init__.cpython-311.pyc +0 -0
src/content_engine/models/__pycache__/database.cpython-311.pyc +0 -0
src/content_engine/models/__pycache__/schemas.cpython-311.pyc +0 -0
src/content_engine/models/database.py +29 -7
src/content_engine/services/__pycache__/__init__.cpython-311.pyc +0 -0
src/content_engine/services/__pycache__/catalog.cpython-311.pyc +0 -0
src/content_engine/services/__pycache__/comfyui_client.cpython-311.pyc +0 -0
src/content_engine/services/__pycache__/lora_trainer.cpython-311.pyc +0 -0
src/content_engine/services/__pycache__/runpod_trainer.cpython-311.pyc +0 -0
src/content_engine/services/__pycache__/template_engine.cpython-311.pyc +0 -0
src/content_engine/services/__pycache__/variation_engine.cpython-311.pyc +0 -0
src/content_engine/services/__pycache__/workflow_builder.cpython-311.pyc +0 -0
src/content_engine/services/runpod_trainer.py +494 -68
src/content_engine/workers/__pycache__/__init__.cpython-311.pyc +0 -0
src/content_engine/workers/__pycache__/local_worker.cpython-311.pyc +0 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,3 @@

+__pycache__/
+*.pyc
+pod_state.json

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,219 @@

+# Content Engine
+Automated AI content generation system with cloud APIs, LoRA training, and multi-backend support.
+## Repositories
+- **Local Development**: `D:\AI automation\content_engine\`
+- **HuggingFace Deployment**: `D:\AI automation\content-engine\` (deployed to https://huggingface.co/spaces/dippoo/content-engine)
+Always sync changes between both directories when modifying code.
+## Architecture
+```
+┌─────────────────────────────────────────────────────────────┐
+│                        Frontend (ui.html)                    │
+│  Generate | Batch | Gallery | Train LoRA | Status | Settings │
+└─────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────┐
+│                    FastAPI Backend (main.py)                 │
+│  routes_generation | routes_video | routes_training | etc   │
+└─────────────────────────────────────────────────────────────┘
+                              │
+        ┌─────────────────────┼─────────────────────┐
+        ▼                     ▼                     ▼
+┌───────────────┐    ┌───────────────┐    ┌───────────────┐
+│  Local GPU    │    │   RunPod      │    │  Cloud APIs   │
+│  (ComfyUI)    │    │  (Serverless) │    │  (WaveSpeed)  │
+└───────────────┘    └───────────────┘    └───────────────┘
+```
+## Cloud Providers
+### WaveSpeed (wavespeed_provider.py)
+Primary cloud API for image/video generation. Uses direct HTTP API (SDK optional).
+**Text-to-Image Models:**
+- `seedream-4.5` - Best quality, NSFW OK (ByteDance)
+- `seedream-4`, `seedream-3.1` - NSFW friendly
+- `gpt-image-1.5`, `gpt-image-1-mini` - OpenAI models
+- `nano-banana-pro`, `nano-banana` - Google models
+- `wan-2.6`, `wan-2.5` - Alibaba models
+- `kling-image-o3` - Kuaishou
+**Image-to-Image (Edit) Models:**
+- `seedream-4.5-edit` - Best for face preservation
+- `seedream-4.5-multi`, `seedream-4-multi` - Multi-reference (up to 3 images)
+- `kling-o1-multi` - Multi-reference (up to 10 images)
+- `wan-2.6-edit`, `wan-2.5-edit` - NSFW friendly
+**Image-to-Video Models:**
+- `wan-2.6-i2v-pro` - Best quality ($0.05/s)
+- `wan-2.6-i2v-flash` - Fast
+- `kling-o3-pro`, `kling-o3` - Kuaishou
+- `higgsfield-dop` - Cinematic 5s clips
+- `veo-3.1`, `sora-2` - Premium models
+**API Pattern:**
+```python
+# WaveSpeed returns async jobs - must poll for result
+response = {"data": {"outputs": [], "urls": {"get": "poll_url"}}}
+# Poll urls.get until outputs[] is populated
+```
+### RunPod
+- **Training**: Cloud GPU for LoRA training (runpod_trainer.py)
+- **Generation**: Serverless endpoint for inference (runpod_provider.py)
+## Character System
+Characters link a trained LoRA to generation:
+**Config file** (`config/characters/alice.yaml`):
+```yaml
+id: alice
+name: "Alice"
+trigger_word: "alicechar"              # Activates the LoRA
+lora_filename: "alice_v1.safetensors"  # In D:\ComfyUI\Models\Lora\
+lora_strength: 0.85
+```
+**Generation flow:**
+1. User selects character from dropdown
+2. System prepends trigger word: `"alicechar, a woman in red dress"`
+3. LoRA is loaded into workflow (local/RunPod only)
+4. Character identity is preserved in output
+**For cloud-only (no local GPU):**
+- Use img2img with reference photo
+- Or deploy LoRA to RunPod serverless endpoint
+## Templates
+Prompt recipes with variables (`config/templates/*.yaml`):
+```yaml
+id: portrait_glamour
+name: "Glamour Portrait"
+positive: "{{character}}, {{pose}}, {{lighting}}, professional photo"
+variables:
+  - name: pose
+    options: ["standing", "sitting", "leaning"]
+  - name: lighting
+    options: ["studio", "natural", "dramatic"]
+```
+## Key Files
+### API Routes
+- `routes_generation.py` - txt2img, img2img endpoints
+- `routes_video.py` - img2video, WaveSpeed/Higgsfield video
+- `routes_training.py` - LoRA training jobs
+- `routes_catalog.py` - Gallery/image management
+- `routes_system.py` - Health checks, character list
+### Services
+- `wavespeed_provider.py` - WaveSpeed API client (SDK optional, uses httpx)
+- `runpod_trainer.py` - Cloud LoRA training
+- `runpod_provider.py` - Cloud generation endpoint
+- `comfyui_client.py` - Local ComfyUI integration
+- `workflow_builder.py` - ComfyUI workflow JSON builder
+- `template_engine.py` - Prompt template rendering
+- `variation_engine.py` - Batch variation generation
+### Frontend
+- `ui.html` - Single-page app with all UI
+## Environment Variables
+```env
+# Cloud APIs
+WAVESPEED_API_KEY=ws_xxx           # WaveSpeed.ai API key
+RUNPOD_API_KEY=xxx                 # RunPod API key
+RUNPOD_ENDPOINT_ID=xxx             # RunPod serverless endpoint (for generation)
+# Optional
+HIGGSFIELD_API_KEY=xxx             # Higgsfield (Kling 3.0, etc.)
+COMFYUI_URL=http://127.0.0.1:8188  # Local ComfyUI
+```
+## Database
+SQLite with async (aiosqlite):
+- `images` - Generated image catalog
+- `characters` - Character profiles
+- `generation_jobs` - Job tracking
+- `scheduled_posts` - Publishing queue
+## UI Structure
+**Generate Page:**
+- Mode chips: Text to Image | Image to Image | Image to Video
+- Backend chips: Local GPU | RunPod GPU | Cloud API
+- Model dropdowns (conditional on mode/backend)
+- Character/Template selectors (2-column grid)
+- Prompt textareas
+- Output settings (aspect ratio, seed)
+**Controls Panel:** 340px width, compact styling
+**Drop Zones:** For reference images (character + pose)
+## Common Issues
+### "Product not found" from WaveSpeed
+Model ID doesn't exist. Check `MODEL_MAP`, `EDIT_MODEL_MAP`, `VIDEO_MODEL_MAP` in wavespeed_provider.py against https://wavespeed.ai/models
+### "No image URL in output"
+WaveSpeed returned async job. Check `outputs` is empty and `urls.get` exists, then poll that URL.
+### HuggingFace Space startup hang
+Check requirements.txt for missing packages. Common: `python-dotenv`, `runpod`, `wavespeed` (optional).
+### Import errors on HF Spaces
+Make optional imports with try/except:
+```python
+try:
+    from wavespeed import Client
+    SDK_AVAILABLE = True
+except ImportError:
+    SDK_AVAILABLE = False
+```
+## Development Commands
+```bash
+# Run locally
+cd content_engine
+python -m uvicorn content_engine.main:app --port 8000 --reload
+# Push to HuggingFace
+cd content-engine
+git add . && git commit -m "message" && git push origin main
+# Sync local ↔ HF
+cp content_engine/src/content_engine/file.py content-engine/src/content_engine/file.py
+```
+## Multi-Reference Image Support
+For img2img with 2 reference images (character + pose):
+1. **UI**: Two drop zones side-by-side
+2. **API**: `image` (required) + `image2` (optional) in FormData
+3. **Backend**: Both uploaded to temp URLs, sent to WaveSpeed
+4. **Models**: SeeDream Sequential, Kling O1 support multi-ref
+## Pricing Notes
+- **WaveSpeed**: ~$0.003-0.01 per image, $0.01-0.05/s for video
+- **RunPod**: ~$0.0002/s for GPU time (training/generation)
+- Cloud API cheaper for light use; RunPod better for volume
+## Future Improvements
+- [ ] RunPod serverless endpoint for LoRA-based generation
+- [ ] Auto-captioning for training images
+- [ ] Batch video generation
+- [ ] Publishing integrations (social media APIs)

config/models.yaml CHANGED Viewed

@@ -5,25 +5,31 @@ training_models:
   # FLUX - Best for photorealistic images (recommended for realistic person)
   flux2_dev:
     name: "FLUX.2 Dev (Recommended)"
-    description: "Latest FLUX model, 32B params, best quality for realistic person. Also supports multi-reference without training."
     hf_repo: "black-forest-labs/FLUX.2-dev"
-    hf_filename: "flux.2-dev.safetensors"
-    model_type: "flux"
     resolution: 1024
-    learning_rate: 1e-3
-    text_encoder_lr: 1e-4
-    network_rank: 48
-    network_alpha: 24
-    clip_skip: 1
-    optimizer: "AdamW8bit"
-    lr_scheduler: "cosine"
-    min_snr_gamma: 5
-    max_train_steps: 1200
     fp8_base: true
     use_case: "images"
-    vram_required_gb: 24
     recommended_images: "15-30 high quality photos with detailed captions"
-    training_script: "flux_train_network.py"
   flux1_dev:
     name: "FLUX.1 Dev"

   # FLUX - Best for photorealistic images (recommended for realistic person)
   flux2_dev:
     name: "FLUX.2 Dev (Recommended)"
+    description: "Latest FLUX model, 32B params, best quality for realistic person. Uses Mistral text encoder."
     hf_repo: "black-forest-labs/FLUX.2-dev"
+    hf_filename: "flux2-dev.safetensors"
+    model_type: "flux2"
+    training_framework: "musubi-tuner"
     resolution: 1024
+    learning_rate: 1.0
+    network_rank: 64
+    network_alpha: 32
+    optimizer: "prodigy"
+    lr_scheduler: "constant"
+    timestep_sampling: "flux2_shift"
+    network_module: "networks.lora_flux_2"
+    max_train_steps: 50
     fp8_base: true
+    gradient_checkpointing: true
     use_case: "images"
+    vram_required_gb: 48
+    recommended_gpu: "NVIDIA RTX A6000"
     recommended_images: "15-30 high quality photos with detailed captions"
+    training_script: "flux_2_train_network.py"
+    # Model paths on network volume:
+    # DiT: /workspace/models/FLUX.2-dev/flux2-dev.safetensors
+    # VAE: /workspace/models/FLUX.2-dev/vae/diffusion_pytorch_model.safetensors
+    # Text encoder: /workspace/models/FLUX.2-dev/text_encoder/model-00001-of-00010.safetensors
   flux1_dev:
     name: "FLUX.1 Dev"

src/content_engine/__pycache__/__init__.cpython-311.pyc DELETED Viewed

Binary file (279 Bytes)

src/content_engine/__pycache__/config.cpython-311.pyc DELETED Viewed

Binary file (6.33 kB)

src/content_engine/__pycache__/main.cpython-311.pyc DELETED Viewed

Binary file (10.6 kB)

src/content_engine/api/__pycache__/__init__.cpython-311.pyc DELETED Viewed

Binary file (211 Bytes)

src/content_engine/api/__pycache__/routes_catalog.cpython-311.pyc DELETED Viewed

Binary file (7.2 kB)

src/content_engine/api/__pycache__/routes_generation.cpython-311.pyc DELETED Viewed

Binary file (28 kB)

src/content_engine/api/__pycache__/routes_pod.cpython-311.pyc DELETED Viewed

Binary file (25.1 kB)

src/content_engine/api/__pycache__/routes_system.cpython-311.pyc DELETED Viewed

Binary file (10.7 kB)

src/content_engine/api/__pycache__/routes_training.cpython-311.pyc DELETED Viewed

Binary file (12.6 kB)

src/content_engine/api/__pycache__/routes_ui.cpython-311.pyc DELETED Viewed

Binary file (1.23 kB)

src/content_engine/api/__pycache__/routes_video.cpython-311.pyc DELETED Viewed

Binary file (13.3 kB)

src/content_engine/api/routes_pod.py CHANGED Viewed

@@ -1,12 +1,18 @@
-"""RunPod Pod management routes — start/stop GPU pods for generation and training."""
 from __future__ import annotations
 import asyncio
 import logging
 import os
 import time
 import uuid
 from typing import Any
 import runpod
@@ -17,28 +23,85 @@ logger = logging.getLogger(__name__)
 router = APIRouter(prefix="/api/pod", tags=["pod"])
 # Pod state
 _pod_state = {
     "pod_id": None,
-    "status": "stopped",  # stopped, starting, running, stopping
     "ip": None,
-    "port": None,
-    "gpu_type": "NVIDIA GeForce RTX 4090",
     "started_at": None,
-    "cost_per_hour": 0.44,
 }
-# Docker image with ComfyUI + FLUX
-COMFYUI_IMAGE = "timpietruskyblibla/runpod-worker-comfy:3.4.0-flux1-dev"
-# GPU options
 GPU_OPTIONS = {
-    "NVIDIA GeForce RTX 4090": {"name": "RTX 4090", "vram": 24, "cost": 0.44},
-    "NVIDIA RTX A6000": {"name": "RTX A6000", "vram": 48, "cost": 0.76},
-    "NVIDIA A100 80GB PCIe": {"name": "A100 80GB", "vram": 80, "cost": 1.89},
 }
 def _get_api_key() -> str:
     key = os.environ.get("RUNPOD_API_KEY")
     if not key:
@@ -48,7 +111,8 @@ def _get_api_key() -> str:
 class StartPodRequest(BaseModel):
-    gpu_type: str = "NVIDIA GeForce RTX 4090"
 class PodStatus(BaseModel):
@@ -57,7 +121,9 @@ class PodStatus(BaseModel):
     ip: str | None = None
     port: int | None = None
     gpu_type: str | None = None
     cost_per_hour: float | None = None
     uptime_minutes: float | None = None
     comfyui_url: str | None = None
@@ -67,44 +133,52 @@ async def get_pod_status():
     """Get current pod status."""
     _get_api_key()
-    # If we have a pod_id, check its actual status
     if _pod_state["pod_id"]:
         try:
-            pod = runpod.get_pod(_pod_state["pod_id"])
             if pod:
                 desired = pod.get("desiredStatus", "")
                 if desired == "RUNNING":
-                    runtime = pod.get("runtime", {})
-                    ports = runtime.get("ports", [])
                     for p in ports:
                         if p.get("privatePort") == 8188:
-                            _pod_state["ip"] = p.get("ip")
-                            _pod_state["port"] = p.get("publicPort")
-                    _pod_state["status"] = "running"
                 elif desired == "EXITED":
                     _pod_state["status"] = "stopped"
                     _pod_state["pod_id"] = None
             else:
                 _pod_state["status"] = "stopped"
                 _pod_state["pod_id"] = None
         except Exception as e:
             logger.warning("Failed to check pod: %s", e)
     uptime = None
-    if _pod_state["started_at"] and _pod_state["status"] == "running":
         uptime = (time.time() - _pod_state["started_at"]) / 60
-    comfyui_url = None
-    if _pod_state["ip"] and _pod_state["port"]:
-        comfyui_url = f"http://{_pod_state['ip']}:{_pod_state['port']}"
     return PodStatus(
         status=_pod_state["status"],
         pod_id=_pod_state["pod_id"],
         ip=_pod_state["ip"],
-        port=_pod_state["port"],
         gpu_type=_pod_state["gpu_type"],
         cost_per_hour=_pod_state["cost_per_hour"],
         uptime_minutes=uptime,
         comfyui_url=comfyui_url,
     )
@@ -116,12 +190,24 @@ async def list_gpu_options():
     return {"gpus": GPU_OPTIONS}
 @router.post("/start")
 async def start_pod(request: StartPodRequest):
-    """Start a GPU pod for generation/training."""
     _get_api_key()
-    if _pod_state["status"] == "running":
         return {"status": "already_running", "pod_id": _pod_state["pod_id"]}
     if _pod_state["status"] == "starting":
@@ -134,74 +220,284 @@ async def start_pod(request: StartPodRequest):
     _pod_state["status"] = "starting"
     _pod_state["gpu_type"] = request.gpu_type
     _pod_state["cost_per_hour"] = gpu_info["cost"]
     try:
-        logger.info("Starting RunPod with %s...", request.gpu_type)
-        pod = runpod.create_pod(
-            name="content-engine-gpu",
-            image_name=COMFYUI_IMAGE,
-            gpu_type_id=request.gpu_type,
-            volume_in_gb=50,  # For models and LoRAs
-            container_disk_in_gb=20,
-            ports="8188/http",
-            env={
-                # Pre-load FLUX model
-                "MODEL_URL": "https://huggingface.co/black-forest-labs/FLUX.1-dev/resolve/main/flux1-dev.safetensors",
-            },
         )
         _pod_state["pod_id"] = pod["id"]
         _pod_state["started_at"] = time.time()
         logger.info("Pod created: %s", pod["id"])
-        # Start background task to wait for pod ready
-        asyncio.create_task(_wait_for_pod_ready(pod["id"]))
         return {
             "status": "starting",
             "pod_id": pod["id"],
-            "message": f"Starting {gpu_info['name']} pod (~2-3 min)",
         }
     except Exception as e:
         _pod_state["status"] = "stopped"
         logger.error("Failed to start pod: %s", e)
         raise HTTPException(500, f"Failed to start pod: {e}")
-async def _wait_for_pod_ready(pod_id: str, timeout: int = 300):
-    """Background task to wait for pod to be ready."""
     start = time.time()
     while time.time() - start < timeout:
         try:
-            pod = runpod.get_pod(pod_id)
             if pod and pod.get("desiredStatus") == "RUNNING":
-                runtime = pod.get("runtime", {})
-                ports = runtime.get("ports", [])
                 for p in ports:
                     if p.get("privatePort") == 8188:
-                        ip = p.get("ip")
-                        port = p.get("publicPort")
-                        if ip and port:
-                            _pod_state["ip"] = ip
-                            _pod_state["port"] = int(port)
-                            _pod_state["status"] = "running"
-                            logger.info("Pod ready at %s:%s", ip, port)
-                            return
         except Exception as e:
             logger.debug("Waiting for pod: %s", e)
         await asyncio.sleep(5)
-    logger.error("Pod did not become ready within %ds", timeout)
-    _pod_state["status"] = "stopped"
 @router.post("/stop")
@@ -221,20 +517,23 @@ async def stop_pod():
         pod_id = _pod_state["pod_id"]
         logger.info("Stopping pod: %s", pod_id)
-        runpod.terminate_pod(pod_id)
         _pod_state["pod_id"] = None
         _pod_state["ip"] = None
-        _pod_state["port"] = None
         _pod_state["status"] = "stopped"
         _pod_state["started_at"] = None
         logger.info("Pod stopped")
         return {"status": "stopped", "message": "Pod terminated"}
     except Exception as e:
         logger.error("Failed to stop pod: %s", e)
-        _pod_state["status"] = "running"  # Revert
         raise HTTPException(500, f"Failed to stop pod: {e}")
@@ -244,10 +543,11 @@ async def list_pod_loras():
     if _pod_state["status"] != "running" or not _pod_state["ip"]:
         return {"loras": [], "message": "Pod not running"}
     try:
         import httpx
         async with httpx.AsyncClient(timeout=30) as client:
-            url = f"http://{_pod_state['ip']}:{_pod_state['port']}/object_info/LoraLoader"
             resp = await client.get(url)
             if resp.status_code == 200:
                 data = resp.json()
@@ -256,15 +556,12 @@ async def list_pod_loras():
     except Exception as e:
         logger.warning("Failed to list pod LoRAs: %s", e)
-    return {"loras": [], "comfyui_url": f"http://{_pod_state['ip']}:{_pod_state['port']}"}
 @router.post("/upload-lora")
-async def upload_lora_to_pod(
-    file: UploadFile = File(...),
-):
     """Upload a LoRA file to the running pod."""
-    from fastapi import UploadFile, File
     import httpx
     if _pod_state["status"] != "running":
@@ -275,13 +572,12 @@ async def upload_lora_to_pod(
     try:
         content = await file.read()
         async with httpx.AsyncClient(timeout=120) as client:
-            # Upload to ComfyUI's models/loras directory
-            url = f"http://{_pod_state['ip']}:{_pod_state['port']}/upload/image"
             files = {"image": (file.filename, content, "application/octet-stream")}
             data = {"subfolder": "loras", "type": "input"}
             resp = await client.post(url, files=files, data=data)
             if resp.status_code == 200:
@@ -326,7 +622,7 @@ async def generate_on_pod(request: PodGenerateRequest):
     job_id = str(uuid.uuid4())[:8]
     seed = request.seed if request.seed >= 0 else random.randint(0, 2**32 - 1)
-    # Build ComfyUI workflow
     workflow = _build_flux_workflow(
         prompt=request.prompt,
         negative_prompt=request.negative_prompt,
@@ -337,12 +633,14 @@ async def generate_on_pod(request: PodGenerateRequest):
         seed=seed,
         lora_name=request.lora_name,
         lora_strength=request.lora_strength,
     )
     try:
         async with httpx.AsyncClient(timeout=30) as client:
-            url = f"http://{_pod_state['ip']}:{_pod_state['port']}/prompt"
-            resp = await client.post(url, json={"prompt": workflow})
             resp.raise_for_status()
             data = resp.json()
@@ -356,15 +654,9 @@ async def generate_on_pod(request: PodGenerateRequest):
             }
             logger.info("Pod generation started: %s -> %s", job_id, prompt_id)
-            # Start background task to poll for completion
             asyncio.create_task(_poll_pod_job(job_id, prompt_id, request.content_rating))
-            return {
-                "job_id": job_id,
-                "status": "running",
-                "seed": seed,
-            }
     except Exception as e:
         logger.error("Pod generation failed: %s", e)
@@ -374,38 +666,33 @@ async def generate_on_pod(request: PodGenerateRequest):
 async def _poll_pod_job(job_id: str, prompt_id: str, content_rating: str):
     """Poll ComfyUI for job completion and save the result."""
     import httpx
-    from pathlib import Path
     start = time.time()
-    timeout = 300  # 5 minutes
     async with httpx.AsyncClient(timeout=60) as client:
         while time.time() - start < timeout:
             try:
-                url = f"http://{_pod_state['ip']}:{_pod_state['port']}/history/{prompt_id}"
-                resp = await client.get(url)
                 if resp.status_code == 200:
                     data = resp.json()
                     if prompt_id in data:
                         outputs = data[prompt_id].get("outputs", {})
-                        # Find SaveImage output
                         for node_id, node_output in outputs.items():
                             if "images" in node_output:
                                 image_info = node_output["images"][0]
                                 filename = image_info["filename"]
                                 subfolder = image_info.get("subfolder", "")
-                                # Download the image
-                                img_url = f"http://{_pod_state['ip']}:{_pod_state['port']}/view"
                                 params = {"filename": filename}
                                 if subfolder:
                                     params["subfolder"] = subfolder
-                                img_resp = await client.get(img_url, params=params)
                                 if img_resp.status_code == 200:
-                                    # Save to local output directory
                                     from content_engine.config import settings
                                     output_dir = settings.paths.output_dir / "pod" / content_rating / "raw"
                                     output_dir.mkdir(parents=True, exist_ok=True)
@@ -419,7 +706,6 @@ async def _poll_pod_job(job_id: str, prompt_id: str, content_rating: str):
                                     logger.info("Pod generation completed: %s -> %s", job_id, local_path)
-                                    # Catalog the image
                                     try:
                                         from content_engine.services.catalog import CatalogService
                                         catalog = CatalogService()
@@ -463,29 +749,69 @@ def _build_flux_workflow(
     seed: int,
     lora_name: str | None,
     lora_strength: float,
 ) -> dict:
-    """Build a ComfyUI workflow for FLUX generation."""
-    # Basic FLUX workflow - compatible with ComfyUI FLUX setup
     workflow = {
-        "4": {
-            "class_type": "CheckpointLoaderSimple",
-            "inputs": {"ckpt_name": "flux1-dev.safetensors"},
         },
         "6": {
             "class_type": "CLIPTextEncode",
             "inputs": {
                 "text": prompt,
-                "clip": ["4", 1],
             },
         },
         "7": {
             "class_type": "CLIPTextEncode",
             "inputs": {
                 "text": negative_prompt or "",
-                "clip": ["4", 1],
             },
         },
         "5": {
             "class_type": "EmptyLatentImage",
             "inputs": {
@@ -494,7 +820,8 @@ def _build_flux_workflow(
                 "batch_size": 1,
             },
         },
-        "3": {
             "class_type": "KSampler",
             "inputs": {
                 "seed": seed,
@@ -503,19 +830,21 @@ def _build_flux_workflow(
                 "sampler_name": "euler",
                 "scheduler": "simple",
                 "denoise": 1.0,
-                "model": ["4", 0],
                 "positive": ["6", 0],
                 "negative": ["7", 0],
                 "latent_image": ["5", 0],
             },
         },
         "8": {
             "class_type": "VAEDecode",
             "inputs": {
-                "samples": ["3", 0],
-                "vae": ["4", 2],
             },
         },
         "9": {
             "class_type": "SaveImage",
             "inputs": {
@@ -527,19 +856,19 @@ def _build_flux_workflow(
     # Add LoRA if specified
     if lora_name:
-        workflow["10"] = {
             "class_type": "LoraLoader",
             "inputs": {
                 "lora_name": lora_name,
                 "strength_model": lora_strength,
                 "strength_clip": lora_strength,
-                "model": ["4", 0],
-                "clip": ["4", 1],
             },
         }
-        # Rewire sampler to use LoRA output
-        workflow["3"]["inputs"]["model"] = ["10", 0]
-        workflow["6"]["inputs"]["clip"] = ["10", 1]
-        workflow["7"]["inputs"]["clip"] = ["10", 1]
     return workflow

+"""RunPod Pod management routes — start/stop GPU pods for generation.
+Starts a persistent ComfyUI pod with network volume access.
+Models and LoRAs are loaded from the shared network volume.
+"""
 from __future__ import annotations
 import asyncio
+import json
 import logging
 import os
 import time
 import uuid
+from pathlib import Path
 from typing import Any
 import runpod
 router = APIRouter(prefix="/api/pod", tags=["pod"])
+# Persist pod state to disk so it survives server restarts
+_POD_STATE_FILE = Path(__file__).parent.parent.parent.parent / "pod_state.json"
+def _save_pod_state():
+    """Save pod state to disk."""
+    try:
+        data = {k: v for k, v in _pod_state.items() if k != "setup_status"}
+        _POD_STATE_FILE.write_text(json.dumps(data))
+    except Exception as e:
+        logger.warning("Failed to save pod state: %s", e)
+def _load_pod_state():
+    """Load pod state from disk on startup."""
+    try:
+        if _POD_STATE_FILE.exists():
+            data = json.loads(_POD_STATE_FILE.read_text())
+            for k, v in data.items():
+                if k in _pod_state:
+                    _pod_state[k] = v
+            logger.info("Restored pod state: pod_id=%s status=%s", _pod_state.get("pod_id"), _pod_state.get("status"))
+    except Exception as e:
+        logger.warning("Failed to load pod state: %s", e)
+def _get_volume_config() -> tuple[str, str]:
+    """Get network volume config at runtime (after dotenv loads)."""
+    return (
+        os.environ.get("RUNPOD_VOLUME_ID", ""),
+        os.environ.get("RUNPOD_VOLUME_DC", ""),
+    )
+# Docker image — PyTorch base with CUDA, we install ComfyUI ourselves
+DOCKER_IMAGE = "runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04"
 # Pod state
 _pod_state = {
     "pod_id": None,
+    "status": "stopped",  # stopped, starting, setting_up, running, stopping
     "ip": None,
+    "ssh_port": None,
+    "comfyui_port": None,
+    "gpu_type": "NVIDIA RTX A6000",
+    "model_type": "flux2",
     "started_at": None,
+    "cost_per_hour": 0.76,
+    "setup_status": None,
 }
+_load_pod_state()
+# GPU options (same as training)
 GPU_OPTIONS = {
+    "NVIDIA A40": {"name": "A40 48GB", "vram": 48, "cost": 0.64},
+    "NVIDIA RTX A6000": {"name": "RTX A6000 48GB", "vram": 48, "cost": 0.76},
+    "NVIDIA L40": {"name": "L40 48GB", "vram": 48, "cost": 0.89},
+    "NVIDIA L40S": {"name": "L40S 48GB", "vram": 48, "cost": 1.09},
+    "NVIDIA A100-SXM4-80GB": {"name": "A100 SXM 80GB", "vram": 80, "cost": 1.64},
+    "NVIDIA A100 80GB PCIe": {"name": "A100 PCIe 80GB", "vram": 80, "cost": 1.89},
+    "NVIDIA H100 80GB HBM3": {"name": "H100 80GB", "vram": 80, "cost": 3.89},
+    "NVIDIA GeForce RTX 5090": {"name": "RTX 5090 32GB", "vram": 32, "cost": 0.69},
+    "NVIDIA GeForce RTX 4090": {"name": "RTX 4090 24GB", "vram": 24, "cost": 0.44},
+    "NVIDIA GeForce RTX 3090": {"name": "RTX 3090 24GB", "vram": 24, "cost": 0.22},
 }
+def _get_comfyui_url() -> str | None:
+    """Get the ComfyUI URL via RunPod's HTTPS proxy.
+    RunPod HTTP ports are only accessible through their proxy at
+    https://{pod_id}-{private_port}.proxy.runpod.net
+    The raw IP:port from the API is an internal address, not publicly routable.
+    """
+    pod_id = _pod_state.get("pod_id")
+    if pod_id:
+        return f"https://{pod_id}-8188.proxy.runpod.net"
+    return None
 def _get_api_key() -> str:
     key = os.environ.get("RUNPOD_API_KEY")
     if not key:
 class StartPodRequest(BaseModel):
+    gpu_type: str = "NVIDIA RTX A6000"
+    model_type: str = "flux2"
 class PodStatus(BaseModel):
     ip: str | None = None
     port: int | None = None
     gpu_type: str | None = None
+    model_type: str | None = None
     cost_per_hour: float | None = None
+    setup_status: str | None = None
     uptime_minutes: float | None = None
     comfyui_url: str | None = None
     """Get current pod status."""
     _get_api_key()
     if _pod_state["pod_id"]:
         try:
+            pod = await asyncio.wait_for(
+                asyncio.to_thread(runpod.get_pod, _pod_state["pod_id"]),
+                timeout=10,
+            )
             if pod:
                 desired = pod.get("desiredStatus", "")
                 if desired == "RUNNING":
+                    runtime = pod.get("runtime") or {}
+                    ports = runtime.get("ports") or []
                     for p in ports:
+                        if p.get("privatePort") == 22:
+                            _pod_state["ssh_ip"] = p.get("ip")
+                            _pod_state["ssh_port"] = p.get("publicPort")
                         if p.get("privatePort") == 8188:
+                            _pod_state["comfyui_ip"] = p.get("ip")
+                            _pod_state["comfyui_port"] = p.get("publicPort")
+                    # Use SSH IP as the main IP for display
+                    _pod_state["ip"] = _pod_state.get("ssh_ip") or _pod_state.get("comfyui_ip")
                 elif desired == "EXITED":
                     _pod_state["status"] = "stopped"
                     _pod_state["pod_id"] = None
             else:
                 _pod_state["status"] = "stopped"
                 _pod_state["pod_id"] = None
+        except asyncio.TimeoutError:
+            logger.warning("RunPod API timeout checking pod status")
         except Exception as e:
             logger.warning("Failed to check pod: %s", e)
     uptime = None
+    if _pod_state["started_at"] and _pod_state["status"] in ("running", "setting_up"):
         uptime = (time.time() - _pod_state["started_at"]) / 60
+    comfyui_url = _get_comfyui_url()
     return PodStatus(
         status=_pod_state["status"],
         pod_id=_pod_state["pod_id"],
         ip=_pod_state["ip"],
+        port=_pod_state.get("comfyui_port"),
         gpu_type=_pod_state["gpu_type"],
+        model_type=_pod_state.get("model_type", "flux2"),
         cost_per_hour=_pod_state["cost_per_hour"],
+        setup_status=_pod_state.get("setup_status"),
         uptime_minutes=uptime,
         comfyui_url=comfyui_url,
     )
     return {"gpus": GPU_OPTIONS}
+@router.get("/model-options")
+async def list_model_options():
+    """List available model types for the pod."""
+    return {
+        "models": {
+            "flux2": {"name": "FLUX.2 Dev", "description": "Best for realistic txt2img (requires 48GB+ VRAM)", "use_case": "txt2img"},
+            "flux1": {"name": "FLUX.1 Dev", "description": "Previous gen FLUX txt2img", "use_case": "txt2img"},
+            "wan22": {"name": "WAN 2.2", "description": "Image-to-video and general generation", "use_case": "img2video"},
+        }
+    }
 @router.post("/start")
 async def start_pod(request: StartPodRequest):
+    """Start a GPU pod with ComfyUI for generation."""
     _get_api_key()
+    if _pod_state["status"] in ("running", "setting_up"):
         return {"status": "already_running", "pod_id": _pod_state["pod_id"]}
     if _pod_state["status"] == "starting":
     _pod_state["status"] = "starting"
     _pod_state["gpu_type"] = request.gpu_type
     _pod_state["cost_per_hour"] = gpu_info["cost"]
+    _pod_state["model_type"] = request.model_type
+    _pod_state["setup_status"] = "Creating pod..."
     try:
+        logger.info("Starting RunPod with %s for %s...", request.gpu_type, request.model_type)
+        pod_kwargs = {
+            "container_disk_in_gb": 30,
+            "ports": "22/tcp,8188/http",
+            "docker_args": "bash -c 'apt-get update && apt-get install -y openssh-server && mkdir -p /run/sshd && echo root:runpod | chpasswd && /usr/sbin/sshd -o PermitRootLogin=yes && sleep infinity'",
+        }
+        volume_id, volume_dc = _get_volume_config()
+        if volume_id:
+            pod_kwargs["network_volume_id"] = volume_id
+            if volume_dc:
+                pod_kwargs["data_center_id"] = volume_dc
+            logger.info("Using network volume: %s (DC: %s)", volume_id, volume_dc)
+        else:
+            pod_kwargs["volume_in_gb"] = 75
+            logger.warning("No network volume configured — using temporary volume")
+        pod = await asyncio.to_thread(
+            runpod.create_pod,
+            f"comfyui-gen-{request.model_type}",
+            DOCKER_IMAGE,
+            request.gpu_type,
+            **pod_kwargs,
         )
         _pod_state["pod_id"] = pod["id"]
         _pod_state["started_at"] = time.time()
+        _save_pod_state()
         logger.info("Pod created: %s", pod["id"])
+        asyncio.create_task(_wait_and_setup_pod(pod["id"], request.model_type))
         return {
             "status": "starting",
             "pod_id": pod["id"],
+            "message": f"Starting {gpu_info['name']} pod (~5-8 min for setup)",
         }
     except Exception as e:
         _pod_state["status"] = "stopped"
+        _pod_state["setup_status"] = None
         logger.error("Failed to start pod: %s", e)
         raise HTTPException(500, f"Failed to start pod: {e}")
+async def _wait_and_setup_pod(pod_id: str, model_type: str, timeout: int = 600):
+    """Wait for pod to be ready, then install ComfyUI and link models via SSH."""
     start = time.time()
+    ssh_host = None
+    ssh_port = None
+    # Phase 1: Wait for SSH to be available
+    _pod_state["setup_status"] = "Waiting for pod to start..."
     while time.time() - start < timeout:
         try:
+            pod = await asyncio.to_thread(runpod.get_pod, pod_id)
             if pod and pod.get("desiredStatus") == "RUNNING":
+                runtime = pod.get("runtime") or {}
+                ports = runtime.get("ports") or []
                 for p in ports:
+                    if p.get("privatePort") == 22:
+                        ssh_host = p.get("ip")
+                        ssh_port = p.get("publicPort")
+                        _pod_state["ssh_ip"] = ssh_host
+                        _pod_state["ssh_port"] = ssh_port
+                        _pod_state["ip"] = ssh_host
                     if p.get("privatePort") == 8188:
+                        _pod_state["comfyui_ip"] = p.get("ip")
+                        _pod_state["comfyui_port"] = p.get("publicPort")
+                if ssh_host and ssh_port:
+                    break
         except Exception as e:
             logger.debug("Waiting for pod: %s", e)
         await asyncio.sleep(5)
+    if not ssh_host or not ssh_port:
+        logger.error("Pod did not become ready within %ds", timeout)
+        _pod_state["status"] = "stopped"
+        _pod_state["setup_status"] = "Failed: pod did not start"
+        return
+    # Phase 2: SSH in and set up ComfyUI
+    _pod_state["status"] = "setting_up"
+    _pod_state["setup_status"] = "Connecting via SSH..."
+    import paramiko
+    ssh = paramiko.SSHClient()
+    ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
+    for attempt in range(30):
+        try:
+            await asyncio.to_thread(
+                ssh.connect, ssh_host, port=int(ssh_port),
+                username="root", password="runpod", timeout=10,
+            )
+            break
+        except Exception:
+            if attempt == 29:
+                _pod_state["setup_status"] = "Failed: SSH connection error"
+                _pod_state["status"] = "stopped"
+                return
+            await asyncio.sleep(5)
+    transport = ssh.get_transport()
+    transport.set_keepalive(30)
+    try:
+        # Symlink network volume
+        volume_id, _ = _get_volume_config()
+        if volume_id:
+            await _ssh_exec_async(ssh, "mkdir -p /runpod-volume/models /runpod-volume/loras")
+            await _ssh_exec_async(ssh, "rm -rf /workspace/models 2>/dev/null; ln -sf /runpod-volume/models /workspace/models")
+        # Install ComfyUI (cache on volume for reuse)
+        comfy_dir = "/workspace/ComfyUI"
+        _pod_state["setup_status"] = "Installing ComfyUI..."
+        comfy_exists = (await _ssh_exec_async(ssh, f"test -f {comfy_dir}/main.py && echo EXISTS || echo MISSING")).strip()
+        if comfy_exists == "EXISTS":
+            logger.info("ComfyUI already installed")
+            _pod_state["setup_status"] = "ComfyUI found, updating..."
+            await _ssh_exec_async(ssh, f"cd {comfy_dir} && git pull 2>&1 | tail -3", timeout=120)
+        else:
+            # Check volume cache
+            vol_comfy = (await _ssh_exec_async(ssh, "test -f /runpod-volume/ComfyUI/main.py && echo EXISTS || echo MISSING")).strip()
+            if vol_comfy == "EXISTS":
+                _pod_state["setup_status"] = "Restoring ComfyUI from volume..."
+                await _ssh_exec_async(ssh, f"cp -r /runpod-volume/ComfyUI {comfy_dir}", timeout=300)
+            else:
+                _pod_state["setup_status"] = "Cloning ComfyUI (first time, ~2 min)..."
+                await _ssh_exec_async(ssh, f"cd /workspace && git clone --depth 1 https://github.com/comfyanonymous/ComfyUI.git", timeout=300)
+                await _ssh_exec_async(ssh, f"cd {comfy_dir} && pip install -r requirements.txt 2>&1 | tail -5", timeout=600)
+                # Cache to volume
+                volume_id, _ = _get_volume_config()
+                if volume_id:
+                    await _ssh_exec_async(ssh, f"cp -r {comfy_dir} /runpod-volume/ComfyUI", timeout=300)
+        # Install pip deps that aren't in ComfyUI requirements
+        _pod_state["setup_status"] = "Installing dependencies..."
+        await _ssh_exec_async(ssh, f"cd {comfy_dir} && pip install -r requirements.txt 2>&1 | tail -5", timeout=600)
+        await _ssh_exec_async(ssh, "pip install aiohttp einops sqlalchemy 2>&1 | tail -3", timeout=120)
+        # Symlink models into ComfyUI directories
+        _pod_state["setup_status"] = "Linking models..."
+        await _ssh_exec_async(ssh, f"mkdir -p {comfy_dir}/models/checkpoints {comfy_dir}/models/vae {comfy_dir}/models/loras {comfy_dir}/models/text_encoders")
+        if model_type == "flux2":
+            # FLUX.2 Dev — separate UNet, text encoder, and VAE
+            await _ssh_exec_async(ssh, f"mkdir -p {comfy_dir}/models/diffusion_models")
+            await _ssh_exec_async(ssh, f"ln -sf /workspace/models/FLUX.2-dev/flux2-dev.safetensors {comfy_dir}/models/diffusion_models/flux2-dev.safetensors")
+            await _ssh_exec_async(ssh, f"ln -sf /workspace/models/FLUX.2-dev/ae.safetensors {comfy_dir}/models/vae/ae.safetensors")
+            # Text encoder — use Comfy-Org's pre-converted single-file version
+            # (HF sharded format is incompatible with ComfyUI's CLIPLoader)
+            te_file = "/runpod-volume/models/mistral_3_small_flux2_fp8.safetensors"
+            te_exists = (await _ssh_exec_async(ssh, f"test -f {te_file} && echo EXISTS || echo MISSING")).strip()
+            if te_exists != "EXISTS":
+                _pod_state["setup_status"] = "Downloading FLUX.2 text encoder (~12GB, first time only)..."
+                await _ssh_exec_async(ssh, "pip install huggingface_hub 2>&1 | tail -1", timeout=60)
+                await _ssh_exec_async(ssh, f"""python -c "
+from huggingface_hub import hf_hub_download
+hf_hub_download(
+    repo_id='Comfy-Org/flux2-dev',
+    filename='split_files/text_encoders/mistral_3_small_flux2_fp8.safetensors',
+    local_dir='/tmp/flux2_te',
+)
+import shutil
+shutil.move('/tmp/flux2_te/split_files/text_encoders/mistral_3_small_flux2_fp8.safetensors', '{te_file}')
+print('Text encoder downloaded')
+" 2>&1 | tail -5""", timeout=1800)
+            await _ssh_exec_async(ssh, f"ln -sf {te_file} {comfy_dir}/models/text_encoders/mistral_3_small_flux2_fp8.safetensors")
+            # Remove old sharded loader patch if present
+            await _ssh_exec_async(ssh, f"rm -f {comfy_dir}/custom_nodes/sharded_loader.py")
+        elif model_type == "flux1":
+            await _ssh_exec_async(ssh, f"ln -sf /workspace/models/flux1-dev.safetensors {comfy_dir}/models/checkpoints/flux1-dev.safetensors")
+            await _ssh_exec_async(ssh, f"ln -sf /workspace/models/ae.safetensors {comfy_dir}/models/vae/ae.safetensors")
+            await _ssh_exec_async(ssh, f"ln -sf /workspace/models/clip_l.safetensors {comfy_dir}/models/text_encoders/clip_l.safetensors")
+            await _ssh_exec_async(ssh, f"ln -sf /workspace/models/t5xxl_fp16.safetensors {comfy_dir}/models/text_encoders/t5xxl_fp16.safetensors")
+        elif model_type == "wan22":
+            # WAN 2.2 Image-to-Video (14B params)
+            wan_dir = "/workspace/models/Wan2.2-I2V-A14B"
+            wan_exists = (await _ssh_exec_async(ssh, f"test -d {wan_dir} && echo EXISTS || echo MISSING")).strip()
+            if wan_exists != "EXISTS":
+                _pod_state["setup_status"] = "Downloading WAN 2.2 model (~28GB, first time only)..."
+                await _ssh_exec_async(ssh, f"pip install huggingface_hub 2>&1 | tail -1", timeout=60)
+                await _ssh_exec_async(ssh, f"""python -c "
+from huggingface_hub import snapshot_download
+snapshot_download('Wan-AI/Wan2.2-I2V-A14B', local_dir='{wan_dir}', ignore_patterns=['*.md', '*.txt'])
+print('WAN 2.2 downloaded')
+" 2>&1 | tail -10""", timeout=3600)
+            # Symlink WAN model to ComfyUI diffusion_models dir
+            await _ssh_exec_async(ssh, f"mkdir -p {comfy_dir}/models/diffusion_models")
+            await _ssh_exec_async(ssh, f"ln -sf {wan_dir} {comfy_dir}/models/diffusion_models/Wan2.2-I2V-A14B")
+            # Also need a VAE and text encoder for WAN — they use their own
+            await _ssh_exec_async(ssh, f"ln -sf {wan_dir} {comfy_dir}/models/checkpoints/Wan2.2-I2V-A14B")
+            # Install ComfyUI-WanVideoWrapper custom nodes
+            _pod_state["setup_status"] = "Installing WAN 2.2 ComfyUI nodes..."
+            wan_nodes_dir = f"{comfy_dir}/custom_nodes/ComfyUI-WanVideoWrapper"
+            wan_nodes_exist = (await _ssh_exec_async(ssh, f"test -d {wan_nodes_dir} && echo EXISTS || echo MISSING")).strip()
+            if wan_nodes_exist != "EXISTS":
+                await _ssh_exec_async(ssh, f"cd {comfy_dir}/custom_nodes && git clone --depth 1 https://github.com/kijai/ComfyUI-WanVideoWrapper.git", timeout=120)
+                await _ssh_exec_async(ssh, f"cd {wan_nodes_dir} && pip install -r requirements.txt 2>&1 | tail -5", timeout=300)
+        # Symlink all LoRAs from volume
+        await _ssh_exec_async(ssh, f"ls /runpod-volume/loras/*.safetensors 2>/dev/null | while read f; do ln -sf \"$f\" {comfy_dir}/models/loras/; done")
+        # Start ComfyUI in background (fire-and-forget — don't wait for output)
+        _pod_state["setup_status"] = "Starting ComfyUI..."
+        await asyncio.to_thread(
+            _ssh_exec_fire_and_forget,
+            ssh,
+            f"cd {comfy_dir} && python main.py --listen 0.0.0.0 --port 8188 --fp8_e4m3fn-unet > /tmp/comfyui.log 2>&1",
+        )
+        await asyncio.sleep(2)  # Give it a moment to start
+        # Wait for ComfyUI HTTP to respond
+        _pod_state["setup_status"] = "Waiting for ComfyUI to load model..."
+        import httpx
+        comfyui_url = _get_comfyui_url()
+        for attempt in range(120):  # Up to 10 minutes
+            try:
+                async with httpx.AsyncClient(timeout=5) as client:
+                    resp = await client.get(f"{comfyui_url}/system_stats")
+                    if resp.status_code == 200:
+                        _pod_state["status"] = "running"
+                        _pod_state["setup_status"] = "Ready"
+                        _save_pod_state()
+                        logger.info("ComfyUI ready at %s", comfyui_url)
+                        return
+            except Exception:
+                pass
+            await asyncio.sleep(5)
+        # If we get here, ComfyUI didn't start
+        # Check the log for errors
+        log_tail = await _ssh_exec_async(ssh, "tail -20 /tmp/comfyui.log")
+        logger.error("ComfyUI didn't start. Log: %s", log_tail)
+        _pod_state["setup_status"] = f"ComfyUI failed to start. Check logs."
+        _pod_state["status"] = "setting_up"  # Keep pod running so user can debug
+    except Exception as e:
+        import traceback
+        err_msg = f"{type(e).__name__}: {e}"
+        logger.error("Pod setup failed: %s\n%s", err_msg, traceback.format_exc())
+        _pod_state["setup_status"] = f"Setup failed: {err_msg}"
+        _pod_state["status"] = "setting_up"  # Keep pod running so user can debug
+    finally:
+        try:
+            ssh.close()
+        except Exception:
+            pass
+def _ssh_exec(ssh, cmd: str, timeout: int = 120) -> str:
+    """Execute a command over SSH and return stdout (blocking — call from async via to_thread or background task)."""
+    _, stdout, stderr = ssh.exec_command(cmd, timeout=timeout)
+    out = stdout.read().decode("utf-8", errors="replace")
+    return out.strip()
+async def _ssh_exec_async(ssh, cmd: str, timeout: int = 120) -> str:
+    """Async wrapper for SSH exec that doesn't block the event loop."""
+    return await asyncio.to_thread(_ssh_exec, ssh, cmd, timeout)
+def _ssh_exec_fire_and_forget(ssh, cmd: str):
+    """Start a command over SSH without waiting for output (for background processes)."""
+    transport = ssh.get_transport()
+    channel = transport.open_session()
+    channel.exec_command(cmd)
+    # Don't read stdout/stderr — just let it run
 @router.post("/stop")
         pod_id = _pod_state["pod_id"]
         logger.info("Stopping pod: %s", pod_id)
+        await asyncio.to_thread(runpod.terminate_pod, pod_id)
         _pod_state["pod_id"] = None
         _pod_state["ip"] = None
+        _pod_state["ssh_port"] = None
+        _pod_state["comfyui_port"] = None
         _pod_state["status"] = "stopped"
         _pod_state["started_at"] = None
+        _pod_state["setup_status"] = None
+        _save_pod_state()
         logger.info("Pod stopped")
         return {"status": "stopped", "message": "Pod terminated"}
     except Exception as e:
         logger.error("Failed to stop pod: %s", e)
+        _pod_state["status"] = "running"
         raise HTTPException(500, f"Failed to stop pod: {e}")
     if _pod_state["status"] != "running" or not _pod_state["ip"]:
         return {"loras": [], "message": "Pod not running"}
+    comfyui_url = _get_comfyui_url()
     try:
         import httpx
         async with httpx.AsyncClient(timeout=30) as client:
+            url = f"{comfyui_url}/object_info/LoraLoader"
             resp = await client.get(url)
             if resp.status_code == 200:
                 data = resp.json()
     except Exception as e:
         logger.warning("Failed to list pod LoRAs: %s", e)
+    return {"loras": [], "comfyui_url": comfyui_url}
 @router.post("/upload-lora")
+async def upload_lora_to_pod(file: UploadFile = File(...)):
     """Upload a LoRA file to the running pod."""
     import httpx
     if _pod_state["status"] != "running":
     try:
         content = await file.read()
+        comfyui_url = _get_comfyui_url()
         async with httpx.AsyncClient(timeout=120) as client:
+            url = f"{comfyui_url}/upload/image"
             files = {"image": (file.filename, content, "application/octet-stream")}
             data = {"subfolder": "loras", "type": "input"}
             resp = await client.post(url, files=files, data=data)
             if resp.status_code == 200:
     job_id = str(uuid.uuid4())[:8]
     seed = request.seed if request.seed >= 0 else random.randint(0, 2**32 - 1)
+    model_type = _pod_state.get("model_type", "flux2")
     workflow = _build_flux_workflow(
         prompt=request.prompt,
         negative_prompt=request.negative_prompt,
         seed=seed,
         lora_name=request.lora_name,
         lora_strength=request.lora_strength,
+        model_type=model_type,
     )
+    comfyui_url = _get_comfyui_url()
     try:
         async with httpx.AsyncClient(timeout=30) as client:
+            resp = await client.post(f"{comfyui_url}/prompt", json={"prompt": workflow})
             resp.raise_for_status()
             data = resp.json()
             }
             logger.info("Pod generation started: %s -> %s", job_id, prompt_id)
             asyncio.create_task(_poll_pod_job(job_id, prompt_id, request.content_rating))
+            return {"job_id": job_id, "status": "running", "seed": seed}
     except Exception as e:
         logger.error("Pod generation failed: %s", e)
 async def _poll_pod_job(job_id: str, prompt_id: str, content_rating: str):
     """Poll ComfyUI for job completion and save the result."""
     import httpx
     start = time.time()
+    timeout = 600  # 10 min — first gen can take 5+ min for model loading
+    comfyui_url = _get_comfyui_url()
     async with httpx.AsyncClient(timeout=60) as client:
         while time.time() - start < timeout:
             try:
+                resp = await client.get(f"{comfyui_url}/history/{prompt_id}")
                 if resp.status_code == 200:
                     data = resp.json()
                     if prompt_id in data:
                         outputs = data[prompt_id].get("outputs", {})
                         for node_id, node_output in outputs.items():
                             if "images" in node_output:
                                 image_info = node_output["images"][0]
                                 filename = image_info["filename"]
                                 subfolder = image_info.get("subfolder", "")
                                 params = {"filename": filename}
                                 if subfolder:
                                     params["subfolder"] = subfolder
+                                img_resp = await client.get(f"{comfyui_url}/view", params=params)
                                 if img_resp.status_code == 200:
                                     from content_engine.config import settings
                                     output_dir = settings.paths.output_dir / "pod" / content_rating / "raw"
                                     output_dir.mkdir(parents=True, exist_ok=True)
                                     logger.info("Pod generation completed: %s -> %s", job_id, local_path)
                                     try:
                                         from content_engine.services.catalog import CatalogService
                                         catalog = CatalogService()
     seed: int,
     lora_name: str | None,
     lora_strength: float,
+    model_type: str = "flux2",
 ) -> dict:
+    """Build a ComfyUI workflow for FLUX generation.
+    FLUX.2 Dev uses separate model components (not a single checkpoint):
+    - UNETLoader for the diffusion model
+    - CLIPLoader (type=flux2) for the Mistral text encoder
+    - VAELoader for the autoencoder
+    """
+    if model_type == "flux2":
+        unet_name = "flux2-dev.safetensors"
+        clip_type = "flux2"
+        clip_name = "mistral_3_small_flux2_fp8.safetensors"
+    else:
+        unet_name = "flux1-dev.safetensors"
+        clip_type = "flux"
+        clip_name = "t5xxl_fp16.safetensors"
+    # Model node ID references
+    model_out = ["1", 0]  # UNETLoader -> MODEL
+    clip_out = ["2", 0]   # CLIPLoader -> CLIP
+    vae_out = ["3", 0]    # VAELoader -> VAE
     workflow = {
+        # Load diffusion model (UNet)
+        "1": {
+            "class_type": "UNETLoader",
+            "inputs": {
+                "unet_name": unet_name,
+                "weight_dtype": "fp8_e4m3fn",
+            },
+        },
+        # Load text encoder
+        "2": {
+            "class_type": "CLIPLoader",
+            "inputs": {
+                "clip_name": clip_name,
+                "type": clip_type,
+            },
+        },
+        # Load VAE
+        "3": {
+            "class_type": "VAELoader",
+            "inputs": {"vae_name": "ae.safetensors"},
         },
+        # Positive prompt
         "6": {
             "class_type": "CLIPTextEncode",
             "inputs": {
                 "text": prompt,
+                "clip": clip_out,
             },
         },
+        # Negative prompt
         "7": {
             "class_type": "CLIPTextEncode",
             "inputs": {
                 "text": negative_prompt or "",
+                "clip": clip_out,
             },
         },
+        # Empty latent
         "5": {
             "class_type": "EmptyLatentImage",
             "inputs": {
                 "batch_size": 1,
             },
         },
+        # Sampler
+        "10": {
             "class_type": "KSampler",
             "inputs": {
                 "seed": seed,
                 "sampler_name": "euler",
                 "scheduler": "simple",
                 "denoise": 1.0,
+                "model": model_out,
                 "positive": ["6", 0],
                 "negative": ["7", 0],
                 "latent_image": ["5", 0],
             },
         },
+        # Decode
         "8": {
             "class_type": "VAEDecode",
             "inputs": {
+                "samples": ["10", 0],
+                "vae": vae_out,
             },
         },
+        # Save
         "9": {
             "class_type": "SaveImage",
             "inputs": {
     # Add LoRA if specified
     if lora_name:
+        workflow["20"] = {
             "class_type": "LoraLoader",
             "inputs": {
                 "lora_name": lora_name,
                 "strength_model": lora_strength,
                 "strength_clip": lora_strength,
+                "model": model_out,
+                "clip": clip_out,
             },
         }
+        # Rewire sampler and text encoders to use LoRA output
+        workflow["10"]["inputs"]["model"] = ["20", 0]
+        workflow["6"]["inputs"]["clip"] = ["20", 1]
+        workflow["7"]["inputs"]["clip"] = ["20", 1]
     return workflow

src/content_engine/api/routes_training.py CHANGED Viewed

@@ -213,8 +213,10 @@ async def list_training_jobs():
                 "loss": j.loss, "started_at": j.started_at,
                 "completed_at": j.completed_at, "output_path": j.output_path,
                 "error": j.error, "backend": "local",
             })
     if _runpod_trainer:
         for j in _runpod_trainer.list_jobs():
             jobs.append({
                 "id": j.id, "name": j.name, "status": j.status,
@@ -225,6 +227,7 @@ async def list_training_jobs():
                 "completed_at": j.completed_at, "output_path": j.output_path,
                 "error": j.error, "backend": "runpod",
                 "base_model": j.base_model, "model_type": j.model_type,
             })
     return jobs
@@ -232,27 +235,35 @@ async def list_training_jobs():
 @router.get("/jobs/{job_id}")
 async def get_training_job(job_id: str):
     """Get details of a specific training job including logs."""
-    if _trainer is None:
-        raise HTTPException(503, "Trainer not initialized")
-    job = _trainer.get_job(job_id)
-    if job is None:
-        raise HTTPException(404, f"Training job not found: {job_id}")
-    return {
-        "id": job.id,
-        "name": job.name,
-        "status": job.status,
-        "progress": round(job.progress, 3),
-        "current_epoch": job.current_epoch,
-        "total_epochs": job.total_epochs,
-        "current_step": job.current_step,
-        "total_steps": job.total_steps,
-        "loss": job.loss,
-        "started_at": job.started_at,
-        "completed_at": job.completed_at,
-        "output_path": job.output_path,
-        "error": job.error,
-        "log_lines": job.log_lines[-50:],
-    }
 @router.post("/jobs/{job_id}/cancel")
@@ -267,3 +278,22 @@ async def cancel_training_job(job_id: str):
         if cancelled:
             return {"status": "cancelled", "job_id": job_id}
     raise HTTPException(404, "Job not found or not running")

                 "loss": j.loss, "started_at": j.started_at,
                 "completed_at": j.completed_at, "output_path": j.output_path,
                 "error": j.error, "backend": "local",
+                "log_lines": j.log_lines[-50:] if hasattr(j, 'log_lines') else [],
             })
     if _runpod_trainer:
+        await _runpod_trainer.ensure_loaded()
         for j in _runpod_trainer.list_jobs():
             jobs.append({
                 "id": j.id, "name": j.name, "status": j.status,
                 "completed_at": j.completed_at, "output_path": j.output_path,
                 "error": j.error, "backend": "runpod",
                 "base_model": j.base_model, "model_type": j.model_type,
+                "log_lines": j.log_lines[-50:],
             })
     return jobs
 @router.get("/jobs/{job_id}")
 async def get_training_job(job_id: str):
     """Get details of a specific training job including logs."""
+    # Check RunPod jobs first
+    if _runpod_trainer:
+        await _runpod_trainer.ensure_loaded()
+        job = _runpod_trainer.get_job(job_id)
+        if job:
+            return {
+                "id": job.id, "name": job.name, "status": job.status,
+                "progress": round(job.progress, 3),
+                "current_epoch": job.current_epoch, "total_epochs": job.total_epochs,
+                "current_step": job.current_step, "total_steps": job.total_steps,
+                "loss": job.loss, "started_at": job.started_at,
+                "completed_at": job.completed_at, "output_path": job.output_path,
+                "error": job.error, "log_lines": job.log_lines[-50:],
+                "backend": "runpod", "base_model": job.base_model,
+            }
+    # Then check local trainer
+    if _trainer:
+        job = _trainer.get_job(job_id)
+        if job:
+            return {
+                "id": job.id, "name": job.name, "status": job.status,
+                "progress": round(job.progress, 3),
+                "current_epoch": job.current_epoch, "total_epochs": job.total_epochs,
+                "current_step": job.current_step, "total_steps": job.total_steps,
+                "loss": job.loss, "started_at": job.started_at,
+                "completed_at": job.completed_at, "output_path": job.output_path,
+                "error": job.error, "log_lines": job.log_lines[-50:],
+            }
+    raise HTTPException(404, f"Training job not found: {job_id}")
 @router.post("/jobs/{job_id}/cancel")
         if cancelled:
             return {"status": "cancelled", "job_id": job_id}
     raise HTTPException(404, "Job not found or not running")
+@router.delete("/jobs/{job_id}")
+async def delete_training_job(job_id: str):
+    """Delete a training job from history."""
+    if _runpod_trainer:
+        deleted = await _runpod_trainer.delete_job(job_id)
+        if deleted:
+            return {"status": "deleted", "job_id": job_id}
+    raise HTTPException(404, f"Training job not found: {job_id}")
+@router.delete("/jobs")
+async def delete_failed_jobs():
+    """Delete all failed training jobs."""
+    if _runpod_trainer:
+        count = await _runpod_trainer.delete_failed_jobs()
+        return {"status": "ok", "deleted": count}
+    return {"status": "ok", "deleted": 0}

src/content_engine/api/ui.html CHANGED Viewed

@@ -636,6 +636,25 @@ select { cursor: pointer; }
 .job-status-completed { background: rgba(34,197,94,0.15); color: var(--green); }
 .job-status-failed { background: rgba(239,68,68,0.15); color: var(--red); }
 .job-status-pending { background: rgba(136,136,136,0.15); color: var(--text-secondary); }
 /* --- Lightbox --- */
 .lightbox {
@@ -1245,10 +1264,22 @@ select { cursor: pointer; }
             <div style="margin-top:8px">
               <label style="margin:0">GPU Type</label>
               <select id="train-gpu-type" style="margin-top:4px">
-                <option value="NVIDIA GeForce RTX 4090">RTX 4090 (~$0.44/hr) - Fastest</option>
-                <option value="NVIDIA GeForce RTX 3090">RTX 3090 (~$0.22/hr) - Good value</option>
-                <option value="NVIDIA RTX A5000">RTX A5000 (~$0.28/hr)</option>
-                <option value="NVIDIA RTX A4000">RTX A4000 (~$0.20/hr) - Cheapest</option>
               </select>
             </div>
           </div>
@@ -1379,13 +1410,27 @@ select { cursor: pointer; }
           </div>
           <div id="pod-controls" style="display:flex; gap:8px; align-items:center; flex-wrap:wrap">
             <select id="pod-model-type" style="padding:8px 12px; border-radius:6px; background:var(--bg-primary); border:1px solid var(--border); color:var(--text-primary)">
-              <option value="flux">FLUX.2 (Realistic)</option>
-              <option value="wan">WAN 2.2 (General/Anime)</option>
             </select>
             <select id="pod-gpu-select" style="padding:8px 12px; border-radius:6px; background:var(--bg-primary); border:1px solid var(--border); color:var(--text-primary)">
-              <option value="NVIDIA GeForce RTX 4090">RTX 4090 - $0.44/hr (24GB)</option>
-              <option value="NVIDIA RTX A6000">RTX A6000 - $0.76/hr (48GB)</option>
-              <option value="NVIDIA A100 80GB PCIe">A100 80GB - $1.89/hr (80GB)</option>
             </select>
             <button id="pod-start-btn" class="btn" onclick="startPod()">Start Pod</button>
             <button id="pod-stop-btn" class="btn btn-danger" onclick="stopPod()" style="display:none">Stop Pod</button>
@@ -2761,9 +2806,10 @@ function updateModelDefaults() {
     <span style="color:var(--accent)">Resolution: ${model.resolution}px | LR: ${model.learning_rate} | Rank: ${model.network_rank} | VRAM: ${model.vram_required_gb}GB</span>
   `;
-  // Update placeholder hints
   document.getElementById('train-lr').placeholder = `Default: ${model.learning_rate}`;
   document.getElementById('lr-default').textContent = `(default: ${model.learning_rate})`;
   // Update resolution default
   const resSelect = document.getElementById('train-resolution');
@@ -2782,6 +2828,16 @@ function updateModelDefaults() {
       break;
     }
   }
 }
 function selectTrainBackend(chip, backend) {
@@ -2899,7 +2955,7 @@ async function pollTrainingJobs() {
     renderTrainingJobs(jobs);
     // Stop polling if no active jobs
-    const active = jobs.filter(j => j.status === 'training' || j.status === 'preparing');
     if (active.length === 0 && trainingPollInterval) {
       clearInterval(trainingPollInterval);
       trainingPollInterval = null;
@@ -2913,35 +2969,93 @@ function renderTrainingJobs(jobs) {
     container.innerHTML = `<div class="empty-state" style="padding:30px"><p>No training jobs yet</p><p style="font-size:12px;margin-top:4px">Upload images and configure settings to start training</p></div>`;
     return;
   }
-  container.innerHTML = jobs.map(j => {
     const pct = (j.progress * 100).toFixed(1);
     const elapsed = j.started_at ? ((Date.now()/1000 - j.started_at) / 60).toFixed(0) : '?';
     return `
-      <div class="job-card">
         <div class="job-header">
           <span class="job-name">${j.name} ${j.backend === 'runpod' ? '<span style="font-size:10px;color:var(--blue);font-weight:400">☁ RunPod</span>' : ''}</span>
           <span class="job-status job-status-${j.status}">${j.status}</span>
         </div>
-        ${['training','preparing','creating_pod','uploading','installing','downloading'].includes(j.status) ? `
           <div class="progress-bar-container" style="margin-top:0">
             <div class="progress-bar-fill" style="width:${pct}%"></div>
           </div>
           <div style="display:flex;gap:16px;margin-top:8px;font-size:12px;color:var(--text-secondary)">
             <span>Progress: <strong style="color:var(--text-primary)">${pct}%</strong></span>
             ${j.current_step ? `<span>Step: <strong style="color:var(--text-primary)">${j.current_step}/${j.total_steps}</strong></span>` : ''}
-            ${j.loss !== null ? `<span>Loss: <strong style="color:var(--text-primary)">${j.loss.toFixed(4)}</strong></span>` : ''}
             <span>Time: <strong style="color:var(--text-primary)">${elapsed}m</strong></span>
           </div>
-          <button class="btn btn-secondary btn-small" style="margin-top:8px" onclick="cancelTraining('${j.id}')">Cancel</button>
         ` : ''}
         ${j.status === 'completed' ? `
           <div style="font-size:12px;color:var(--green);margin-top:4px">LoRA saved to ComfyUI models folder</div>
           ${j.output_path ? `<div style="font-size:11px;color:var(--text-secondary);margin-top:2px;word-break:break-all">${j.output_path}</div>` : ''}
         ` : ''}
-        ${j.status === 'failed' && j.error ? `<div style="font-size:12px;color:var(--red);margin-top:4px">${j.error}</div>` : ''}
       </div>
     `;
   }).join('');
 }
 async function cancelTraining(jobId) {
@@ -2954,6 +3068,27 @@ async function cancelTraining(jobId) {
   }
 }
 // --- Status ---
 async function checkStatus() {
   try {
@@ -3085,10 +3220,11 @@ function updatePodUI(pod) {
     if (!podPollInterval) {
       podPollInterval = setInterval(loadPodStatus, 30000);
     }
-  } else if (pod.status === 'starting') {
-    statusText.innerHTML = '<span style="color:var(--yellow)">● Starting...</span> <span style="color:var(--text-secondary)">(~2-3 min)</span>';
     startBtn.style.display = 'none';
-    stopBtn.style.display = 'none';
     gpuSelect.style.display = 'none';
     podInfo.style.display = 'none';
     // Poll more frequently while starting

 .job-status-completed { background: rgba(34,197,94,0.15); color: var(--green); }
 .job-status-failed { background: rgba(239,68,68,0.15); color: var(--red); }
 .job-status-pending { background: rgba(136,136,136,0.15); color: var(--text-secondary); }
+.job-logs-panel {
+  margin-top: 8px;
+  border-top: 1px solid var(--border);
+  padding-top: 8px;
+}
+.job-logs-content {
+  background: var(--bg-primary);
+  border: 1px solid var(--border);
+  border-radius: 6px;
+  padding: 8px 10px;
+  font-family: monospace;
+  font-size: 11px;
+  line-height: 1.5;
+  max-height: 300px;
+  overflow-y: auto;
+  white-space: pre-wrap;
+  word-break: break-all;
+  color: var(--text-secondary);
+}
 /* --- Lightbox --- */
 .lightbox {
             <div style="margin-top:8px">
               <label style="margin:0">GPU Type</label>
               <select id="train-gpu-type" style="margin-top:4px">
+                <optgroup label="48GB+ (Required for FLUX.2 Dev)">
+                  <option value="NVIDIA A40">A40 48GB (~$0.64/hr) - Cheapest for FLUX.2</option>
+                  <option value="NVIDIA RTX A6000" selected>RTX A6000 48GB (~$0.76/hr) - Recommended</option>
+                  <option value="NVIDIA L40">L40 48GB (~$0.89/hr)</option>
+                  <option value="NVIDIA L40S">L40S 48GB (~$1.09/hr)</option>
+                  <option value="NVIDIA A100-SXM4-80GB">A100 SXM 80GB (~$1.64/hr)</option>
+                  <option value="NVIDIA A100 80GB PCIe">A100 PCIe 80GB (~$1.89/hr)</option>
+                  <option value="NVIDIA H100 80GB HBM3">H100 80GB (~$3.89/hr) - Fastest</option>
+                </optgroup>
+                <optgroup label="24-32GB (SD 1.5, SDXL, FLUX.1 only)">
+                  <option value="NVIDIA GeForce RTX 5090">RTX 5090 32GB (~$0.69/hr)</option>
+                  <option value="NVIDIA GeForce RTX 4090">RTX 4090 24GB (~$0.44/hr)</option>
+                  <option value="NVIDIA GeForce RTX 3090">RTX 3090 24GB (~$0.22/hr)</option>
+                  <option value="NVIDIA RTX A5000">RTX A5000 24GB (~$0.28/hr)</option>
+                  <option value="NVIDIA RTX A4000">RTX A4000 16GB (~$0.20/hr)</option>
+                </optgroup>
               </select>
             </div>
           </div>
           </div>
           <div id="pod-controls" style="display:flex; gap:8px; align-items:center; flex-wrap:wrap">
             <select id="pod-model-type" style="padding:8px 12px; border-radius:6px; background:var(--bg-primary); border:1px solid var(--border); color:var(--text-primary)">
+              <option value="flux2">FLUX.2 Dev (Realistic txt2img)</option>
+              <option value="flux1">FLUX.1 Dev (txt2img)</option>
+              <option value="wan22">WAN 2.2 (img2video)</option>
             </select>
             <select id="pod-gpu-select" style="padding:8px 12px; border-radius:6px; background:var(--bg-primary); border:1px solid var(--border); color:var(--text-primary)">
+              <optgroup label="48GB+ (FLUX.2 / Large models)">
+                <option value="NVIDIA A40">A40 48GB - $0.64/hr</option>
+                <option value="NVIDIA RTX A6000" selected>A6000 48GB - $0.76/hr</option>
+                <option value="NVIDIA L40">L40 48GB - $0.89/hr</option>
+                <option value="NVIDIA L40S">L40S 48GB - $1.09/hr</option>
+                <option value="NVIDIA A100-SXM4-80GB">A100 SXM 80GB - $1.64/hr</option>
+                <option value="NVIDIA A100 80GB PCIe">A100 PCIe 80GB - $1.89/hr</option>
+                <option value="NVIDIA H100 80GB HBM3">H100 80GB - $3.89/hr</option>
+              </optgroup>
+              <optgroup label="24-32GB (SD 1.5 / SDXL / FLUX.1)">
+                <option value="NVIDIA GeForce RTX 5090">RTX 5090 32GB - $0.69/hr</option>
+                <option value="NVIDIA GeForce RTX 4090">RTX 4090 24GB - $0.44/hr</option>
+                <option value="NVIDIA GeForce RTX 3090">RTX 3090 24GB - $0.22/hr</option>
+                <option value="NVIDIA RTX A5000">A5000 24GB - $0.28/hr</option>
+                <option value="NVIDIA RTX A4000">A4000 16GB - $0.20/hr</option>
+              </optgroup>
             </select>
             <button id="pod-start-btn" class="btn" onclick="startPod()">Start Pod</button>
             <button id="pod-stop-btn" class="btn btn-danger" onclick="stopPod()" style="display:none">Stop Pod</button>
     <span style="color:var(--accent)">Resolution: ${model.resolution}px | LR: ${model.learning_rate} | Rank: ${model.network_rank} | VRAM: ${model.vram_required_gb}GB</span>
   `;
+  // Update placeholder hints and auto-fill LR
   document.getElementById('train-lr').placeholder = `Default: ${model.learning_rate}`;
   document.getElementById('lr-default').textContent = `(default: ${model.learning_rate})`;
+  document.getElementById('train-lr').value = model.learning_rate;
   // Update resolution default
   const resSelect = document.getElementById('train-resolution');
       break;
     }
   }
+  // Update optimizer default
+  const optSelect = document.getElementById('train-optimizer');
+  const optName = (model.optimizer || 'AdamW8bit').toLowerCase();
+  for (let opt of optSelect.options) {
+    if (opt.value.toLowerCase() === optName) {
+      opt.selected = true;
+      break;
+    }
+  }
 }
 function selectTrainBackend(chip, backend) {
     renderTrainingJobs(jobs);
     // Stop polling if no active jobs
+    const active = jobs.filter(j => ['training','preparing','creating_pod','uploading','installing','downloading','pending'].includes(j.status));
     if (active.length === 0 && trainingPollInterval) {
       clearInterval(trainingPollInterval);
       trainingPollInterval = null;
     container.innerHTML = `<div class="empty-state" style="padding:30px"><p>No training jobs yet</p><p style="font-size:12px;margin-top:4px">Upload images and configure settings to start training</p></div>`;
     return;
   }
+  // Store latest jobs for log viewer
+  window._trainingJobs = jobs;
+  // Show "Clear Failed" button if there are any failed jobs
+  const failedCount = jobs.filter(j => j.status === 'failed' || j.status === 'error').length;
+  let html = failedCount > 0 ? `<div style="text-align:right;margin-bottom:8px"><button class="btn btn-secondary btn-small" style="color:var(--red)" onclick="clearFailedJobs()">Clear ${failedCount} Failed</button></div>` : '';
+  html += jobs.map(j => {
     const pct = (j.progress * 100).toFixed(1);
     const elapsed = j.started_at ? ((Date.now()/1000 - j.started_at) / 60).toFixed(0) : '?';
+    const isActive = ['training','preparing','creating_pod','uploading','installing','downloading'].includes(j.status);
+    const hasLogs = j.log_lines && j.log_lines.length > 0;
     return `
+      <div class="job-card" id="job-card-${j.id}">
         <div class="job-header">
           <span class="job-name">${j.name} ${j.backend === 'runpod' ? '<span style="font-size:10px;color:var(--blue);font-weight:400">☁ RunPod</span>' : ''}</span>
           <span class="job-status job-status-${j.status}">${j.status}</span>
         </div>
+        ${isActive ? `
           <div class="progress-bar-container" style="margin-top:0">
             <div class="progress-bar-fill" style="width:${pct}%"></div>
           </div>
           <div style="display:flex;gap:16px;margin-top:8px;font-size:12px;color:var(--text-secondary)">
             <span>Progress: <strong style="color:var(--text-primary)">${pct}%</strong></span>
             ${j.current_step ? `<span>Step: <strong style="color:var(--text-primary)">${j.current_step}/${j.total_steps}</strong></span>` : ''}
+            ${j.loss !== null && j.loss !== undefined ? `<span>Loss: <strong style="color:var(--text-primary)">${j.loss.toFixed(4)}</strong></span>` : ''}
             <span>Time: <strong style="color:var(--text-primary)">${elapsed}m</strong></span>
           </div>
+          <div style="display:flex;gap:6px;margin-top:8px">
+            <button class="btn btn-secondary btn-small" onclick="toggleJobLogs('${j.id}')">View Logs</button>
+            <button class="btn btn-secondary btn-small" style="color:var(--red)" onclick="cancelTraining('${j.id}')">Cancel</button>
+          </div>
         ` : ''}
         ${j.status === 'completed' ? `
           <div style="font-size:12px;color:var(--green);margin-top:4px">LoRA saved to ComfyUI models folder</div>
           ${j.output_path ? `<div style="font-size:11px;color:var(--text-secondary);margin-top:2px;word-break:break-all">${j.output_path}</div>` : ''}
+          <button class="btn btn-secondary btn-small" style="margin-top:6px" onclick="toggleJobLogs('${j.id}')">View Logs</button>
+        ` : ''}
+        ${j.status === 'failed' ? `
+          ${j.error ? `<div style="font-size:12px;color:var(--red);margin-top:4px">${j.error}</div>` : ''}
+          <div style="display:flex;gap:6px;margin-top:6px">
+            <button class="btn btn-secondary btn-small" onclick="toggleJobLogs('${j.id}')">View Logs</button>
+            <button class="btn btn-secondary btn-small" style="color:var(--red)" onclick="deleteJob('${j.id}')">Delete</button>
+          </div>
         ` : ''}
+        <div id="job-logs-${j.id}" class="job-logs-panel" style="display:none">
+          <div class="job-logs-content"></div>
+        </div>
       </div>
     `;
   }).join('');
+  container.innerHTML = html;
+  // Auto-show logs for active jobs
+  const activeJob = jobs.find(j => ['training','preparing','creating_pod','uploading','installing','downloading'].includes(j.status));
+  if (activeJob && activeJob.log_lines && activeJob.log_lines.length > 0) {
+    showJobLogs(activeJob.id);
+  }
+}
+function toggleJobLogs(jobId) {
+  const panel = document.getElementById('job-logs-' + jobId);
+  if (!panel) return;
+  if (panel.style.display === 'none') {
+    showJobLogs(jobId);
+  } else {
+    panel.style.display = 'none';
+  }
+}
+function showJobLogs(jobId) {
+  const panel = document.getElementById('job-logs-' + jobId);
+  if (!panel) return;
+  panel.style.display = 'block';
+  // Find job data
+  const job = (window._trainingJobs || []).find(j => j.id === jobId);
+  if (!job || !job.log_lines) {
+    panel.querySelector('.job-logs-content').textContent = 'No logs available';
+    return;
+  }
+  const content = panel.querySelector('.job-logs-content');
+  content.textContent = job.log_lines.join('\n');
+  content.scrollTop = content.scrollHeight;
 }
 async function cancelTraining(jobId) {
   }
 }
+async function deleteJob(jobId) {
+  try {
+    await fetch(API + `/api/training/jobs/${jobId}`, { method: 'DELETE' });
+    toast('Job deleted', 'info');
+    pollTrainingJobs();
+  } catch(e) {
+    toast('Failed to delete job', 'error');
+  }
+}
+async function clearFailedJobs() {
+  try {
+    const res = await fetch(API + '/api/training/jobs', { method: 'DELETE' });
+    const data = await res.json();
+    toast(`Cleared ${data.deleted} failed jobs`, 'info');
+    pollTrainingJobs();
+  } catch(e) {
+    toast('Failed to clear jobs', 'error');
+  }
+}
 // --- Status ---
 async function checkStatus() {
   try {
     if (!podPollInterval) {
       podPollInterval = setInterval(loadPodStatus, 30000);
     }
+  } else if (pod.status === 'starting' || pod.status === 'setting_up') {
+    const setupMsg = pod.setup_status || 'Starting pod...';
+    statusText.innerHTML = `<span style="color:var(--yellow)">● ${setupMsg}</span>`;
     startBtn.style.display = 'none';
+    stopBtn.style.display = '';  // Allow stopping during setup
     gpuSelect.style.display = 'none';
     podInfo.style.display = 'none';
     // Poll more frequently while starting

src/content_engine/models/__pycache__/__init__.cpython-311.pyc DELETED Viewed

Binary file (802 Bytes)

src/content_engine/models/__pycache__/database.cpython-311.pyc DELETED Viewed

Binary file (10.7 kB)

src/content_engine/models/__pycache__/schemas.cpython-311.pyc DELETED Viewed

Binary file (5.59 kB)

src/content_engine/models/database.py CHANGED Viewed

@@ -2,9 +2,7 @@
 from __future__ import annotations
-import os
 from datetime import datetime
-from pathlib import Path
 from sqlalchemy import (
     Boolean,
@@ -149,12 +147,36 @@ class ScheduledPost(Base):
     )
-# --- Engine / Session factories ---
-# Ensure database directory exists before creating engine
-_db_path = settings.database.url.replace("sqlite+aiosqlite:///", "")
-_db_dir = Path(_db_path).parent
-_db_dir.mkdir(parents=True, exist_ok=True)
 _catalog_engine = create_async_engine(
     settings.database.url,

 from __future__ import annotations
 from datetime import datetime
 from sqlalchemy import (
     Boolean,
     )
+class TrainingJob(Base):
+    __tablename__ = "training_jobs"
+    id: Mapped[str] = mapped_column(String(36), primary_key=True)
+    name: Mapped[str] = mapped_column(String(128), nullable=False)
+    status: Mapped[str] = mapped_column(String(32), default="pending", index=True)
+    progress: Mapped[float] = mapped_column(Float, default=0.0)
+    current_epoch: Mapped[int] = mapped_column(Integer, default=0)
+    total_epochs: Mapped[int] = mapped_column(Integer, default=0)
+    current_step: Mapped[int] = mapped_column(Integer, default=0)
+    total_steps: Mapped[int] = mapped_column(Integer, default=0)
+    loss: Mapped[float | None] = mapped_column(Float)
+    started_at: Mapped[float | None] = mapped_column(Float)
+    completed_at: Mapped[float | None] = mapped_column(Float)
+    output_path: Mapped[str | None] = mapped_column(String(512))
+    error: Mapped[str | None] = mapped_column(Text)
+    log_text: Mapped[str | None] = mapped_column(Text)  # newline-separated log lines
+    pod_id: Mapped[str | None] = mapped_column(String(64))
+    gpu_type: Mapped[str | None] = mapped_column(String(64))
+    backend: Mapped[str] = mapped_column(String(16), default="runpod")
+    base_model: Mapped[str | None] = mapped_column(String(64))
+    model_type: Mapped[str | None] = mapped_column(String(16))
+    trigger_word: Mapped[str | None] = mapped_column(String(128))
+    image_upload_dir: Mapped[str | None] = mapped_column(String(512))
+    created_at: Mapped[datetime] = mapped_column(
+        DateTime, server_default=func.now()
+    )
+# --- Engine / Session factories ---
 _catalog_engine = create_async_engine(
     settings.database.url,

src/content_engine/services/__pycache__/__init__.cpython-311.pyc DELETED Viewed

Binary file (232 Bytes)

src/content_engine/services/__pycache__/catalog.cpython-311.pyc DELETED Viewed

Binary file (13.1 kB)

src/content_engine/services/__pycache__/comfyui_client.cpython-311.pyc DELETED Viewed

Binary file (17.5 kB)

src/content_engine/services/__pycache__/lora_trainer.cpython-311.pyc DELETED Viewed

Binary file (19 kB)

src/content_engine/services/__pycache__/runpod_trainer.cpython-311.pyc DELETED Viewed

Binary file (32 kB)

src/content_engine/services/__pycache__/template_engine.cpython-311.pyc DELETED Viewed

Binary file (11 kB)

src/content_engine/services/__pycache__/variation_engine.cpython-311.pyc DELETED Viewed

Binary file (8.52 kB)

src/content_engine/services/__pycache__/workflow_builder.cpython-311.pyc DELETED Viewed

Binary file (7.16 kB)

src/content_engine/services/runpod_trainer.py CHANGED Viewed

@@ -27,6 +27,7 @@ logger = logging.getLogger(__name__)
 import os
 from content_engine.config import settings, IS_HF_SPACES
 LORA_OUTPUT_DIR = settings.paths.lora_dir
 if IS_HF_SPACES:
@@ -36,16 +37,30 @@ else:
 # RunPod GPU options (id -> display name, approx cost/hr)
 GPU_OPTIONS = {
-    "NVIDIA GeForce RTX 3090": "RTX 3090 (~$0.22/hr)",
-    "NVIDIA GeForce RTX 4090": "RTX 4090 (~$0.44/hr)",
-    "NVIDIA RTX A4000": "RTX A4000 (~$0.20/hr)",
-    "NVIDIA RTX A5000": "RTX A5000 (~$0.28/hr)",
-    "NVIDIA RTX A6000": "RTX A6000 (~$0.76/hr)",
     "NVIDIA A100 80GB PCIe": "A100 80GB (~$1.89/hr)",
 }
 DEFAULT_GPU = "NVIDIA GeForce RTX 4090"
 # Docker image with PyTorch + CUDA pre-installed
 DOCKER_IMAGE = "runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04"
@@ -84,12 +99,15 @@ class CloudTrainingJob:
     cost_estimate: str | None = None
     base_model: str = "sd15_realistic"
     model_type: str = "sd15"
     def _log(self, msg: str):
         self.log_lines.append(msg)
         if len(self.log_lines) > 200:
             self.log_lines = self.log_lines[-200:]
         logger.info("[%s] %s", self.id, msg)
 class RunPodTrainer:
@@ -100,6 +118,7 @@ class RunPodTrainer:
         runpod.api_key = api_key
         self._jobs: dict[str, CloudTrainingJob] = {}
         self._model_registry = load_model_registry()
     @property
     def available(self) -> bool:
@@ -123,6 +142,8 @@ class RunPodTrainer:
                 "learning_rate": cfg.get("learning_rate", 1e-4),
                 "network_rank": cfg.get("network_rank", 32),
                 "network_alpha": cfg.get("network_alpha", 16),
                 "vram_required_gb": cfg.get("vram_required_gb", 8),
                 "recommended_images": cfg.get("recommended_images", "15-30 photos"),
             }
@@ -179,6 +200,8 @@ class RunPodTrainer:
             model_type=model_type,
         )
         self._jobs[job_id] = job
         # Launch the full pipeline as a background task
         asyncio.create_task(self._run_cloud_training(
@@ -224,15 +247,26 @@ class RunPodTrainer:
             job.status = "creating_pod"
             job._log(f"Creating RunPod with {job.gpu_type}...")
             pod = await asyncio.to_thread(
                 runpod.create_pod,
                 f"lora-train-{job.id}",
                 DOCKER_IMAGE,
                 job.gpu_type,
-                volume_in_gb=75,  # Increased for FLUX models
-                container_disk_in_gb=30,
-                ports="22/tcp",
-                docker_args="bash -c 'apt-get update && apt-get install -y openssh-server && mkdir -p /run/sshd && echo \"root:runpod\" | chpasswd && echo \"PermitRootLogin yes\" >> /etc/ssh/sshd_config && /usr/sbin/sshd && sleep infinity'",
             )
             job.pod_id = pod["id"]
@@ -263,89 +297,273 @@ class RunPodTrainer:
                     await asyncio.sleep(5)
             job._log("SSH connected")
             sftp = ssh.open_sftp()
-            # Step 3: Upload training images
             job.status = "uploading"
-            job._log(f"Uploading {len(image_paths)} training images...")
             folder_name = f"10_{trigger_word or 'character'}"
             self._ssh_exec(ssh, f"mkdir -p /workspace/dataset/{folder_name}")
-            for img_path in image_paths:
                 p = Path(img_path)
                 if p.exists():
-                    remote_path = f"/workspace/dataset/{folder_name}/{p.name}"
-                    sftp.put(str(p), remote_path)
                     # Upload matching caption .txt file if it exists locally
                     local_caption = p.with_suffix(".txt")
                     if local_caption.exists():
-                        remote_caption = f"/workspace/dataset/{folder_name}/{local_caption.name}"
                         sftp.put(str(local_caption), remote_caption)
                     else:
                         # Fallback: create caption from trigger word
-                        remote_caption = remote_path.rsplit(".", 1)[0] + ".txt"
                         with sftp.open(remote_caption, "w") as f:
                             f.write(trigger_word or "")
             job._log("Images uploaded")
-            # Step 4: Install Kohya sd-scripts on the pod
             job.status = "installing"
-            job._log("Installing Kohya sd-scripts (this takes a few minutes)...")
             job.progress = 0.05
-            install_cmds = [
-                "cd /workspace && git clone --depth 1 https://github.com/kohya-ss/sd-scripts.git",
-                "cd /workspace/sd-scripts && pip install -r requirements.txt 2>&1 | tail -1",
-                "pip install accelerate lion-pytorch prodigy-optimizer safetensors bitsandbytes xformers 2>&1 | tail -1",
-            ]
             for cmd in install_cmds:
                 out = self._ssh_exec(ssh, cmd, timeout=600)
                 job._log(out[:200] if out else "done")
-            # Download base model from HuggingFace
             hf_repo = model_cfg.get("hf_repo", "SG161222/Realistic_Vision_V5.1_noVAE")
             hf_filename = model_cfg.get("hf_filename", "Realistic_Vision_V5.1_fp16-no-ema.safetensors")
             model_name = model_cfg.get("name", job.base_model)
-            job._log(f"Downloading base model: {model_name}...")
             job.progress = 0.1
             self._ssh_exec(ssh, """pip install huggingface_hub 2>&1 | tail -1""", timeout=120)
-            # Download main model
-            self._ssh_exec(ssh, f"""
-                python -c "
 from huggingface_hub import hf_hub_download
 hf_hub_download('{hf_repo}', '{hf_filename}', local_dir='/workspace/models')
 " 2>&1 | tail -5
-            """, timeout=1200)  # Longer timeout for large models
-            # For FLUX, download additional required models (CLIP, T5, VAE)
-            if model_type == "flux":
-                job._log("Downloading FLUX auxiliary models (CLIP, T5, VAE)...")
-                job.progress = 0.12
-                self._ssh_exec(ssh, """
-                    python -c "
 from huggingface_hub import hf_hub_download
-# CLIP text encoder
 hf_hub_download('comfyanonymous/flux_text_encoders', 'clip_l.safetensors', local_dir='/workspace/models')
-# T5 text encoder (fp16)
 hf_hub_download('comfyanonymous/flux_text_encoders', 't5xxl_fp16.safetensors', local_dir='/workspace/models')
-# VAE/AutoEncoder
 hf_hub_download('black-forest-labs/FLUX.1-dev', 'ae.safetensors', local_dir='/workspace/models')
 " 2>&1 | tail -5
-                """, timeout=1200)
-            job._log("Base model downloaded")
             job.progress = 0.15
             # Step 5: Run training
             job.status = "training"
             job._log(f"Starting {model_type.upper()} LoRA training...")
-            model_path = f"/workspace/models/{hf_filename}"
             # Build training command based on model type
             train_cmd = self._build_training_command(
@@ -361,6 +579,7 @@ hf_hub_download('black-forest-labs/FLUX.1-dev', 'ae.safetensors', local_dir='/wo
                 optimizer=optimizer,
                 save_every_n_epochs=save_every_n_epochs,
                 model_cfg=model_cfg,
             )
             # Execute training and stream output
@@ -371,19 +590,37 @@ hf_hub_download('black-forest-labs/FLUX.1-dev', 'ae.safetensors', local_dir='/wo
             # Read output progressively
             buffer = ""
             while not channel.exit_status_ready() or channel.recv_ready():
                 if channel.recv_ready():
                     chunk = channel.recv(4096).decode("utf-8", errors="replace")
                     buffer += chunk
-                    # Process complete lines
-                    while "\n" in buffer:
-                        line, buffer = buffer.split("\n", 1)
-                        line = line.strip()
                         if not line:
                             continue
                         job._log(line)
                         self._parse_progress(job, line)
                 else:
                     await asyncio.sleep(2)
             exit_code = channel.recv_exit_status()
@@ -393,27 +630,33 @@ hf_hub_download('black-forest-labs/FLUX.1-dev', 'ae.safetensors', local_dir='/wo
             job._log("Training completed on RunPod!")
             job.progress = 0.9
-            # Step 6: Download the LoRA file
             job.status = "downloading"
-            job._log("Downloading trained LoRA...")
-            LORA_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
-            local_path = LORA_OUTPUT_DIR / f"{name}.safetensors"
-            # Try the main output file first, then look for any .safetensors
-            try:
-                sftp.get(f"/workspace/output/{name}.safetensors", str(local_path))
-            except FileNotFoundError:
-                # Find any safetensors file
-                remote_files = sftp.listdir("/workspace/output")
-                safetensors = [f for f in remote_files if f.endswith(".safetensors")]
-                if safetensors:
-                    sftp.get(f"/workspace/output/{safetensors[-1]}", str(local_path))
                 else:
                     raise RuntimeError("No .safetensors output found")
             job.output_path = str(local_path)
-            job._log(f"LoRA saved to {local_path}")
             # Done!
             job.status = "completed"
@@ -449,6 +692,92 @@ hf_hub_download('black-forest-labs/FLUX.1-dev', 'ae.safetensors', local_dir='/wo
                 except Exception as e:
                     job._log(f"Warning: Failed to terminate pod {job.pod_id}: {e}")
     def _build_training_command(
         self,
         *,
@@ -464,6 +793,7 @@ hf_hub_download('black-forest-labs/FLUX.1-dev', 'ae.safetensors', local_dir='/wo
         optimizer: str,
         save_every_n_epochs: int,
         model_cfg: dict,
     ) -> str:
         """Build the training command based on model type."""
@@ -497,8 +827,70 @@ hf_hub_download('black-forest-labs/FLUX.1-dev', 'ae.safetensors', local_dir='/wo
         lr_scheduler = model_cfg.get("lr_scheduler", "cosine_with_restarts")
         base_args += f" \\\n            --lr_scheduler={lr_scheduler}"
-        if model_type == "flux":
-            # FLUX-specific training
             script = "flux_train_network.py"
             flux_args = f"""
             --pretrained_model_name_or_path="{model_path}" \
@@ -539,24 +931,33 @@ hf_hub_download('black-forest-labs/FLUX.1-dev', 'ae.safetensors', local_dir='/wo
             --clip_skip={clip_skip} \
             --xformers {base_args} 2>&1"""
-    async def _wait_for_pod_ready(self, job: CloudTrainingJob, timeout: int = 300) -> tuple[str, int]:
         """Wait for pod to be running and return (ssh_host, ssh_port)."""
         start = time.time()
         while time.time() - start < timeout:
-            pod = await asyncio.to_thread(runpod.get_pod, job.pod_id)
             status = pod.get("desiredStatus", "")
             runtime = pod.get("runtime")
             if status == "RUNNING" and runtime:
-                ports = runtime.get("ports", [])
-                for port_info in ports:
                     if port_info.get("privatePort") == 22:
                         ip = port_info.get("ip")
                         public_port = port_info.get("publicPort")
                         if ip and public_port:
                             return ip, int(public_port)
             await asyncio.sleep(5)
         raise RuntimeError(f"Pod did not become ready within {timeout}s")
@@ -623,3 +1024,28 @@ hf_hub_download('black-forest-labs/FLUX.1-dev', 'ae.safetensors', local_dir='/wo
         job.status = "failed"
         job.error = "Cancelled by user"
         return True

 import os
 from content_engine.config import settings, IS_HF_SPACES
+from content_engine.models.database import catalog_session_factory, TrainingJob as TrainingJobDB
 LORA_OUTPUT_DIR = settings.paths.lora_dir
 if IS_HF_SPACES:
 # RunPod GPU options (id -> display name, approx cost/hr)
 GPU_OPTIONS = {
+    # 24GB - SD 1.5, SDXL, FLUX.1 only (NOT enough for FLUX.2)
+    "NVIDIA GeForce RTX 3090": "RTX 3090 24GB (~$0.22/hr)",
+    "NVIDIA GeForce RTX 4090": "RTX 4090 24GB (~$0.44/hr)",
+    "NVIDIA GeForce RTX 5090": "RTX 5090 32GB (~$0.69/hr)",
+    "NVIDIA RTX A4000": "RTX A4000 16GB (~$0.20/hr)",
+    "NVIDIA RTX A5000": "RTX A5000 24GB (~$0.28/hr)",
+    # 48GB+ - Required for FLUX.2 Dev (Mistral text encoder needs ~48GB)
+    "NVIDIA RTX A6000": "RTX A6000 48GB (~$0.76/hr)",
+    "NVIDIA A40": "A40 48GB (~$0.64/hr)",
+    "NVIDIA L40": "L40 48GB (~$0.89/hr)",
+    "NVIDIA L40S": "L40S 48GB (~$1.09/hr)",
     "NVIDIA A100 80GB PCIe": "A100 80GB (~$1.89/hr)",
+    "NVIDIA A100-SXM4-80GB": "A100 SXM 80GB (~$1.64/hr)",
+    "NVIDIA H100 80GB HBM3": "H100 80GB (~$3.89/hr)",
 }
 DEFAULT_GPU = "NVIDIA GeForce RTX 4090"
+# Network volume for persistent model storage (avoids re-downloading models each run)
+# Set RUNPOD_VOLUME_ID in .env to use a persistent volume
+# Set RUNPOD_VOLUME_DC to the datacenter ID where the volume lives (e.g. "EU-RO-1")
+NETWORK_VOLUME_ID = os.environ.get("RUNPOD_VOLUME_ID", "")
+NETWORK_VOLUME_DC = os.environ.get("RUNPOD_VOLUME_DC", "")
 # Docker image with PyTorch + CUDA pre-installed
 DOCKER_IMAGE = "runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04"
     cost_estimate: str | None = None
     base_model: str = "sd15_realistic"
     model_type: str = "sd15"
+    _db_callback: Any = None  # called on state changes to persist to DB
     def _log(self, msg: str):
         self.log_lines.append(msg)
         if len(self.log_lines) > 200:
             self.log_lines = self.log_lines[-200:]
         logger.info("[%s] %s", self.id, msg)
+        if self._db_callback:
+            self._db_callback(self)
 class RunPodTrainer:
         runpod.api_key = api_key
         self._jobs: dict[str, CloudTrainingJob] = {}
         self._model_registry = load_model_registry()
+        self._loaded_from_db = False
     @property
     def available(self) -> bool:
                 "learning_rate": cfg.get("learning_rate", 1e-4),
                 "network_rank": cfg.get("network_rank", 32),
                 "network_alpha": cfg.get("network_alpha", 16),
+                "optimizer": cfg.get("optimizer", "AdamW8bit"),
+                "lr_scheduler": cfg.get("lr_scheduler", "cosine"),
                 "vram_required_gb": cfg.get("vram_required_gb", 8),
                 "recommended_images": cfg.get("recommended_images", "15-30 photos"),
             }
             model_type=model_type,
         )
         self._jobs[job_id] = job
+        job._db_callback = self._schedule_db_save
+        asyncio.ensure_future(self._save_to_db(job))
         # Launch the full pipeline as a background task
         asyncio.create_task(self._run_cloud_training(
             job.status = "creating_pod"
             job._log(f"Creating RunPod with {job.gpu_type}...")
+            # Use network volume if configured (persists models across runs)
+            pod_kwargs = {
+                "container_disk_in_gb": 30,
+                "ports": "22/tcp",
+                "docker_args": "bash -c 'apt-get update && apt-get install -y openssh-server && mkdir -p /run/sshd && echo root:runpod | chpasswd && /usr/sbin/sshd -o PermitRootLogin=yes && sleep infinity'",
+            }
+            if NETWORK_VOLUME_ID:
+                pod_kwargs["network_volume_id"] = NETWORK_VOLUME_ID
+                if NETWORK_VOLUME_DC:
+                    pod_kwargs["data_center_id"] = NETWORK_VOLUME_DC
+                job._log(f"Using persistent network volume: {NETWORK_VOLUME_ID} (DC: {NETWORK_VOLUME_DC or 'auto'})")
+            else:
+                pod_kwargs["volume_in_gb"] = 75
             pod = await asyncio.to_thread(
                 runpod.create_pod,
                 f"lora-train-{job.id}",
                 DOCKER_IMAGE,
                 job.gpu_type,
+                **pod_kwargs,
             )
             job.pod_id = pod["id"]
                     await asyncio.sleep(5)
             job._log("SSH connected")
+            # If using network volume, symlink to /workspace so all paths work
+            if NETWORK_VOLUME_ID:
+                self._ssh_exec(ssh, "mkdir -p /runpod-volume/models && rm -rf /workspace/models 2>/dev/null; ln -sf /runpod-volume/models /workspace/models")
+                job._log("Network volume symlinked to /workspace")
+            # Enable keepalive to prevent SSH timeout during uploads
+            transport = ssh.get_transport()
+            transport.set_keepalive(30)
             sftp = ssh.open_sftp()
+            sftp.get_channel().settimeout(300)  # 5 min timeout per file
+            # Step 3: Upload training images (compress first to speed up transfer)
             job.status = "uploading"
+            resolution = model_cfg.get("resolution", 1024)
+            job._log(f"Compressing and uploading {len(image_paths)} training images...")
+            import tempfile
+            from PIL import Image
+            tmp_dir = Path(tempfile.mkdtemp(prefix="lora_upload_"))
             folder_name = f"10_{trigger_word or 'character'}"
             self._ssh_exec(ssh, f"mkdir -p /workspace/dataset/{folder_name}")
+            for i, img_path in enumerate(image_paths):
                 p = Path(img_path)
                 if p.exists():
+                    # Resize and convert to JPEG to reduce upload size
+                    try:
+                        img = Image.open(p)
+                        img.thumbnail((resolution * 2, resolution * 2), Image.LANCZOS)
+                        compressed = tmp_dir / f"{p.stem}.jpg"
+                        img.save(compressed, "JPEG", quality=95)
+                        upload_path = compressed
+                    except Exception:
+                        upload_path = p  # fallback to original
+                    remote_name = f"{p.stem}.jpg" if upload_path.suffix == ".jpg" else p.name
+                    remote_path = f"/workspace/dataset/{folder_name}/{remote_name}"
+                    for attempt in range(3):
+                        try:
+                            sftp.put(str(upload_path), remote_path)
+                            break
+                        except (EOFError, OSError):
+                            if attempt == 2:
+                                raise
+                            job._log(f"Upload retry {attempt+1} for {p.name}")
+                            sftp.close()
+                            sftp = ssh.open_sftp()
+                            sftp.get_channel().settimeout(300)
                     # Upload matching caption .txt file if it exists locally
                     local_caption = p.with_suffix(".txt")
                     if local_caption.exists():
+                        remote_caption = f"/workspace/dataset/{folder_name}/{p.stem}.txt"
                         sftp.put(str(local_caption), remote_caption)
                     else:
                         # Fallback: create caption from trigger word
+                        remote_caption = f"/workspace/dataset/{folder_name}/{p.stem}.txt"
                         with sftp.open(remote_caption, "w") as f:
                             f.write(trigger_word or "")
+                    job._log(f"Uploaded {i+1}/{len(image_paths)}: {p.name}")
+            # Cleanup temp compressed images
+            import shutil
+            shutil.rmtree(tmp_dir, ignore_errors=True)
             job._log("Images uploaded")
+            # Step 4: Install training framework on the pod (skip if cached on volume)
             job.status = "installing"
             job.progress = 0.05
+            training_framework = model_cfg.get("training_framework", "sd-scripts")
+            if training_framework == "musubi-tuner":
+                # FLUX.2 uses musubi-tuner (Kohya's newer framework)
+                tuner_dir = "/workspace/musubi-tuner"
+                install_cmds = []
+                # Check if already present in workspace
+                tuner_exist = self._ssh_exec(ssh, f"test -f {tuner_dir}/pyproject.toml && echo EXISTS || echo MISSING").strip()
+                if tuner_exist == "EXISTS":
+                    job._log("musubi-tuner found in workspace")
+                else:
+                    # Check volume cache
+                    vol_exist = self._ssh_exec(ssh, "test -f /runpod-volume/musubi-tuner/pyproject.toml && echo EXISTS || echo MISSING").strip()
+                    if vol_exist == "EXISTS":
+                        job._log("Restoring musubi-tuner from volume cache...")
+                        self._ssh_exec(ssh, f"rm -rf {tuner_dir} 2>/dev/null; cp -r /runpod-volume/musubi-tuner {tuner_dir}")
+                    else:
+                        job._log("Cloning musubi-tuner from GitHub...")
+                        self._ssh_exec(ssh, f"rm -rf {tuner_dir} /runpod-volume/musubi-tuner 2>/dev/null; true")
+                        install_cmds.append(f"cd /workspace && git clone --depth 1 https://github.com/kohya-ss/musubi-tuner.git")
+                        # Save to volume for future pods
+                        if NETWORK_VOLUME_ID:
+                            install_cmds.append(f"cp -r {tuner_dir} /runpod-volume/musubi-tuner")
+                # Always install pip deps (they are pod-local, lost on every new pod)
+                job._log("Installing pip dependencies (accelerate, torch, etc.)...")
+                install_cmds.extend([
+                    f"cd {tuner_dir} && pip install -e . 2>&1 | tail -5",
+                    "pip install accelerate lion-pytorch prodigyopt safetensors bitsandbytes 2>&1 | tail -5",
+                ])
+            else:
+                # SD 1.5 / SDXL / FLUX.1 use sd-scripts
+                scripts_exist = self._ssh_exec(ssh, "test -f /workspace/sd-scripts/setup.py && echo EXISTS || echo MISSING").strip()
+                if scripts_exist == "EXISTS":
+                    job._log("Kohya sd-scripts already cached on volume, updating...")
+                    install_cmds = [
+                        "cd /workspace/sd-scripts && git pull 2>&1 | tail -1",
+                    ]
+                else:
+                    job._log("Installing Kohya sd-scripts (this takes a few minutes)...")
+                    install_cmds = [
+                        "cd /workspace && git clone --depth 1 https://github.com/kohya-ss/sd-scripts.git",
+                    ]
+                # Always install pip deps (pod-local, lost on new pods)
+                install_cmds.extend([
+                    "cd /workspace/sd-scripts && pip install -r requirements.txt 2>&1 | tail -1",
+                    "pip install accelerate lion-pytorch prodigyopt safetensors bitsandbytes xformers 2>&1 | tail -1",
+                ])
             for cmd in install_cmds:
                 out = self._ssh_exec(ssh, cmd, timeout=600)
                 job._log(out[:200] if out else "done")
+            # Download base model from HuggingFace (skip if already on network volume)
             hf_repo = model_cfg.get("hf_repo", "SG161222/Realistic_Vision_V5.1_noVAE")
             hf_filename = model_cfg.get("hf_filename", "Realistic_Vision_V5.1_fp16-no-ema.safetensors")
             model_name = model_cfg.get("name", job.base_model)
             job.progress = 0.1
             self._ssh_exec(ssh, """pip install huggingface_hub 2>&1 | tail -1""", timeout=120)
+            if model_type == "flux2":
+                # FLUX.2 models are stored in a directory structure on the volume
+                flux2_dir = "/workspace/models/FLUX.2-dev"
+                dit_path = f"{flux2_dir}/flux2-dev.safetensors"
+                vae_path = f"{flux2_dir}/ae.safetensors"  # Original BFL format (not diffusers)
+                te_path = f"{flux2_dir}/text_encoder/model-00001-of-00010.safetensors"
+                dit_exists = self._ssh_exec(ssh, f"test -f {dit_path} && echo EXISTS || echo MISSING").strip()
+                vae_exists = self._ssh_exec(ssh, f"test -f {vae_path} && echo EXISTS || echo MISSING").strip()
+                te_exists = self._ssh_exec(ssh, f"test -f {te_path} && echo EXISTS || echo MISSING").strip()
+                if dit_exists != "EXISTS" or te_exists != "EXISTS":
+                    missing = []
+                    if dit_exists != "EXISTS":
+                        missing.append("DiT")
+                    if te_exists != "EXISTS":
+                        missing.append("text encoder")
+                    raise RuntimeError(f"FLUX.2 Dev missing on volume: {', '.join(missing)}. Please download models to the network volume first.")
+                # Download ae.safetensors (original format VAE) if not present
+                if vae_exists != "EXISTS":
+                    job._log("Downloading FLUX.2 VAE (ae.safetensors, 336MB)...")
+                    self._ssh_exec(ssh, """pip install huggingface_hub 2>&1 | tail -1""", timeout=120)
+                    self._ssh_exec(ssh, f"""python -c "
+from huggingface_hub import hf_hub_download
+hf_hub_download('black-forest-labs/FLUX.2-dev', 'ae.safetensors', local_dir='{flux2_dir}')
+print('Downloaded ae.safetensors')
+" 2>&1 | tail -5""", timeout=600)
+                    # Verify download
+                    vae_check = self._ssh_exec(ssh, f"test -f {vae_path} && echo EXISTS || echo MISSING").strip()
+                    if vae_check != "EXISTS":
+                        raise RuntimeError("Failed to download ae.safetensors")
+                    job._log("VAE downloaded")
+                job._log("FLUX.2 Dev models ready")
+            else:
+                # SD 1.5 / SDXL / FLUX.1 — download single model file
+                model_exists = self._ssh_exec(ssh, f"test -f /workspace/models/{hf_filename} && echo EXISTS || echo MISSING").strip()
+                if model_exists == "EXISTS":
+                    job._log(f"Base model already cached on volume: {model_name}")
+                else:
+                    job._log(f"Downloading base model: {model_name}...")
+                    self._ssh_exec(ssh, f"""
+                        python -c "
 from huggingface_hub import hf_hub_download
 hf_hub_download('{hf_repo}', '{hf_filename}', local_dir='/workspace/models')
 " 2>&1 | tail -5
+                    """, timeout=1200)
+                # For FLUX.1, download additional required models (CLIP, T5, VAE)
+                if model_type == "flux":
+                    flux_files_check = self._ssh_exec(ssh, "test -f /workspace/models/clip_l.safetensors && test -f /workspace/models/t5xxl_fp16.safetensors && test -f /workspace/models/ae.safetensors && echo EXISTS || echo MISSING").strip()
+                    if flux_files_check == "EXISTS":
+                        job._log("FLUX.1 auxiliary models already cached on volume")
+                    else:
+                        job._log("Downloading FLUX.1 auxiliary models (CLIP, T5, VAE)...")
+                        job.progress = 0.12
+                        self._ssh_exec(ssh, """
+                            python -c "
 from huggingface_hub import hf_hub_download
 hf_hub_download('comfyanonymous/flux_text_encoders', 'clip_l.safetensors', local_dir='/workspace/models')
 hf_hub_download('comfyanonymous/flux_text_encoders', 't5xxl_fp16.safetensors', local_dir='/workspace/models')
 hf_hub_download('black-forest-labs/FLUX.1-dev', 'ae.safetensors', local_dir='/workspace/models')
 " 2>&1 | tail -5
+                        """, timeout=1200)
+            job._log("Base model ready")
             job.progress = 0.15
             # Step 5: Run training
             job.status = "training"
             job._log(f"Starting {model_type.upper()} LoRA training...")
+            if model_type == "flux2":
+                model_path = f"/workspace/models/FLUX.2-dev/flux2-dev.safetensors"
+            else:
+                model_path = f"/workspace/models/{hf_filename}"
+            # For musubi-tuner, create TOML dataset config
+            if training_framework == "musubi-tuner":
+                folder_name = f"10_{trigger_word or 'character'}"
+                toml_content = f"""[[datasets]]
+image_directory = "/workspace/dataset/{folder_name}"
+caption_extension = ".txt"
+batch_size = 1
+num_repeats = 10
+resolution = [{resolution}, {resolution}]
+"""
+                self._ssh_exec(ssh, f"cat > /workspace/dataset.toml << 'TOMLEOF'\n{toml_content}TOMLEOF")
+                job._log("Created dataset.toml config")
+                # musubi-tuner requires pre-caching latents and text encoder outputs
+                flux2_dir = "/workspace/models/FLUX.2-dev"
+                vae_path = f"{flux2_dir}/ae.safetensors"
+                te_path = f"{flux2_dir}/text_encoder/model-00001-of-00010.safetensors"
+                job._log("Caching latents (VAE encoding)...")
+                job.progress = 0.15
+                self._schedule_db_save(job)
+                cache_latents_cmd = (
+                    f"cd /workspace/musubi-tuner && PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python src/musubi_tuner/flux_2_cache_latents.py"
+                    f" --dataset_config /workspace/dataset.toml"
+                    f" --vae {vae_path}"
+                    f" --model_version dev"
+                    f" --vae_dtype bfloat16"
+                    f" 2>&1 | tee /tmp/cache_latents.log; echo EXIT_CODE=${{PIPESTATUS[0]}}"
+                )
+                out = self._ssh_exec(ssh, cache_latents_cmd, timeout=600)
+                # Get last lines which have the real error
+                last_lines = out.split('\n')[-30:]
+                job._log('\n'.join(last_lines))
+                if "EXIT_CODE=0" not in out:
+                    # Fetch the full error log
+                    err_log = self._ssh_exec(ssh, "grep -i 'error\\|exception\\|traceback\\|failed' /tmp/cache_latents.log | tail -10")
+                    job._log(f"Cache error details: {err_log}")
+                    raise RuntimeError(f"Latent caching failed")
+                job._log("Caching text encoder outputs (bf16)...")
+                job.progress = 0.25
+                self._schedule_db_save(job)
+                cache_te_cmd = (
+                    f"cd /workspace/musubi-tuner && PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True"
+                    f" python src/musubi_tuner/flux_2_cache_text_encoder_outputs.py"
+                    f" --dataset_config /workspace/dataset.toml"
+                    f" --text_encoder {te_path}"
+                    f" --model_version dev"
+                    f" --batch_size 1"
+                    f" 2>&1; echo EXIT_CODE=$?"
+                )
+                out = self._ssh_exec(ssh, cache_te_cmd, timeout=600)
+                job._log(out[-500:] if out else "done")
+                if "EXIT_CODE=0" not in out:
+                    raise RuntimeError(f"Text encoder caching failed: {out[-200:]}")
             # Build training command based on model type
             train_cmd = self._build_training_command(
                 optimizer=optimizer,
                 save_every_n_epochs=save_every_n_epochs,
                 model_cfg=model_cfg,
+                gpu_type=job.gpu_type,
             )
             # Execute training and stream output
             # Read output progressively
             buffer = ""
+            last_flush = time.time()
             while not channel.exit_status_ready() or channel.recv_ready():
                 if channel.recv_ready():
                     chunk = channel.recv(4096).decode("utf-8", errors="replace")
                     buffer += chunk
+                    # Process complete lines (handle both \n and \r for tqdm progress)
+                    while "\n" in buffer or "\r" in buffer:
+                        # Split on whichever comes first
+                        n_pos = buffer.find("\n")
+                        r_pos = buffer.find("\r")
+                        if n_pos == -1:
+                            split_pos = r_pos
+                        elif r_pos == -1:
+                            split_pos = n_pos
+                        else:
+                            split_pos = min(n_pos, r_pos)
+                        line = buffer[:split_pos].strip()
+                        buffer = buffer[split_pos + 1:]
                         if not line:
                             continue
                         job._log(line)
                         self._parse_progress(job, line)
+                        self._schedule_db_save(job)
                 else:
+                    # Periodically flush buffer for partial tqdm lines
+                    if buffer.strip() and time.time() - last_flush > 10:
+                        job._log(buffer.strip())
+                        self._parse_progress(job, buffer.strip())
+                        buffer = ""
+                        last_flush = time.time()
+                        self._schedule_db_save(job)
                     await asyncio.sleep(2)
             exit_code = channel.recv_exit_status()
             job._log("Training completed on RunPod!")
             job.progress = 0.9
+            # Step 6: Save LoRA to network volume and download locally
             job.status = "downloading"
+            # First, copy to network volume for persistence
+            job._log("Saving LoRA to network volume...")
+            self._ssh_exec(ssh, "mkdir -p /runpod-volume/loras")
+            remote_output = f"/workspace/output/{name}.safetensors"
+            # Find the output file
+            check = self._ssh_exec(ssh, f"test -f {remote_output} && echo EXISTS || echo MISSING").strip()
+            if check == "MISSING":
+                remote_files = self._ssh_exec(ssh, "ls /workspace/output/*.safetensors 2>/dev/null").strip()
+                if remote_files:
+                    remote_output = remote_files.split("\n")[-1].strip()
                 else:
                     raise RuntimeError("No .safetensors output found")
+            self._ssh_exec(ssh, f"cp {remote_output} /runpod-volume/loras/{name}.safetensors")
+            job._log(f"LoRA saved to volume: /runpod-volume/loras/{name}.safetensors")
+            # Then download locally
+            job._log("Downloading LoRA to local machine...")
+            LORA_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+            local_path = LORA_OUTPUT_DIR / f"{name}.safetensors"
+            sftp.get(remote_output, str(local_path))
             job.output_path = str(local_path)
+            job._log(f"LoRA saved locally to {local_path}")
             # Done!
             job.status = "completed"
                 except Exception as e:
                     job._log(f"Warning: Failed to terminate pod {job.pod_id}: {e}")
+    def _schedule_db_save(self, job: CloudTrainingJob):
+        """Schedule a DB save (non-blocking)."""
+        try:
+            asyncio.get_event_loop().create_task(self._save_to_db(job))
+        except RuntimeError:
+            pass  # no event loop
+    async def _save_to_db(self, job: CloudTrainingJob):
+        """Persist job state to database."""
+        try:
+            from sqlalchemy import text
+            async with catalog_session_factory() as session:
+                # Use raw INSERT OR REPLACE for SQLite upsert
+                await session.execute(
+                    text("""INSERT OR REPLACE INTO training_jobs
+                        (id, name, status, progress, current_epoch, total_epochs,
+                         current_step, total_steps, loss, started_at, completed_at,
+                         output_path, error, log_text, pod_id, gpu_type, backend,
+                         base_model, model_type, created_at)
+                        VALUES (:id, :name, :status, :progress, :current_epoch, :total_epochs,
+                                :current_step, :total_steps, :loss, :started_at, :completed_at,
+                                :output_path, :error, :log_text, :pod_id, :gpu_type, :backend,
+                                :base_model, :model_type, COALESCE((SELECT created_at FROM training_jobs WHERE id = :id), CURRENT_TIMESTAMP))
+                    """),
+                    {
+                        "id": job.id, "name": job.name, "status": job.status,
+                        "progress": job.progress, "current_epoch": job.current_epoch,
+                        "total_epochs": job.total_epochs, "current_step": job.current_step,
+                        "total_steps": job.total_steps, "loss": job.loss,
+                        "started_at": job.started_at, "completed_at": job.completed_at,
+                        "output_path": job.output_path, "error": job.error,
+                        "log_text": "\n".join(job.log_lines[-200:]),
+                        "pod_id": job.pod_id, "gpu_type": job.gpu_type,
+                        "backend": "runpod", "base_model": job.base_model,
+                        "model_type": job.model_type,
+                    }
+                )
+                await session.commit()
+        except Exception as e:
+            logger.warning("Failed to save training job to DB: %s", e)
+    async def _load_jobs_from_db(self):
+        """Load previously saved jobs from database on startup."""
+        try:
+            from sqlalchemy import select
+            async with catalog_session_factory() as session:
+                result = await session.execute(
+                    select(TrainingJobDB).order_by(TrainingJobDB.created_at.desc()).limit(20)
+                )
+                db_jobs = result.scalars().all()
+                for db_job in db_jobs:
+                    if db_job.id not in self._jobs:
+                        job = CloudTrainingJob(
+                            id=db_job.id,
+                            name=db_job.name,
+                            status=db_job.status,
+                            progress=db_job.progress or 0.0,
+                            current_epoch=db_job.current_epoch or 0,
+                            total_epochs=db_job.total_epochs or 0,
+                            current_step=db_job.current_step or 0,
+                            total_steps=db_job.total_steps or 0,
+                            loss=db_job.loss,
+                            started_at=db_job.started_at,
+                            completed_at=db_job.completed_at,
+                            output_path=db_job.output_path,
+                            error=db_job.error,
+                            log_lines=(db_job.log_text or "").split("\n") if db_job.log_text else [],
+                            pod_id=db_job.pod_id,
+                            gpu_type=db_job.gpu_type or DEFAULT_GPU,
+                            base_model=db_job.base_model or "sd15_realistic",
+                            model_type=db_job.model_type or "sd15",
+                        )
+                        # Mark interrupted jobs as failed
+                        if job.status not in ("completed", "failed"):
+                            job.status = "failed"
+                            job.error = "Interrupted by server restart"
+                        self._jobs[db_job.id] = job
+        except Exception as e:
+            logger.warning("Failed to load training jobs from DB: %s", e)
+    async def ensure_loaded(self):
+        """Load jobs from DB on first access."""
+        if not self._loaded_from_db:
+            self._loaded_from_db = True
+            await self._load_jobs_from_db()
     def _build_training_command(
         self,
         *,
         optimizer: str,
         save_every_n_epochs: int,
         model_cfg: dict,
+        gpu_type: str = "",
     ) -> str:
         """Build the training command based on model type."""
         lr_scheduler = model_cfg.get("lr_scheduler", "cosine_with_restarts")
         base_args += f" \\\n            --lr_scheduler={lr_scheduler}"
+        if model_type == "flux2":
+            # FLUX.2 training via musubi-tuner
+            flux2_dir = "/workspace/models/FLUX.2-dev"
+            dit_path = f"{flux2_dir}/flux2-dev.safetensors"
+            vae_path = f"{flux2_dir}/ae.safetensors"
+            te_path = f"{flux2_dir}/text_encoder/model-00001-of-00010.safetensors"
+            network_mod = model_cfg.get("network_module", "networks.lora_flux_2")
+            ts_sampling = model_cfg.get("timestep_sampling", "flux2_shift")
+            lr_scheduler = model_cfg.get("lr_scheduler", "cosine")
+            # Build as list of args to avoid shell escaping issues
+            args = [
+                "cd /workspace/musubi-tuner && PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True",
+                "accelerate launch --num_cpu_threads_per_process 1 --mixed_precision bf16",
+                "src/musubi_tuner/flux_2_train_network.py",
+                "--model_version dev",
+                f"--dit {dit_path}",
+                f"--vae {vae_path}",
+                f"--text_encoder {te_path}",
+                "--dataset_config /workspace/dataset.toml",
+                "--sdpa --mixed_precision bf16",
+                f"--timestep_sampling {ts_sampling} --weighting_scheme none",
+                f"--network_module {network_mod}",
+                f"--network_dim={network_rank}",
+                f"--network_alpha={network_alpha}",
+                "--gradient_checkpointing",
+            ]
+            # Only use fp8_base on GPUs with native fp8 support (RTX 4090, H100)
+            # A100 and A6000 don't support fp8 tensor ops, and have enough VRAM without it
+            if gpu_type and ("4090" in gpu_type or "5090" in gpu_type or "L40S" in gpu_type or "H100" in gpu_type):
+                args.append("--fp8_base")
+            # Handle Prodigy optimizer (needs special class path and args)
+            if optimizer.lower() == "prodigy":
+                args.extend([
+                    "--optimizer_type=prodigyopt.Prodigy",
+                    f"--learning_rate={learning_rate}",
+                    '--optimizer_args "weight_decay=0.01" "decouple=True" "use_bias_correction=True" "safeguard_warmup=True" "d_coef=2"',
+                ])
+            else:
+                args.extend([
+                    f"--optimizer_type={optimizer}",
+                    f"--learning_rate={learning_rate}",
+                ])
+            args.extend([
+                f"--save_every_n_epochs={save_every_n_epochs}",
+                "--seed=42",
+                '--output_dir=/workspace/output',
+                f'--output_name={name}',
+                f"--lr_scheduler={lr_scheduler}",
+            ])
+            if max_train_steps:
+                args.append(f"--max_train_steps={max_train_steps}")
+            else:
+                args.append(f"--max_train_epochs={num_epochs}")
+            return " ".join(args) + " 2>&1"
+        elif model_type == "flux":
+            # FLUX.1 training via sd-scripts
             script = "flux_train_network.py"
             flux_args = f"""
             --pretrained_model_name_or_path="{model_path}" \
             --clip_skip={clip_skip} \
             --xformers {base_args} 2>&1"""
+    async def _wait_for_pod_ready(self, job: CloudTrainingJob, timeout: int = 600) -> tuple[str, int]:
         """Wait for pod to be running and return (ssh_host, ssh_port)."""
         start = time.time()
         while time.time() - start < timeout:
+            try:
+                pod = await asyncio.to_thread(runpod.get_pod, job.pod_id)
+            except Exception as e:
+                job._log(f"  API error: {e}")
+                await asyncio.sleep(10)
+                continue
             status = pod.get("desiredStatus", "")
             runtime = pod.get("runtime")
             if status == "RUNNING" and runtime:
+                ports = runtime.get("ports") or []
+                for port_info in (ports or []):
                     if port_info.get("privatePort") == 22:
                         ip = port_info.get("ip")
                         public_port = port_info.get("publicPort")
                         if ip and public_port:
                             return ip, int(public_port)
+            elapsed = int(time.time() - start)
+            if elapsed % 30 < 6:
+                job._log(f"  Status: {status} | runtime: {'ports pending' if runtime else 'not ready yet'} ({elapsed}s)")
             await asyncio.sleep(5)
         raise RuntimeError(f"Pod did not become ready within {timeout}s")
         job.status = "failed"
         job.error = "Cancelled by user"
         return True
+    async def delete_job(self, job_id: str) -> bool:
+        """Delete a training job from memory and database."""
+        if job_id not in self._jobs:
+            return False
+        del self._jobs[job_id]
+        try:
+            async with catalog_session_factory() as session:
+                result = await session.execute(
+                    __import__('sqlalchemy').select(TrainingJobDB).where(TrainingJobDB.id == job_id)
+                )
+                db_job = result.scalar_one_or_none()
+                if db_job:
+                    await session.delete(db_job)
+                    await session.commit()
+        except Exception as e:
+            logger.warning("Failed to delete job from DB: %s", e)
+        return True
+    async def delete_failed_jobs(self) -> int:
+        """Delete all failed/error training jobs."""
+        failed_ids = [jid for jid, j in self._jobs.items() if j.status in ("failed", "error")]
+        for jid in failed_ids:
+            await self.delete_job(jid)
+        return len(failed_ids)

src/content_engine/workers/__pycache__/__init__.cpython-311.pyc DELETED Viewed

Binary file (239 Bytes)

src/content_engine/workers/__pycache__/local_worker.cpython-311.pyc DELETED Viewed

Binary file (6.09 kB)