Spaces:

roll-ai
/

CogVideoXInterp

Paused

App Files Files Community

AhmadMustafa commited on Sep 29, 2025

Commit

068b511

0 Parent(s):

Initial commit for CogVideoXInterp

Browse files

Files changed (8) hide show

.vscode/settings.json +5 -0
FILES.txt +61 -0
README.md +12 -0
SETUP.md +130 -0
app.py +254 -0
cogvideox_interpolation/datasets.py +154 -0
cogvideox_interpolation/pipeline.py +799 -0
requirements.txt +7 -0

.vscode/settings.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+    "python-envs.defaultEnvManager": "ms-python.python:conda",
+    "python-envs.defaultPackageManager": "ms-python.python:conda",
+    "python-envs.pythonProjects": []
+}

FILES.txt ADDED Viewed

	@@ -0,0 +1,61 @@

+Bare Minimum Files for CogVideoX-Interpolation Gradio App
+===========================================================
+ESSENTIAL FILES (Must have all):
+1. app.py (7.7KB)
+   - Main Gradio application
+   - Handles UI and video generation
+2. requirements.txt (103B)
+   - Python package dependencies
+   - Install with: pip install -r requirements.txt
+3. README.md (232B)
+   - HuggingFace Spaces configuration
+   - Contains YAML frontmatter for Spaces
+4. cogvideox_interpolation/ (directory)
+   - pipeline.py (~38KB)
+     * Core CogVideoX interpolation pipeline
+     * Custom diffusion model implementation
+   - datasets.py (~6KB)
+     * Dataset loading utilities
+     * Not used in inference but required for imports
+OPTIONAL (Helpful but not required):
+5. SETUP.md (3.1KB)
+   - Quick setup instructions
+   - Can be deleted after setup
+TOTAL SIZE: ~64KB (excluding model weights)
+MODEL DOWNLOAD:
+- Model auto-downloads on first run (~20GB)
+- Model: feizhengcong/CogvideoX-Interpolation
+- Downloads to: ~/.cache/huggingface/
+WHAT'S NOT NEEDED:
+✗ Training scripts (finetune.py, finetune.sh)
+✗ Documentation files (CLAUDE.md, GPU_REQUIREMENTS.md, GRADIO_README.md)
+✗ Example cases (cases/ directory)
+✗ Git files (.git, .gitignore)
+✗ Compiled files (__pycache__, *.pyc)
+✗ Original README.md from repo
+✗ requirement.txt (original, uses requirements.txt instead)
+TO RUN LOCALLY:
+1. pip install -r requirements.txt
+2. python app.py
+3. Open http://localhost:7860
+TO DEPLOY ON HUGGINGFACE SPACES:
+1. Upload all files in this directory
+2. Select GPU hardware (T4 minimum, A10G recommended)
+3. Space auto-deploys
+GPU REQUIREMENTS:
+- Minimum: 16GB VRAM
+- Recommended: 24GB VRAM

README.md ADDED Viewed

	@@ -0,0 +1,12 @@

+---
+title: CogVideoXInterp
+emoji: ⚡
+colorFrom: red
+colorTo: gray
+sdk: gradio
+sdk_version: 5.47.2
+app_file: app.py
+pinned: false
+---
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

SETUP.md ADDED Viewed

	@@ -0,0 +1,130 @@

+# CogVideoX Keyframe Interpolation - Quick Setup
+This directory contains the **bare minimum files** needed to run the CogVideoX Keyframe Interpolation Gradio app.
+## 📁 Contents
+```
+CogVideoXInterp/
+├── README.md                   # HuggingFace Spaces README
+├── app.py                      # Main Gradio application
+├── requirements.txt            # Python dependencies
+├── cogvideox_interpolation/    # Core pipeline module
+│   ├── datasets.py            # Dataset loading (not needed for inference)
+│   └── pipeline.py            # Custom interpolation pipeline
+└── SETUP.md                    # This file
+```
+**Total size:** ~64KB (model downloads separately)
+---
+## 🚀 Quick Start
+### Local Setup
+1. **Install dependencies:**
+   ```bash
+   pip install -r requirements.txt
+   ```
+2. **Run the app:**
+   ```bash
+   python app.py
+   ```
+3. **Open browser:**
+   Navigate to `http://localhost:7860`
+### GPU Requirements
+- **Minimum:** 16GB VRAM (RTX 4060 Ti 16GB, RTX 4080)
+- **Recommended:** 24GB VRAM (RTX 3090, RTX 4090)
+---
+## 🤗 Deploy to HuggingFace Spaces
+### Method 1: Web Upload
+1. Go to https://huggingface.co/spaces
+2. Click "Create new Space"
+3. Choose **Gradio** as SDK
+4. Upload all files from this directory
+5. Select GPU hardware (T4 minimum, A10G recommended)
+6. Space will auto-deploy!
+### Method 2: Git Push
+```bash
+# Create a Space on HuggingFace first, then:
+git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
+cd YOUR_SPACE_NAME
+# Copy files
+cp -r /path/to/CogVideoXInterp/* .
+# Push
+git add .
+git commit -m "Initial commit"
+git push
+```
+### HuggingFace Spaces Hardware Options
+| Hardware | VRAM | Speed | Cost/hr |
+|----------|------|-------|---------|
+| CPU | 0GB | ❌ Won't work | Free |
+| T4 | 16GB | ⚠️ Slow (5-8 min) | ~$0.60 |
+| A10G | 24GB | ✅ Good (2-4 min) | ~$3.15 |
+| A100 | 40GB | ✅ Fast (1-2 min) | ~$7.00 |
+**Note:** Model will auto-download on first run (~20GB)
+---
+## 📝 Usage
+1. **Load Model** - Enter model path or use default `feizhengcong/CogvideoX-Interpolation`
+2. **Upload Images** - Provide start and end frame
+3. **Write Prompt** - Describe the motion/transition
+4. **Generate** - Wait 2-5 minutes for video
+### Example Prompts
+✅ "A person walks forward slowly, their body moving naturally with each step"
+✅ "The camera smoothly pans from left to right, revealing the scene"
+✅ "A dancer gracefully transitions from one pose to another"
+---
+## 🔧 Troubleshooting
+### Out of Memory
+Reduce parameters in the app:
+- Frames: 49 → 25
+- Steps: 50 → 30
+### Model Download Fails
+Check internet connection. Model is ~20GB and downloads to:
+- Linux/Mac: `~/.cache/huggingface/`
+- Windows: `C:\Users\USERNAME\.cache\huggingface\`
+### Import Errors
+Make sure all files from this directory are in the same location, especially the `cogvideox_interpolation/` folder.
+---
+## 📚 More Information
+For detailed documentation, see the parent repository at:
+https://github.com/feizc/CogvideX-Interpolation
+**Model:** https://huggingface.co/feizhengcong/CogvideoX-Interpolation
+**License:** Apache 2.0

app.py ADDED Viewed

	@@ -0,0 +1,254 @@

+import gradio as gr
+import torch
+from diffusers.utils import export_to_video
+from cogvideox_interpolation.pipeline import CogVideoXInterpolationPipeline
+from PIL import Image
+import tempfile
+import os
+# Global variable to store the pipeline
+pipe = None
+device = "cuda" if torch.cuda.is_available() else "cpu"
+def load_model(model_path):
+    """Load the CogVideoX-Interpolation model"""
+    global pipe
+    print(f"Loading model from {model_path}...")
+    print(f"Using device: {device}")
+    # Determine dtype based on model variant
+    dtype = torch.bfloat16 if "5b" in model_path.lower() else torch.float16
+    pipe = CogVideoXInterpolationPipeline.from_pretrained(
+        model_path,
+        torch_dtype=dtype
+    )
+    # Memory optimization
+    if device == "cuda":
+        pipe.enable_sequential_cpu_offload()
+    else:
+        pipe = pipe.to(device)
+    pipe.vae.enable_tiling()
+    pipe.vae.enable_slicing()
+    print("Model loaded successfully!")
+    return "✓ Model loaded successfully!"
+def generate_interpolation(
+    first_image,
+    last_image,
+    prompt,
+    num_frames=49,
+    num_inference_steps=50,
+    guidance_scale=6.0,
+    fps=8,
+    seed=42
+):
+    """Generate interpolated video between two keyframes"""
+    if pipe is None:
+        return None, "⚠️ Please load the model first!"
+    if first_image is None or last_image is None:
+        return None, "⚠️ Please upload both start and end frame images!"
+    if not prompt.strip():
+        return None, "⚠️ Please provide a text prompt describing the motion!"
+    try:
+        # Convert numpy arrays to PIL Images if needed
+        if not isinstance(first_image, Image.Image):
+            first_image = Image.fromarray(first_image)
+        if not isinstance(last_image, Image.Image):
+            last_image = Image.fromarray(last_image)
+        print(f"Generating video with prompt: {prompt}")
+        print(f"Parameters: frames={num_frames}, steps={num_inference_steps}, guidance={guidance_scale}")
+        # Generate video
+        generator = torch.Generator(device=device).manual_seed(seed)
+        video = pipe(
+            prompt=prompt,
+            first_image=first_image,
+            last_image=last_image,
+            num_videos_per_prompt=1,
+            num_inference_steps=num_inference_steps,
+            num_frames=num_frames,
+            guidance_scale=guidance_scale,
+            generator=generator,
+        )[0]
+        # Export to temporary file
+        temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".mp4")
+        output_path = temp_file.name
+        temp_file.close()
+        export_to_video(video, output_path, fps=fps)
+        status = f"✓ Video generated successfully! ({num_frames} frames at {fps} fps)"
+        print(status)
+        return output_path, status
+    except Exception as e:
+        error_msg = f"❌ Error: {str(e)}"
+        print(error_msg)
+        return None, error_msg
+# Create Gradio interface
+with gr.Blocks(title="CogVideoX Keyframe Interpolation") as demo:
+    gr.Markdown("""
+    # 🎬 CogVideoX Keyframe Interpolation
+    Generate smooth video transitions between two keyframe images using AI.
+    **Instructions:**
+    1. First, load the model by providing the path to your checkpoint
+    2. Upload start and end frame images
+    3. Describe the motion/transition in the text prompt
+    4. Adjust parameters and generate!
+    """)
+    with gr.Row():
+        with gr.Column():
+            gr.Markdown("### 🔧 Model Setup")
+            model_path_input = gr.Textbox(
+                label="Model Path",
+                placeholder="e.g., /path/to/CogVideoX-5b-I2V-inter or feizhengcong/CogvideoX-Interpolation",
+                value="feizhengcong/CogvideoX-Interpolation"
+            )
+            load_btn = gr.Button("Load Model", variant="primary")
+            model_status = gr.Textbox(label="Status", interactive=False)
+    gr.Markdown("---")
+    with gr.Row():
+        with gr.Column():
+            gr.Markdown("### 🖼️ Input Keyframes")
+            first_image_input = gr.Image(
+                label="Start Frame",
+                type="pil",
+                height=300
+            )
+            last_image_input = gr.Image(
+                label="End Frame",
+                type="pil",
+                height=300
+            )
+        with gr.Column():
+            gr.Markdown("### ⚙️ Generation Settings")
+            prompt_input = gr.Textbox(
+                label="Motion Description",
+                placeholder="Describe the motion/transition between the frames...",
+                lines=4
+            )
+            with gr.Row():
+                num_frames_slider = gr.Slider(
+                    label="Number of Frames",
+                    minimum=13,
+                    maximum=49,
+                    step=4,
+                    value=49,
+                    info="Must be 4k+1 format (13, 17, 21, ..., 49)"
+                )
+                fps_slider = gr.Slider(
+                    label="FPS",
+                    minimum=4,
+                    maximum=16,
+                    step=2,
+                    value=8
+                )
+            with gr.Row():
+                num_steps_slider = gr.Slider(
+                    label="Inference Steps",
+                    minimum=20,
+                    maximum=100,
+                    step=5,
+                    value=50,
+                    info="More steps = better quality but slower"
+                )
+                guidance_slider = gr.Slider(
+                    label="Guidance Scale",
+                    minimum=1.0,
+                    maximum=15.0,
+                    step=0.5,
+                    value=6.0,
+                    info="Higher = stronger prompt following"
+                )
+            seed_input = gr.Number(
+                label="Random Seed",
+                value=42,
+                precision=0
+            )
+            generate_btn = gr.Button("🎬 Generate Video", variant="primary", size="lg")
+    gr.Markdown("---")
+    with gr.Row():
+        with gr.Column():
+            gr.Markdown("### 🎥 Generated Video")
+            output_video = gr.Video(label="Output")
+            generation_status = gr.Textbox(label="Generation Status", interactive=False)
+    # Examples
+    gr.Markdown("---")
+    gr.Markdown("### 💡 Example Prompts")
+    gr.Examples(
+        examples=[
+            ["A person walks forward slowly, their body moving naturally with each step."],
+            ["The camera smoothly pans from left to right, revealing the scene."],
+            ["A dancer gracefully transitions from one pose to another."],
+            ["The sun sets gradually, changing the lighting and colors of the scene."],
+            ["A car accelerates down the street, moving from standstill to motion."],
+        ],
+        inputs=prompt_input,
+        label="Click to use example prompts"
+    )
+    # Event handlers
+    load_btn.click(
+        fn=load_model,
+        inputs=[model_path_input],
+        outputs=[model_status]
+    )
+    generate_btn.click(
+        fn=generate_interpolation,
+        inputs=[
+            first_image_input,
+            last_image_input,
+            prompt_input,
+            num_frames_slider,
+            num_steps_slider,
+            guidance_slider,
+            fps_slider,
+            seed_input
+        ],
+        outputs=[output_video, generation_status]
+    )
+if __name__ == "__main__":
+    print("="*50)
+    print("CogVideoX Keyframe Interpolation Gradio App")
+    print("="*50)
+    print(f"Device: {device}")
+    print(f"CUDA available: {torch.cuda.is_available()}")
+    if torch.cuda.is_available():
+        print(f"GPU: {torch.cuda.get_device_name(0)}")
+        print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
+    print("="*50)
+    demo.launch(
+        server_name="0.0.0.0",
+        server_port=7860,
+        share=False
+    )

cogvideox_interpolation/datasets.py ADDED Viewed

	@@ -0,0 +1,154 @@

+import json
+import torch
+from typing import Any, Dict, List, Optional, Tuple
+from torch.utils.data import DataLoader, Dataset
+import torchvision.transforms as TT
+from torchvision import transforms
+from torchvision.transforms.functional import center_crop, resize
+from torchvision.transforms import InterpolationMode
+import random
+try:
+    import decord
+except ImportError:
+    raise ImportError(
+        "The `decord` package is required for loading the video dataset. Install with `pip install decord`"
+    )
+decord.bridge.set_bridge("torch")
+class ImageVideoDataset(Dataset):
+    def __init__(
+        self,
+        data_root,
+        tokenizer,
+        max_sequence_length: int = 226,
+        height: int = 480,
+        width: int = 720,
+        video_reshape_mode: str = "center",
+        fps: int = 8,
+        stripe: int = 2,
+        max_num_frames: int = 49,
+        skip_frames_start: int = 0,
+        skip_frames_end: int = 0,
+        random_flip: Optional[float] = None,
+    ) -> None:
+        super().__init__()
+        with open(data_root, 'r') as f:
+            self.data_list = json.load(f)
+        self.tokenizer = tokenizer
+        self.max_sequence_length = max_sequence_length
+        self.height = height
+        self.width = width
+        self.video_reshape_mode = video_reshape_mode
+        self.fps = fps
+        self.max_num_frames = max_num_frames
+        self.skip_frames_start = skip_frames_start
+        self.skip_frames_end = skip_frames_end
+        self.stripe = stripe
+        self.video_transforms = transforms.Compose(
+            [
+                transforms.RandomHorizontalFlip(random_flip) if random_flip else transforms.Lambda(lambda x: x),
+                transforms.Lambda(lambda x: x / 255.0),
+                transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5], inplace=True),
+            ]
+        )
+    def __len__(self):
+        return len(self.data_list)
+    def _resize_for_rectangle_crop(self, arr):
+        image_size = self.height, self.width
+        reshape_mode = self.video_reshape_mode
+        if arr.shape[3] / arr.shape[2] > image_size[1] / image_size[0]:
+            arr = resize(
+                arr,
+                size=[image_size[0], int(arr.shape[3] * image_size[0] / arr.shape[2])],
+                interpolation=InterpolationMode.BICUBIC,
+            )
+        else:
+            arr = resize(
+                arr,
+                size=[int(arr.shape[2] * image_size[1] / arr.shape[3]), image_size[1]],
+                interpolation=InterpolationMode.BICUBIC,
+            )
+        h, w = arr.shape[2], arr.shape[3]
+        arr = arr.squeeze(0)
+        delta_h = h - image_size[0]
+        delta_w = w - image_size[1]
+        if reshape_mode == "random" or reshape_mode == "none":
+            top = np.random.randint(0, delta_h + 1)
+            left = np.random.randint(0, delta_w + 1)
+        elif reshape_mode == "center":
+            top, left = delta_h // 2, delta_w // 2
+        else:
+            raise NotImplementedError
+        arr = TT.functional.crop(arr, top=top, left=left, height=image_size[0], width=image_size[1])
+        return arr
+    def __getitem__(self, index):
+        while True:
+            try:
+                video_reader = decord.VideoReader(self.data_list[index]['file_path'], width=self.width, height=self.height)
+                video_num_frames = len(video_reader)
+                # print(video_num_frames, video_reader.get_avg_fps())
+                if self.stripe * self.max_num_frames > video_num_frames:
+                    stripe = 1
+                else:
+                    stripe = self.stripe
+                random_range = video_num_frames - stripe * self.max_num_frames - 1
+                random_range = max(1, random_range)
+                start_frame = random.randint(1, random_range) if random_range > 0 else 1
+                indices = list(range(start_frame, start_frame + stripe * self.max_num_frames, stripe)) # (end_frame - start_frame) // self.max_num_frames))
+                frames = video_reader.get_batch(indices)
+                # Ensure that we don't go over the limit
+                frames = frames[: self.max_num_frames]
+                selected_num_frames = frames.shape[0]
+                # Choose first (4k + 1) frames as this is how many is required by the VAE
+                remainder = (3 + (selected_num_frames % 4)) % 4
+                if remainder != 0:
+                    frames = frames[:-remainder]
+                selected_num_frames = frames.shape[0]
+                assert (selected_num_frames - 1) % 4 == 0
+                if selected_num_frames == self.max_num_frames:
+                    break
+                else:
+                    index = (index + 1) % len(self.data_list)
+                    continue
+            except Exception as e:
+                index = (index + 1) % len(self.data_list)
+                print(video_num_frames, start_frame, indices)
+                print(
+                    "Error encounter during audio feature extraction: ", e,
+                )
+                continue
+        # Training transforms
+        # frames = (frames - 127.5) / 127.5
+        frames = frames.permute(0, 3, 1, 2).contiguous()  # [F, C, H, W]
+        frames = self._resize_for_rectangle_crop(frames)
+        frames = torch.stack([self.video_transforms(frame) for frame in frames], dim=0)
+        text_inputs = self.tokenizer(
+            [self.data_list[index]['text']],
+            padding="max_length",
+            max_length=self.max_sequence_length,
+            truncation=True,
+            add_special_tokens=True,
+            return_tensors="pt",
+        )
+        text_input_ids = text_inputs.input_ids[0]
+        return frames.contiguous(), text_input_ids

cogvideox_interpolation/pipeline.py ADDED Viewed

	@@ -0,0 +1,799 @@

+import math
+import PIL
+import inspect
+import torch
+from typing import Callable, Dict, List, Optional, Tuple, Union
+from transformers import T5EncoderModel, T5Tokenizer
+from diffusers.pipelines.pipeline_utils import DiffusionPipeline
+from diffusers.schedulers import CogVideoXDDIMScheduler, CogVideoXDPMScheduler
+from diffusers.models import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel
+from diffusers.utils import (
+    logging,
+    replace_example_docstring,
+)
+from diffusers.image_processor import PipelineImageInput
+from diffusers.callbacks import MultiPipelineCallbacks, PipelineCallback
+from diffusers.video_processor import VideoProcessor
+from diffusers.utils.torch_utils import randn_tensor
+from diffusers.models.embeddings import get_3d_rotary_pos_embed
+# Similar to diffusers.pipelines.hunyuandit.pipeline_hunyuandit.get_resize_crop_region_for_grid
+def get_resize_crop_region_for_grid(src, tgt_width, tgt_height):
+    tw = tgt_width
+    th = tgt_height
+    h, w = src
+    r = h / w
+    if r > (th / tw):
+        resize_height = th
+        resize_width = int(round(th / h * w))
+    else:
+        resize_width = tw
+        resize_height = int(round(tw / w * h))
+    crop_top = int(round((th - resize_height) / 2.0))
+    crop_left = int(round((tw - resize_width) / 2.0))
+    return (crop_top, crop_left), (crop_top + resize_height, crop_left + resize_width)
+# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps
+def retrieve_timesteps(
+    scheduler,
+    num_inference_steps: Optional[int] = None,
+    device: Optional[Union[str, torch.device]] = None,
+    timesteps: Optional[List[int]] = None,
+    sigmas: Optional[List[float]] = None,
+    **kwargs,
+):
+    """
+    Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
+    custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
+    Args:
+        scheduler (`SchedulerMixin`):
+            The scheduler to get timesteps from.
+        num_inference_steps (`int`):
+            The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
+            must be `None`.
+        device (`str` or `torch.device`, *optional*):
+            The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
+        timesteps (`List[int]`, *optional*):
+            Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
+            `num_inference_steps` and `sigmas` must be `None`.
+        sigmas (`List[float]`, *optional*):
+            Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
+            `num_inference_steps` and `timesteps` must be `None`.
+    Returns:
+        `Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
+        second element is the number of inference steps.
+    """
+    if timesteps is not None and sigmas is not None:
+        raise ValueError("Only one of `timesteps` or `sigmas` can be passed. Please choose one to set custom values")
+    if timesteps is not None:
+        accepts_timesteps = "timesteps" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
+        if not accepts_timesteps:
+            raise ValueError(
+                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
+                f" timestep schedules. Please check whether you are using the correct scheduler."
+            )
+        scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+        num_inference_steps = len(timesteps)
+    elif sigmas is not None:
+        accept_sigmas = "sigmas" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
+        if not accept_sigmas:
+            raise ValueError(
+                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
+                f" sigmas schedules. Please check whether you are using the correct scheduler."
+            )
+        scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+        num_inference_steps = len(timesteps)
+    else:
+        scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+    return timesteps, num_inference_steps
+# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_img2img.retrieve_latents
+def retrieve_latents(
+    encoder_output: torch.Tensor, generator: Optional[torch.Generator] = None, sample_mode: str = "sample"
+):
+    if hasattr(encoder_output, "latent_dist") and sample_mode == "sample":
+        return encoder_output.latent_dist.sample(generator)
+    elif hasattr(encoder_output, "latent_dist") and sample_mode == "argmax":
+        return encoder_output.latent_dist.mode()
+    elif hasattr(encoder_output, "latents"):
+        return encoder_output.latents
+    else:
+        raise AttributeError("Could not access latents of provided encoder_output")
+class CogVideoXInterpolationPipeline(DiffusionPipeline):
+    _optional_components = []
+    model_cpu_offload_seq = "text_encoder->transformer->vae"
+    _callback_tensor_inputs = [
+        "latents",
+        "prompt_embeds",
+        "negative_prompt_embeds",
+    ]
+    def __init__(
+        self,
+        tokenizer: T5Tokenizer,
+        text_encoder: T5EncoderModel,
+        vae: AutoencoderKLCogVideoX,
+        transformer: CogVideoXTransformer3DModel,
+        scheduler: Union[CogVideoXDDIMScheduler, CogVideoXDPMScheduler],
+    ):
+        super().__init__()
+        self.register_modules(
+            tokenizer=tokenizer,
+            text_encoder=text_encoder,
+            vae=vae,
+            transformer=transformer,
+            scheduler=scheduler,
+        )
+        self.vae_scale_factor_spatial = (
+            2 ** (len(self.vae.config.block_out_channels) - 1) if hasattr(self, "vae") and self.vae is not None else 8
+        )
+        self.vae_scale_factor_temporal = (
+            self.vae.config.temporal_compression_ratio if hasattr(self, "vae") and self.vae is not None else 4
+        )
+        self.video_processor = VideoProcessor(vae_scale_factor=self.vae_scale_factor_spatial)
+    # Copied from diffusers.pipelines.cogvideo.pipeline_cogvideox.CogVideoXPipeline._get_t5_prompt_embeds
+    def _get_t5_prompt_embeds(
+        self,
+        prompt: Union[str, List[str]] = None,
+        num_videos_per_prompt: int = 1,
+        max_sequence_length: int = 226,
+        device: Optional[torch.device] = None,
+        dtype: Optional[torch.dtype] = None,
+    ):
+        device = device or self._execution_device
+        dtype = dtype or self.text_encoder.dtype
+        prompt = [prompt] if isinstance(prompt, str) else prompt
+        batch_size = len(prompt)
+        text_inputs = self.tokenizer(
+            prompt,
+            padding="max_length",
+            max_length=max_sequence_length,
+            truncation=True,
+            add_special_tokens=True,
+            return_tensors="pt",
+        )
+        text_input_ids = text_inputs.input_ids
+        untruncated_ids = self.tokenizer(prompt, padding="longest", return_tensors="pt").input_ids
+        if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal(text_input_ids, untruncated_ids):
+            removed_text = self.tokenizer.batch_decode(untruncated_ids[:, max_sequence_length - 1 : -1])
+            logger.warning(
+                "The following part of your input was truncated because `max_sequence_length` is set to "
+                f" {max_sequence_length} tokens: {removed_text}"
+            )
+        prompt_embeds = self.text_encoder(text_input_ids.to(device))[0]
+        prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
+        # duplicate text embeddings for each generation per prompt, using mps friendly method
+        _, seq_len, _ = prompt_embeds.shape
+        prompt_embeds = prompt_embeds.repeat(1, num_videos_per_prompt, 1)
+        prompt_embeds = prompt_embeds.view(batch_size * num_videos_per_prompt, seq_len, -1)
+        return prompt_embeds
+    # Copied from diffusers.pipelines.cogvideo.pipeline_cogvideox.CogVideoXPipeline.encode_prompt
+    def encode_prompt(
+        self,
+        prompt: Union[str, List[str]],
+        negative_prompt: Optional[Union[str, List[str]]] = None,
+        do_classifier_free_guidance: bool = True,
+        num_videos_per_prompt: int = 1,
+        prompt_embeds: Optional[torch.Tensor] = None,
+        negative_prompt_embeds: Optional[torch.Tensor] = None,
+        max_sequence_length: int = 226,
+        device: Optional[torch.device] = None,
+        dtype: Optional[torch.dtype] = None,
+    ):
+        r"""
+        Encodes the prompt into text encoder hidden states.
+        Args:
+            prompt (`str` or `List[str]`, *optional*):
+                prompt to be encoded
+            negative_prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts not to guide the image generation. If not defined, one has to pass
+                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
+                less than `1`).
+            do_classifier_free_guidance (`bool`, *optional*, defaults to `True`):
+                Whether to use classifier free guidance or not.
+            num_videos_per_prompt (`int`, *optional*, defaults to 1):
+                Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
+            prompt_embeds (`torch.Tensor`, *optional*):
+                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                provided, text embeddings will be generated from `prompt` input argument.
+            negative_prompt_embeds (`torch.Tensor`, *optional*):
+                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
+                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
+                argument.
+            device: (`torch.device`, *optional*):
+                torch device
+            dtype: (`torch.dtype`, *optional*):
+                torch dtype
+        """
+        device = device or self._execution_device
+        prompt = [prompt] if isinstance(prompt, str) else prompt
+        if prompt is not None:
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+        if prompt_embeds is None:
+            prompt_embeds = self._get_t5_prompt_embeds(
+                prompt=prompt,
+                num_videos_per_prompt=num_videos_per_prompt,
+                max_sequence_length=max_sequence_length,
+                device=device,
+                dtype=dtype,
+            )
+        if do_classifier_free_guidance and negative_prompt_embeds is None:
+            negative_prompt = negative_prompt or ""
+            negative_prompt = batch_size * [negative_prompt] if isinstance(negative_prompt, str) else negative_prompt
+            if prompt is not None and type(prompt) is not type(negative_prompt):
+                raise TypeError(
+                    f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
+                    f" {type(prompt)}."
+                )
+            elif batch_size != len(negative_prompt):
+                raise ValueError(
+                    f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
+                    f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
+                    " the batch size of `prompt`."
+                )
+            negative_prompt_embeds = self._get_t5_prompt_embeds(
+                prompt=negative_prompt,
+                num_videos_per_prompt=num_videos_per_prompt,
+                max_sequence_length=max_sequence_length,
+                device=device,
+                dtype=dtype,
+            )
+        return prompt_embeds, negative_prompt_embeds
+    def prepare_latents(
+        self,
+        first_image: torch.Tensor,
+        last_image: torch.Tensor,
+        batch_size: int = 1,
+        num_channels_latents: int = 16,
+        num_frames: int = 13,
+        height: int = 60,
+        width: int = 90,
+        dtype: Optional[torch.dtype] = None,
+        device: Optional[torch.device] = None,
+        generator: Optional[torch.Generator] = None,
+        latents: Optional[torch.Tensor] = None,
+    ):
+        num_frames = (num_frames - 1) // self.vae_scale_factor_temporal + 1
+        shape = (
+            batch_size,
+            num_frames,
+            num_channels_latents,
+            height // self.vae_scale_factor_spatial,
+            width // self.vae_scale_factor_spatial,
+        )
+        if isinstance(generator, list) and len(generator) != batch_size:
+            raise ValueError(
+                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
+                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
+            )
+        first_image = first_image.unsqueeze(2)  # [B, C, F, H, W]
+        last_image = last_image.unsqueeze(2)  # [B, C, F, H, W]
+        if isinstance(generator, list):
+            first_image_latents = [
+                retrieve_latents(self.vae.encode(first_image[i].unsqueeze(0)), generator[i]) for i in range(batch_size)
+            ]
+        else:
+            first_image_latents = [retrieve_latents(self.vae.encode(first_img.unsqueeze(0)), generator) for first_img in first_image]
+        if isinstance(generator, list):
+            last_image_latents = [
+                retrieve_latents(self.vae.encode(last_image[i].unsqueeze(0)), generator[i]) for i in range(batch_size)
+            ]
+        else:
+            last_image_latents = [retrieve_latents(self.vae.encode(last_img.unsqueeze(0)), generator) for last_img in last_image]
+        first_image_latents = torch.cat(first_image_latents, dim=0).to(dtype).permute(0, 2, 1, 3, 4)  # [B, F, C, H, W]
+        first_image_latents = self.vae.config.scaling_factor * first_image_latents
+        last_image_latents = torch.cat(last_image_latents, dim=0).to(dtype).permute(0, 2, 1, 3, 4)  # [B, F, C, H, W]
+        last_image_latents = self.vae.config.scaling_factor * last_image_latents
+        padding_shape = (
+            batch_size,
+            num_frames - 2,
+            num_channels_latents,
+            height // self.vae_scale_factor_spatial,
+            width // self.vae_scale_factor_spatial,
+        )
+        latent_padding = torch.zeros(padding_shape, device=device, dtype=dtype)
+        image_latents = torch.cat([first_image_latents, latent_padding, last_image_latents], dim=1)
+        if latents is None:
+            latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+        else:
+            latents = latents.to(device)
+        # scale the initial noise by the standard deviation required by the scheduler
+        latents = latents * self.scheduler.init_noise_sigma
+        return latents, image_latents
+    # Copied from diffusers.pipelines.cogvideo.pipeline_cogvideox.CogVideoXPipeline.decode_latents
+    def decode_latents(self, latents: torch.Tensor) -> torch.Tensor:
+        latents = latents.permute(0, 2, 1, 3, 4)  # [batch_size, num_channels, num_frames, height, width]
+        latents = 1 / self.vae.config.scaling_factor * latents
+        frames = self.vae.decode(latents).sample
+        return frames
+    # Copied from diffusers.pipelines.animatediff.pipeline_animatediff_video2video.AnimateDiffVideoToVideoPipeline.get_timesteps
+    def get_timesteps(self, num_inference_steps, timesteps, strength, device):
+        # get the original timestep using init_timestep
+        init_timestep = min(int(num_inference_steps * strength), num_inference_steps)
+        t_start = max(num_inference_steps - init_timestep, 0)
+        timesteps = timesteps[t_start * self.scheduler.order :]
+        return timesteps, num_inference_steps - t_start
+    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_extra_step_kwargs
+    def prepare_extra_step_kwargs(self, generator, eta):
+        # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
+        # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
+        # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502
+        # and should be between [0, 1]
+        accepts_eta = "eta" in set(inspect.signature(self.scheduler.step).parameters.keys())
+        extra_step_kwargs = {}
+        if accepts_eta:
+            extra_step_kwargs["eta"] = eta
+        # check if the scheduler accepts generator
+        accepts_generator = "generator" in set(inspect.signature(self.scheduler.step).parameters.keys())
+        if accepts_generator:
+            extra_step_kwargs["generator"] = generator
+        return extra_step_kwargs
+    def check_inputs(
+        self,
+        first_image,
+        last_image,
+        prompt,
+        height,
+        width,
+        negative_prompt,
+        callback_on_step_end_tensor_inputs,
+        video=None,
+        latents=None,
+        prompt_embeds=None,
+        negative_prompt_embeds=None,
+    ):
+        if (
+            not isinstance(first_image, torch.Tensor)
+            and not isinstance(first_image, PIL.Image.Image)
+            and not isinstance(first_image, list)
+        ):
+            raise ValueError(
+                "`image` has to be of type `torch.Tensor` or `PIL.Image.Image` or `List[PIL.Image.Image]` but is"
+                f" {type(first_image)}"
+            )
+        if (
+            not isinstance(last_image, torch.Tensor)
+            and not isinstance(last_image, PIL.Image.Image)
+            and not isinstance(last_image, list)
+        ):
+            raise ValueError(
+                "`image` has to be of type `torch.Tensor` or `PIL.Image.Image` or `List[PIL.Image.Image]` but is"
+                f" {type(last_image)}"
+            )
+        if height % 8 != 0 or width % 8 != 0:
+            raise ValueError(f"`height` and `width` have to be divisible by 8 but are {height} and {width}.")
+        if callback_on_step_end_tensor_inputs is not None and not all(
+            k in self._callback_tensor_inputs for k in callback_on_step_end_tensor_inputs
+        ):
+            raise ValueError(
+                f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
+            )
+        if prompt is not None and prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
+                " only forward one of the two."
+            )
+        elif prompt is None and prompt_embeds is None:
+            raise ValueError(
+                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
+            )
+        elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
+            raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
+        if prompt is not None and negative_prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `prompt`: {prompt} and `negative_prompt_embeds`:"
+                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
+            )
+        if negative_prompt is not None and negative_prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
+                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
+            )
+        if prompt_embeds is not None and negative_prompt_embeds is not None:
+            if prompt_embeds.shape != negative_prompt_embeds.shape:
+                raise ValueError(
+                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
+                    f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
+                    f" {negative_prompt_embeds.shape}."
+                )
+        if video is not None and latents is not None:
+            raise ValueError("Only one of `video` or `latents` should be provided")
+    # Copied from diffusers.pipelines.cogvideo.pipeline_cogvideox.CogVideoXPipeline.fuse_qkv_projections
+    def fuse_qkv_projections(self) -> None:
+        r"""Enables fused QKV projections."""
+        self.fusing_transformer = True
+        self.transformer.fuse_qkv_projections()
+    # Copied from diffusers.pipelines.cogvideo.pipeline_cogvideox.CogVideoXPipeline.unfuse_qkv_projections
+    def unfuse_qkv_projections(self) -> None:
+        r"""Disable QKV projection fusion if enabled."""
+        if not self.fusing_transformer:
+            logger.warning("The Transformer was not initially fused for QKV projections. Doing nothing.")
+        else:
+            self.transformer.unfuse_qkv_projections()
+            self.fusing_transformer = False
+    # Copied from diffusers.pipelines.cogvideo.pipeline_cogvideox.CogVideoXPipeline._prepare_rotary_positional_embeddings
+    def _prepare_rotary_positional_embeddings(
+        self,
+        height: int,
+        width: int,
+        num_frames: int,
+        device: torch.device,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        grid_height = height // (self.vae_scale_factor_spatial * self.transformer.config.patch_size)
+        grid_width = width // (self.vae_scale_factor_spatial * self.transformer.config.patch_size)
+        base_size_width = 720 // (self.vae_scale_factor_spatial * self.transformer.config.patch_size)
+        base_size_height = 480 // (self.vae_scale_factor_spatial * self.transformer.config.patch_size)
+        grid_crops_coords = get_resize_crop_region_for_grid(
+            (grid_height, grid_width), base_size_width, base_size_height
+        )
+        freqs_cos, freqs_sin = get_3d_rotary_pos_embed(
+            embed_dim=self.transformer.config.attention_head_dim,
+            crops_coords=grid_crops_coords,
+            grid_size=(grid_height, grid_width),
+            temporal_size=num_frames,
+        )
+        freqs_cos = freqs_cos.to(device=device)
+        freqs_sin = freqs_sin.to(device=device)
+        return freqs_cos, freqs_sin
+    @property
+    def guidance_scale(self):
+        return self._guidance_scale
+    @property
+    def num_timesteps(self):
+        return self._num_timesteps
+    @property
+    def interrupt(self):
+        return self._interrupt
+    @torch.no_grad()
+    def __call__(
+        self,
+        first_image: PipelineImageInput,
+        last_image: PipelineImageInput,
+        prompt: Optional[Union[str, List[str]]] = None,
+        negative_prompt: Optional[Union[str, List[str]]] = None,
+        height: int = 480,
+        width: int = 720,
+        num_frames: int = 49,
+        num_inference_steps: int = 50,
+        timesteps: Optional[List[int]] = None,
+        guidance_scale: float = 6,
+        use_dynamic_cfg: bool = False,
+        num_videos_per_prompt: int = 1,
+        eta: float = 0.0,
+        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+        latents: Optional[torch.FloatTensor] = None,
+        prompt_embeds: Optional[torch.FloatTensor] = None,
+        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
+        output_type: str = "pil",
+        return_dict: bool = True,
+        callback_on_step_end: Optional[
+            Union[Callable[[int, int, Dict], None], PipelineCallback, MultiPipelineCallbacks]
+        ] = None,
+        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
+        max_sequence_length: int = 226,
+    ):
+        """
+        Function invoked when calling the pipeline for generation.
+        Args:
+            image (`PipelineImageInput`):
+                The input video to condition the generation on. Must be an image, a list of images or a `torch.Tensor`.
+            prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
+                instead.
+            negative_prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts not to guide the image generation. If not defined, one has to pass
+                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
+                less than `1`).
+            height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
+                The height in pixels of the generated image. This is set to 1024 by default for the best results.
+            width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
+                The width in pixels of the generated image. This is set to 1024 by default for the best results.
+            num_frames (`int`, defaults to `48`):
+                Number of frames to generate. Must be divisible by self.vae_scale_factor_temporal. Generated video will
+                contain 1 extra frame because CogVideoX is conditioned with (num_seconds * fps + 1) frames where
+                num_seconds is 6 and fps is 4. However, since videos can be saved at any fps, the only condition that
+                needs to be satisfied is that of divisibility mentioned above.
+            num_inference_steps (`int`, *optional*, defaults to 50):
+                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
+                expense of slower inference.
+            timesteps (`List[int]`, *optional*):
+                Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument
+                in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is
+                passed will be used. Must be in descending order.
+            guidance_scale (`float`, *optional*, defaults to 7.0):
+                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
+                `guidance_scale` is defined as `w` of equation 2. of [Imagen
+                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
+                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
+                usually at the expense of lower image quality.
+            num_videos_per_prompt (`int`, *optional*, defaults to 1):
+                The number of videos to generate per prompt.
+            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
+                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
+                to make generation deterministic.
+            latents (`torch.FloatTensor`, *optional*):
+                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
+                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
+                tensor will ge generated by sampling using the supplied random `generator`.
+            prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                provided, text embeddings will be generated from `prompt` input argument.
+            negative_prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
+                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
+                argument.
+            output_type (`str`, *optional*, defaults to `"pil"`):
+                The output format of the generate image. Choose between
+                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput`] instead
+                of a plain tuple.
+            callback_on_step_end (`Callable`, *optional*):
+                A function that calls at the end of each denoising steps during the inference. The function is called
+                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
+                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
+                `callback_on_step_end_tensor_inputs`.
+            callback_on_step_end_tensor_inputs (`List`, *optional*):
+                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
+                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
+                `._callback_tensor_inputs` attribute of your pipeline class.
+            max_sequence_length (`int`, defaults to `226`):
+                Maximum sequence length in encoded prompt. Must be consistent with
+                `self.transformer.config.max_text_seq_length` otherwise may lead to poor results.
+        Examples:
+        Returns:
+            [`~pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput`] or `tuple`:
+            [`~pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput`] if `return_dict` is True, otherwise a
+            `tuple`. When returning a tuple, the first element is a list with the generated images.
+        """
+        if num_frames > 49:
+            raise ValueError(
+                "The number of frames must be less than 49 for now due to static positional embeddings. This will be updated in the future to remove this limitation."
+            )
+        if isinstance(callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):
+            callback_on_step_end_tensor_inputs = callback_on_step_end.tensor_inputs
+        height = height or self.transformer.config.sample_size * self.vae_scale_factor_spatial
+        width = width or self.transformer.config.sample_size * self.vae_scale_factor_spatial
+        num_videos_per_prompt = 1
+        # 1. Check inputs. Raise error if not correct
+        self.check_inputs(
+            first_image,
+            last_image,
+            prompt,
+            height,
+            width,
+            negative_prompt,
+            callback_on_step_end_tensor_inputs,
+            prompt_embeds,
+            negative_prompt_embeds,
+        )
+        self._guidance_scale = guidance_scale
+        self._interrupt = False
+        # 2. Default call parameters
+        if prompt is not None and isinstance(prompt, str):
+            batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+        device = self._execution_device
+        # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
+        # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
+        # corresponds to doing no classifier free guidance.
+        do_classifier_free_guidance = guidance_scale > 1.0
+        # 3. Encode input prompt
+        prompt_embeds, negative_prompt_embeds = self.encode_prompt(
+            prompt=prompt,
+            negative_prompt=negative_prompt,
+            do_classifier_free_guidance=do_classifier_free_guidance,
+            num_videos_per_prompt=num_videos_per_prompt,
+            prompt_embeds=prompt_embeds,
+            negative_prompt_embeds=negative_prompt_embeds,
+            max_sequence_length=max_sequence_length,
+            device=device,
+        )
+        if do_classifier_free_guidance:
+            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0)
+        # 4. Prepare timesteps
+        timesteps, num_inference_steps = retrieve_timesteps(self.scheduler, num_inference_steps, device, timesteps)
+        self._num_timesteps = len(timesteps)
+        # 5. Prepare latents
+        first_image = self.video_processor.preprocess(first_image, height=height, width=width).to(
+            device, dtype=prompt_embeds.dtype
+        )
+        last_image = self.video_processor.preprocess(last_image, height=height, width=width).to(
+            device, dtype=prompt_embeds.dtype
+        )
+        latent_channels = self.transformer.config.in_channels // 2
+        latents, image_latents = self.prepare_latents(
+            first_image,
+            last_image,
+            batch_size * num_videos_per_prompt,
+            latent_channels,
+            num_frames,
+            height,
+            width,
+            prompt_embeds.dtype,
+            device,
+            generator,
+            latents,
+        )
+        # 6. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
+        extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)
+        # 7. Create rotary embeds if required
+        image_rotary_emb = (
+            self._prepare_rotary_positional_embeddings(height, width, latents.size(1), device)
+            if self.transformer.config.use_rotary_positional_embeddings
+            else None
+        )
+        # 8. Denoising loop
+        num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
+        with self.progress_bar(total=num_inference_steps) as progress_bar:
+            # for DPM-solver++
+            old_pred_original_sample = None
+            for i, t in enumerate(timesteps):
+                if self.interrupt:
+                    continue
+                latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents
+                latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
+                latent_image_input = torch.cat([image_latents] * 2) if do_classifier_free_guidance else image_latents
+                latent_model_input = torch.cat([latent_model_input, latent_image_input], dim=2)
+                # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
+                timestep = t.expand(latent_model_input.shape[0])
+                # predict noise model_output
+                noise_pred = self.transformer(
+                    hidden_states=latent_model_input,
+                    encoder_hidden_states=prompt_embeds,
+                    timestep=timestep,
+                    image_rotary_emb=image_rotary_emb,
+                    return_dict=False,
+                )[0]
+                noise_pred = noise_pred.float()
+                # perform guidance
+                if use_dynamic_cfg:
+                    self._guidance_scale = 1 + guidance_scale * (
+                        (1 - math.cos(math.pi * ((num_inference_steps - t.item()) / num_inference_steps) ** 5.0)) / 2
+                    )
+                if do_classifier_free_guidance:
+                    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
+                    noise_pred = noise_pred_uncond + self.guidance_scale * (noise_pred_text - noise_pred_uncond)
+                # compute the previous noisy sample x_t -> x_t-1
+                if not isinstance(self.scheduler, CogVideoXDPMScheduler):
+                    latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs, return_dict=False)[0]
+                else:
+                    latents, old_pred_original_sample = self.scheduler.step(
+                        noise_pred,
+                        old_pred_original_sample,
+                        t,
+                        timesteps[i - 1] if i > 0 else None,
+                        latents,
+                        **extra_step_kwargs,
+                        return_dict=False,
+                    )
+                latents = latents.to(prompt_embeds.dtype)
+                # call the callback, if provided
+                if callback_on_step_end is not None:
+                    callback_kwargs = {}
+                    for k in callback_on_step_end_tensor_inputs:
+                        callback_kwargs[k] = locals()[k]
+                    callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
+                    latents = callback_outputs.pop("latents", latents)
+                    prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
+                    negative_prompt_embeds = callback_outputs.pop("negative_prompt_embeds", negative_prompt_embeds)
+                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                    progress_bar.update()
+        if not output_type == "latent":
+            video = self.decode_latents(latents)
+            video = self.video_processor.postprocess_video(video=video, output_type=output_type)
+        else:
+            video = latents
+        # Offload all models
+        self.maybe_free_model_hooks()
+        if not return_dict:
+            return (video,)
+        return (video,)

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+diffusers==0.30.3
+transformers==4.44.2
+accelerate==0.34.0
+gradio>=4.0.0
+torch>=2.0.0
+torchvision
+Pillow