Buckets:
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. | |
| --> | |
| # CogView4 | |
| > [!TIP] | |
| > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. | |
| This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM). | |
| ## CogView4Pipeline[[diffusers.CogView4Pipeline]] | |
| #### diffusers.CogView4Pipeline[[diffusers.CogView4Pipeline]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_11739/src/diffusers/pipelines/cogview4/pipeline_cogview4.py#L137) | |
| Pipeline for text-to-image generation using CogView4. | |
| This model inherits from [DiffusionPipeline](/docs/diffusers/pr_11739/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the | |
| library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) | |
| __call__diffusers.CogView4Pipeline.__call__https://github.com/huggingface/diffusers/blob/vr_11739/src/diffusers/pipelines/cogview4/pipeline_cogview4.py#L402[{"name": "prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "height", "val": ": typing.Optional[int] = None"}, {"name": "width", "val": ": typing.Optional[int] = None"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "timesteps", "val": ": typing.Optional[typing.List[int]] = None"}, {"name": "sigmas", "val": ": typing.Optional[typing.List[float]] = None"}, {"name": "guidance_scale", "val": ": float = 5.0"}, {"name": "num_images_per_prompt", "val": ": int = 1"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "original_size", "val": ": typing.Optional[typing.Tuple[int, int]] = None"}, {"name": "crops_coords_top_left", "val": ": typing.Tuple[int, int] = (0, 0)"}, {"name": "output_type", "val": ": str = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "attention_kwargs", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "callback_on_step_end", "val": ": typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": typing.List[str] = ['latents']"}, {"name": "max_sequence_length", "val": ": int = 1024"}]- **prompt** (`str` or `List[str]`, *optional*) -- | |
| The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. | |
| - **negative_prompt** (`str` or `List[str]`, *optional*) -- | |
| The prompt or prompts not to guide the image generation. If not defined, one has to pass | |
| `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is | |
| less than `1`). | |
| - **height** (`int`, *optional*, defaults to self.transformer.config.sample_size * self.vae_scale_factor) -- | |
| The height in pixels of the generated image. If not provided, it is set to 1024. | |
| - **width** (`int`, *optional*, defaults to self.transformer.config.sample_size * self.vae_scale_factor) -- | |
| The width in pixels of the generated image. If not provided it is set to 1024. | |
| - **num_inference_steps** (`int`, *optional*, defaults to `50`) -- | |
| The number of denoising steps. More denoising steps usually lead to a higher quality image at the | |
| expense of slower inference. | |
| - **timesteps** (`List[int]`, *optional*) -- | |
| Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument | |
| in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is | |
| passed will be used. Must be in descending order. | |
| - **sigmas** (`List[float]`, *optional*) -- | |
| Custom sigmas to use for the denoising process with schedulers which support a `sigmas` argument in | |
| their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed | |
| will be used. | |
| - **guidance_scale** (`float`, *optional*, defaults to `5.0`) -- | |
| Guidance scale as defined in [Classifier-Free Diffusion | |
| Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2. | |
| of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting | |
| `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to | |
| the text `prompt`, usually at the expense of lower image quality. | |
| - **num_images_per_prompt** (`int`, *optional*, defaults to `1`) -- | |
| The number of images to generate per prompt. | |
| - **generator** (`torch.Generator` or `List[torch.Generator]`, *optional*) -- | |
| One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) | |
| to make generation deterministic. | |
| - **latents** (`torch.FloatTensor`, *optional*) -- | |
| Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image | |
| generation. Can be used to tweak the same generation with different prompts. If not provided, a latents | |
| tensor will be generated by sampling using the supplied random `generator`. | |
| - **prompt_embeds** (`torch.FloatTensor`, *optional*) -- | |
| Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not | |
| provided, text embeddings will be generated from `prompt` input argument. | |
| - **negative_prompt_embeds** (`torch.FloatTensor`, *optional*) -- | |
| Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt | |
| weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input | |
| argument. | |
| - **original_size** (`Tuple[int]`, *optional*, defaults to (1024, 1024)) -- | |
| If `original_size` is not the same as `target_size` the image will appear to be down- or upsampled. | |
| `original_size` defaults to `(height, width)` if not specified. Part of SDXL's micro-conditioning as | |
| explained in section 2.2 of | |
| [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952). | |
| - **crops_coords_top_left** (`Tuple[int]`, *optional*, defaults to (0, 0)) -- | |
| `crops_coords_top_left` can be used to generate an image that appears to be "cropped" from the position | |
| `crops_coords_top_left` downwards. Favorable, well-centered images are usually achieved by setting | |
| `crops_coords_top_left` to (0, 0). Part of SDXL's micro-conditioning as explained in section 2.2 of | |
| [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952). | |
| - **output_type** (`str`, *optional*, defaults to `"pil"`) -- | |
| The output format of the generate image. Choose between | |
| [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. | |
| - **return_dict** (`bool`, *optional*, defaults to `True`) -- | |
| Whether or not to return a `~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput` instead | |
| of a plain tuple. | |
| - **attention_kwargs** (`dict`, *optional*) -- | |
| A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under | |
| `self.processor` in | |
| [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py). | |
| - **callback_on_step_end** (`Callable`, *optional*) -- | |
| A function that calls at the end of each denoising steps during the inference. The function is called | |
| with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, | |
| callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by | |
| `callback_on_step_end_tensor_inputs`. | |
| - **callback_on_step_end_tensor_inputs** (`List`, *optional*) -- | |
| The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list | |
| will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the | |
| `._callback_tensor_inputs` attribute of your pipeline class. | |
| - **max_sequence_length** (`int`, defaults to `224`) -- | |
| Maximum sequence length in encoded prompt. Can be set to other values but may lead to poorer results.0`~pipelines.cogview4.pipeline_CogView4.CogView4PipelineOutput` or `tuple``~pipelines.cogview4.pipeline_CogView4.CogView4PipelineOutput` if `return_dict` is True, otherwise a | |
| `tuple`. When returning a tuple, the first element is a list with the generated images. | |
| Function invoked when calling the pipeline for generation. | |
| Examples: | |
| ```python | |
| >>> import torch | |
| >>> from diffusers import CogView4Pipeline | |
| >>> pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16) | |
| >>> pipe.to("cuda") | |
| >>> prompt = "A photo of an astronaut riding a horse on mars" | |
| >>> image = pipe(prompt).images[0] | |
| >>> image.save("output.png") | |
| ``` | |
| **Parameters:** | |
| vae ([AutoencoderKL](/docs/diffusers/pr_11739/en/api/models/autoencoderkl#diffusers.AutoencoderKL)) : Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. | |
| text_encoder (`GLMModel`) : Frozen text-encoder. CogView4 uses [glm-4-9b-hf](https://huggingface.co/THUDM/glm-4-9b-hf). | |
| tokenizer (`PreTrainedTokenizer`) : Tokenizer of class [PreTrainedTokenizer](https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer#transformers.PreTrainedTokenizer). | |
| transformer ([CogView4Transformer2DModel](/docs/diffusers/pr_11739/en/api/models/cogview4_transformer2d#diffusers.CogView4Transformer2DModel)) : A text conditioned `CogView4Transformer2DModel` to denoise the encoded image latents. | |
| scheduler ([SchedulerMixin](/docs/diffusers/pr_11739/en/api/schedulers/overview#diffusers.SchedulerMixin)) : A scheduler to be used in combination with `transformer` to denoise the encoded image latents. | |
| **Returns:** | |
| ``~pipelines.cogview4.pipeline_CogView4.CogView4PipelineOutput` or `tuple`` | |
| `~pipelines.cogview4.pipeline_CogView4.CogView4PipelineOutput` if `return_dict` is True, otherwise a | |
| `tuple`. When returning a tuple, the first element is a list with the generated images. | |
| #### encode_prompt[[diffusers.CogView4Pipeline.encode_prompt]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_11739/src/diffusers/pipelines/cogview4/pipeline_cogview4.py#L221) | |
| Encodes the prompt into text encoder hidden states. | |
| **Parameters:** | |
| prompt (`str` or `List[str]`, *optional*) : prompt to be encoded | |
| negative_prompt (`str` or `List[str]`, *optional*) : The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`). | |
| do_classifier_free_guidance (`bool`, *optional*, defaults to `True`) : Whether to use classifier free guidance or not. | |
| num_images_per_prompt (`int`, *optional*, defaults to 1) : Number of images that should be generated per prompt. torch device to place the resulting embeddings on | |
| prompt_embeds (`torch.Tensor`, *optional*) : Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument. | |
| negative_prompt_embeds (`torch.Tensor`, *optional*) : Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input argument. | |
| device : (`torch.device`, *optional*): torch device | |
| dtype : (`torch.dtype`, *optional*): torch dtype | |
| max_sequence_length (`int`, defaults to `1024`) : Maximum sequence length in encoded prompt. Can be set to other values but may lead to poorer results. | |
| ## CogView4PipelineOutput[[diffusers.pipelines.cogview4.pipeline_output.CogView4PipelineOutput]] | |
| #### diffusers.pipelines.cogview4.pipeline_output.CogView4PipelineOutput[[diffusers.pipelines.cogview4.pipeline_output.CogView4PipelineOutput]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_11739/src/diffusers/pipelines/cogview4/pipeline_output.py#L11) | |
| Output class for CogView3 pipelines. | |
| **Parameters:** | |
| images (`List[PIL.Image.Image]` or `np.ndarray`) : List of denoised PIL images of length `batch_size` or numpy array of shape `(batch_size, height, width, num_channels)`. PIL images or numpy array present the denoised images of the diffusion pipeline. | |
Xet Storage Details
- Size:
- 13.5 kB
- Xet hash:
- 7476a001e50d16c1059f2556daffeca96a5130a92946288cf3ab78c76f67e018
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.