Instructions to use IamCreateAI/Ruyi-Mini-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use IamCreateAI/Ruyi-Mini-7B with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline from diffusers.utils import load_image, export_to_video # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("IamCreateAI/Ruyi-Mini-7B", dtype=torch.bfloat16, device_map="cuda") pipe.to("cuda") prompt = "A man with short gray hair plays a red electric guitar." image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png" ) output = pipe(image=image, prompt=prompt).frames[0] export_to_video(output, "output.mp4") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -46,7 +46,7 @@ Or use ComfyUI wrapper in our [github repo](https://github.com/IamCreateAI/Ruyi-
|
|
| 46 |
## Model Architecture
|
| 47 |
|
| 48 |
Ruyi-Mini-7B is an advanced image-to-video model with about 7.1 billion parameters. The model architecture is modified from [EasyAnimate V4 model](https://github.com/aigc-apps/EasyAnimate), whose transformer module is inherited from [HunyuanDiT](https://github.com/Tencent/HunyuanDiT). It comprises three key components:
|
| 49 |
-
1. Casual VAE Module: Handles video compression and decompression. It reduces spatial resolution to 1/8 and temporal resolution to 1/4, with each latent pixel is represented in 16
|
| 50 |
2. Diffusion Transformer Module: Generates compressed video data using 3D full attention, with:
|
| 51 |
- 2D Normalized-RoPE for spatial dimensions;
|
| 52 |
- Sin-cos position embedding for temporal dimensions;
|
|
|
|
| 46 |
## Model Architecture
|
| 47 |
|
| 48 |
Ruyi-Mini-7B is an advanced image-to-video model with about 7.1 billion parameters. The model architecture is modified from [EasyAnimate V4 model](https://github.com/aigc-apps/EasyAnimate), whose transformer module is inherited from [HunyuanDiT](https://github.com/Tencent/HunyuanDiT). It comprises three key components:
|
| 49 |
+
1. Casual VAE Module: Handles video compression and decompression. It reduces spatial resolution to 1/8 and temporal resolution to 1/4, with each latent pixel is represented in 16 float numbers after compression.
|
| 50 |
2. Diffusion Transformer Module: Generates compressed video data using 3D full attention, with:
|
| 51 |
- 2D Normalized-RoPE for spatial dimensions;
|
| 52 |
- Sin-cos position embedding for temporal dimensions;
|