Duplicate from vantagewithai/Capybara-GGUF

659e2fd 6 days ago

7.43 kB

	---
	license: mit
	pipeline_tag: any-to-any
	---

	Quantized GGUF Versions of Capybara-1.0

	Original model Link: [https://huggingface.co/xgen-universe/Capybara](https://huggingface.co/xgen-universe/Capybara)

	Watch us at Youtube: [@VantageWithAI](https://www.youtube.com/@vantagewithai)

	<p align="center">
	<img src="https://huggingface.co/xgen-universe/Capybara/resolve/main/assets/misc/logo.png" style="width: 80%; height: auto;"/>
	</p>

	# CAPYBARA: A Unified Visual Creation Model

	<div align="center">
	<a href=https://lllydialee.github.io/Capybara-Project-Page target="_blank"><img alt="Project Page" src="https://img.shields.io/badge/Project-Page-8A2BE2" height="22px" />
	<a href=https://huggingface.co/xgen-universe/Capybara target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Models-d96902.svg height=22px></a>
	<a href=https://github.com/xgen-universe/Capybara target="_blank"><img src= https://img.shields.io/badge/Page-bb8a2e.svg?logo=github height=22px></a>
	<a href="https://inappetent-acrophonically-alison.ngrok-free.dev/" target="_blank"><img src=https://img.shields.io/badge/Gradio-Demo-orange?logo=gradio&logoColor=white height=22px></a>
	<a href="https://github.com/xgen-universe/Capybara/blob/main/assets/docs/tech_report.pdf" target="_blank"><img src=https://img.shields.io/badge/Technical_Report-PDF-EC1C24?logo=adobeacrobatreader&logoColor=white height=22px></a>
	<a href="https://pub-7ba77a763b8142cea73b9e48b46830ca.r2.dev/wechat.jpg" target="_blank"><img src=https://img.shields.io/badge/WeChat-07C160?style=flat&logo=wechat&logoColor=white height=22px></a>

	</div>

	<p align="center">
	🎉 Welcome to visit our <a href=https://lllydialee.github.io/Capybara-Project-Page>Project Page</a> \|
	💻 Visit our <a href="https://inappetent-acrophonically-alison.ngrok-free.dev/">Demo Website</a> to try our model!
	</p>

	Capybara is a unified visual creation model, i.e., a powerful visual generation and editing framework designed for high-quality visual synthesis and manipulation tasks.

	The framework leverages advanced diffusion models and transformer architectures to support versatile visual generation and editing capabilities with precise control over content, motion, and camera movements.

	<table>
	<tr>
	<td align="center">
	<video src="https://pub-7ba77a763b8142cea73b9e48b46830ca.r2.dev/demo_video.mp4" controls="controls" style="max-width: 100%;">
	</video>
	<br>
	<sub>Speech-driven base clips generated by Seedance 2.0. All editing powered by CAPYBARA</sub>
	</td>
	</tr>
	</table>

	Key Features:

	* 🎬 Multi-Task Support: Supports Text-to-Video (T2V), Text-to-Image (T2I), Instruction-based Video-to-Video (TV2V), Instruction-based Image-to-Image (TI2I), and various editing tasks
	* 🚀 High Performance: Built with distributed inference support for efficient multi-GPU processing

	## 🔥 News

	* [2026.02.20] 🎨 Added [ComfyUI support](#-comfyui-support) with custom nodes for all task types (T2I, T2V, TI2I, TV2V), together with [FP8 quantization](#-fp8-quantization) support for the inference script and ComfyUI custom node.
	* [2026.02.17] 🚀 Initial release v0.1 of the Capybara inference framework supporting generation and instruction-based editing tasks (T2I, T2V, TI2I, TV2V).

	## 📝 TODO List
	- [x] Add support for ComfyUI.
	- [ ] Release our unified creation model.
	- [ ] Release training code.

	## 🏞️ Show Cases
	Results of generation tasks. We show two generation tasks under our unified model. The top section presents text-to-image results, illustrating high-fidelity synthesis across diverse styles. The bottom rows show text-to-video results, demonstrating temporally coherent generation with natural motion for both realistic and stylized content.
	<p align="center">
	<img src="https://huggingface.co/xgen-universe/Capybara/resolve/main/assets/misc/gen_teaser.png" style="width: 100%; height: auto;"/>
	</p>

	Results of image editing tasks. We show the results of both instruction-based image editing and in-context image editing. The examples cover local and global edits (e.g., time-of-day and style changes), background replacement, and expression control. We further demonstrate multi-turn editing, where edits are applied sequentially. We also show in-context editing guided by a refenece image.
	<p align="center">
	<img src="https://huggingface.co/xgen-universe/Capybara/resolve/main/assets/misc/imageedit_teaser.png" style="width: 100%; height: auto;"/>
	</p>

	Results of instruction-based video editing task. We showcase instruction-based editing (TV2V) under our unified creation interface, covering local edits, global edits, dense prediction, and dynamic edits. Each example presents input frames and the edited outputs, highlighting temporally coherent transformations that preserve identity and overall structure.
	<p align="center">
	<img src="https://huggingface.co/xgen-universe/Capybara/resolve/main/assets/misc/videoedit_teaser5.png" style="width: 100%; height: auto;"/>
	</p>

	Results of in-context visual creation. We show in-context generation and in-context editing results, including subject-conditioned generation (S2V/S2I), conditional generation (C2V), image-to-video (I2V), reference-driven editing (II2I/IV2V).
	<p align="center">
	<img src="https://huggingface.co/xgen-universe/Capybara/resolve/main/assets/misc/incontext_teaser2.png" style="width: 100%; height: auto;"/>
	</p>

	### Recommended Settings

	For optimal quality and performance, we recommend the following settings:

	\| Task Type \| Recommended Resolution \| Recommended Steps \| Note \|
	\| --------- \| --------------------- \| ----------------- \| -------------------------------------- \|
	\| Video (T2V, TV2V) \| `480p` \| `50` \| Balanced quality and generation speed \|
	\| Image (T2I, TI2I) \| `720p` \| `50` \| Higher quality for static images \|

	Notes:
	- Resolution: You can experiment with higher resolutions (`1024` or `1080p`).
	- Inference Steps: 50 steps provide a good balance between quality and speed. You can use 30-40 steps for faster generation.

	## 📄 License

	This project is released under the MIT License.

	## 🙏 Acknowledgments

	This project is built upon:
	- [HunyuanVideo-1.5](https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5) - Base Video Generation Model
	- [Diffusers](https://github.com/huggingface/diffusers) - Diffusion pipeline infrastructure
	- [Accelerate](https://github.com/huggingface/accelerate) - Distributed training/inference
	- [SageAttention](https://github.com/thu-ml/SageAttention) - Efficient attention mechanism

	## 📝 Citation

	If you find Capybara useful for your research, please consider citing:

	```bibtex
	@misc{capybara2026rao,
	title={Capybara: A Unified Visual Creation Model},
	author={Rao, Zhefan and Che, Haoxuan and Hu, Ziwen and Zou, Bin and Liu, Yaofang and He, Xuanhua and Choi, Chong-Hou and He, Yuyang and Chen, Haoyu and Su, Jingran and Li, Yanheng and Chu, Meng and Lei, Chenyang and Zhao, Guanhua and Li, Zhaoqing and Zhang, Xichen and Li, Anping and Liu, Lin and Tu, Dandan and Liu, Rui},
	year={2026}
	}
	```

	## 📧 Contact

	For questions and feedback, please open an issue on GitHub.

	You can also contact us by email: zraoac@ust.hk and hche@ust.hk

	---

	⭐ If you find this project helpful, please consider giving it a star!