YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Mega Multimodal Model V5 (mega-multimodal-v5)
This repository contains an advanced multimodal model combining numerous specialized models into a single architecture with a unified generate() interface. This version includes selective loading, logging, enhanced validation, and performance timing.
Included Capabilities & Models (Based on Config)
- Text Generation: Qwen/Qwen1.5-1.8B (Loaded)
- Code Generation: bigcode/starcoder2-3b (Loaded)
- Speech Recognition (Whisper Tiny): openai/whisper-tiny (Loaded)
- Speech Recognition (Whisper Large V3): openai/whisper-large-v3 (Loaded)
- Audio Classification: microsoft/wavlm-base-plus-sd (Loaded)
- Music Generation: facebook/musicgen-medium (Loaded)
- Audio Generation (Bark TTS): suno/bark (Loaded)
- Text-to-Image (SD 1.5): runwayml/stable-diffusion-v1-5 (Loaded)
- Text-to-Image (SDXL): stabilityai/stable-diffusion-xl-base-1.0 (Refiner: stabilityai/stable-diffusion-xl-refiner-1.0) (Base Loaded, Refiner Loaded)
- Text-to-Image (Kandinsky 2.2): Prior: kandinsky-community/kandinsky-2-2-prior, Decoder: kandinsky-community/kandinsky-2-2-decoder (Loaded)
- Image Inpainting (SD): runwayml/stable-diffusion-inpainting (Not Loaded)
- ControlNet Integration (SD 1.5): Canny:lllyasviel/control_v11p_sd15_canny (Not Loaded), Pose:lllyasviel/control_v11p_sd15_openpose (Not Loaded), SoftEdge:lllyasviel/control_v11p_sd15_softedge (Not Loaded), Depth:lllyasviel/control_v11f1p_sd15_depth (Not Loaded)
- Image Editing (InstructPix2Pix): timbrooks/instruct-pix2pix (Loaded)
- Text-to-3D (Shap-E): stabilityai/shap-e (Not Loaded)
- Translation: facebook/nllb-200-distilled-600M (Loaded)
- Summarization: facebook/bart-large-cnn (Loaded)
- Text-to-Video: cerspense/zeroscope_v2_576w (Loaded)
- Image-to-Video: stabilityai/stable-video-diffusion-img2vid-xt (Loaded)
- Video Classification: MCG-NJU/videomae-base-finetuned-kinetics (Loaded)
- Sentence Embeddings: sentence-transformers/all-MiniLM-L6-v2 (Loaded)
- CLIP Embeddings: openai/clip-vit-base-patch32 (Loaded)
- Sentiment/Toxicity Analysis: unitary/toxic-bert (Loaded)
- Question Answering (Text): distilbert-base-uncased-distilled-squad (Loaded)
- Table Question Answering (TAPAS): google/tapas-base-finetuned-wtq (Loaded)
- Document Visual Question Answering: naver-clova-ix/donut-base-finetuned-docvqa (Not Loaded)
- Named Entity Recognition (NER): dbmdz/bert-large-cased-finetuned-conll03-english (Loaded)
- Optical Character Recognition (OCR): microsoft/trocr-small-handwritten (Loaded)
- Object Detection (DETR): facebook/detr-resnet-50 (Loaded)
- Zero-Shot Object Detection (OwlViT): google/owlvit-base-patch32 (Loaded)
- Semantic Segmentation (SegFormer): nvidia/segformer-b0-finetuned-ade-512-512 (Loaded)
- Image Captioning: nlpconnect/vit-gpt2-image-captioning (Loaded)
- Depth Estimation: Intel/dpt-large (Loaded)
- Visual Question Answering (VILT): dandelin/vilt-b32-finetuned-vqa (Loaded)
- Zero-Shot Image Classification: openai/clip-vit-large-patch14 (Loaded)
Optimizations & Mechanisms
- Selective Loading: Load only specified components via
from_pretrained(..., components_to_load=['text', 'sd']). - Logging: Uses Python's
loggingmodule instead ofprint. - Performance Timing: Optional timing via
generate(..., time_execution=True). - Input Validation: Enhanced type/value checks in
generate. - Custom BitLinear: Configured: False.
- BitsAndBytes Quantization: Configured: False, Mode: 4bit.
- Global Pruning: Configured Amount: 0.0.
- Gradient Checkpointing: Configured: False.
- Flash Attention 2: Configured: False.
- Diffusers Optimizations: Attention Slicing (True), CPU Offload (True).
- Low CPU Mem Usage: Configured: True.
- Memory Management: Includes
gc.collect()andtorch.cuda.empty_cache()calls.
Saving & Loading
Uses standard save_pretrained / from_pretrained. Saves models using safetensors where possible. Uploads entire saved directory content to the Hub.
from mega_multimodal_model import MegaMultimodalModel
# Save the model
# model.save_pretrained("./my_mega_multimodal_model_v5")
# Load all components
# loaded_model = MegaMultimodalModel.from_pretrained("./my_mega_multimodal_model_v5")
# Load only specific components (example)
components = ['text', 'sd', 'caption_model']
loaded_model_subset = MegaMultimodalModel.from_pretrained("./my_mega_multimodal_model_v5", components_to_load=components)
# Or load from Hub
# loaded_model = MegaMultimodalModel.from_pretrained("your_hf_username/your_repo_name")
# loaded_model_subset = MegaMultimodalModel.from_pretrained("your_hf_username/your_repo_name", components_to_load=components)
# Example usage
# text_output = loaded_model.generate("Hello!", task="text", time_execution=True)
# image_output = loaded_model.generate("A cat astronaut", task="image")
Installation Dependencies
Core:
pip install torch torchvision torchaudio transformers diffusers huggingface_hub[hf_xet] hf_xet safetensors timm==0.9.16 Pillow accelerate bitsandbytes einops pandas decord ftfy pyav soundfile protobuf --upgrade
Optional:
pip install controlnet-aux
# pip install flash-attn --no-build-isolation
Model Configuration (config.json)
Stores component model IDs and optimization settings.
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support