Mega Multimodal Model V6 (mega-multimodal-v6)

Multimodal model with unified interface, selective loading, auto-skip for failed loads.

Included Capabilities & Models (Based on Config)

ControlNet (Canny): lllyasviel/control_v11p_sd15_canny (Not Loaded)
ControlNet (Depth): lllyasviel/control_v11f1p_sd15_depth (Not Loaded)
ControlNet (Pose): lllyasviel/control_v11p_sd15_openpose (Not Loaded)
ControlNet (Softedge): N/A (controlnet_softedge) (Not Loaded)
Audio_cls Model: microsoft/wavlm-base-plus-sd (Not Loaded)
bark: suno/bark (Not Loaded)
Caption Model: nlpconnect/vit-gpt2-image-captioning (Not Loaded)
clip: openai/clip-vit-base-patch32 (Not Loaded)
code: bigcode/starcoder2-3b (Not Loaded)
Depth Model: Intel/dpt-large (Not Loaded)
detr: facebook/detr-resnet-50 (Not Loaded)
Docvqa Model: naver-clova-ix/donut-base-finetuned-docvqa (Not Loaded)
music: facebook/musicgen-medium (Not Loaded)
ner: dbmdz/bert-large-cased-finetuned-conll03-english (Not Loaded)
qa: distilbert-base-uncased-distilled-squad (Not Loaded)
sentiment: unitary/toxic-bert (Not Loaded)
speech: openai/whisper-tiny (Not Loaded)
text: Qwen/Qwen1.5-1.8B (Not Loaded)
tqa: google/tapas-base-finetuned-wtq (Not Loaded)
trocr: microsoft/trocr-small-handwritten (Not Loaded)
Video_cls Model: MCG-NJU/videomae-base-finetuned-kinetics (Not Loaded)
Vqa Model: dandelin/vilt-b32-finetuned-vqa (Not Loaded)
zshot_cls: openai/clip-vit-large-patch14 (Not Loaded)
zshot_det: google/owlvit-base-patch32 (Not Loaded)
i2v: stabilityai/stable-video-diffusion-img2vid-xt (Not Loaded)
instruct_pix2pix: timbrooks/instruct-pix2pix (Not Loaded)
kandinsky_decoder: kandinsky-community/kandinsky-2-2-decoder (Not Loaded)
kandinsky_prior: kandinsky-community/kandinsky-2-2-prior (Not Loaded)
refine: stabilityai/stable-diffusion-xl-refiner-1.0 (Not Loaded)
Text-to-Image (SD 1.5): runwayml/stable-diffusion-v1-5 (Not Loaded)
sd_inpainting: runwayml/stable-diffusion-inpainting (Not Loaded)
Text-to-Image (SDXL): stabilityai/stable-diffusion-xl-base-1.0 (Not Loaded)
shape_pipe: stabilityai/shap-e (Not Loaded)
t2v: cerspense/zeroscope_v2_576w (Not Loaded)
Audio_cls Processor: microsoft/wavlm-base-plus-sd (Not Loaded)
Bark Processor: suno/bark (Not Loaded)
Caption Processor: nlpconnect/vit-gpt2-image-captioning (Not Loaded)
Clip Processor: openai/clip-vit-base-patch32 (Not Loaded)
Depth Processor: Intel/dpt-large (Not Loaded)
Detr Processor: facebook/detr-resnet-50 (Not Loaded)
Docvqa Processor: naver-clova-ix/donut-base-finetuned-docvqa (Not Loaded)
Speech Processor: openai/whisper-tiny (Not Loaded)
Trocr Processor: microsoft/trocr-small-handwritten (Not Loaded)
Video_cls Processor: MCG-NJU/videomae-base-finetuned-kinetics (Not Loaded)
Vqa Processor: dandelin/vilt-b32-finetuned-vqa (Not Loaded)
Zshot_cls Processor: openai/clip-vit-large-patch14 (Not Loaded)
Zshot_det Processor: google/owlvit-base-patch32 (Not Loaded)
Code Tokenizer: bigcode/starcoder2-3b (Not Loaded)
Music Tokenizer: facebook/musicgen-medium (Not Loaded)
Ner Tokenizer: dbmdz/bert-large-cased-finetuned-conll03-english (Not Loaded)
Qa Tokenizer: distilbert-base-uncased-distilled-squad (Not Loaded)
Sentiment Tokenizer: unitary/toxic-bert (Not Loaded)
Text Tokenizer: Qwen/Qwen1.5-1.8B (Not Loaded)
Tqa Tokenizer: google/tapas-base-finetuned-wtq (Not Loaded)

Optimizations & Mechanisms

Selective Loading: from_pretrained(..., components_to_load=[...]).
Auto-Skip Failed Loads: Logs errors and continues if a component fails.
Logging & Performance Timing: Optional generate(..., time_execution=True).
Input Validation: Enhanced type/value checks.
Custom BitLinear: Configured: False.
BitsAndBytes Quantization: Configured: False, Mode: 4bit.
Global Pruning: Configured Amount: 0.0.
Gradient Checkpointing: Configured: False.
Flash Attention 2: Configured: False.
Diffusers Optimizations: Slicing (True), Offload (True).
Low CPU Mem Usage: Configured: True.

Saving & Loading

Uses standard save_pretrained / from_pretrained. Components in subdirs. Failed loads during from_pretrained skipped.

python from mega_multimodal_model import MegaMultimodalModel # Assuming class definition saved

Save

model.save_pretrained("./my_mega_multimodal_model_v6")

Load all

loaded_model = MegaMultimodalModel.from_pretrained("./my_mega_multimodal_model_v6")

Load selectively (example)

components = ['text', 'text_tok', 'sd', 'canny'] # Use class attribute names or controlnet types

loaded_model_subset = MegaMultimodalModel.from_pretrained("./my_mega_multimodal_model_v6", components_to_load=components)

Load from Hub

loaded_model = MegaMultimodalModel.from_pretrained("your_hf_username/your_repo_name")

loaded_model_subset = MegaMultimodalModel.from_pretrained("your_hf_username/your_repo_name", components_to_load=components)

Usage

text_output = loaded_model.generate("Hello!", task="text", time_execution=True)

Installation Dependencies

Core: torch torchvision torchaudio transformers diffusers huggingface_hub[hf_xet] hf_xet safetensors timm Pillow accelerate bitsandbytes einops pandas decord ftfy pyav --upgrade Optional: controlnet-aux, flash-attn --no-build-isolation

Model Configuration (config.json)

Stores component IDs and optimization settings.

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support