YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Mega Multimodal Model V5 (mega-multimodal-v5)

This repository contains an advanced multimodal model combining numerous specialized models into a single architecture with a unified generate() interface. This version includes selective loading, logging, enhanced validation, and performance timing.

Included Capabilities & Models (Based on Config)

  • Text Generation: Qwen/Qwen1.5-1.8B (Loaded)
  • Code Generation: bigcode/starcoder2-3b (Loaded)
  • Speech Recognition (Whisper Tiny): openai/whisper-tiny (Loaded)
  • Speech Recognition (Whisper Large V3): openai/whisper-large-v3 (Loaded)
  • Audio Classification: microsoft/wavlm-base-plus-sd (Loaded)
  • Music Generation: facebook/musicgen-medium (Loaded)
  • Audio Generation (Bark TTS): suno/bark (Loaded)
  • Text-to-Image (SD 1.5): runwayml/stable-diffusion-v1-5 (Loaded)
  • Text-to-Image (SDXL): stabilityai/stable-diffusion-xl-base-1.0 (Refiner: stabilityai/stable-diffusion-xl-refiner-1.0) (Base Loaded, Refiner Loaded)
  • Text-to-Image (Kandinsky 2.2): Prior: kandinsky-community/kandinsky-2-2-prior, Decoder: kandinsky-community/kandinsky-2-2-decoder (Loaded)
  • Image Inpainting (SD): runwayml/stable-diffusion-inpainting (Not Loaded)
  • ControlNet Integration (SD 1.5): Canny:lllyasviel/control_v11p_sd15_canny (Not Loaded), Pose:lllyasviel/control_v11p_sd15_openpose (Not Loaded), SoftEdge:lllyasviel/control_v11p_sd15_softedge (Not Loaded), Depth:lllyasviel/control_v11f1p_sd15_depth (Not Loaded)
  • Image Editing (InstructPix2Pix): timbrooks/instruct-pix2pix (Loaded)
  • Text-to-3D (Shap-E): stabilityai/shap-e (Not Loaded)
  • Translation: facebook/nllb-200-distilled-600M (Loaded)
  • Summarization: facebook/bart-large-cnn (Loaded)
  • Text-to-Video: cerspense/zeroscope_v2_576w (Loaded)
  • Image-to-Video: stabilityai/stable-video-diffusion-img2vid-xt (Loaded)
  • Video Classification: MCG-NJU/videomae-base-finetuned-kinetics (Loaded)
  • Sentence Embeddings: sentence-transformers/all-MiniLM-L6-v2 (Loaded)
  • CLIP Embeddings: openai/clip-vit-base-patch32 (Loaded)
  • Sentiment/Toxicity Analysis: unitary/toxic-bert (Loaded)
  • Question Answering (Text): distilbert-base-uncased-distilled-squad (Loaded)
  • Table Question Answering (TAPAS): google/tapas-base-finetuned-wtq (Loaded)
  • Document Visual Question Answering: naver-clova-ix/donut-base-finetuned-docvqa (Not Loaded)
  • Named Entity Recognition (NER): dbmdz/bert-large-cased-finetuned-conll03-english (Loaded)
  • Optical Character Recognition (OCR): microsoft/trocr-small-handwritten (Loaded)
  • Object Detection (DETR): facebook/detr-resnet-50 (Loaded)
  • Zero-Shot Object Detection (OwlViT): google/owlvit-base-patch32 (Loaded)
  • Semantic Segmentation (SegFormer): nvidia/segformer-b0-finetuned-ade-512-512 (Loaded)
  • Image Captioning: nlpconnect/vit-gpt2-image-captioning (Loaded)
  • Depth Estimation: Intel/dpt-large (Loaded)
  • Visual Question Answering (VILT): dandelin/vilt-b32-finetuned-vqa (Loaded)
  • Zero-Shot Image Classification: openai/clip-vit-large-patch14 (Loaded)

Optimizations & Mechanisms

  • Selective Loading: Load only specified components via from_pretrained(..., components_to_load=['text', 'sd']).
  • Logging: Uses Python's logging module instead of print.
  • Performance Timing: Optional timing via generate(..., time_execution=True).
  • Input Validation: Enhanced type/value checks in generate.
  • Custom BitLinear: Configured: False.
  • BitsAndBytes Quantization: Configured: False, Mode: 4bit.
  • Global Pruning: Configured Amount: 0.0.
  • Gradient Checkpointing: Configured: False.
  • Flash Attention 2: Configured: False.
  • Diffusers Optimizations: Attention Slicing (True), CPU Offload (True).
  • Low CPU Mem Usage: Configured: True.
  • Memory Management: Includes gc.collect() and torch.cuda.empty_cache() calls.

Saving & Loading

Uses standard save_pretrained / from_pretrained. Saves models using safetensors where possible. Uploads entire saved directory content to the Hub.

from mega_multimodal_model import MegaMultimodalModel

# Save the model
# model.save_pretrained("./my_mega_multimodal_model_v5")

# Load all components
# loaded_model = MegaMultimodalModel.from_pretrained("./my_mega_multimodal_model_v5")

# Load only specific components (example)
components = ['text', 'sd', 'caption_model']
loaded_model_subset = MegaMultimodalModel.from_pretrained("./my_mega_multimodal_model_v5", components_to_load=components)

# Or load from Hub
# loaded_model = MegaMultimodalModel.from_pretrained("your_hf_username/your_repo_name")
# loaded_model_subset = MegaMultimodalModel.from_pretrained("your_hf_username/your_repo_name", components_to_load=components)

# Example usage
# text_output = loaded_model.generate("Hello!", task="text", time_execution=True)
# image_output = loaded_model.generate("A cat astronaut", task="image")

Installation Dependencies

Core:

pip install torch torchvision torchaudio transformers diffusers huggingface_hub[hf_xet] hf_xet safetensors timm==0.9.16 Pillow accelerate bitsandbytes einops pandas decord ftfy pyav soundfile protobuf --upgrade

Optional:

pip install controlnet-aux
# pip install flash-attn --no-build-isolation

Model Configuration (config.json)

Stores component model IDs and optimization settings.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support