| | --- |
| | license: mit |
| | tags: |
| | - vision-language |
| | - mixture-of-experts |
| | - text-generation |
| | - vision-transformer |
| | - pytorch |
| | model_index: |
| | - name: SparseFusion |
| | results: |
| | - task: |
| | type: text-generation |
| | dataset: |
| | name: Custom Caption Dataset |
| | type: custom |
| | metrics: |
| | - name: Validation Loss |
| | type: loss |
| | value: 0.8 |
| | --- |
| | |
| | # SparseFusion |
| |
|
| | **SparseFusion** is a multimodal Mixture-of-Experts (MoE) model integrating a Vision Transformer (ViT) and transformer decoder for image-conditioned text generation. It is built entirely in PyTorch and extends [SeeMOE](https://github.com/AviSoori1x/seemore). |
| |
|
| | --- |
| |
|
| | ## π§ Model Details |
| |
|
| | - **Name**: SparseFusion |
| | - **Author**: Derrick Kirimi ([GitHub](https://github.com/DerrickKirimi) Β· [LinkedIn](https://www.linkedin.com/in/derrick-kirimi-22a470175/) Β· [Hugging Face](https://huggingface.co/Aptheos)) |
| | - **Model Type**: Vision-Language Model |
| | - **Architecture**: |
| | - Vision Encoder: ViT (96Γ96 images, 16Γ16 patches, 512-dim patch embeddings) |
| | - Decoder: Transformer with MoE layers (8 layers, 128-dim, 8 heads) |
| | - MoE Setup: 8 experts, top-2 routing, expert capacity control |
| | - Token Fusion: Concatenation of image tokens and character-level encoded text |
| | - **License**: Apache 2.0 |
| | - **Repository**: [GitHub - DerrickKirimi/SparseFusion](https://github.com/DerrickKirimi/SparseFusion) |
| |
|
| | --- |
| |
|
| | ## π Intended Use |
| |
|
| | - **Primary Use Case**: Image-conditioned text generation for educational and research experimentation |
| | - **Intended Users**: ML researchers, students, developers |
| | - **Out-of-Scope Uses**: Not suitable for deployment in production or for generating harmful content |
| |
|
| | --- |
| |
|
| | ## ποΈββοΈ Training & Evaluation |
| |
|
| | ### π
Dataset |
| |
|
| | - **Text**: Tiny Shakespeare (character-level) |
| | - **Images**: 300 synthetic image-caption pairs |
| |
|
| | ### βοΈ Training |
| |
|
| | - Trained for 2 epochs on **Google Colab (1 GPU, 12 GB VRAM)** |
| | - Logging via **Weights & Biases (wandb)** |
| |
|
| | ### π Hyperparameters |
| |
|
| | ```yaml |
| | epochs: 2 |
| | batch_size: 16 |
| | learning_rate: 0.001 |
| | n_embd: 128 |
| | n_head: 8 |
| | n_layer: 8 |
| | num_experts: 8 |
| | top_k: 2 |
| | expert_capacity: 32 |
| | img_size: 96 |
| | patch_size: 16 |
| | ``` |
| |
|
| | ### π Evaluation |
| |
|
| | - **Validation Loss**: 0.8 after 2 epochs |
| | - **Summary**: |
| | - Generates basic coherent text |
| | - Shows 15% improvement in expert utilization with routing control and load balancing |
| |
|
| | --- |
| |
|
| | ## π Usage |
| |
|
| | ### π¦ Installation |
| |
|
| | ```bash |
| | pip install torch torchvision transformers huggingface_hub wandb |
| | ``` |
| |
|
| | ### π Inference |
| |
|
| | ```python |
| | import torch |
| | import pickle |
| | from PIL import Image |
| | import torchvision.transforms as transforms |
| | from huggingface_hub import hf_hub_download |
| | |
| | # Load vocabulary mappings |
| | stoi = pickle.load(open(hf_hub_download("Aptheos/SparseFusion", "stoi.pkl"), "rb")) |
| | itos = pickle.load(open(hf_hub_download("Aptheos/SparseFusion", "itos.pkl"), "rb")) |
| | encode = lambda s: [stoi[c] for c in s] |
| | decode = lambda l: ''.join([itos[i] for i in l]) |
| | |
| | # Define model architecture |
| | model = VisionMoELanguageModel( |
| | n_embd=128, image_embed_dim=512, vocab_size=len(stoi), n_layer=8, |
| | img_size=96, patch_size=16, num_heads=8, num_blks=3, |
| | emb_dropout=0.1, blk_dropout=0.1, num_experts=8, top_k=2, expert_capacity=32 |
| | ) |
| | model.load_state_dict(torch.load(hf_hub_download("Aptheos/SparseFusion", "vision_moe_model.pth"))) |
| | model.eval().to("cuda") |
| | |
| | # Preprocess image |
| | transform = transforms.Compose([ |
| | transforms.Resize((96, 96)), |
| | transforms.ToTensor(), |
| | transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) |
| | ]) |
| | image = transform(Image.open("example.jpg")).unsqueeze(0).to("cuda") |
| | prompt = torch.tensor([encode("A photo of")], dtype=torch.long).to("cuda") |
| | |
| | # Generate text |
| | generated = model.generate(image, prompt, max_new_tokens=50) |
| | print(decode(generated[0].tolist())) |
| | ``` |
| |
|
| | To run on CPU: |
| |
|
| | ```python |
| | model.eval().to("cpu") |
| | image = image.to("cpu") |
| | prompt = prompt.to("cpu") |
| | ``` |
| |
|
| | --- |
| |
|
| | ## β οΈ Limitations & Biases |
| |
|
| | ### Limitations |
| |
|
| | - The model generates incoherent text (e.g., `"A photo ofiecp ntti<pad><pad>..."`) due to training on a small, synthetic dataset of 300 identical images with simplistic captions. |
| | - Vision encoder (ViT) is **not pre-trained**, reducing visual feature quality. |
| | - Character-level tokenization limits text fluency and introduces `<pad>` tokens. |
| | - Limited training time (2 epochs) restricts deep multimodal learning. |
| |
|
| | ### Biases |
| |
|
| | - Synthetic captions create bias toward repetitive language structures. |
| | - Lack of diverse image inputs may bias the modelβs visual representation. |
| |
|
| | --- |
| |
|
| | ## π Future Work |
| |
|
| | - Train on larger datasets (e.g., COCO, Flickr30k) for better generalization |
| | - Use pre-trained ViT backbone (e.g., `timm/vit_small_patch16_224`) |
| | - Implement subword tokenization (e.g., SentencePiece, BPE) |
| | - Add modality type embeddings and rotary positional embeddings (RoPE) |
| | - Visualize expert routing and attention patterns for interpretability |
| | - Increase training epochs and perform hyperparameter tuning |
| |
|
| | --- |
| |
|
| | ## π License |
| |
|
| | Licensed under the **MIT License** for open research and educational use. |
| |
|
| | --- |
| |
|
| | ## π Citation |
| |
|
| | ```bibtex |
| | @misc{sparsefusion2025, |
| | author = {Derrick Kirimi}, |
| | title = {SparseFusion: A Multimodal Mixture-of-Experts Model}, |
| | year = {2025}, |
| | url = {https://huggingface.co/Aptheos/SparseFusion} |
| | } |
| | ``` |