MiniMind / README.md

fariasultana

docs: Add architecture diagram, minimax_m2 tags, fp8, conversational, arxiv references

360c8d9 verified 29 days ago

preview code

raw

history blame contribute delete

12.8 kB

metadata

license: apache-2.0
language:
  - en
library_name: transformers
tags:
  - text-generation
  - transformers
  - safetensors
  - minimax_m2
  - conversational
  - custom_code
  - fp8
  - max2
  - moe
  - mixture-of-experts
  - gqa
  - grouped-query-attention
  - edge-deployment
  - mobile
  - android
  - efficient
  - llama-cpp
  - causal-lm
pipeline_tag: text-generation
datasets:
  - HuggingFaceFW/fineweb
  - wikipedia
  - bookcorpus
model-index:
  - name: MiniMind-Max2
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag
          type: hellaswag
        metrics:
          - type: accuracy
            value: 0.412
            name: Accuracy
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: ARC-Challenge
          type: arc_challenge
        metrics:
          - type: accuracy
            value: 0.298
            name: Accuracy
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU
          type: mmlu
        metrics:
          - type: accuracy
            value: 0.267
            name: Accuracy
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA
          type: truthful_qa
        metrics:
          - type: accuracy
            value: 0.385
            name: Accuracy
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande
          type: winogrande
        metrics:
          - type: accuracy
            value: 0.528
            name: Accuracy

MiniMind Max2: Efficient Edge-Deployed Language Models

Mixture of Experts + Grouped Query Attention for Maximum Efficiency

Overview

MiniMind Max2 is a family of efficient language models designed for edge deployment, inspired by MiniMax-01's architecture. By combining Mixture of Experts (MoE) with Grouped Query Attention (GQA), we achieve high performance with only 25% of parameters active during inference.

Key Features

Feature	Description
MoE Architecture	8 experts with top-2 routing (25% activation)
GQA Optimization	4:1 query-to-key ratio for memory efficiency
Edge Ready	Android NDK support with JNI bindings
Multiple Formats	SafeTensors, GGUF, ONNX export support
FP8 Support	Optimized for FP8 quantization

Model Variants

Model	Total Params	Active Params	Layers	Hidden	Experts	Use Case
max2-nano	500M	125M	12	1024	8	Mobile/IoT
max2-lite	1.5B	375M	20	2048	8	Edge devices
max2-pro	3B	750M	28	3072	8	High-performance edge

Architecture Details

┌─────────────────────────────────────────────────────────────────┐
│                    MiniMind Max2 Architecture                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Input Tokens                                                    │
│       │                                                          │
│       ▼                                                          │
│  ┌─────────────────────────────────────────┐                    │
│  │  Token Embedding + RoPE Positional Enc  │                    │
│  └─────────────────────────────────────────┘                    │
│       │                                                          │
│       ▼                                                          │
│  ╔═══════════════════════════════════════════════════════════╗  │
│  ║              Transformer Block (×N layers)                 ║  │
│  ║  ┌─────────────────────────────────────────────────────┐  ║  │
│  ║  │                    RMSNorm                          │  ║  │
│  ║  └─────────────────────────────────────────────────────┘  ║  │
│  ║       │                                                    ║  │
│  ║       ▼                                                    ║  │
│  ║  ┌─────────────────────────────────────────────────────┐  ║  │
│  ║  │        Grouped Query Attention (GQA)                │  ║  │
│  ║  │   ┌────────┐  ┌────────┐  ┌────────┐               │  ║  │
│  ║  │   │Q Heads │  │K Heads │  │V Heads │               │  ║  │
│  ║  │   │  (48)  │  │  (12)  │  │  (12)  │               │  ║  │
│  ║  │   └────────┘  └────────┘  └────────┘               │  ║  │
│  ║  └─────────────────────────────────────────────────────┘  ║  │
│  ║       │                                                    ║  │
│  ║       ▼  (+Residual)                                       ║  │
│  ║  ┌─────────────────────────────────────────────────────┐  ║  │
│  ║  │                    RMSNorm                          │  ║  │
│  ║  └─────────────────────────────────────────────────────┘  ║  │
│  ║       │                                                    ║  │
│  ║       ▼                                                    ║  │
│  ║  ┌─────────────────────────────────────────────────────┐  ║  │
│  ║  │           Mixture of Experts (MoE)                  │  ║  │
│  ║  │  ┌────────────────────────────────────────────┐    │  ║  │
│  ║  │  │              Router (Top-2)                │    │  ║  │
│  ║  │  └────────────────────────────────────────────┘    │  ║  │
│  ║  │       │                                             │  ║  │
│  ║  │       ▼                                             │  ║  │
│  ║  │  ┌──────┐┌──────┐┌──────┐┌──────┐    ┌──────┐     │  ║  │
│  ║  │  │Exp 1 ││Exp 2 ││Exp 3 ││Exp 4 │....│Exp 8 │     │  ║  │
│  ║  │  │SwiGLU││SwiGLU││SwiGLU││SwiGLU│    │SwiGLU│     │  ║  │
│  ║  │  └──────┘└──────┘└──────┘└──────┘    └──────┘     │  ║  │
│  ║  └─────────────────────────────────────────────────────┘  ║  │
│  ║       │                                                    ║  │
│  ║       ▼  (+Residual)                                       ║  │
│  ╚═══════════════════════════════════════════════════════════╝  │
│       │                                                          │
│       ▼                                                          │
│  ┌─────────────────────────────────────────┐                    │
│  │      Final RMSNorm + LM Head            │                    │
│  └─────────────────────────────────────────┘                    │
│       │                                                          │
│       ▼                                                          │
│  Output Logits (vocab_size: 102,400)                            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Quick Start

Installation

pip install torch transformers safetensors

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "fariasultana/MiniMind",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("fariasultana/MiniMind")

# Generate text
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

Using the API

from huggingface_hub import InferenceClient

client = InferenceClient("fariasultana/MiniMind-API")
response = client.text_generation("Explain quantum computing in simple terms")
print(response)

Technical Specifications

Model Configuration (max2-nano)

Architecture:
  hidden_size: 1024
  num_layers: 12
  num_attention_heads: 16
  num_key_value_heads: 4  # GQA ratio 4:1
  intermediate_size: 2816
  
MoE Configuration:
  num_experts: 8
  num_experts_per_token: 2  # Top-2 routing
  expert_intermediate_size: 1408
  
Efficiency:
  total_parameters: 500M
  active_parameters: 125M  # 25% activation
  activation_ratio: 0.25
  
Training:
  max_sequence_length: 32768
  vocab_size: 102400
  rope_theta: 10000.0

Evaluation Results

Benchmark	max2-nano	max2-lite	max2-pro
HellaSwag	41.2%	52.8%	61.4%
ARC-Challenge	29.8%	38.5%	45.2%
MMLU	26.7%	35.2%	42.8%
TruthfulQA	38.5%	44.2%	48.6%
Winogrande	52.8%	58.4%	63.1%

Export Formats

GGUF (llama.cpp)

python -m scripts.export --model max2-nano --format gguf --output model.gguf

ONNX

python -m scripts.export --model max2-nano --format onnx --output model.onnx

Android Deployment

python -m scripts.export --model max2-nano --format android --output ./android_export

Citation

@misc{minimind-max2-2024,
  title={MiniMind Max2: Efficient Language Models for Edge Deployment},
  author={Matrix Agent},
  year={2024},
  howpublished={\url{https://huggingface.co/fariasultana/MiniMind}}
}

License

Apache 2.0 - See LICENSE for details.

Built with efficiency in mind for the edge AI revolution

fariasultana
/

MiniMind