🧠 Model Card: Gemma3-MoE-11X0.3B-Heavens-Army (DualGPU FullTrain)
🔷 Model Details • Model Name: Gemma3-MoE-11X0.3B-Heavens-Army-DualGPU-FullTrain • Base Architecture: Google DeepMind Gemma 3 (Mixture-of-Experts modified) • Model Type: Causal Language Model (MoE - 10 Experts + 1 Router) • Parameter Structure: • 10 Experts (~0.3B each) • 1 Router (gating network trained for dynamic expert selection) • Total Effective Capacity: ~3B+ routed parameters (sparse activation) • Training Method: Full Fine-Tuning (NO adapters, NO LoRA) • Precision: FP16 mixed precision • Hardware: Dual NVIDIA Tesla T4 (Kaggle) • Frameworks: • PyTorch • Hugging Face Transformers • Accelerate
⸻
🔷 Model Description
Heavens Army is a highly specialized Mixture-of-Experts system designed to behave like a multi-mind AI collective. Each expert is trained on a distinct cognitive domain, while the router dynamically selects which expert(s) should process each token.
Think of it like a council of specialists— a strategist, a coder, a debugger, a systems architect— and a commander (router) deciding who speaks at every moment 🧩
⸻
🔷 Training Data • Primary Dataset: → https://huggingface.co/datasets/gss1111/hyper_advanced_10_datasets • Dataset Composition: 10 curated sub-datasets (5K each), mapped directly to experts:
Expert ID Dataset Name Specialization 0 MIND Algorithmic Dialogue Reasoning + structured thinking 1 InfiniByte SystemsForge Systems + infra design 2 Agentic ToolCraft Tool usage + agents 3 rStar Verified Elite High-quality reasoning 4 CP SQL Fusion SQL + data logic 5 SearchServe Agentic Retrieval + search 6 OpenCode RepoAgent Codebase navigation 7 AgentRx RootCause Debugging 8 OpenCodeInstruct ExecJudge Code execution reasoning 9 EpiFeatureTree RepoSynth Repo synthesis
• Total Samples: ~50,000
• Tokenization: Dynamic padding + truncation (max_length=1024)
⸻
🔷 Training Procedure
🔹 Objective
Train: • Experts → domain mastery • Router → intelligent expert selection
🔹 Key Techniques • Sparse MoE routing • Router auxiliary loss (load balancing) • Z-loss stabilization • Expert prior regularization • Gradient accumulation for memory scaling
🔹 Hyperparameters
epochs: 2 max_length: 1024 train_batch_size: 4 eval_batch_size: 4 gradient_accumulation: 8
learning_rate_main: 1e-4 learning_rate_router: 3e-4
router_aux_loss_coef: 1e-3 router_z_loss_coef: 1e-4 expert_prior_loss_coef: 5e-2
mixed_precision: fp16 distributed: DDP (2 GPUs)
⸻
🔷 Architecture Breakdown
🧠 Experts (10)
Each expert is a fully trainable transformer block stack specializing in a domain.
🎯 Router (1) • Learns token → expert mapping • Uses softmax gating • Top-k expert selection (typically k=2)
⚙️ Execution Flow
Input → Token Embedding → Router decides expert weights → Selected experts process tokens → Outputs merged (weighted sum) → Final logits
⸻
🔷 Intended Use
✅ Best For: • Multi-domain reasoning • Advanced coding + debugging • Agentic workflows • Complex system design • Repository-level understanding
❌ Not Ideal For: • Ultra-low latency inference (MoE overhead) • Tiny-device deployment • Non-English heavy workloads (unless extended)
⸻
🔷 Limitations • Router collapse risk if improperly balanced • Expert imbalance possible without proper loss tuning • High VRAM usage during training • Requires careful batching (padding/truncation critical)
⸻
🔷 Evaluation
(To be filled after benchmarking)
Recommended benchmarks: • HumanEval (code) • MBPP • GSM8K (reasoning) • RepoBench (codebase understanding)
⸻
🔷 License
Custom WithIn Us AI License • Base model: Derived from Gemma (by Google DeepMind) • Fine-tuning + MoE architecture: WithIn Us AI (gss1111) • Dataset usage: Third-party + proprietary datasets used without ownership claims
Terms: • Attribution required • No claim over base model • Respect original dataset creators
⸻
🔷 Attribution
Base Model • Gemma 3 by Google DeepMind
Frameworks • PyTorch • Hugging Face Transformers • Accelerate
Dataset Creator • gss1111 (WithIn Us AI)
⸻
🔷 Future Improvements • Dynamic expert expansion (scaling beyond 10 experts) • Smarter router (reinforcement-trained gating) • Memory-efficient MoE (DeepSpeed / FSDP hybrid) • Long-context extension (128K+)
⸻
🔷 Final Notes
This model isn’t just trained… it’s orchestrated 🎼
A coordinated intelligence system where each expert carries a blade, and the router decides who Strikes!
- Downloads last month
- 77
Model tree for GODsStrongestSoldier/Gemma3-MoE-11X0.3B-Heavens-Army-FullTrain-RouterDualGPU
Base model
google/functiongemma-270m-it