🧠 Model Card: Gemma3-MoE-11X0.3B-Heavens-Army (DualGPU FullTrain)

🔷 Model Details • Model Name: Gemma3-MoE-11X0.3B-Heavens-Army-DualGPU-FullTrain • Base Architecture: Google DeepMind Gemma 3 (Mixture-of-Experts modified) • Model Type: Causal Language Model (MoE - 10 Experts + 1 Router) • Parameter Structure: • 10 Experts (~0.3B each) • 1 Router (gating network trained for dynamic expert selection) • Total Effective Capacity: ~3B+ routed parameters (sparse activation) • Training Method: Full Fine-Tuning (NO adapters, NO LoRA) • Precision: FP16 mixed precision • Hardware: Dual NVIDIA Tesla T4 (Kaggle) • Frameworks: • PyTorch • Hugging Face Transformers • Accelerate

🔷 Model Description

Heavens Army is a highly specialized Mixture-of-Experts system designed to behave like a multi-mind AI collective. Each expert is trained on a distinct cognitive domain, while the router dynamically selects which expert(s) should process each token.

Think of it like a council of specialists— a strategist, a coder, a debugger, a systems architect— and a commander (router) deciding who speaks at every moment 🧩

🔷 Training Data • Primary Dataset: → https://huggingface.co/datasets/gss1111/hyper_advanced_10_datasets • Dataset Composition: 10 curated sub-datasets (5K each), mapped directly to experts:

Expert ID Dataset Name Specialization 0 MIND Algorithmic Dialogue Reasoning + structured thinking 1 InfiniByte SystemsForge Systems + infra design 2 Agentic ToolCraft Tool usage + agents 3 rStar Verified Elite High-quality reasoning 4 CP SQL Fusion SQL + data logic 5 SearchServe Agentic Retrieval + search 6 OpenCode RepoAgent Codebase navigation 7 AgentRx RootCause Debugging 8 OpenCodeInstruct ExecJudge Code execution reasoning 9 EpiFeatureTree RepoSynth Repo synthesis

•	Total Samples: ~50,000
•	Tokenization: Dynamic padding + truncation (max_length=1024)

🔷 Training Procedure

🔹 Objective

Train: • Experts → domain mastery • Router → intelligent expert selection

🔹 Key Techniques • Sparse MoE routing • Router auxiliary loss (load balancing) • Z-loss stabilization • Expert prior regularization • Gradient accumulation for memory scaling

🔹 Hyperparameters

epochs: 2 max_length: 1024 train_batch_size: 4 eval_batch_size: 4 gradient_accumulation: 8

learning_rate_main: 1e-4 learning_rate_router: 3e-4

router_aux_loss_coef: 1e-3 router_z_loss_coef: 1e-4 expert_prior_loss_coef: 5e-2

mixed_precision: fp16 distributed: DDP (2 GPUs)

🔷 Architecture Breakdown

🧠 Experts (10)

Each expert is a fully trainable transformer block stack specializing in a domain.

🎯 Router (1) • Learns token → expert mapping • Uses softmax gating • Top-k expert selection (typically k=2)

⚙️ Execution Flow

Input → Token Embedding → Router decides expert weights → Selected experts process tokens → Outputs merged (weighted sum) → Final logits

🔷 Intended Use

✅ Best For: • Multi-domain reasoning • Advanced coding + debugging • Agentic workflows • Complex system design • Repository-level understanding

❌ Not Ideal For: • Ultra-low latency inference (MoE overhead) • Tiny-device deployment • Non-English heavy workloads (unless extended)

🔷 Limitations • Router collapse risk if improperly balanced • Expert imbalance possible without proper loss tuning • High VRAM usage during training • Requires careful batching (padding/truncation critical)

🔷 Evaluation

(To be filled after benchmarking)

Recommended benchmarks: • HumanEval (code) • MBPP • GSM8K (reasoning) • RepoBench (codebase understanding)

🔷 License

Custom WithIn Us AI License • Base model: Derived from Gemma (by Google DeepMind) • Fine-tuning + MoE architecture: WithIn Us AI (gss1111) • Dataset usage: Third-party + proprietary datasets used without ownership claims

Terms: • Attribution required • No claim over base model • Respect original dataset creators

🔷 Attribution

Base Model • Gemma 3 by Google DeepMind

Frameworks • PyTorch • Hugging Face Transformers • Accelerate

Dataset Creator • gss1111 (WithIn Us AI)

🔷 Future Improvements • Dynamic expert expansion (scaling beyond 10 experts) • Smarter router (reinforcement-trained gating) • Memory-efficient MoE (DeepSpeed / FSDP hybrid) • Long-context extension (128K+)

🔷 Final Notes

This model isn’t just trained… it’s orchestrated 🎼

A coordinated intelligence system where each expert carries a blade, and the router decides who Strikes!

Downloads last month
77
Safetensors
Model size
0.9B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GODsStrongestSoldier/Gemma3-MoE-11X0.3B-Heavens-Army-FullTrain-RouterDualGPU

Dataset used to train GODsStrongestSoldier/Gemma3-MoE-11X0.3B-Heavens-Army-FullTrain-RouterDualGPU