--- library_name: transformers license: apache-2.0 language: - en base_model: - HuggingFaceTB/SmolLM2-360M-Instruct - prithivMLmods/SmolLM2-CoT-360M - summerstars/SolaraV2-coder-0517 - Fu01978/SmolLM2-360M-Instruct-Heretic pipeline_tag: text-generation tags: - mixture-of-experts - moe - mergekit - smollm2 - instruct - reasoning - code - math - creative - merge --- # SmolMoE-4x360M-Instruct A Mixture-of-Experts model built by merging four SmolLM2-360M fine-tunes using [mergekit](https://github.com/arcee-ai/mergekit). Each expert specializes in a distinct domain, with 2 experts active per token (~720M active parameters per forward pass out of ~1.4B total). ## Experts | # | Model | Specialization | |---|-------|---------------| | E0 | [HuggingFaceTB/SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) | General knowledge, factual Q&A | | E1 | [prithivMLmods/SmolLM2-CoT-360M](https://huggingface.co/prithivMLmods/SmolLM2-CoT-360M) | Chain-of-thought reasoning, logic | | E2 | [summerstars/SolaraV2-coder-0517](https://huggingface.co/summerstars/SolaraV2-coder-0517) | Code generation, mathematics | | E3 | [Fu01978/SmolLM2-360M-Instruct-Heretic](https://huggingface.co/Fu01978/SmolLM2-360M-Instruct-Heretic) | Creative writing, expressive language | ## Architecture - **Base architecture:** Mixtral-style MoE (via mergekit) - **Total experts:** 4 - **Active experts per token:** 2 - **Gate mode:** `hidden` (router trained on real hidden states), with subsequent router fine-tuning - **Active parameters per token:** ~720M ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained( "Fu01978/SmolMoE-4x360M-Instruct", torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained("Fu01978/SmolMoE-4x360M-Instruct") messages = [{"role": "user", "content": "Implement a binary search in Python."}] formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(formatted, return_tensors="pt").to(model.device) output = model.generate(**inputs, max_new_tokens=256, temperature=0.2, do_sample=True) print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) ``` ## Limitations - General factual accuracy is imperfect — the model can hallucinate details on knowledge questions - At 360M per expert, complex multi-step reasoning has limits - E0 (General) is the weakest router target due to weight similarity with E3 (Heretic), which is a direct fine-tune of the same base ## Created With - [mergekit](https://github.com/arcee-ai/mergekit) — MoE construction - Kaggle Dual T4 GPUs