| --- |
| library_name: transformers |
| license: apache-2.0 |
| language: |
| - en |
| base_model: |
| - HuggingFaceTB/SmolLM2-360M-Instruct |
| - prithivMLmods/SmolLM2-CoT-360M |
| - summerstars/SolaraV2-coder-0517 |
| - Fu01978/SmolLM2-360M-Instruct-Heretic |
| pipeline_tag: text-generation |
| tags: |
| - mixture-of-experts |
| - moe |
| - mergekit |
| - smollm2 |
| - instruct |
| - reasoning |
| - code |
| - math |
| - creative |
| - merge |
| --- |
| |
| # SmolMoE-4x360M-Instruct |
|
|
| A Mixture-of-Experts model |
| built by merging four SmolLM2-360M fine-tunes |
| using [mergekit](https://github.com/arcee-ai/mergekit). |
| Each expert specializes |
| in a distinct domain, |
| with 2 experts active per token |
| (~720M active parameters |
| per forward pass |
| out of ~1.4B total). |
|
|
| ## Experts |
| |
| | # | Model | Specialization | |
| |---|-------|---------------| |
| | E0 | [HuggingFaceTB/SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) | General knowledge, factual Q&A | |
| | E1 | [prithivMLmods/SmolLM2-CoT-360M](https://huggingface.co/prithivMLmods/SmolLM2-CoT-360M) | Chain-of-thought reasoning, logic | |
| | E2 | [summerstars/SolaraV2-coder-0517](https://huggingface.co/summerstars/SolaraV2-coder-0517) | Code generation, mathematics | |
| | E3 | [Fu01978/SmolLM2-360M-Instruct-Heretic](https://huggingface.co/Fu01978/SmolLM2-360M-Instruct-Heretic) | Creative writing, expressive language | |
|
|
| ## Architecture |
| |
| - **Base architecture:** Mixtral-style MoE (via mergekit) |
| - **Total experts:** 4 |
| - **Active experts per token:** 2 |
| - **Gate mode:** `hidden` (router trained on real hidden states), with subsequent router fine-tuning |
| - **Active parameters per token:** ~720M |
|
|
| ## Usage |
| |
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| "Fu01978/SmolMoE-4x360M-Instruct", |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| tokenizer = AutoTokenizer.from_pretrained("Fu01978/SmolMoE-4x360M-Instruct") |
| |
| messages = [{"role": "user", "content": "Implement a binary search in Python."}] |
| formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| inputs = tokenizer(formatted, return_tensors="pt").to(model.device) |
| |
| output = model.generate(**inputs, max_new_tokens=256, temperature=0.2, do_sample=True) |
| print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) |
| ``` |
|
|
| ## Limitations |
|
|
| - General factual accuracy |
| is imperfect — the model can hallucinate |
| details on knowledge questions |
| - At 360M per expert, |
| complex multi-step reasoning |
| has limits |
| - E0 (General) |
| is the weakest |
| router target |
| due to weight similarity |
| with E3 (Heretic), |
| which is a direct fine-tune |
| of the same base |
|
|
| ## Created With |
| |
| - [mergekit](https://github.com/arcee-ai/mergekit) |
| — MoE construction |
| - Kaggle Dual T4 GPUs |