File size: 3,712 Bytes

---
library_name: transformers
license: mit
datasets:
- YourDatasetName/if-applicable
language:
- en
pipeline_tag: text-generation
tags:
- transformer
- language-model
- experimental
---

# **SmalLM**

<hr>
<div align="center">
  <a href="https://github.com/azrails/SmalLm" target="_blank" style="margin: 2px;">
    <img alt="GitHub" src="https://img.shields.io/badge/GitHub-SmalLM-181717?logo=github" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <a href="https://github.com/azrails/SmalLm/blob/main/LICENSE" style="margin: 2px;">
    <img alt="License" src="https://img.shields.io/badge/License-MIT-blue.svg" style="display: inline-block; vertical-align: middle;"/>
  </a>
</div>

SmalLM is a series of small transformer models built from scratch for language modeling. This project is designed to explore innovative approaches to transformer architectures through modular pipelines for pretraining, fine-tuning, and alignment.

## Uses

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Azrail/smallm_70")
model = AutoModelForCausalLM.from_pretrained("Azrail/smallm_70", trust_remote_code=True)
inputs = tokenizer("How are you?", return_tensors="pt")

out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.batch_decode(out))
```

## Model Details**
**Key Features:**

1. Grouped Query Attention (GQA).

2. Mixture-of-Experts with auxiliary loss-free balancing.

3. ALiBi (Attention with Linear Biases) or Rotary Position Embedding (RoPE).

4. NTK-by-parts RoPE interpolation for extends context length.

**Pre-Training**:

| Model                | Training Data                                                                 | Steps | Content Length | Tokens | LR    | Batch Size | Precision |
|----------------------|-------------------------------------------------------------------------------|-------|----------------|--------|-------|------------|-----------|
| [SmalLM-70M](https://huggingface.co/Azrail/smallm_70)      | [smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus)                | 70k    | 1024           | 18B     | 1e-3  | 0.25M       | bfloat16  |
| [SmalLM-150M](#)      | [smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus)                | -   | 1024           | -    | -  | -         | bfloat16  |
| [SmalLM-350M](#)     | [smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus)                | -   | 1024           | -    | -  | -       | bfloat16  |
| [SmalLM-500M](#)     | [smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus)                | -   | 1024           | -    | -  | -         | bfloat16  |

**Evaluation**:
Evaluation runing with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)

| Model                | MMLU | ARC easy/hard | PIQA  | HellaSwag | OBQA  | Winogrande |
|----------------------|------|----------------|-------|-----------|-------|------------|
| [SmalLM-70M](#)      | 25.33 | 51.47/25.68    | 61.75 | 30.31     | 30.8  | 50.83      |
| [SmalLM-150M](#)     | -    | -              | -     | -         | -     | -          |
| [SmalLM-350M](#)     | -    | -              | -     | -         | -     | -          |
| [SmalLM-500M](#)     | -    | -              | -     | -         | -     | -          |


**Procedure**:

[<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://api.wandb.ai/links/azrails-main/58rwb1yb)

### Framework versions

- Transformers 4.50.3
- Pytorch 2.6.0+cu126
- Datasets 3.5.0
- Tokenizers 0.21.1