HuggingFaceFW/fineweb-edu
Viewer • Updated • 3.5B • 386k • 1.17k
MicroBananaMind-v1 is a very small causal language model trained from scratch on FineWeb-Edu, FineMath, and Cosmopedia-v2.
The model has 902,272 parameters and uses a custom 1536-token byte-level BPE tokenizer with digit-aware tokenization It is our smallest model ever that is not just a TinyStories model.
| Field | Value |
|---|---|
| Parameters | 902,272 |
| Architecture | Custom Llama-style decoder |
| Layers | 4 |
| Hidden size | 128 |
| Intermediate size | 352 |
| Attention heads | 4 |
| KV heads | 1 |
| Vocabulary size | 1,536 |
| Context length | 1,024 |
| Embeddings | Tied input/output embeddings |
| Weight format | safetensors |
MicroBananaMind-v1 uses our digit-aware 1536-token tokenizer.
| Dataset | Tokens |
|---|---|
| FineWeb-Edu sample-10BT retokenized with 1536 digit tokenizer | 16,799,039,898 |
| FineMath retokenized with 1536 digit tokenizer | 1,740,373,303 |
| Cosmopedia-v2 retokenized with 1536 digit tokenizer | 3,458,958,651 |
Training setup:
| Field | Value |
|---|---|
| Sequence length | 1,024 |
| FineWeb sampling ratio | 70% |
| FineMath sampling ratio | 10% |
| Cosmopedia sampling ratio | 20% |
| Batch size | 128 |
| Gradient accumulation | 8 |
| Tokens per optimizer step | 1,048,576 |
| Training steps | 20,963 |
| Approx training tokens seen | 21,981,298,688 |
| Learning rate | 8e-4 |
| Minimum learning rate | 8e-5 |
| Warmup steps | 500 |
| Weight decay | 0.1 |
| Seed | 1337 |
We recommend using a temperature of 0 or 0.1
This model uses custom architecture code, so load it with trust_remote_code=True.
pip install -U transformers safetensors torch
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "BananaMind/MicroBananaMind-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float16,
).cuda().eval()
prompt = "The color of the sky is "
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
with torch.no_grad():
output = model.generate(
input_ids=input_ids,
max_new_tokens=64,
do_sample=False,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Apache 2.0