Barynsor-1
A 21M parameter decoder-only Transformer language model trained from scratch.
Model Details
| Parameter | Value |
|---|---|
| Parameters | 21,143,552 |
| Architecture | Decoder-only Transformer (Pre-LN) |
| Hidden size | 512 |
| Layers | 6 |
| Attention heads | 8 |
| FFN size | 2048 |
| Vocab size | 4096 |
| Max sequence length | 256 |
| Activation | GELU |
| Weight tying | Yes (embed โ lm_head) |
Tokenizer
- SentencePiece Unigram (vocab_size=4096)
- Special tokens:
<pad>(0),<unk>(1),<bos>(2),<eos>(3)
Training Data
Pre-training Data
Target: 10B tokens, composition: en 50% / ja 35% / code 13% / math 2%
| Category | Source | Config/Subset |
|---|---|---|
| English | wikimedia/wikipedia | 20231101.en |
| English | HuggingFaceFW/fineweb-2 | eng_Latn |
| Japanese | wikimedia/wikipedia | 20231101.ja |
| Japanese | HuggingFaceFW/fineweb-2 | jpn_Jpan |
| Python | bigcode/starcoderdata | python |
| JSON | bigcode/starcoderdata | json |
| YAML | bigcode/starcoderdata | yaml |
| Markdown | bigcode/starcoderdata | markdown |
| Math (en) | open-web-math/open-web-math | โ |
| Math (ja) | AutoMathText | web-0.50-to-1.00-ja |
Supplemental English data from HuggingFaceFW/fineweb (CC-MAIN-2023-50) was also used.
SFT Data
brulee-1/Barynsor-1-SFT-Conversations โ 91,896 multilingual conversations generated via Magpie method.
- Query model: qwen2.5:1.5b
- Answer model: gemma3:12b-cloud
- Languages: en (26,180), ja (23,432), zh (11,361), fr (10,739), es (10,727), ko (9,457)
- SFT used only en/ja subset, with Japanese oversampling and 15% pre-training data mixing
Training
Pre-training
- Optimizer: AdamW (lr=3e-4, betas=(0.9, 0.95), weight_decay=0.1)
- LR schedule: Cosine with 2,000 step warmup
- Gradient clipping: 1.0
- Batch size: 64
- Initial sequence length: 32
- Context extension: 32 โ 64 โ 128 โ 256 (linear interpolation of positional embeddings)
- Total pre-training steps: ~240,000 (seq_len=32) + ~12,000 (64) + ~24,000 (128) + ~36,500 (256)
- Pre-training val_loss: 2.107 (seq_len=32 stage)
- Precision: bfloat16 (on Apple MPS)
SFT (Supervised Fine-Tuning)
- Base: pre-trained checkpoint at seq_len=256
- Data: en/ja instruction-response pairs (~19M tokens train, ~841K tokens val)
- Loss: assistant-only masking
- Optimizer: AdamW (lr=5e-5, betas=(0.9, 0.95), weight_decay=0.1)
- LR schedule: Cosine with 200 step warmup
- Batch size: 32
- Steps: 6,800
- SFT val_loss: 2.3163
- Early stopping: patience=5
Usage
from model import Barynsor1, Config
from safetensors.torch import load_file
cfg = Config(vocab_size=4096, seq_len=256, d_model=512, num_layers=6, num_heads=8, d_ff=2048)
model = Barynsor1(cfg)
state = load_file("model.safetensors")
model.load_state_dict(state, strict=False)
model.eval()
# Generate
import torch, sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file="tokenizer.model")
prompt = "Hello"
ids = [sp.bos_id()] + sp.encode(prompt)
idx = torch.tensor([ids])
out = model.generate(idx, max_new_tokens=64, temperature=0.7, top_k=40)
print(sp.decode(out[0].tolist()))
License
Apache 2.0
- Downloads last month
- 120