Barynsor-1

A 21M parameter decoder-only Transformer language model trained from scratch.

Barynsor

Model Details

Parameter Value
Parameters 21,143,552
Architecture Decoder-only Transformer (Pre-LN)
Hidden size 512
Layers 6
Attention heads 8
FFN size 2048
Vocab size 4096
Max sequence length 256
Activation GELU
Weight tying Yes (embed โ†” lm_head)

Tokenizer

  • SentencePiece Unigram (vocab_size=4096)
  • Special tokens: <pad> (0), <unk> (1), <bos> (2), <eos> (3)

Training Data

Pre-training Data

Target: 10B tokens, composition: en 50% / ja 35% / code 13% / math 2%

Category Source Config/Subset
English wikimedia/wikipedia 20231101.en
English HuggingFaceFW/fineweb-2 eng_Latn
Japanese wikimedia/wikipedia 20231101.ja
Japanese HuggingFaceFW/fineweb-2 jpn_Jpan
Python bigcode/starcoderdata python
JSON bigcode/starcoderdata json
YAML bigcode/starcoderdata yaml
Markdown bigcode/starcoderdata markdown
Math (en) open-web-math/open-web-math โ€”
Math (ja) AutoMathText web-0.50-to-1.00-ja

Supplemental English data from HuggingFaceFW/fineweb (CC-MAIN-2023-50) was also used.

SFT Data

brulee-1/Barynsor-1-SFT-Conversations โ€” 91,896 multilingual conversations generated via Magpie method.

  • Query model: qwen2.5:1.5b
  • Answer model: gemma3:12b-cloud
  • Languages: en (26,180), ja (23,432), zh (11,361), fr (10,739), es (10,727), ko (9,457)
  • SFT used only en/ja subset, with Japanese oversampling and 15% pre-training data mixing

Training

Pre-training

  • Optimizer: AdamW (lr=3e-4, betas=(0.9, 0.95), weight_decay=0.1)
  • LR schedule: Cosine with 2,000 step warmup
  • Gradient clipping: 1.0
  • Batch size: 64
  • Initial sequence length: 32
  • Context extension: 32 โ†’ 64 โ†’ 128 โ†’ 256 (linear interpolation of positional embeddings)
  • Total pre-training steps: ~240,000 (seq_len=32) + ~12,000 (64) + ~24,000 (128) + ~36,500 (256)
  • Pre-training val_loss: 2.107 (seq_len=32 stage)
  • Precision: bfloat16 (on Apple MPS)

SFT (Supervised Fine-Tuning)

  • Base: pre-trained checkpoint at seq_len=256
  • Data: en/ja instruction-response pairs (~19M tokens train, ~841K tokens val)
  • Loss: assistant-only masking
  • Optimizer: AdamW (lr=5e-5, betas=(0.9, 0.95), weight_decay=0.1)
  • LR schedule: Cosine with 200 step warmup
  • Batch size: 32
  • Steps: 6,800
  • SFT val_loss: 2.3163
  • Early stopping: patience=5

Usage

from model import Barynsor1, Config
from safetensors.torch import load_file

cfg = Config(vocab_size=4096, seq_len=256, d_model=512, num_layers=6, num_heads=8, d_ff=2048)
model = Barynsor1(cfg)
state = load_file("model.safetensors")
model.load_state_dict(state, strict=False)
model.eval()

# Generate
import torch, sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file="tokenizer.model")
prompt = "Hello"
ids = [sp.bos_id()] + sp.encode(prompt)
idx = torch.tensor([ids])
out = model.generate(idx, max_new_tokens=64, temperature=0.7, top_k=40)
print(sp.decode(out[0].tolist()))

License

Apache 2.0

Downloads last month
120
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support