Arkadiko V4 — Base (pretrained, no SFT)

214M-parameter causal decoder pretrained from scratch on ~100B tokens across 9 domains. Pretraining only — no instruction tuning, no chat alignment, no RLHF. Released as a research artifact.

This is V4, not V5. The Arkadiko model family advances to V5 only after demonstrating four post-SFT capabilities (multi-turn chat, ar↔en translation, tool calling, structured thinking). None of those have been validated on this checkpoint. See the Honest Limitations section before considering use.

Quick facts


Parameters	213,934,720
Architecture	Pure causal decoder, 18 layers
Hidden size	640
Attention	GQA, 10 query heads / 2 KV heads, head_dim=64
FFN	SwiGLU, hidden=3456 (≈5.4×)
Vocab	60,000 (SentencePiece BPE)
Context	2,048 tokens
Position	RoPE, theta=10000
Tied embeddings	No (separate `wte` and `lm_head`)
Tokens trained	100,000,006,144 (~100B)
Training steps	9,114,584
Training hours	524.7
Hardware	1× NVIDIA RTX PRO 4000 Blackwell (24GB)
Run completed	2026-05-06

Final evaluation (held-out per-domain)

Loss in nats, perplexity = exp(loss). Best-ever overall val PPL was 26.6 at step 8,815k; the released final checkpoint is at PPL ~28.8 (cosine-tail polish).

Domain	Val loss (MA3)	Perplexity
code	1.93	6.9
math	3.10	22.1
fr	3.32	27.7
es	3.43	30.9
it	3.50	32.9
de	3.57	35.6
classical (Arabic)	3.78	43.7
en	3.75	42.5
ar (modern)	3.80	44.5
overall	3.36	28.8

Training data

Roughly:

Domain	Tokens	Source
Arabic (modern)	24B	ArabicWeb24 + cc100-ar + CulturaX-ar
English	28B	FineWeb-Edu
German	12B	cc100-de
French	8B	cc100-fr
Spanish	8B	cc100-es
Italian	7B	cc100-it
Code	8B	CodeParrot + StarCoderData
Math	7B	OpenWebMath
Classical Arabic	2.7B	Custom (hadith, tafsir, OpenITI, poetry, tashkeela)

Single SentencePiece BPE tokenizer shared across all 9 domains. Token-fertility is uneven — Arabic averages roughly 2× the tokens-per-word of English in this vocab, which we believe is a primary cause of weaker Arabic perplexity. The next iteration uses an Arabic-aware tokenizer (see Roadmap).

Honest limitations

This base model has known structural failures verified through completion testing across the run. Use accordingly.

Coherent generation horizon ≈ 50 tokens. Past that, drift, topic-loop, or repetition. Capacity-bound at this size; SFT cannot extend it.
No factual recall in long form. Capitals, public figures, dates — the model produces fluent confabulation, not facts. Pair with retrieval/tools, do not deploy as a Q&A system.
Cross-language code bleed. Code prompts in one language frequently produce output flavored by another (JS prompt → Python output). Vocab-level issue.
Arabic — the primary target language — is the second-worst text domain by PPL. Surface fluency reaches ~30-50 token spans; long-form Arabic reasoning is not present. The "Arabic-first" framing was not delivered at this scale.
No safety alignment. No RLHF, no DPO, no toxicity filtering of training data beyond source-level curation. Outputs may be biased, false, or offensive.
No instruction-following. Base model only. Will not reliably follow chat templates, refuse harmful requests, or call tools.

Configuration / tokenizer ID misalignment (read before using)

The config.json shipped here records the values used during training: bos_token_id=0, eos_token_id=2, pad_token_id=1. The actual SentencePiece model (tokenizer.model) defines these tokens at different IDs:

Token	SPM ID	config.json
`<unk>`	0	(not specified)
`<bos>`	1	`bos_token_id=0`
`<eos>`	2	`eos_token_id=2`
`<pad>`	3	`pad_token_id=1`

Use the IDs from the SPM model when serving. tokenizer_config.json lists the SPM-derived IDs in added_tokens. The misaligned values in config.json are preserved for reproducibility — the model was trained with them — but downstream code should treat the SPM model as the source of truth.

This also affects all other special tokens, which the SPM model places at IDs 7–14:

<system>=7  <user>=8  <assistant>=9
<think>=10  </think>=11  <tool_call>=12  <tool_result>=13  <eot>=14

<think> is the only special with a paired closer; <tool_call> and <tool_result> content is bounded by <eos> rather than a closing tag.

Loading

The model uses a custom architecture (ArkadikoForCausalLM) which is not part of transformers upstream. To load weights, use the arkadiko/llm/model.py definition from the project repo, or load the safetensors tensors directly:

import json
from safetensors.torch import load_file
state_dict = load_file("model.safetensors")
config = json.load(open("config.json"))
# Initialize your ArkadikoConfig + ArkadikoForCausalLM
# (see https://github.com/... for the model code)
# model.load_state_dict(state_dict, strict=False)

The repository code is not yet public. Drop a note in the discussions tab if you need it earlier than the planned release.

What this artifact is good for

Research baseline. Reproducible 214M / 100B-token Arabic-inclusive base.
SFT experiments. Suitable starting point for short-context, structured-output tasks (tool calling, format compliance) at small scale.
Capability-curve studies. Final eval and run log are included; full per-checkpoint curve available on request.

What this artifact is not good for

Production chat or assistant deployment.
Factual question answering.
Long-form generation (>50 tokens).
Translation as native generation. (A translation tool wrapper around any base may work better than this model alone.)

Roadmap

The next planned iteration drops German/French/Spanish/Italian, focuses on Arabic + English + Classical + Code + Math, and grows to ~700M parameters with a 128K Arabic-aware tokenizer. See ADR-210 / ADR-211 in the project repo. This V4 base remains the experimental control.

License

CC BY-NC 4.0 — non-commercial use only. Attribution required. No warranty, no liability.

Citation

@misc{arkadiko_v4_base_2026,
  author       = {{VectorNomad}},
  title        = {Arkadiko V4: A 214M Arabic-Inclusive Pretrained Base Model},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/VectorNomad/arkadiko-v4-base}}
}

Acknowledgements

Trained on a single RTX PRO 4000 Blackwell. Bridges, not factories.

Downloads last month: 12

Safetensors

Model size

0.2B params

Tensor type

BF16