Arkadiko V4 โ€” Base (pretrained, no SFT)

214M-parameter causal decoder pretrained from scratch on ~100B tokens across 9 domains. Pretraining only โ€” no instruction tuning, no chat alignment, no RLHF. Released as a research artifact.

This is V4, not V5. The Arkadiko model family advances to V5 only after demonstrating four post-SFT capabilities (multi-turn chat, arโ†”en translation, tool calling, structured thinking). None of those have been validated on this checkpoint. See the Honest Limitations section before considering use.

Quick facts

Parameters 213,934,720
Architecture Pure causal decoder, 18 layers
Hidden size 640
Attention GQA, 10 query heads / 2 KV heads, head_dim=64
FFN SwiGLU, hidden=3456 (โ‰ˆ5.4ร—)
Vocab 60,000 (SentencePiece BPE)
Context 2,048 tokens
Position RoPE, theta=10000
Tied embeddings No (separate wte and lm_head)
Tokens trained 100,000,006,144 (~100B)
Training steps 9,114,584
Training hours 524.7
Hardware 1ร— NVIDIA RTX PRO 4000 Blackwell (24GB)
Run completed 2026-05-06

Final evaluation (held-out per-domain)

Loss in nats, perplexity = exp(loss). Best-ever overall val PPL was 26.6 at step 8,815k; the released final checkpoint is at PPL ~28.8 (cosine-tail polish).

Domain Val loss (MA3) Perplexity
code 1.93 6.9
math 3.10 22.1
fr 3.32 27.7
es 3.43 30.9
it 3.50 32.9
de 3.57 35.6
classical (Arabic) 3.78 43.7
en 3.75 42.5
ar (modern) 3.80 44.5
overall 3.36 28.8

Training data

Roughly:

Domain Tokens Source
Arabic (modern) 24B ArabicWeb24 + cc100-ar + CulturaX-ar
English 28B FineWeb-Edu
German 12B cc100-de
French 8B cc100-fr
Spanish 8B cc100-es
Italian 7B cc100-it
Code 8B CodeParrot + StarCoderData
Math 7B OpenWebMath
Classical Arabic 2.7B Custom (hadith, tafsir, OpenITI, poetry, tashkeela)

Single SentencePiece BPE tokenizer shared across all 9 domains. Token-fertility is uneven โ€” Arabic averages roughly 2ร— the tokens-per-word of English in this vocab, which we believe is a primary cause of weaker Arabic perplexity. The next iteration uses an Arabic-aware tokenizer (see Roadmap).

Honest limitations

This base model has known structural failures verified through completion testing across the run. Use accordingly.

  1. Coherent generation horizon โ‰ˆ 50 tokens. Past that, drift, topic-loop, or repetition. Capacity-bound at this size; SFT cannot extend it.
  2. No factual recall in long form. Capitals, public figures, dates โ€” the model produces fluent confabulation, not facts. Pair with retrieval/tools, do not deploy as a Q&A system.
  3. Cross-language code bleed. Code prompts in one language frequently produce output flavored by another (JS prompt โ†’ Python output). Vocab-level issue.
  4. Arabic โ€” the primary target language โ€” is the second-worst text domain by PPL. Surface fluency reaches ~30-50 token spans; long-form Arabic reasoning is not present. The "Arabic-first" framing was not delivered at this scale.
  5. No safety alignment. No RLHF, no DPO, no toxicity filtering of training data beyond source-level curation. Outputs may be biased, false, or offensive.
  6. No instruction-following. Base model only. Will not reliably follow chat templates, refuse harmful requests, or call tools.

Configuration / tokenizer ID misalignment (read before using)

The config.json shipped here records the values used during training: bos_token_id=0, eos_token_id=2, pad_token_id=1. The actual SentencePiece model (tokenizer.model) defines these tokens at different IDs:

Token SPM ID config.json
<unk> 0 (not specified)
<bos> 1 bos_token_id=0
<eos> 2 eos_token_id=2
<pad> 3 pad_token_id=1

Use the IDs from the SPM model when serving. tokenizer_config.json lists the SPM-derived IDs in added_tokens. The misaligned values in config.json are preserved for reproducibility โ€” the model was trained with them โ€” but downstream code should treat the SPM model as the source of truth.

This also affects all other special tokens, which the SPM model places at IDs 7โ€“14:

<system>=7  <user>=8  <assistant>=9
<think>=10  </think>=11  <tool_call>=12  <tool_result>=13  <eot>=14

<think> is the only special with a paired closer; <tool_call> and <tool_result> content is bounded by <eos> rather than a closing tag.

Loading

The model uses a custom architecture (ArkadikoForCausalLM) which is not part of transformers upstream. To load weights, use the arkadiko/llm/model.py definition from the project repo, or load the safetensors tensors directly:

import json
from safetensors.torch import load_file
state_dict = load_file("model.safetensors")
config = json.load(open("config.json"))
# Initialize your ArkadikoConfig + ArkadikoForCausalLM
# (see https://github.com/... for the model code)
# model.load_state_dict(state_dict, strict=False)

The repository code is not yet public. Drop a note in the discussions tab if you need it earlier than the planned release.

What this artifact is good for

  • Research baseline. Reproducible 214M / 100B-token Arabic-inclusive base.
  • SFT experiments. Suitable starting point for short-context, structured-output tasks (tool calling, format compliance) at small scale.
  • Capability-curve studies. Final eval and run log are included; full per-checkpoint curve available on request.

What this artifact is not good for

  • Production chat or assistant deployment.
  • Factual question answering.
  • Long-form generation (>50 tokens).
  • Translation as native generation. (A translation tool wrapper around any base may work better than this model alone.)

Roadmap

The next planned iteration drops German/French/Spanish/Italian, focuses on Arabic + English + Classical + Code + Math, and grows to ~700M parameters with a 128K Arabic-aware tokenizer. See ADR-210 / ADR-211 in the project repo. This V4 base remains the experimental control.

License

CC BY-NC 4.0 โ€” non-commercial use only. Attribution required. No warranty, no liability.

Citation

@misc{arkadiko_v4_base_2026,
  author       = {{VectorNomad}},
  title        = {Arkadiko V4: A 214M Arabic-Inclusive Pretrained Base Model},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/VectorNomad/arkadiko-v4-base}}
}

Acknowledgements

Trained on a single RTX PRO 4000 Blackwell. Bridges, not factories.

Downloads last month
12
Safetensors
Model size
0.2B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support