Arkadiko V4 โ Base (pretrained, no SFT)
214M-parameter causal decoder pretrained from scratch on ~100B tokens across 9 domains. Pretraining only โ no instruction tuning, no chat alignment, no RLHF. Released as a research artifact.
This is V4, not V5. The Arkadiko model family advances to V5 only after demonstrating four post-SFT capabilities (multi-turn chat, arโen translation, tool calling, structured thinking). None of those have been validated on this checkpoint. See the Honest Limitations section before considering use.
Quick facts
| Parameters | 213,934,720 |
| Architecture | Pure causal decoder, 18 layers |
| Hidden size | 640 |
| Attention | GQA, 10 query heads / 2 KV heads, head_dim=64 |
| FFN | SwiGLU, hidden=3456 (โ5.4ร) |
| Vocab | 60,000 (SentencePiece BPE) |
| Context | 2,048 tokens |
| Position | RoPE, theta=10000 |
| Tied embeddings | No (separate wte and lm_head) |
| Tokens trained | 100,000,006,144 (~100B) |
| Training steps | 9,114,584 |
| Training hours | 524.7 |
| Hardware | 1ร NVIDIA RTX PRO 4000 Blackwell (24GB) |
| Run completed | 2026-05-06 |
Final evaluation (held-out per-domain)
Loss in nats, perplexity = exp(loss). Best-ever overall val PPL was 26.6 at step 8,815k; the released final checkpoint is at PPL ~28.8 (cosine-tail polish).
| Domain | Val loss (MA3) | Perplexity |
|---|---|---|
| code | 1.93 | 6.9 |
| math | 3.10 | 22.1 |
| fr | 3.32 | 27.7 |
| es | 3.43 | 30.9 |
| it | 3.50 | 32.9 |
| de | 3.57 | 35.6 |
| classical (Arabic) | 3.78 | 43.7 |
| en | 3.75 | 42.5 |
| ar (modern) | 3.80 | 44.5 |
| overall | 3.36 | 28.8 |
Training data
Roughly:
| Domain | Tokens | Source |
|---|---|---|
| Arabic (modern) | 24B | ArabicWeb24 + cc100-ar + CulturaX-ar |
| English | 28B | FineWeb-Edu |
| German | 12B | cc100-de |
| French | 8B | cc100-fr |
| Spanish | 8B | cc100-es |
| Italian | 7B | cc100-it |
| Code | 8B | CodeParrot + StarCoderData |
| Math | 7B | OpenWebMath |
| Classical Arabic | 2.7B | Custom (hadith, tafsir, OpenITI, poetry, tashkeela) |
Single SentencePiece BPE tokenizer shared across all 9 domains. Token-fertility is uneven โ Arabic averages roughly 2ร the tokens-per-word of English in this vocab, which we believe is a primary cause of weaker Arabic perplexity. The next iteration uses an Arabic-aware tokenizer (see Roadmap).
Honest limitations
This base model has known structural failures verified through completion testing across the run. Use accordingly.
- Coherent generation horizon โ 50 tokens. Past that, drift, topic-loop, or repetition. Capacity-bound at this size; SFT cannot extend it.
- No factual recall in long form. Capitals, public figures, dates โ the model produces fluent confabulation, not facts. Pair with retrieval/tools, do not deploy as a Q&A system.
- Cross-language code bleed. Code prompts in one language frequently produce output flavored by another (JS prompt โ Python output). Vocab-level issue.
- Arabic โ the primary target language โ is the second-worst text domain by PPL. Surface fluency reaches ~30-50 token spans; long-form Arabic reasoning is not present. The "Arabic-first" framing was not delivered at this scale.
- No safety alignment. No RLHF, no DPO, no toxicity filtering of training data beyond source-level curation. Outputs may be biased, false, or offensive.
- No instruction-following. Base model only. Will not reliably follow chat templates, refuse harmful requests, or call tools.
Configuration / tokenizer ID misalignment (read before using)
The config.json shipped here records the values used during training: bos_token_id=0, eos_token_id=2, pad_token_id=1. The actual SentencePiece model (tokenizer.model) defines these tokens at different IDs:
| Token | SPM ID | config.json |
|---|---|---|
<unk> |
0 | (not specified) |
<bos> |
1 | bos_token_id=0 |
<eos> |
2 | eos_token_id=2 |
<pad> |
3 | pad_token_id=1 |
Use the IDs from the SPM model when serving. tokenizer_config.json lists the SPM-derived IDs in added_tokens. The misaligned values in config.json are preserved for reproducibility โ the model was trained with them โ but downstream code should treat the SPM model as the source of truth.
This also affects all other special tokens, which the SPM model places at IDs 7โ14:
<system>=7 <user>=8 <assistant>=9
<think>=10 </think>=11 <tool_call>=12 <tool_result>=13 <eot>=14
<think> is the only special with a paired closer; <tool_call> and <tool_result> content is bounded by <eos> rather than a closing tag.
Loading
The model uses a custom architecture (ArkadikoForCausalLM) which is not part of transformers upstream. To load weights, use the arkadiko/llm/model.py definition from the project repo, or load the safetensors tensors directly:
import json
from safetensors.torch import load_file
state_dict = load_file("model.safetensors")
config = json.load(open("config.json"))
# Initialize your ArkadikoConfig + ArkadikoForCausalLM
# (see https://github.com/... for the model code)
# model.load_state_dict(state_dict, strict=False)
The repository code is not yet public. Drop a note in the discussions tab if you need it earlier than the planned release.
What this artifact is good for
- Research baseline. Reproducible 214M / 100B-token Arabic-inclusive base.
- SFT experiments. Suitable starting point for short-context, structured-output tasks (tool calling, format compliance) at small scale.
- Capability-curve studies. Final eval and run log are included; full per-checkpoint curve available on request.
What this artifact is not good for
- Production chat or assistant deployment.
- Factual question answering.
- Long-form generation (>50 tokens).
- Translation as native generation. (A translation tool wrapper around any base may work better than this model alone.)
Roadmap
The next planned iteration drops German/French/Spanish/Italian, focuses on Arabic + English + Classical + Code + Math, and grows to ~700M parameters with a 128K Arabic-aware tokenizer. See ADR-210 / ADR-211 in the project repo. This V4 base remains the experimental control.
License
CC BY-NC 4.0 โ non-commercial use only. Attribution required. No warranty, no liability.
Citation
@misc{arkadiko_v4_base_2026,
author = {{VectorNomad}},
title = {Arkadiko V4: A 214M Arabic-Inclusive Pretrained Base Model},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/VectorNomad/arkadiko-v4-base}}
}
Acknowledgements
Trained on a single RTX PRO 4000 Blackwell. Bridges, not factories.
- Downloads last month
- 12