arkadiko-v4-base / README.md
VectorNomad's picture
Initial release: Arkadiko V4 base, 214M / 100B tokens
d0e66b7 verified
---
license: cc-by-nc-4.0
language:
- ar
- en
- de
- fr
- es
- it
tags:
- arkadiko
- arabic
- bilingual
- pretrained
- causal-lm
- research
library_name: transformers
pipeline_tag: text-generation
---
# Arkadiko V4 โ€” Base (pretrained, no SFT)
214M-parameter causal decoder pretrained from scratch on ~100B tokens across 9 domains. **Pretraining only โ€” no instruction tuning, no chat alignment, no RLHF.** Released as a research artifact.
This is **V4**, not V5. The Arkadiko model family advances to V5 only after demonstrating four post-SFT capabilities (multi-turn chat, arโ†”en translation, tool calling, structured thinking). None of those have been validated on this checkpoint. See the [Honest Limitations](#honest-limitations) section before considering use.
## Quick facts
| | |
|---|---|
| Parameters | 213,934,720 |
| Architecture | Pure causal decoder, 18 layers |
| Hidden size | 640 |
| Attention | GQA, 10 query heads / 2 KV heads, head_dim=64 |
| FFN | SwiGLU, hidden=3456 (โ‰ˆ5.4ร—) |
| Vocab | 60,000 (SentencePiece BPE) |
| Context | 2,048 tokens |
| Position | RoPE, theta=10000 |
| Tied embeddings | No (separate `wte` and `lm_head`) |
| Tokens trained | 100,000,006,144 (~100B) |
| Training steps | 9,114,584 |
| Training hours | 524.7 |
| Hardware | 1ร— NVIDIA RTX PRO 4000 Blackwell (24GB) |
| Run completed | 2026-05-06 |
## Final evaluation (held-out per-domain)
Loss in nats, perplexity = exp(loss). Best-ever overall val PPL was **26.6** at step 8,815k; the released final checkpoint is at PPL ~28.8 (cosine-tail polish).
| Domain | Val loss (MA3) | Perplexity |
|---|---|---|
| code | 1.93 | 6.9 |
| math | 3.10 | 22.1 |
| fr | 3.32 | 27.7 |
| es | 3.43 | 30.9 |
| it | 3.50 | 32.9 |
| de | 3.57 | 35.6 |
| classical (Arabic) | 3.78 | 43.7 |
| en | 3.75 | 42.5 |
| **ar (modern)** | **3.80** | **44.5** |
| **overall** | 3.36 | 28.8 |
## Training data
Roughly:
| Domain | Tokens | Source |
|---|---|---|
| Arabic (modern) | 24B | ArabicWeb24 + cc100-ar + CulturaX-ar |
| English | 28B | FineWeb-Edu |
| German | 12B | cc100-de |
| French | 8B | cc100-fr |
| Spanish | 8B | cc100-es |
| Italian | 7B | cc100-it |
| Code | 8B | CodeParrot + StarCoderData |
| Math | 7B | OpenWebMath |
| Classical Arabic | 2.7B | Custom (hadith, tafsir, OpenITI, poetry, tashkeela) |
Single SentencePiece BPE tokenizer shared across all 9 domains. **Token-fertility is uneven** โ€” Arabic averages roughly 2ร— the tokens-per-word of English in this vocab, which we believe is a primary cause of weaker Arabic perplexity. The next iteration uses an Arabic-aware tokenizer (see [Roadmap](#roadmap)).
## Honest limitations
This base model has known structural failures verified through completion testing across the run. Use accordingly.
1. **Coherent generation horizon โ‰ˆ 50 tokens.** Past that, drift, topic-loop, or repetition. Capacity-bound at this size; SFT cannot extend it.
2. **No factual recall in long form.** Capitals, public figures, dates โ€” the model produces fluent confabulation, not facts. Pair with retrieval/tools, do not deploy as a Q&A system.
3. **Cross-language code bleed.** Code prompts in one language frequently produce output flavored by another (JS prompt โ†’ Python output). Vocab-level issue.
4. **Arabic โ€” the primary target language โ€” is the second-worst text domain by PPL.** Surface fluency reaches ~30-50 token spans; long-form Arabic reasoning is not present. The "Arabic-first" framing was not delivered at this scale.
5. **No safety alignment.** No RLHF, no DPO, no toxicity filtering of training data beyond source-level curation. Outputs may be biased, false, or offensive.
6. **No instruction-following.** Base model only. Will not reliably follow chat templates, refuse harmful requests, or call tools.
### Configuration / tokenizer ID misalignment (read before using)
The `config.json` shipped here records the values used during training: `bos_token_id=0, eos_token_id=2, pad_token_id=1`. The actual SentencePiece model (`tokenizer.model`) defines these tokens at different IDs:
| Token | SPM ID | config.json |
|---|---|---|
| `<unk>` | 0 | (not specified) |
| `<bos>` | 1 | `bos_token_id=0` |
| `<eos>` | 2 | `eos_token_id=2` |
| `<pad>` | 3 | `pad_token_id=1` |
**Use the IDs from the SPM model when serving.** `tokenizer_config.json` lists the SPM-derived IDs in `added_tokens`. The misaligned values in `config.json` are preserved for reproducibility โ€” the model was trained with them โ€” but downstream code should treat the SPM model as the source of truth.
This also affects all other special tokens, which the SPM model places at IDs 7โ€“14:
```
<system>=7 <user>=8 <assistant>=9
<think>=10 </think>=11 <tool_call>=12 <tool_result>=13 <eot>=14
```
`<think>` is the only special with a paired closer; `<tool_call>` and `<tool_result>` content is bounded by `<eos>` rather than a closing tag.
## Loading
The model uses a custom architecture (`ArkadikoForCausalLM`) which is not part of `transformers` upstream. To load weights, use the `arkadiko/llm/model.py` definition from the project repo, or load the `safetensors` tensors directly:
```python
import json
from safetensors.torch import load_file
state_dict = load_file("model.safetensors")
config = json.load(open("config.json"))
# Initialize your ArkadikoConfig + ArkadikoForCausalLM
# (see https://github.com/... for the model code)
# model.load_state_dict(state_dict, strict=False)
```
The repository code is not yet public. Drop a note in the discussions tab if you need it earlier than the planned release.
## What this artifact is good for
- **Research baseline.** Reproducible 214M / 100B-token Arabic-inclusive base.
- **SFT experiments.** Suitable starting point for short-context, structured-output tasks (tool calling, format compliance) at small scale.
- **Capability-curve studies.** Final eval and run log are included; full per-checkpoint curve available on request.
## What this artifact is **not** good for
- Production chat or assistant deployment.
- Factual question answering.
- Long-form generation (>50 tokens).
- Translation as native generation. (A translation tool wrapper around any base may work better than this model alone.)
## Roadmap
The next planned iteration drops German/French/Spanish/Italian, focuses on Arabic + English + Classical + Code + Math, and grows to ~700M parameters with a 128K Arabic-aware tokenizer. See ADR-210 / ADR-211 in the project repo. This V4 base remains the experimental control.
## License
**CC BY-NC 4.0** โ€” non-commercial use only. Attribution required. No warranty, no liability.
## Citation
```bibtex
@misc{arkadiko_v4_base_2026,
author = {{VectorNomad}},
title = {Arkadiko V4: A 214M Arabic-Inclusive Pretrained Base Model},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/VectorNomad/arkadiko-v4-base}}
}
```
## Acknowledgements
Trained on a single RTX PRO 4000 Blackwell. Bridges, not factories.