---
license: cc-by-nc-4.0
language:
  - ar
  - en
  - de
  - fr
  - es
  - it
tags:
  - arkadiko
  - arabic
  - bilingual
  - pretrained
  - causal-lm
  - research
library_name: transformers
pipeline_tag: text-generation
---

# Arkadiko V4 — Base (pretrained, no SFT)

214M-parameter causal decoder pretrained from scratch on ~100B tokens across 9 domains. **Pretraining only — no instruction tuning, no chat alignment, no RLHF.** Released as a research artifact.

This is **V4**, not V5. The Arkadiko model family advances to V5 only after demonstrating four post-SFT capabilities (multi-turn chat, ar↔en translation, tool calling, structured thinking). None of those have been validated on this checkpoint. See the [Honest Limitations](#honest-limitations) section before considering use.

## Quick facts

| | |
|---|---|
| Parameters | 213,934,720 |
| Architecture | Pure causal decoder, 18 layers |
| Hidden size | 640 |
| Attention | GQA, 10 query heads / 2 KV heads, head_dim=64 |
| FFN | SwiGLU, hidden=3456 (≈5.4×) |
| Vocab | 60,000 (SentencePiece BPE) |
| Context | 2,048 tokens |
| Position | RoPE, theta=10000 |
| Tied embeddings | No (separate `wte` and `lm_head`) |
| Tokens trained | 100,000,006,144 (~100B) |
| Training steps | 9,114,584 |
| Training hours | 524.7 |
| Hardware | 1× NVIDIA RTX PRO 4000 Blackwell (24GB) |
| Run completed | 2026-05-06 |

## Final evaluation (held-out per-domain)

Loss in nats, perplexity = exp(loss). Best-ever overall val PPL was **26.6** at step 8,815k; the released final checkpoint is at PPL ~28.8 (cosine-tail polish).

| Domain | Val loss (MA3) | Perplexity |
|---|---|---|
| code | 1.93 | 6.9 |
| math | 3.10 | 22.1 |
| fr | 3.32 | 27.7 |
| es | 3.43 | 30.9 |
| it | 3.50 | 32.9 |
| de | 3.57 | 35.6 |
| classical (Arabic) | 3.78 | 43.7 |
| en | 3.75 | 42.5 |
| **ar (modern)** | **3.80** | **44.5** |
| **overall** | 3.36 | 28.8 |

## Training data

Roughly:

| Domain | Tokens | Source |
|---|---|---|
| Arabic (modern) | 24B | ArabicWeb24 + cc100-ar + CulturaX-ar |
| English | 28B | FineWeb-Edu |
| German | 12B | cc100-de |
| French | 8B | cc100-fr |
| Spanish | 8B | cc100-es |
| Italian | 7B | cc100-it |
| Code | 8B | CodeParrot + StarCoderData |
| Math | 7B | OpenWebMath |
| Classical Arabic | 2.7B | Custom (hadith, tafsir, OpenITI, poetry, tashkeela) |

Single SentencePiece BPE tokenizer shared across all 9 domains. **Token-fertility is uneven** — Arabic averages roughly 2× the tokens-per-word of English in this vocab, which we believe is a primary cause of weaker Arabic perplexity. The next iteration uses an Arabic-aware tokenizer (see [Roadmap](#roadmap)).

## Honest limitations

This base model has known structural failures verified through completion testing across the run. Use accordingly.

1. **Coherent generation horizon ≈ 50 tokens.** Past that, drift, topic-loop, or repetition. Capacity-bound at this size; SFT cannot extend it.
2. **No factual recall in long form.** Capitals, public figures, dates — the model produces fluent confabulation, not facts. Pair with retrieval/tools, do not deploy as a Q&A system.
3. **Cross-language code bleed.** Code prompts in one language frequently produce output flavored by another (JS prompt → Python output). Vocab-level issue.
4. **Arabic — the primary target language — is the second-worst text domain by PPL.** Surface fluency reaches ~30-50 token spans; long-form Arabic reasoning is not present. The "Arabic-first" framing was not delivered at this scale.
5. **No safety alignment.** No RLHF, no DPO, no toxicity filtering of training data beyond source-level curation. Outputs may be biased, false, or offensive.
6. **No instruction-following.** Base model only. Will not reliably follow chat templates, refuse harmful requests, or call tools.

### Configuration / tokenizer ID misalignment (read before using)

The `config.json` shipped here records the values used during training: `bos_token_id=0, eos_token_id=2, pad_token_id=1`. The actual SentencePiece model (`tokenizer.model`) defines these tokens at different IDs:

| Token | SPM ID | config.json |
|---|---|---|
| `<unk>` | 0 | (not specified) |
| `<bos>` | 1 | `bos_token_id=0` |
| `<eos>` | 2 | `eos_token_id=2` |
| `<pad>` | 3 | `pad_token_id=1` |

**Use the IDs from the SPM model when serving.** `tokenizer_config.json` lists the SPM-derived IDs in `added_tokens`. The misaligned values in `config.json` are preserved for reproducibility — the model was trained with them — but downstream code should treat the SPM model as the source of truth.

This also affects all other special tokens, which the SPM model places at IDs 7–14:

```
<system>=7  <user>=8  <assistant>=9
<think>=10  </think>=11  <tool_call>=12  <tool_result>=13  <eot>=14
```

`<think>` is the only special with a paired closer; `<tool_call>` and `<tool_result>` content is bounded by `<eos>` rather than a closing tag.

## Loading

The model uses a custom architecture (`ArkadikoForCausalLM`) which is not part of `transformers` upstream. To load weights, use the `arkadiko/llm/model.py` definition from the project repo, or load the `safetensors` tensors directly:

```python
import json
from safetensors.torch import load_file
state_dict = load_file("model.safetensors")
config = json.load(open("config.json"))
# Initialize your ArkadikoConfig + ArkadikoForCausalLM
# (see https://github.com/... for the model code)
# model.load_state_dict(state_dict, strict=False)
```

The repository code is not yet public. Drop a note in the discussions tab if you need it earlier than the planned release.

## What this artifact is good for

- **Research baseline.** Reproducible 214M / 100B-token Arabic-inclusive base.
- **SFT experiments.** Suitable starting point for short-context, structured-output tasks (tool calling, format compliance) at small scale.
- **Capability-curve studies.** Final eval and run log are included; full per-checkpoint curve available on request.

## What this artifact is **not** good for

- Production chat or assistant deployment.
- Factual question answering.
- Long-form generation (>50 tokens).
- Translation as native generation. (A translation tool wrapper around any base may work better than this model alone.)

## Roadmap

The next planned iteration drops German/French/Spanish/Italian, focuses on Arabic + English + Classical + Code + Math, and grows to ~700M parameters with a 128K Arabic-aware tokenizer. See ADR-210 / ADR-211 in the project repo. This V4 base remains the experimental control.

## License

**CC BY-NC 4.0** — non-commercial use only. Attribution required. No warranty, no liability.

## Citation

```bibtex
@misc{arkadiko_v4_base_2026,
  author       = {{VectorNomad}},
  title        = {Arkadiko V4: A 214M Arabic-Inclusive Pretrained Base Model},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/VectorNomad/arkadiko-v4-base}}
}
```

## Acknowledgements

Trained on a single RTX PRO 4000 Blackwell. Bridges, not factories.