Initial release: Arkadiko V4 base, 214M / 100B tokens

d0e66b7 verified 14 days ago

7.03 kB

	---
	license: cc-by-nc-4.0
	language:
	- ar
	- en
	- de
	- fr
	- es
	- it
	tags:
	- arkadiko
	- arabic
	- bilingual
	- pretrained
	- causal-lm
	- research
	library_name: transformers
	pipeline_tag: text-generation
	---

	# Arkadiko V4 — Base (pretrained, no SFT)

	214M-parameter causal decoder pretrained from scratch on ~100B tokens across 9 domains. Pretraining only — no instruction tuning, no chat alignment, no RLHF. Released as a research artifact.

	This is V4, not V5. The Arkadiko model family advances to V5 only after demonstrating four post-SFT capabilities (multi-turn chat, ar↔en translation, tool calling, structured thinking). None of those have been validated on this checkpoint. See the [Honest Limitations](#honest-limitations) section before considering use.

	## Quick facts

	\| \| \|
	\|---\|---\|
	\| Parameters \| 213,934,720 \|
	\| Architecture \| Pure causal decoder, 18 layers \|
	\| Hidden size \| 640 \|
	\| Attention \| GQA, 10 query heads / 2 KV heads, head_dim=64 \|
	\| FFN \| SwiGLU, hidden=3456 (≈5.4×) \|
	\| Vocab \| 60,000 (SentencePiece BPE) \|
	\| Context \| 2,048 tokens \|
	\| Position \| RoPE, theta=10000 \|
	\| Tied embeddings \| No (separate `wte` and `lm_head`) \|
	\| Tokens trained \| 100,000,006,144 (~100B) \|
	\| Training steps \| 9,114,584 \|
	\| Training hours \| 524.7 \|
	\| Hardware \| 1× NVIDIA RTX PRO 4000 Blackwell (24GB) \|
	\| Run completed \| 2026-05-06 \|

	## Final evaluation (held-out per-domain)

	Loss in nats, perplexity = exp(loss). Best-ever overall val PPL was 26.6 at step 8,815k; the released final checkpoint is at PPL ~28.8 (cosine-tail polish).

	\| Domain \| Val loss (MA3) \| Perplexity \|
	\|---\|---\|---\|
	\| code \| 1.93 \| 6.9 \|
	\| math \| 3.10 \| 22.1 \|
	\| fr \| 3.32 \| 27.7 \|
	\| es \| 3.43 \| 30.9 \|
	\| it \| 3.50 \| 32.9 \|
	\| de \| 3.57 \| 35.6 \|
	\| classical (Arabic) \| 3.78 \| 43.7 \|
	\| en \| 3.75 \| 42.5 \|
	\| ar (modern) \| 3.80 \| 44.5 \|
	\| overall \| 3.36 \| 28.8 \|

	## Training data

	Roughly:

	\| Domain \| Tokens \| Source \|
	\|---\|---\|---\|
	\| Arabic (modern) \| 24B \| ArabicWeb24 + cc100-ar + CulturaX-ar \|
	\| English \| 28B \| FineWeb-Edu \|
	\| German \| 12B \| cc100-de \|
	\| French \| 8B \| cc100-fr \|
	\| Spanish \| 8B \| cc100-es \|
	\| Italian \| 7B \| cc100-it \|
	\| Code \| 8B \| CodeParrot + StarCoderData \|
	\| Math \| 7B \| OpenWebMath \|
	\| Classical Arabic \| 2.7B \| Custom (hadith, tafsir, OpenITI, poetry, tashkeela) \|

	Single SentencePiece BPE tokenizer shared across all 9 domains. Token-fertility is uneven — Arabic averages roughly 2× the tokens-per-word of English in this vocab, which we believe is a primary cause of weaker Arabic perplexity. The next iteration uses an Arabic-aware tokenizer (see [Roadmap](#roadmap)).

	## Honest limitations

	This base model has known structural failures verified through completion testing across the run. Use accordingly.

	1. Coherent generation horizon ≈ 50 tokens. Past that, drift, topic-loop, or repetition. Capacity-bound at this size; SFT cannot extend it.
	2. No factual recall in long form. Capitals, public figures, dates — the model produces fluent confabulation, not facts. Pair with retrieval/tools, do not deploy as a Q&A system.
	3. Cross-language code bleed. Code prompts in one language frequently produce output flavored by another (JS prompt → Python output). Vocab-level issue.
	4. Arabic — the primary target language — is the second-worst text domain by PPL. Surface fluency reaches ~30-50 token spans; long-form Arabic reasoning is not present. The "Arabic-first" framing was not delivered at this scale.
	5. No safety alignment. No RLHF, no DPO, no toxicity filtering of training data beyond source-level curation. Outputs may be biased, false, or offensive.
	6. No instruction-following. Base model only. Will not reliably follow chat templates, refuse harmful requests, or call tools.

	### Configuration / tokenizer ID misalignment (read before using)

	The `config.json` shipped here records the values used during training: `bos_token_id=0, eos_token_id=2, pad_token_id=1`. The actual SentencePiece model (`tokenizer.model`) defines these tokens at different IDs:

	\| Token \| SPM ID \| config.json \|
	\|---\|---\|---\|
	\| `<unk>` \| 0 \| (not specified) \|
	\| `<bos>` \| 1 \| `bos_token_id=0` \|
	\| `<eos>` \| 2 \| `eos_token_id=2` \|
	\| `<pad>` \| 3 \| `pad_token_id=1` \|

	Use the IDs from the SPM model when serving. `tokenizer_config.json` lists the SPM-derived IDs in `added_tokens`. The misaligned values in `config.json` are preserved for reproducibility — the model was trained with them — but downstream code should treat the SPM model as the source of truth.

	This also affects all other special tokens, which the SPM model places at IDs 7–14:

	```
	<system>=7 <user>=8 <assistant>=9
	<think>=10 </think>=11 <tool_call>=12 <tool_result>=13 <eot>=14
	```

	`<think>` is the only special with a paired closer; `<tool_call>` and `<tool_result>` content is bounded by `<eos>` rather than a closing tag.

	## Loading

	The model uses a custom architecture (`ArkadikoForCausalLM`) which is not part of `transformers` upstream. To load weights, use the `arkadiko/llm/model.py` definition from the project repo, or load the `safetensors` tensors directly:

	```python
	import json
	from safetensors.torch import load_file
	state_dict = load_file("model.safetensors")
	config = json.load(open("config.json"))
	# Initialize your ArkadikoConfig + ArkadikoForCausalLM
	# (see https://github.com/... for the model code)
	# model.load_state_dict(state_dict, strict=False)
	```

	The repository code is not yet public. Drop a note in the discussions tab if you need it earlier than the planned release.

	## What this artifact is good for

	- Research baseline. Reproducible 214M / 100B-token Arabic-inclusive base.
	- SFT experiments. Suitable starting point for short-context, structured-output tasks (tool calling, format compliance) at small scale.
	- Capability-curve studies. Final eval and run log are included; full per-checkpoint curve available on request.

	## What this artifact is not good for

	- Production chat or assistant deployment.
	- Factual question answering.
	- Long-form generation (>50 tokens).
	- Translation as native generation. (A translation tool wrapper around any base may work better than this model alone.)

	## Roadmap

	The next planned iteration drops German/French/Spanish/Italian, focuses on Arabic + English + Classical + Code + Math, and grows to ~700M parameters with a 128K Arabic-aware tokenizer. See ADR-210 / ADR-211 in the project repo. This V4 base remains the experimental control.

	## License

	CC BY-NC 4.0 — non-commercial use only. Attribution required. No warranty, no liability.

	## Citation

	```bibtex
	@misc{arkadiko_v4_base_2026,
	author = {{VectorNomad}},
	title = {Arkadiko V4: A 214M Arabic-Inclusive Pretrained Base Model},
	year = {2026},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/VectorNomad/arkadiko-v4-base}}
	}
	```

	## Acknowledgements

	Trained on a single RTX PRO 4000 Blackwell. Bridges, not factories.