--- license: cc-by-nc-4.0 language: - ar - en - de - fr - es - it tags: - arkadiko - arabic - bilingual - pretrained - causal-lm - research library_name: transformers pipeline_tag: text-generation --- # Arkadiko V4 — Base (pretrained, no SFT) 214M-parameter causal decoder pretrained from scratch on ~100B tokens across 9 domains. **Pretraining only — no instruction tuning, no chat alignment, no RLHF.** Released as a research artifact. This is **V4**, not V5. The Arkadiko model family advances to V5 only after demonstrating four post-SFT capabilities (multi-turn chat, ar↔en translation, tool calling, structured thinking). None of those have been validated on this checkpoint. See the [Honest Limitations](#honest-limitations) section before considering use. ## Quick facts | | | |---|---| | Parameters | 213,934,720 | | Architecture | Pure causal decoder, 18 layers | | Hidden size | 640 | | Attention | GQA, 10 query heads / 2 KV heads, head_dim=64 | | FFN | SwiGLU, hidden=3456 (≈5.4×) | | Vocab | 60,000 (SentencePiece BPE) | | Context | 2,048 tokens | | Position | RoPE, theta=10000 | | Tied embeddings | No (separate `wte` and `lm_head`) | | Tokens trained | 100,000,006,144 (~100B) | | Training steps | 9,114,584 | | Training hours | 524.7 | | Hardware | 1× NVIDIA RTX PRO 4000 Blackwell (24GB) | | Run completed | 2026-05-06 | ## Final evaluation (held-out per-domain) Loss in nats, perplexity = exp(loss). Best-ever overall val PPL was **26.6** at step 8,815k; the released final checkpoint is at PPL ~28.8 (cosine-tail polish). | Domain | Val loss (MA3) | Perplexity | |---|---|---| | code | 1.93 | 6.9 | | math | 3.10 | 22.1 | | fr | 3.32 | 27.7 | | es | 3.43 | 30.9 | | it | 3.50 | 32.9 | | de | 3.57 | 35.6 | | classical (Arabic) | 3.78 | 43.7 | | en | 3.75 | 42.5 | | **ar (modern)** | **3.80** | **44.5** | | **overall** | 3.36 | 28.8 | ## Training data Roughly: | Domain | Tokens | Source | |---|---|---| | Arabic (modern) | 24B | ArabicWeb24 + cc100-ar + CulturaX-ar | | English | 28B | FineWeb-Edu | | German | 12B | cc100-de | | French | 8B | cc100-fr | | Spanish | 8B | cc100-es | | Italian | 7B | cc100-it | | Code | 8B | CodeParrot + StarCoderData | | Math | 7B | OpenWebMath | | Classical Arabic | 2.7B | Custom (hadith, tafsir, OpenITI, poetry, tashkeela) | Single SentencePiece BPE tokenizer shared across all 9 domains. **Token-fertility is uneven** — Arabic averages roughly 2× the tokens-per-word of English in this vocab, which we believe is a primary cause of weaker Arabic perplexity. The next iteration uses an Arabic-aware tokenizer (see [Roadmap](#roadmap)). ## Honest limitations This base model has known structural failures verified through completion testing across the run. Use accordingly. 1. **Coherent generation horizon ≈ 50 tokens.** Past that, drift, topic-loop, or repetition. Capacity-bound at this size; SFT cannot extend it. 2. **No factual recall in long form.** Capitals, public figures, dates — the model produces fluent confabulation, not facts. Pair with retrieval/tools, do not deploy as a Q&A system. 3. **Cross-language code bleed.** Code prompts in one language frequently produce output flavored by another (JS prompt → Python output). Vocab-level issue. 4. **Arabic — the primary target language — is the second-worst text domain by PPL.** Surface fluency reaches ~30-50 token spans; long-form Arabic reasoning is not present. The "Arabic-first" framing was not delivered at this scale. 5. **No safety alignment.** No RLHF, no DPO, no toxicity filtering of training data beyond source-level curation. Outputs may be biased, false, or offensive. 6. **No instruction-following.** Base model only. Will not reliably follow chat templates, refuse harmful requests, or call tools. ### Configuration / tokenizer ID misalignment (read before using) The `config.json` shipped here records the values used during training: `bos_token_id=0, eos_token_id=2, pad_token_id=1`. The actual SentencePiece model (`tokenizer.model`) defines these tokens at different IDs: | Token | SPM ID | config.json | |---|---|---| | `` | 0 | (not specified) | | `` | 1 | `bos_token_id=0` | | `` | 2 | `eos_token_id=2` | | `` | 3 | `pad_token_id=1` | **Use the IDs from the SPM model when serving.** `tokenizer_config.json` lists the SPM-derived IDs in `added_tokens`. The misaligned values in `config.json` are preserved for reproducibility — the model was trained with them — but downstream code should treat the SPM model as the source of truth. This also affects all other special tokens, which the SPM model places at IDs 7–14: ``` =7 =8 =9 =10 =11 =12 =13 =14 ``` `` is the only special with a paired closer; `` and `` content is bounded by `` rather than a closing tag. ## Loading The model uses a custom architecture (`ArkadikoForCausalLM`) which is not part of `transformers` upstream. To load weights, use the `arkadiko/llm/model.py` definition from the project repo, or load the `safetensors` tensors directly: ```python import json from safetensors.torch import load_file state_dict = load_file("model.safetensors") config = json.load(open("config.json")) # Initialize your ArkadikoConfig + ArkadikoForCausalLM # (see https://github.com/... for the model code) # model.load_state_dict(state_dict, strict=False) ``` The repository code is not yet public. Drop a note in the discussions tab if you need it earlier than the planned release. ## What this artifact is good for - **Research baseline.** Reproducible 214M / 100B-token Arabic-inclusive base. - **SFT experiments.** Suitable starting point for short-context, structured-output tasks (tool calling, format compliance) at small scale. - **Capability-curve studies.** Final eval and run log are included; full per-checkpoint curve available on request. ## What this artifact is **not** good for - Production chat or assistant deployment. - Factual question answering. - Long-form generation (>50 tokens). - Translation as native generation. (A translation tool wrapper around any base may work better than this model alone.) ## Roadmap The next planned iteration drops German/French/Spanish/Italian, focuses on Arabic + English + Classical + Code + Math, and grows to ~700M parameters with a 128K Arabic-aware tokenizer. See ADR-210 / ADR-211 in the project repo. This V4 base remains the experimental control. ## License **CC BY-NC 4.0** — non-commercial use only. Attribution required. No warranty, no liability. ## Citation ```bibtex @misc{arkadiko_v4_base_2026, author = {{VectorNomad}}, title = {Arkadiko V4: A 214M Arabic-Inclusive Pretrained Base Model}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/VectorNomad/arkadiko-v4-base}} } ``` ## Acknowledgements Trained on a single RTX PRO 4000 Blackwell. Bridges, not factories.