--- license: apache-2.0 language: - ar - en base_model: Qwen/Qwen2.5-0.5B pipeline_tag: text-generation tags: - arabic - edge - small-language-model - sft - dpo - gguf - qwen2 library_name: transformers model-index: - name: RightNow-Arabic-0.5B-Turbo results: - task: type: text-generation name: COPA-ar dataset: name: copa_ar type: copa_ar metrics: - type: acc_norm value: 58.4 name: Accuracy (norm) - task: type: text-generation name: Arabic HellaSwag dataset: name: arabic_mt_hellaswag type: arabic_mt_hellaswag metrics: - type: acc_norm value: 26.0 name: Accuracy (norm) - task: type: text-generation name: ArabicMMLU dataset: name: arabic_leaderboard_arabic_mmlu type: arabic_leaderboard_arabic_mmlu metrics: - type: acc value: 23.2 name: Accuracy ---
# RightNow-Arabic-0.5B-Turbo ### The smallest open Arabic-specialized decoder LLM **518M parameters | 398 MB on disk (q4_k_m) | 635 tok/s on H100** *Built by [RightNow AI](https://rightnowai.co)*
--- ## What is this? RightNow-Arabic-0.5B-Turbo is a **518M-parameter Arabic-specialized language model** built on top of Qwen2.5-0.5B via vocabulary injection, continued pretraining, supervised fine-tuning, and direct preference optimization. It is the **smallest open Arabic-specialized decoder LLM** with publicly available weights. The model targets **edge deployment**: phones, laptops, embedded devices, and browsers. Quantized to q4_k_m it fits in 398 MB and generates 635 tokens/s at batch size 1. ## Key Features - **27,032 new Arabic tokens** added via mean-subtoken initialization, cutting Arabic tokenizer fertility by 17.3% (2.18 to 1.80 tokens/word) - **504M Arabic pretraining tokens** (Arabic Wikipedia) on 8xH100 SXM5 with FSDP + FlashAttention varlen + Liger fused kernels - **129,116 Arabic instruction pairs** for SFT with response-only loss masking - **6,750 Arabic preference pairs** for DPO - **Weight soup merging** (DPO 50%, SFT 25%, Pretrain 25%) for optimal accuracy - **4 GGUF quantizations** for instant edge deployment ## Benchmarks Evaluated with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) v0.4.11, `apply_chat_template=True`, `limit=200`, `acc_norm` preferred. ### Head-to-head comparison | Model | Params | COPA-ar | HellaSwag-ar | ArabicMMLU | Mean | |-------|--------|---------|-------------|------------|------| | **RightNow-Arabic-0.5B-Turbo (ours)** | **518M** | **58.4%** | **26.0%** | 23.2% | **35.9%** | | Qwen2.5-0.5B-Instruct | 494M | 53.9% | 22.5% | **26.0%** | 34.1% | | Falcon-H1-0.5B-Instruct | 524M | 44.9% | 23.0% | 24.2% | 30.7% | | Falcon-H1-1.5B-Instruct | 1.5B | 58.4% | 27.5% | 32.7% | 39.5% | | AceGPT-7B-chat | 7B | 69.7% | 27.0% | 35.0% | 43.9% | | ALLaM-7B-Instruct | 7B | 68.5% | 29.0% | 52.2% | 49.9% | | SILMA-9B-Instruct | 9B | 69.7% | 38.0% | 52.9% | 53.5% | **Among 0.5B models**: best on COPA-ar (+4.5 vs Qwen), best on HellaSwag-ar (+3.5 vs Qwen), best mean (+1.8 vs Qwen, +5.2 vs Falcon). **Ties Falcon-H1-1.5B** on COPA-ar (both 58.4%) at one-third the parameters. **Recovers 67%** of SILMA-9B mean accuracy at 5.8% of the parameters. ## Available Formats | Format | Size | Speed (tok/s, bs=1, H100) | Use case | |--------|------|---------------------------|----------| | bf16 | 1.04 GB | 82 (HF generate) | Fine-tuning, research | | int8 | 664 MB | -- | Reduced memory inference | | GGUF f16 | 988 MB | 582 | Maximum quality | | **GGUF q8_0** | **525 MB** | **646** | Best speed | | GGUF q5_k_m | 419 MB | 634 | Balanced | | **GGUF q4_k_m** | **398 MB** | **635** | Smallest footprint | ## Quick Start ### With Transformers (bf16) ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "RightNowAI/RightNow-Arabic-0.5B-Turbo" tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="bf16") model = AutoModelForCausalLM.from_pretrained( model_id, subfolder="bf16", torch_dtype="bfloat16", device_map="auto" ) messages = [ {"role": "system", "content": "أنت مساعد ذكي يجيب باللغة العربية الفصحى"}, {"role": "user", "content": "ما هي عاصمة المملكة العربية السعودية؟"} ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=256) print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)) ``` ### With llama.cpp (GGUF) ```bash # Download the q8_0 quantization (best speed) huggingface-cli download RightNowAI/RightNow-Arabic-0.5B-Turbo \ gguf/RightNow-Arabic-0.5B-Turbo-q8_0.gguf --local-dir . # Run inference ./llama-cli -m RightNow-Arabic-0.5B-Turbo-q8_0.gguf \ -p "ما هي أكبر مدينة في مصر؟" \ -n 128 --temp 0.7 ``` ## Training Pipeline ``` Qwen2.5-0.5B (494M, 151,665 vocab) | v Tokenizer Surgery (+27,032 Arabic tokens -> 178,697 vocab) - SentencePiece unigram 32k on 12.5 GB Arabic corpus - Mean-subtoken embedding initialization - Fertility: 2.18 -> 1.80 tok/word (-17.3%) | v Continued Pretraining (504M arwiki tokens) - 2,500 steps, 8xH100 SXM5 - FSDP _HYBRID_SHARD_ZERO2 + FlashAttention varlen + Liger - Loss: 14.21 -> 1.69 | Wall time: 6h 57m | v Supervised Fine-Tuning (129,116 instructions) - 5 datasets: evol-instruct-arabic, alpaca-gpt4-arabic, sharegpt-arabic, CIDAR, aya_dataset - Response-only loss masking (72.1% of tokens carry loss) - 5 epochs, 418 steps | Wall time: 12m | v Direct Preference Optimization (6,750 pairs) - argilla-dpo-mix-7k-arabic - 2 epochs, 844 steps | Wall time: 34m | v Weight Soup Merging - Linear(DPO 0.5, SFT 0.25, Pretrain 0.25) - +0.44 points mean accuracy over DPO alone | v Export: bf16, int8, GGUF {f16, q8_0, q5_k_m, q4_k_m} ``` ## Training Data | Dataset | Examples/Tokens | Use | |---------|----------------|-----| | Arabic Wikipedia (wikimedia/wikipedia 20231101.ar) | 504M tokens | Continued pretraining | | FreedomIntelligence/evol-instruct-arabic | 59,022 | SFT | | FreedomIntelligence/alpaca-gpt4-arabic | 49,969 | SFT | | FreedomIntelligence/sharegpt-arabic | 5,231 | SFT | | arbml/CIDAR | 10,000 | SFT | | CohereForAI/aya_dataset (Arabic) | 4,947 | SFT | | 2A2I/argilla-dpo-mix-7k-arabic | 6,750 | DPO | ## Limitations - **Knowledge ceiling**: At 518M parameters, ArabicMMLU-style knowledge tasks lag 7B+ models by 12-30 points. This is a parameter-count limit, not a training limit. - **MSA only**: Trained on Wikipedia (Modern Standard Arabic). Dialects (Egyptian, Gulf, Levantine) get MSA responses. - **504M pretraining tokens**: Below Chinchilla-optimal ratio. More Arabic data would improve knowledge tasks. - **DPO was weak**: 6,750 machine-translated preference pairs provided minimal signal at 0.5B scale. The weight soup merge was more impactful. - **GGUF tile alignment**: q4_k_m and q5_k_m fall back to higher-bit quantization for 144/290 tensors due to the expanded vocabulary not aligning with k-quant tile sizes. ## Hardware All training ran on a single Nebius `gpu-h100-sxm` node: - 8x NVIDIA H100 80 GB SXM5 HBM3, NVLink4 - 128 vCPUs, 1.5 TiB RAM - CUDA 13.0, PyTorch 2.11, flash-attn 2.8.3, transformers 5.5.0 ## Citation ```bibtex @article{jaber2025rightnow, title={RightNow-Arabic-0.5B-Turbo: An Open Sub-1B Arabic Language Model via Vocabulary Injection and Edge-First Deployment}, author={Jaber, Jaber and Jaber, Osama}, year={2025}, url={https://huggingface.co/RightNowAI/RightNow-Arabic-0.5B-Turbo} } ``` ## License Apache 2.0 (same as the base Qwen2.5-0.5B model). --- *Built by [RightNow AI](https://rightnowai.co)*