You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Hakeem-7B is a RESEARCH PREVIEW — not a medical device and not a substitute for a licensed clinician. It can produce confident but incorrect or unsafe answers, including wrong drug and dosing information. By requesting access you agree to use it for research and evaluation only — never for clinical, diagnostic, triage, or treatment decisions — and to the Falcon-LLM license inherited from the base model.

Log in or Sign Up to review the conditions and access this model content.

Hakeem-7B — Arabic Medical Reasoning

حكيم-7B  ·  Arabic Medical Reasoning  ·  by Vionex Digital Solutions
Built on Falcon-H1-7B · reasons step-by-step before answering


⚠️ RESEARCH PREVIEW — NOT MEDICAL ADVICE

Hakeem-7B is an experimental research model — in active research, not a finished product. It is NOT a medical device, NOT a clinical tool, and NOT a substitute for a qualified healthcare professional. It can and does produce confident but INCORRECT and potentially UNSAFE answers — including wrong drug information, wrong dosing, and mismanagement of emergencies. Do not use it to make any health decision. Always consult a licensed clinician. Released for research and evaluation only.


TL;DR

English — Hakeem-7B is an Arabic medical reasoning model by Vionex Digital Solutions, built on Falcon-H1-7B-Instruct. It reasons step-by-step before answering medical questions in Arabic (Modern Standard, Egyptian, Gulf) and English. On the MedAraBench Arabic-medical benchmark it beats every model in its 7–8B class, beats its own base model by +6.2 points, beats the 70B-parameter OpenBioLLM-70B medical specialist, and matches models 4–10× its size (ties the 27B–70B band). Reasoning quality (blind LLM-judge panel) ranks it above the 27B general model and both 70B medical models. It is a research preview only.

العربية — «حكيم-7B» نموذج ذكاء اصطناعي طبي عربي طوّرته شركة Vionex Digital Solutions، مبنيٌّ على Falcon-H1-7B. دُرِّب على التفكير خطوة بخطوة قبل الإجابة عن الأسئلة الطبية بالعربية (الفصحى والمصرية والخليجية) والإنجليزية. على معيار MedAraBench يتفوّق على جميع النماذج في فئته (7–8 مليار معامل)، ويتجاوز نموذجه الأساس بمقدار +6.2 نقطة، ويتفوّق على النموذج الطبي المتخصّص OpenBioLLM-70B، ويضاهي نماذج تكبره من 4 إلى 10 أضعاف. ⚠️ نموذج بحثي للتجربة فقط، وليس أداة طبية ولا بديلاً عن الطبيب المختص.


Model overview

Model Hakeem-7B (حكيم-7B)
Developer Vionex Digital Solutions
Base model tiiuae/Falcon-H1-7B-Instruct
Architecture Falcon-H1 hybrid Mamba-2 (SSM) + Attention decoder, 44 layers, hidden size 3072
Parameters ≈ 7.6 B (bf16, 4 safetensors shards ≈ 15.2 GB)
Max context up to 256K positions (per base config; long-context not separately evaluated here)
Languages Arabic (MSA · Egyptian · Gulf) + English
Domain Medicine / basic medical science
Training DAPT → SFT (chain-of-thought) → DPO → identity-card SFT
Inference mode Think-first (reason, then answer) — the default and the evaluated mode
License Falcon-LLM license (inherited from base)
Status Research preview

Intended use

  • Research on Arabic medical NLP, reasoning, and domain adaptation of hybrid-SSM models.
  • Benchmarking and evaluation of Arabic medical question answering.
  • Education and exploration of medical reasoning with a human expert in the loop.

Out of scope (do not use for)

  • Any real clinical, diagnostic, triage, dosing, or treatment decision.
  • Patient-facing medical advice, or anything resembling a medical device.
  • High-stakes or autonomous deployment of any kind.

⚠️ Two things that ARE the recipe — don't skip them

  1. Serve the identity card as the system prompt. Identity lives in the prompt, not the weights. The card ships baked into chat_template.jinja, so apply_chat_template prepends it automatically. With the card, Hakeem identifies as Hakeem / Vionex 29/30 under adversarial pressure (jailbreak/impersonation probes).
  2. Use think-first mode (the default). Hakeem reasons step-by-step, then writes «الإجابة النهائية:» / "Final answer:". In answer-only mode it over-refuses and is weaker. All reported benchmark numbers are think-first.

Quick start

Think-first and the identity card are applied automatically by the chat template.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Vionex-digital/Hakeem-7B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

messages = [
    {"role": "user",
     "content": "مريض عنده ضغط مرتفع وسكري نوع 2، ما أنسب دواء لخفض الضغط ولماذا؟"}
]
# The identity card + think-first instruction are injected by chat_template.jinja.
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=768, do_sample=False, temperature=None)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
# -> step-by-step reasoning, then «الإجابة النهائية: ...»

vLLM (fast eval / serving) and the convenience wrapper:

# Convenience helper (ships in deploy/hakeem_chat.py) — card + think-first handled for you:
from hakeem_chat import Hakeem
h = Hakeem("Vionex-digital/Hakeem-7B")
print(h.chat("ما الفرق بين السكري من النوع الأول والنوع الثاني؟"))   # mode="think" default; mode="direct" for answer-only

Falcon-H1 needs transformers >= 4.57 (developed on 4.57.6). For batched generation prefer vLLM; some transformers versions have batched-generation quirks on Falcon-H1 — use batch size 1 if you see degenerate output on raw HF batched calls.


Training pipeline

Training pipeline

Hakeem is the base model carried through four stages. GRPO was explored but did not beat this stack on the clean benchmark and is not part of the released model.

Stage What Data Purpose
0. Base Falcon-H1-7B-Instruct Hybrid Mamba/attention foundation
1. DAPT Domain-adaptive pre-training 14,404 Arabic medical docs (~2.08M words), 3 epochs Inject Arabic basic-medical-science knowledge
2. SFT Supervised fine-tuning on chain-of-thought 9,714 think-first pairs distilled from Claude Opus 4.6 (decontaminated against the test set) Teach step-by-step medical reasoning
3. DPO Direct preference optimization (β = 0.1) 387 hard preference ("bleed") pairs Sharpen answer selection / reduce failure modes
4. Identity-card SFT Light identity install + served card identity/jailbreak probes Consistent "Hakeem / Vionex" identity

The +6.2-point capability gain comes primarily from DAPT + SFT + DPO; the identity stage is about who the model says it is, not medical capability (medical accuracy held, ≈ +0.5pt).


Training data

DAPT corpus composition

DAPT v2 corpus — 14,404 documents / ~2.08M words. An Arabic basic-medical-science corpus, generated and then critic-refined by an LLM judge for factual precision (≈ 68% clean / 29% minor issues / 2.5% major after refinement). Composition:

Topic Share
Anatomy 26%
Physiology 14%
Biochemistry 12%
Physics 11%
Cell Biology 9%
Other (genetics, histology, microbiology, pharmacology, …) 28%

SFT — 9,714 think-first pairs distilled from Claude Opus 4.6 chain-of-thought, then decontaminated against the evaluation set (a 7.35% leak was found and removed before the released run). DPO — 387 hard preference pairs targeting the model's own residual error modes.


Architecture

Architecture

Hakeem inherits Falcon-H1's hybrid design: every decoder block runs a Mamba-2 state-space mixer and attention heads in parallel, combining linear-time long-range mixing with attention's precision. Key dimensions (from config.json):

model_type falcon_h1 (FalconH1ForCausalLM)
Decoder layers 44
Hidden size 3072
Attention heads / KV heads 12 / 2 (grouped-query) · head dim 128
Mamba-2 SSM state 256 · 24 heads · d_ssm 3072 · conv 4
MLP intermediate 12288 (SiLU)
Vocab 130,049
Max position embeddings 262,144
Dtype bfloat16

Identity card mechanism

Identity card flow

Identity and the think-first instruction are served, not memorized. chat_template.jinja prepends a bilingual (AR + EN) system card on every turn:

  1. "You are «حكيم» / Hakeem-7B by Vionex Digital Solutions, built on Falcon-H1."
  2. An anti-impersonation clause — refuse to claim to be ChatGPT / Claude / Gemini / DeepSeek or to role-play another AI under pressure.
  3. A think-first instruction — reason step-by-step, then state the final answer.

When a caller supplies their own system message, the card is prepended before it, so identity and safety framing survive custom system prompts. This is why the card is baked into the template rather than left to the integrator.


Evaluation

Benchmark: MedAraBench — native Arabic medical-school multiple-choice questions. We evaluate on a cleaned split (manually de-duplicated / fixed-key, n = 3692) in think-first mode.

Two harnesses, read both honestly. • A clean capability harness (vLLM, our default) measures Hakeem at 0.600 vs base 0.538 = +6.2 pt. • A strict shared-parser ladder scores all 14 models identically (n = 3692); there Hakeem is 0.575 (acc on all questions) / 0.585 (valid-parse only), base 0.538 = +3.7 pt. The base is measured the same way in both. Absolute numbers are parser-dependent — trust the ranking.

1 · Capability vs base

Base → Hakeem delta

Model Accuracy (clean vLLM harness)
Falcon-H1-7B-Instruct (base) 0.538
Hakeem-7B 0.600 (+6.2 pt)

2 · 14-model leaderboard (shared parser, n = 3692)

Benchmark ladder

# Model Size Acc Tier
1 DeepSeek-R1-Distill-Llama-70B 70B 0.698 frontier reasoner
2 Llama-3.3-70B-Instruct 70B 0.665 frontier
3 DeepSeek-R1-Distill-Qwen-32B 32B 0.642 frontier reasoner
4 Qwen2.5-32B-Instruct 32B 0.598 frontier
5 Med42-70B 70B 0.584 medical specialist
6 Gemma-2-27B-it 27B 0.578 general
7 ★ Hakeem-7B 7B 0.575 this model
8 Falcon-H1-7B-Instruct (base) 7B 0.538 base
9 OpenBioLLM-70B 70B 0.530 medical specialist
10 Llama-3.1-70B-Instruct 70B 0.516 frontier
11 Llama-2-13B-Chat 13B 0.460 older
12 Llama-3.1-8B-Instruct 8B 0.421 small
13 Llama-2-7B-Chat 7B 0.378 older
14 DeepSeek-R1-Distill-Llama-8B 8B 0.367 small reasoner

A 7B model ranks 7th of 14 — above every other ≤8B model, above its own base, above the 70B OpenBioLLM-70B and Llama-3.1-70B, and statistically level with Gemma-2-27B and Med42-70B. It trails only frontier 32B+/70B models. Note the reasoning-distillation pattern: it helps at scale (R1-70B tops the board) but hurts small models (R1-8B is last) — Hakeem reaches this band without that fragility.

3 · Per-specialty accuracy

Per-specialty heatmap

Specialty Acc n
Statistics 0.746 268
Cell Biology 0.646 189
Physics 0.623 459
Chemistry 0.612 67
Biochemistry 0.604 280
Physiology 0.600 538
Histology 0.596 151
Pharmacology 0.545 55
Genetics 0.536 196
Surgery 0.483 87
Anatomy 0.452 772
Ophthalmology 0.435 214
Microbiology 0.350 103

Anatomy is the bottleneck: it is both the largest slice (772 questions, ~21%) and a weak one (0.452). DAPT raised most specialties (Cell/Molecular, Internal Medicine, Physiology, Biochemistry) but Anatomy is retention-bound, not coverage-bound — the corpus covered the topics, yet the model does not retain them. Closing Anatomy is the main lever toward ~0.62 overall.

4 · Reasoning quality (blind LLM-judge panel)

CoT quality radar

Blind dual-judge (Claude Opus 4.5 + Sonnet 4.6), 1–5, n = 90, scored on coherence, medical soundness, faithfulness, language, and structure:

Model CoT-quality (1–5)
Llama-3.3-70B 3.25
Qwen2.5-32B 3.18
Hakeem-7B 2.82
Falcon-H1-7B (base) 2.57
Gemma-2-27B 2.55
OpenBioLLM-70B 2.41

Hakeem's reasoning quality ranks above the 27B general model and both 70B medical models, and +0.25 over its own base — i.e. SFT/DPO improved how it reasons, not just final accuracy. The remaining gap to the frontier reasoners is concentrated in medical soundness (knowledge); structure, faithfulness, coherence, and language are competitive.

Per-dialect results — planned, not yet released

Hakeem is trained on trilingual (MSA + Egyptian + Gulf) data, but dedicated per-dialect evaluation sets (Khaleeji-200 / Egyptian-200) are not yet built, so per-dialect numbers are intentionally omitted rather than estimated. They will be added when the sets exist.


Robustness & safety findings

What we have verified qualitatively (a formal quantitative robustness suite is still pending):

  • Identity / anti-impersonation: with the served card, correct "Hakeem / Vionex" identity on 29/30 adversarial probes (jailbreak, role-play, "ignore your instructions").
  • Self-prescription safety: improved — declines to hand out specific drug doses on request.
  • ⚠️ Drug disambiguation: can confuse similarly-named drugs (e.g. Captagon ↔ Methadone) — a dangerous failure mode. Verify any pharmacology output.
  • ⚠️ Dialect mistranslation: occasional Egyptian-colloquial symptom mismapping (e.g. fever ↔ migraine).
  • ⚠️ Token bleed: stray CJK / Cyrillic / Latin fragments in longer generations (a Falcon-H1 base artifact), plus occasional repetition / runaway on long outputs.

A formal, scored robustness battery (identity probes, format-perturbation, drug-safety, refusal calibration) is on the roadmap and will be published with results when run.


Limitations

  • Research preview — NOT for real clinical decisions. Always consult a licensed clinician.
  • Reasoning structure is strong; factual recall has holes. Well-structured reasoning can carry confident pharmacology/anatomy errors. Verify specifics.
  • Acute clinical management is unreliable (e.g. it can recognize DKA but mismanage treatment).
  • Anatomy is the weakest major specialty (see per-specialty table).
  • Not a polished chat product. Occasional token-bleed, repetition, and meta-leaks on long generations.
  • Benchmark caveats. Absolute accuracy is parser/protocol-dependent; the robust signal is the ranking and the controlled base-vs-Hakeem delta.

Bias, risks & responsible use

Medical content carries real-world harm potential. This model may reflect biases in its training sources, under-represent some populations and conditions, and is strongest on exam-style basic medical science rather than bedside management. It must not be used to provide medical advice, and any research use should keep a qualified human expert in the loop. It is released publicly for research and evaluation only — not for clinical, diagnostic, or treatment use.


Compute & environmental footprint

Training used cloud 8×A100 (40/80GB) and 2×H100 nodes for the DAPT → SFT → DPO → identity stages (GRPO experiments are excluded from the released model). A precise GPU-hour and CO₂e figure will be finalized from the training logs/training_args and is reported here as approximate in the interim:

Estimate (to be finalized)
Hardware NVIDIA A100 / H100, cloud
GPU-hours (shipped stack) order of a few hundred A100-equivalent GPU-hours
CO₂e to be computed (GPU-hours × TDP × PUE × regional grid intensity)

(We prefer to leave this approximate rather than publish an unverified number; exact figures pending a pass over the training logs.)


Citation

@misc{hakeem7b2026,
  title        = {Hakeem-7B: An Arabic Medical Reasoning Model},
  author       = {Vionex Digital Solutions},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/Vionex-digital/Hakeem-7B}},
  note         = {Built on Falcon-H1-7B-Instruct; DAPT + chain-of-thought SFT + DPO}
}

Acknowledgments

  • TII for the Falcon-H1 base model and hybrid Mamba/attention architecture.
  • MedAraBench for the Arabic medical evaluation benchmark.
  • Chain-of-thought distillation supervision via Claude Opus 4.6; reasoning-quality adjudication via Claude Opus 4.5 + Sonnet 4.6.

License

Released under the Falcon-LLM license inherited from tiiuae/Falcon-H1-7B-Instruct. By using Hakeem-7B you agree to that license and to the research-only, non-clinical access terms above.


© 2026 Vionex Digital Solutions · Hakeem-7B (حكيم-7B) · Research preview

Downloads last month
-
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Vionex-digital/Hakeem-7B

Finetuned
(7)
this model

Evaluation results

  • Accuracy (clean vLLM harness) on MedAraBench (clean split, think-first)
    self-reported
    0.600
  • Accuracy (14-model shared-parser ladder, n=3692) on MedAraBench (clean split, think-first)
    self-reported
    0.575