Arabic performance and multilingual deployment
The multilingual coverage here is impressive β 23 languages including Arabic. I've been testing smaller models on Arabic dialect tasks (Egyptian, Levantine, Gulf) and finding that most multilingual models perform well on MSA but struggle significantly with dialectal variations.
Curious: have you benchmarked Arabic performance specifically? The paper mentions arXiv:2503.01743 β does it include dialect-level evaluation, or primarily Modern Standard Arabic?
Also, for edge deployment: the 4-bit quantized variants are popular in production, but I've noticed Arabic generation quality degrades more than English at aggressive quantization levels. Any observations on language-specific quantization sensitivity? We're seeing this pattern across several multilingual models.