Multimodal Arabic and speech processing capabilities
The multimodal architecture here is compelling β 23 languages plus audio/vision capabilities in a single model. I'm particularly interested in the Arabic speech processing (ASR, speech translation).
For multilingual audio pipelines: have you evaluated cross-lingual transfer? We've seen models trained primarily on English speech data struggle with Arabic phonemes that don't exist in English (like emphatic consonants and guttural sounds). The 23-language claim is impressive, but I'm curious about the per-language performance distribution.
Also, any benchmarks on code-switched speech? Arabic-English mixed speech is common in many regions, and most ASR systems struggle significantly with the language identification boundary.
The vision-language-audio integration raises interesting deployment questions β what's the inference cost breakdown across modalities? Can you run audio-only inference without loading the full multimodal weights?