Multimodal Arabic and speech processing capabilities

#92
by O96a - opened

The multimodal architecture here is compelling β€” 23 languages plus audio/vision capabilities in a single model. I'm particularly interested in the Arabic speech processing (ASR, speech translation).

For multilingual audio pipelines: have you evaluated cross-lingual transfer? We've seen models trained primarily on English speech data struggle with Arabic phonemes that don't exist in English (like emphatic consonants and guttural sounds). The 23-language claim is impressive, but I'm curious about the per-language performance distribution.

Also, any benchmarks on code-switched speech? Arabic-English mixed speech is common in many regions, and most ASR systems struggle significantly with the language identification boundary.

The vision-language-audio integration raises interesting deployment questions β€” what's the inference cost breakdown across modalities? Can you run audio-only inference without loading the full multimodal weights?

Sign up or log in to comment