Better then Whisper ???
A 100M parameters text-to-speech (TTS) model by Kyutai-Labs
Extract structured layout and text from PDFs or images