--- license: apache-2.0 language: - en --- # ELLSA: End-to-end Listen, Look, Speak and Act
The **first** end-to-end model that unifies **vision, speech, text and action** in a **streaming full-duplex** framework, enabling joint multimodal perception and concurrent generation.
ELLSA/ โโโ configs/ # Model configuration files โโโ models/ # Tokenizer and diffusion test โโโ train/ # Training dataset and pipeline โโโ reference/ # Reference code โ โโโ cosyvoice/ # Speech synthesizer โ โโโ Emu3/ # Base code โ โโโ RoboVLMs/ # Evaluation code โ โโโ spear_encoder/ # Speech encoder โโโ scripts/ # Shell scripts for training โโโ tools/ # Data preprocessing tools โโโ README.md # Project description and user guide