Vocal LLM

Cost-Efficient Joint Audio-Language Modeling via Lightweight Projector Training over Frozen Foundations

Vocal LLM is a joint audio-language model that bridges a frozen Whisper speech encoder with the Sarvam-M 24B Indic LLM through a lightweight trainable projector. The entire model was trained for ~$10 on a single NVIDIA A100 GPU in approximately 6 hours.

Architecture

Joint_embedding_model_Sarvam_with_Whisper

Vocal LLM consists of three components:

Component Model Parameters Status
Speech Encoder openai/whisper-medium ~300M Frozen
Multimodal Projector Two-layer MLP (GELU + LayerNorm) ~60M Trained
Language Model sarvamai/sarvam-m (Mistral-based, 24B) ~24B LoRA-adapted (~103M trainable)

Total trainable parameters: <3% of the full model.

How it works

  1. Audio encoding β€” Raw audio is resampled to 16 kHz, converted to a log-mel spectrogram, and processed by the frozen Whisper encoder to produce 1024-dim embeddings at 50 frames/sec.
  2. Projection β€” The MLP projector stacks 8 consecutive frames (8x temporal downsampling) and maps them into the LLM's 2048-dim input space. A 30-second clip becomes ~188 pseudo-tokens.
  3. Text generation β€” Projected audio tokens are concatenated with text instruction tokens and processed by the LoRA-adapted Sarvam-M LLM to generate the response.

Training

Training follows a two-stage pipeline:

Stage 1: Projector Pre-training β€” Alignment between Whisper's speech representations and Sarvam-M's text embedding space using 10K audio continuation pairs from Mozilla Common Voice (Hindi). Only the projector MLP is trained. 1 epoch, AdamW, lr=1e-4, bfloat16.

Stage 2: Instruction Fine-tuning β€” 3,000 synthetic Hindi audio question-answer pairs. Both the projector and LoRA adapters (rank 16, alpha=32, applied to all attention projections) are trained. 3 epochs, lr=5e-5.

The synthetic dataset was generated by prompting a text-only LLM with ASR transcripts to create instruction-answer pairs β€” 10-50x cheaper than processing raw audio through multimodal APIs.

Capabilities

  • Hindi audio question answering β€” Given audio + a question, generates contextually relevant Hindi responses
  • Cross-lingual understanding β€” Translates Hindi speech to English text
  • Audio transcription β€” Transcribes Hindi speech leveraging Whisper's multilingual capabilities
  • Content summarization β€” Summarizes audio content in Hindi or English

Usage

# Inference format
# User: [INST] Based on the provided audio, answer the following question: {Q} <|audio|> [/INST]
# Assistant: {Answer}

# During the forward pass, the <|audio|> placeholder is replaced
# with the projected audio pseudo-tokens from the Whisper encoder + MLP projector.

Limitations

  • Hallucination β€” May occasionally generate fluent but factually incorrect responses
  • Limited vocabulary β€” Trained on only 3,000 samples; restricted Hindi vocabulary coverage
  • Length sensitivity β€” Audio clips significantly longer/shorter than training distribution may produce degraded outputs
  • Noise sensitivity β€” Background noise or atypical speaking patterns can cause incoherent output

Citation

@article{vocalllm2026,
  title={Vocal LLM: Cost-Efficient Joint Audio-Language Modeling
         via Lightweight Projector Training over Frozen Foundations},
  author={Team Vizuara},
  year={2026}
}

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support