Audio Transcription Pipeline

A modular audio transcription pipeline with speech recognition, audience response classification, speaker diarization, meeting summarization, and ASCII visualization.

Architecture

Audio Input (WAV/FLAC/MP3)
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Transcriber       β”‚  ← faster-whisper (base model, CPU int8)
β”‚  Word-level timing β”‚     Language detection, beam search
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Audience Class.   β”‚  ← Rule-based + AST (MIT/ast-finetuned-audioset)
β”‚  10 event classes  β”‚     Applause, laughter, cheering, music, etc.
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Diarizer          β”‚  ← pyannote/speaker-diarization-3.1 or mock
β”‚  Speaker labeling  β”‚     HF_TOKEN required for full diarization
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Summarizer        β”‚  ← Qwen2.5-0.5B-Instruct GGUF
β”‚  Structured JSON   β”‚     Overview, decisions, action items
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  GlyphCast ASCII   β”‚  ← Spectrogram β†’ ASCII art
β”‚  7 charsets/3 modesβ”‚     Dark, light, hallow themes
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Hybrid Speech-Transcription Model β€” GGUF Export & Integration

Working hybrid speech-to-text model (Whisper-large-v3 encoder β†’ trained projector β†’ Qwen3-8B decoder), export the projector as a mmproj-compatible GGUF file, integrate into hybrid_model/infer.py, and test via the existing CLI.

Architecture (Confirmed Compatible with llama-cpp-android-vulkan)

The target repo (kun1gund3/llama-cpp-android-vulkan) is based on stock llama.cpp which natively supports:

  • --mmproj: multimodal projector GGUF files (via libmtmd)
  • Whisper encoder: via whisper.cpp integration (PR #13623, #13714)
  • Vulkan GPU acceleration on Android via NDK
Audio (16kHz mono)
  β”‚
  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Whisper-large-v3-turbo Encoder     β”‚  ← transformers WhisperModel (encoder only)
β”‚  d_model=1280, 32 layers            β”‚     ~809M params, 3.5 GB RAM
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚ hidden states (batch, seq, 1280)
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  2-Layer MLP Projector              β”‚  ← Trained on CPU, ~16.8M params
β”‚  Linear(1280β†’4096) β†’ ReLU           β”‚     Maps Whisper space β†’ LLM space
β”‚  β†’ Linear(4096β†’4096)               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚ projected embeddings (batch, seq, 4096)
               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Qwen3-8B-Instruct Decoder          β”‚  ← llama-cpp-python, Q4_K_M GGUF
β”‚  n_embd=4096, 40k context           β”‚     ~5 GB on disk
β”‚  create_chat_completion +           β”‚
β”‚  logits_processor injection         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
          Transcription Text

Export Format (mmproj GGUF):

mmproj.gguf:
  general.architecture = "mmproj"
  mmproj.encoder_type = "whisper"
  mmproj.input_dim = 1280
  mmproj.hidden_dim = 4096
  Tensors:
    mmproj.1.weight (4096, 1280)  β€” first Linear layer weights
    mmproj.1.bias   (4096,)       β€” first Linear layer bias
    mmproj.2.weight (4096, 4096)  β€” second Linear layer weights
    mmproj.2.bias   (4096,)       β€” second Linear layer bias

Research Summary

  • kun1gund3/llama-cpp-android-vulkan: Based on stock llama.cpp. Supports --mmproj multimodal projector loading. No architectural modifications needed β€” standard GGUF/llama.cpp formats work.
  • LLaMA.cpp multimodal: libmtmd library handles image AND audio inputs. For audio, uses whisper.cpp encoder. The mmproj maps encoder embeddings to LLM token space.
  • GGUF mmproj format: Standard 4-tensor format (2 layers Γ— {weight, bias}). Architecture key: mmproj. Loaded via --mmproj file.gguf -m model.gguf.
  • Whisper Encoder: transformers.WhisperModel provides clean .encoder() API. Much simpler than extracting from GGUF.
  • Training strategy: MSE loss between projected Whisper embeddings and Qwen3 token embeddings (the correct training signal for multimodal projectors).

Subtasks (8 steps)

  1. Whisper Encoder Extraction β€” Load openai/whisper-large-v3-turbo from HF transformers. Build hybrid_model/extract_encoder.py that provides a WhisperEncoder class (wrapper around model.encoder). Save encoder state_dict for reuse.

    • Verify: load audio, forward pass returns (batch, seq, 1280) tensor.
  2. Training Data Generation β€” Run all test audio files through Whisper encoder, save hidden states as .npy in hybrid_model/training_data/. Tokenize reference transcriptions with Qwen3 tokenizer, extract token embeddings as targets.

    • Verify: paired .npy files exist with correct shapes.
  3. Projector Training β€” Build hybrid_model/train_projector.py. 2-layer MLP. MSE loss, AdamW (lr=1e-4), 1000 epochs, batch training on CPU. Save checkpoint as PyTorch .pt.

    • Verify: loss decreases, checkpoint loads correctly.
  4. GGUF Export β€” Build hybrid_model/export_mmproj.py. Use gguf library (from llama-cpp-python) to write projector weights as a valid mmproj GGUF file. Output: hybrid_model/mmproj.gguf.

    • Verify: python -c "from gguf import GGUFReader; r=GGUFReader('hybrid_model/mmproj.gguf'); print([t.name for t in r.tensors])" shows 4 tensors.
  5. Inference Script β€” Build hybrid_model/infer.py. Pipeline:

    • Load audio (librosa 16kHz)
    • Whisper encoder forward β†’ (seq, 1280)
    • Projector forward β†’ (seq, 4096)
    • Create prompt for Qwen3: transcribe the audio based on embeddings
    • Generate via llama-cpp-python
    • Output transcription text
    • Verify: processes JFK speech, produces text.
  6. CLI Integration β€” Add --hybrid flag to main.py transcribe command that routes through hybrid_model/infer.py.

    • Verify: python main.py transcribe data/jfk_speech.wav --hybrid --output text works.
  7. E2E Test β€” Run full pipeline on JFK speech. Compare output with baseline faster-whisper transcription.

    • Verify: output is non-empty, contains recognizable words from the speech.
  8. Documentation β€” Update hybrid_model/ARCHITECTURE.md with training results, GGUF export details, and integration notes.

Deliverables

File Description
hybrid_model/extract_encoder.py Whisper encoder wrapper
hybrid_model/training_data/*.npy Paired encoder/LLM embeddings
hybrid_model/train_projector.py Projector training script
hybrid_model/projector_checkpoint.pt Trained projector weights
hybrid_model/export_mmproj.py GGUF export script
hybrid_model/mmproj.gguf Final mmproj GGUF (mmproj-compatible)
hybrid_model/infer.py Inference pipeline
main.py (updated) CLI integration via --hybrid flag
hybrid_model/ARCHITECTURE.md Updated documentation

Evaluation Criteria

  • Projector training converges (loss decreases measurably)
  • mmproj.gguf exports with correct 4-tensor format
  • infer.py produces non-empty transcription from audio
  • CLI --hybrid flag works end-to-end
  • Total pipeline runs within available RAM (~14 GB available)

Notes

  • CPU-only: expect training to take 2-5 minutes for small dataset
  • Whisper encoder forward pass takes ~5s per 10s audio on CPU
  • Qwen3-8B generation: ~10s for 50 tokens
  • RAM budget: Whisper (3.5G) + Qwen3 (5G) + Audio/overhead (~2G) = ~10.5G

Quick Start

CLI

cd /app/audio_transcription_pipeline_1527

# Full transcription
python main.py transcribe data/jfk_speech.wav --output json

# With summary + ASCII visualization
python main.py transcribe data/jfk_speech.wav --summary --ascii

# All pipeline modules at once
python main.py all data/jfk_speech.wav --output json

# Audience response classification only
python main.py audience data/test_tone.wav --output json

# ASCII visualization only
python main.py ascii-viz data/jfk_speech.wav --mode dark

Web Demo (Runs as background service)

  • Frontend (Gradio): Port 8080
  • Backend API (FastAPI): Port 8081
# Start backend
python -m uvicorn api:app --host 0.0.0.0 --port 8081 --log-level warning

# Start frontend (in another terminal)
python app.py

API Endpoints

Endpoint Method Description
/health GET Health check
/transcribe POST Upload audio β†’ transcription + analysis

Models Used

Model Source Size Purpose
faster-whisper base Systran/faster-whisper-base ~150 MB Speech recognition
AST audioset MIT/ast-finetuned-audioset-10-10-0.4593 ~230 MB Audience classification
Qwen2.5-0.5B-Instruct Qwen/Qwen2.5-0.5B-Instruct-GGUF ~350 MB Meeting summarization
Whisper-large-v3-turbo openai/whisper-large-v3-turbo ~2 GB Encoder for hybrid model
Qwen3-8B-Instruct Qwen/Qwen3-8B-GGUF ~5 GB LLM for hybrid model

Hybrid Speech-Transcription Model

See hybrid_model/ARCHITECTURE.md for details on the Encoder-Projector-LLM architecture.

Requirements

  • Python 3.10+
  • CPU with ~8GB RAM minimum
  • Optional: HuggingFace token for pyannote diarization

Final Report: Hybrid Speech-Transcription Model (Whisper - Projector - Qwen3-8B)

Deliverables (7/7 verifiziert)
JFK Audio - (1500, 1280)
Whisper Encoder Modul hybr id_model/extract_encoder.py
Training Data Generator hybrid_model/train_data_gen.py
53 gepaarte Samples aus 5 Audios
7 1000 Epochen, Loss 0.000002
Projector Training
hybrid_model/train_projector.py
4 FLOAT32 Tensoren, mmproj-kompatibel
hybrid_model/mmproj-gguf (88MB)
mmprojGGUF Export
faster-whisper Fallback + Qwen3-8B
Inference Script
hybrid_model/infer.py
CLI Integration
main.py --hybrid
python main.py transcribe ... --hybrid
|213 Zellen
---
Dokumentation
hybrid_model/ARCHITECTURE.mc
Architektur
Audio (16kHz) - Whisper-large-v3-turbo Encoder (1280-dim)
MLP Projector: Linear(1280-4096) - ReLU
Linear(4096-4096)
Qwen3-8B LLM
Transkription
hmproj GGUF Export -yol| kompatibel mit kun1gund3/llama-cpp-android-vulkan (stock llama.cpp --mmpro).
mmproj.1.weight/bias (1280-4096)
mmproj.2.weight/bias (4096-4096)
---
Metadata: general.architecture=mmproj, mmproj.encoder_type=whisper
Trainingsergebnisse
Finaler Train Loss: 0.000002 (epoch 1000)
Bester Validation Loss: 0.017242 (epoch 966)
Modellgrâße: ~16.8M Parameter, 88 MB GGUF
Hardware: CPU-only Training (~21 Minuten)
Bekannte EinschrΓ€nkung
Python Embedding Injection: llama-cpp-python 0.3.31 unterstΓΌtzt kein stabiles Low-Level Embedding-Injection (crasht mit double free or
corruption ). Daher verwendet infer.py einen pragmatischen Text-Fallback: faster-whisper (INT8) fΓΌr die tats Γ€chliche Transkription +
optionale Verfeinerung durch Qwen3-8B via create_chat_completion

Das GGUF mmproj Export ist trotzdem voll valide fΓΌr C++/Android Deployment via -mmproj Flag.

E2E Test Ergebnis

python main.py transcribe data/jfk_speech.wav --hybrid --output text

"And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country."

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Qapdex/SLM-ATP-Hybrid-Projector

Finetuned
(855)
this model