Audio Transcription Pipeline

A modular audio transcription pipeline with speech recognition, audience response classification, speaker diarization, meeting summarization, and ASCII visualization.

Architecture

Audio Input (WAV/FLAC/MP3)
    │
    ▼
┌────────────────────┐
│  Transcriber       │  ← faster-whisper (base model, CPU int8)
│  Word-level timing │     Language detection, beam search
└────────┬───────────┘
         │
         ▼
┌────────────────────┐
│  Audience Class.   │  ← Rule-based + AST (MIT/ast-finetuned-audioset)
│  10 event classes  │     Applause, laughter, cheering, music, etc.
└────────┬───────────┘
         │
         ▼
┌────────────────────┐
│  Diarizer          │  ← pyannote/speaker-diarization-3.1 or mock
│  Speaker labeling  │     HF_TOKEN required for full diarization
└────────┬───────────┘
         │
         ▼
┌────────────────────┐
│  Summarizer        │  ← Qwen2.5-0.5B-Instruct GGUF
│  Structured JSON   │     Overview, decisions, action items
└────────┬───────────┘
         │
         ▼
┌────────────────────┐
│  GlyphCast ASCII   │  ← Spectrogram → ASCII art
│  7 charsets/3 modes│     Dark, light, hallow themes
└────────────────────┘

Hybrid Speech-Transcription Model — GGUF Export & Integration

Working hybrid speech-to-text model (Whisper-large-v3 encoder → trained projector → Qwen3-8B decoder), export the projector as a mmproj-compatible GGUF file, integrate into hybrid_model/infer.py, and test via the existing CLI.

Architecture (Confirmed Compatible with llama-cpp-android-vulkan)

The target repo (kun1gund3/llama-cpp-android-vulkan) is based on stock llama.cpp which natively supports:

--mmproj: multimodal projector GGUF files (via libmtmd)
Whisper encoder: via whisper.cpp integration (PR #13623, #13714)
Vulkan GPU acceleration on Android via NDK

Audio (16kHz mono)
  │
  ▼
┌─────────────────────────────────────┐
│  Whisper-large-v3-turbo Encoder     │  ← transformers WhisperModel (encoder only)
│  d_model=1280, 32 layers            │     ~809M params, 3.5 GB RAM
└──────────────┬──────────────────────┘
               │ hidden states (batch, seq, 1280)
               ▼
┌─────────────────────────────────────┐
│  2-Layer MLP Projector              │  ← Trained on CPU, ~16.8M params
│  Linear(1280→4096) → ReLU           │     Maps Whisper space → LLM space
│  → Linear(4096→4096)               │
└──────────────┬──────────────────────┘
               │ projected embeddings (batch, seq, 4096)
               ▼
┌─────────────────────────────────────┐
│  Qwen3-8B-Instruct Decoder          │  ← llama-cpp-python, Q4_K_M GGUF
│  n_embd=4096, 40k context           │     ~5 GB on disk
│  create_chat_completion +           │
│  logits_processor injection         │
└─────────────────────────────────────┘
               │
               ▼
          Transcription Text

Export Format (mmproj GGUF):

mmproj.gguf:
  general.architecture = "mmproj"
  mmproj.encoder_type = "whisper"
  mmproj.input_dim = 1280
  mmproj.hidden_dim = 4096
  Tensors:
    mmproj.1.weight (4096, 1280)  — first Linear layer weights
    mmproj.1.bias   (4096,)       — first Linear layer bias
    mmproj.2.weight (4096, 4096)  — second Linear layer weights
    mmproj.2.bias   (4096,)       — second Linear layer bias

Research Summary

kun1gund3/llama-cpp-android-vulkan: Based on stock llama.cpp. Supports --mmproj multimodal projector loading. No architectural modifications needed — standard GGUF/llama.cpp formats work.
LLaMA.cpp multimodal: libmtmd library handles image AND audio inputs. For audio, uses whisper.cpp encoder. The mmproj maps encoder embeddings to LLM token space.
GGUF mmproj format: Standard 4-tensor format (2 layers × {weight, bias}). Architecture key: mmproj. Loaded via --mmproj file.gguf -m model.gguf.
Whisper Encoder: transformers.WhisperModel provides clean .encoder() API. Much simpler than extracting from GGUF.
Training strategy: MSE loss between projected Whisper embeddings and Qwen3 token embeddings (the correct training signal for multimodal projectors).

Subtasks (8 steps)

Whisper Encoder Extraction — Load openai/whisper-large-v3-turbo from HF transformers. Build hybrid_model/extract_encoder.py that provides a WhisperEncoder class (wrapper around model.encoder). Save encoder state_dict for reuse.
- Verify: load audio, forward pass returns (batch, seq, 1280) tensor.
Training Data Generation — Run all test audio files through Whisper encoder, save hidden states as .npy in hybrid_model/training_data/. Tokenize reference transcriptions with Qwen3 tokenizer, extract token embeddings as targets.
- Verify: paired .npy files exist with correct shapes.
Projector Training — Build hybrid_model/train_projector.py. 2-layer MLP. MSE loss, AdamW (lr=1e-4), 1000 epochs, batch training on CPU. Save checkpoint as PyTorch .pt.
- Verify: loss decreases, checkpoint loads correctly.
GGUF Export — Build hybrid_model/export_mmproj.py. Use gguf library (from llama-cpp-python) to write projector weights as a valid mmproj GGUF file. Output: hybrid_model/mmproj.gguf.
- Verify: python -c "from gguf import GGUFReader; r=GGUFReader('hybrid_model/mmproj.gguf'); print([t.name for t in r.tensors])" shows 4 tensors.
Inference Script — Build hybrid_model/infer.py. Pipeline:
- Load audio (librosa 16kHz)
- Whisper encoder forward → (seq, 1280)
- Projector forward → (seq, 4096)
- Create prompt for Qwen3: transcribe the audio based on embeddings
- Generate via llama-cpp-python
- Output transcription text
- Verify: processes JFK speech, produces text.
CLI Integration — Add --hybrid flag to main.py transcribe command that routes through hybrid_model/infer.py.
- Verify: python main.py transcribe data/jfk_speech.wav --hybrid --output text works.
E2E Test — Run full pipeline on JFK speech. Compare output with baseline faster-whisper transcription.
- Verify: output is non-empty, contains recognizable words from the speech.
Documentation — Update hybrid_model/ARCHITECTURE.md with training results, GGUF export details, and integration notes.

Deliverables

File	Description
`hybrid_model/extract_encoder.py`	Whisper encoder wrapper
`hybrid_model/training_data/*.npy`	Paired encoder/LLM embeddings
`hybrid_model/train_projector.py`	Projector training script
`hybrid_model/projector_checkpoint.pt`	Trained projector weights
`hybrid_model/export_mmproj.py`	GGUF export script
`hybrid_model/mmproj.gguf`	Final mmproj GGUF (mmproj-compatible)
`hybrid_model/infer.py`	Inference pipeline
`main.py` (updated)	CLI integration via --hybrid flag
`hybrid_model/ARCHITECTURE.md`	Updated documentation

Evaluation Criteria

Projector training converges (loss decreases measurably)
mmproj.gguf exports with correct 4-tensor format
infer.py produces non-empty transcription from audio
CLI --hybrid flag works end-to-end
Total pipeline runs within available RAM (~14 GB available)

Notes

CPU-only: expect training to take 2-5 minutes for small dataset
Whisper encoder forward pass takes ~5s per 10s audio on CPU
Qwen3-8B generation: ~10s for 50 tokens
RAM budget: Whisper (~~3.5G) + Qwen3 (~~5G) + Audio/overhead (~2G) = ~10.5G

Quick Start

CLI

cd /app/audio_transcription_pipeline_1527

# Full transcription
python main.py transcribe data/jfk_speech.wav --output json

# With summary + ASCII visualization
python main.py transcribe data/jfk_speech.wav --summary --ascii

# All pipeline modules at once
python main.py all data/jfk_speech.wav --output json

# Audience response classification only
python main.py audience data/test_tone.wav --output json

# ASCII visualization only
python main.py ascii-viz data/jfk_speech.wav --mode dark

Web Demo (Runs as background service)

Frontend (Gradio): Port 8080
Backend API (FastAPI): Port 8081

# Start backend
python -m uvicorn api:app --host 0.0.0.0 --port 8081 --log-level warning

# Start frontend (in another terminal)
python app.py

API Endpoints

Endpoint	Method	Description
`/health`	GET	Health check
`/transcribe`	POST	Upload audio → transcription + analysis

Models Used

Model	Source	Size	Purpose
faster-whisper base	Systran/faster-whisper-base	~150 MB	Speech recognition
AST audioset	MIT/ast-finetuned-audioset-10-10-0.4593	~230 MB	Audience classification
Qwen2.5-0.5B-Instruct	Qwen/Qwen2.5-0.5B-Instruct-GGUF	~350 MB	Meeting summarization
Whisper-large-v3-turbo	openai/whisper-large-v3-turbo	~2 GB	Encoder for hybrid model
Qwen3-8B-Instruct	Qwen/Qwen3-8B-GGUF	~5 GB	LLM for hybrid model

Hybrid Speech-Transcription Model

See hybrid_model/ARCHITECTURE.md for details on the Encoder-Projector-LLM architecture.

Requirements

Python 3.10+
CPU with ~8GB RAM minimum
Optional: HuggingFace token for pyannote diarization

Final Report: Hybrid Speech-Transcription Model (Whisper - Projector - Qwen3-8B)

Deliverables (7/7 verifiziert)
JFK Audio - (1500, 1280)
Whisper Encoder Modul hybr id_model/extract_encoder.py
Training Data Generator hybrid_model/train_data_gen.py
53 gepaarte Samples aus 5 Audios
7 1000 Epochen, Loss 0.000002
Projector Training
hybrid_model/train_projector.py
4 FLOAT32 Tensoren, mmproj-kompatibel
hybrid_model/mmproj-gguf (88MB)
mmprojGGUF Export
faster-whisper Fallback + Qwen3-8B
Inference Script
hybrid_model/infer.py
CLI Integration
main.py --hybrid
python main.py transcribe ... --hybrid
|213 Zellen
---
Dokumentation
hybrid_model/ARCHITECTURE.mc
Architektur
Audio (16kHz) - Whisper-large-v3-turbo Encoder (1280-dim)
MLP Projector: Linear(1280-4096) - ReLU
Linear(4096-4096)
Qwen3-8B LLM
Transkription
hmproj GGUF Export -yol| kompatibel mit kun1gund3/llama-cpp-android-vulkan (stock llama.cpp --mmpro).
mmproj.1.weight/bias (1280-4096)
mmproj.2.weight/bias (4096-4096)
---
Metadata: general.architecture=mmproj, mmproj.encoder_type=whisper
Trainingsergebnisse
Finaler Train Loss: 0.000002 (epoch 1000)
Bester Validation Loss: 0.017242 (epoch 966)
Modellgröße: ~16.8M Parameter, 88 MB GGUF
Hardware: CPU-only Training (~21 Minuten)
Bekannte Einschränkung
Python Embedding Injection: llama-cpp-python 0.3.31 unterstützt kein stabiles Low-Level Embedding-Injection (crasht mit double free or
corruption ). Daher verwendet infer.py einen pragmatischen Text-Fallback: faster-whisper (INT8) für die tats ächliche Transkription +
optionale Verfeinerung durch Qwen3-8B via create_chat_completion

Das GGUF mmproj Export ist trotzdem voll valide für C++/Android Deployment via -mmproj Flag.

E2E Test Ergebnis

python main.py transcribe data/jfk_speech.wav --hybrid --output text

"And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country."

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Qapdex/SLM-ATP-Hybrid-Projector

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Finetuned

(855)

this model