parakeet-zerogpu / README.md
magboola's picture
Fix ZeroGPU: move model.to(cuda) into @spaces.GPU functions
e73855d

A newer version of the Gradio SDK is available: 6.13.0

Upgrade
metadata
title: Parakeet ASR for Cadayn
emoji: 🦜
colorFrom: yellow
colorTo: green
sdk: gradio
sdk_version: 5.42.0
app_file: app.py
pinned: false
license: apache-2.0
hardware: zero-a10g
python_version: '3.10'

Parakeet TDT Dedicated GPU Space

Ultra-fast ASR using NVIDIA Parakeet TDT 0.6B on a dedicated A10G GPU.

Features

  • 3000x Faster than Whisper: Transcribe 1 hour in ~1 second on GPU
  • Word-Level Timestamps: Precise timing for every word
  • Long Audio Support: Auto-chunking for files >10 minutes
  • Always-On: Model loaded at startup, no cold start per request
  • State-of-the-Art Accuracy: Best WER on LibriSpeech benchmark
  • Multi-Format Support: MP3, WAV, M4A, MP4, WebM, OGG, FLAC
  • No Quota Limits: Dedicated A10G GPU — no ZeroGPU quota constraints

API Usage

Full Transcription

from gradio_client import Client
import base64

client = Client("Cadayn/parakeet-zerogpu")

# From URL
result = client.predict(
    audio_url="https://example.com/audio.mp3",
    api_name="/api_transcribe"
)
print(result)
# {"success": True, "text": "...", "words": [{"start_s": 0.0, "end_s": 0.5, "text": "Hello"}, ...]}

# From file
with open("audio.mp3", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

result = client.predict(
    audio_base64=audio_b64,
    api_name="/api_transcribe"
)

Segment Transcription (for streaming)

# For real-time or segment-by-segment processing
with open("segment.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

result = client.predict(
    audio_base64=audio_b64,
    start_s=30.0,  # Offset for timestamp alignment
    api_name="/api_transcribe_segment"
)
print(result)
# {"success": True, "words": [{"start_s": 30.5, "end_s": 31.0, "text": "word"}, ...]}

Model

Uses nvidia/parakeet-tdt-0.6b-v3:

  • 600M parameters
  • TDT (Token-and-Duration Transducer) architecture
  • 25 European languages with automatic language detection
  • Long audio support (up to 24 min full attention, 3 hours with local attention)
  • Runs on dedicated A10G Small (24 GB VRAM)

Performance

Audio Length Transcription Time
1 minute ~20ms
10 minutes ~200ms
1 hour ~1.2s

Integration with EagleEye

Set environment variable in EagleEye:

PARAKEET_HF_SPACE_URL=https://cadayn-parakeet-zerogpu.hf.space

Limitations

  • Dedicated A10G Small GPU ($1.00/hr billed while Space is running)
  • Pause the Space from HuggingFace settings when not in use to save credits