{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "gpuType": "T4" }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" }, "accelerator": "GPU" }, "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Rade-ASR-CTC-3B-fa — Persian Speech-to-Text\n", "\n", "Run Meta's **Omnilingual ASR CTC-3B**, fine-tuned on **Persian** by [Rade AI](https://huggingface.co/RadeAI).\n", "\n", "**Steps:** set a **GPU** runtime (`Runtime ▸ Change runtime type ▸ T4 GPU`), run **Cell 1**, then **`Runtime ▸ Restart session`**, then run the rest. Audio clips must be **< 40 s**.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cell 1 — install (then RESTART the session)\n" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "!apt-get -qq install -y libsndfile1\n", "# need omnilingual-asr 0.2.0 (it registers the 3b_v2 architecture this model uses).\n", "# --ignore-requires-python: 0.2.0's metadata caps python at '<=3.12', which pip reads as\n", "# <=3.12.0 and wrongly rejects Colab's 3.12.x — the flag installs it anyway (it works on 3.12).\n", "!pip install -q --ignore-requires-python omnilingual-asr==0.2.0 huggingface_hub\n", "# fairseq2 needs the CUDA 12.8 torch build; pin all three or you hit libcudart/torchvision errors\n", "!pip install -q torch==2.8.0 torchaudio==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128\n", "print('installed — now click Runtime ▸ Restart session, then run the cells below')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ⚠️ Now do `Runtime ▸ Restart session`, then continue ↓\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cell 2 — check GPU\n" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "import torch\n", "print('torch', torch.__version__, '| CUDA', torch.cuda.is_available(),\n", " torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'NO GPU')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cell 3 — download the fine-tuned weights (single fp16 file, ~6.2 GB)\n" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "from huggingface_hub import hf_hub_download\n", "# single consolidated fp16 file — half the download of the fp32 shards, identical output\n", "ckpt = hf_hub_download('RadeAI/Rade-ASR-CTC-3B-fa', 'model_fp16.pt')\n", "print('weights at', ckpt)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cell 4 — register the model with fairseq2\n" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "import pathlib\n", "ad = pathlib.Path.home()/'.config/fairseq2/assets'; ad.mkdir(parents=True, exist_ok=True)\n", "(ad/'rade.yaml').write_text(f'''name: rade_CTC_3B_fa\n", "model_family: wav2vec2_asr\n", "model_arch: 3b_v2\n", "checkpoint: \"{ckpt}\"\n", "tokenizer_ref: omniASR_tokenizer_written_v2\n", "''')\n", "print('asset card ready')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cell 5 — get a Persian clip (< 40 s)\n", "\n", "By default this grabs the sample clip shipped in the repo. To use **your own** audio, run the commented `files.upload()` lines instead.\n" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "# default: use the sample clip from the repo\n", "audio_path = hf_hub_download('RadeAI/Rade-ASR-CTC-3B-fa', 'sample_fa.wav')\n", "\n", "# --- or upload your own (uncomment) ---\n", "# from google.colab import files\n", "# up = files.upload(); audio_path = list(up.keys())[0]\n", "print('using', audio_path)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cell 6 — transcribe 🎙️ → 📝\n" ] }, { "cell_type": "code", "metadata": {}, "execution_count": null, "outputs": [], "source": [ "from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline\n", "pipe = ASRInferencePipeline(model_card='rade_CTC_3B_fa',\n", " device='cuda' if torch.cuda.is_available() else 'cpu',\n", " dtype=torch.float16) # ~199x real time, 6.4 GB VRAM\n", "text = pipe.transcribe([audio_path], lang=['pes_Arab'], batch_size=1)\n", "print('📝', text[0])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "Made by [Rade AI](https://huggingface.co/RadeAI) · base: [facebook/omniASR-CTC-3B](https://huggingface.co/facebook/omniASR-CTC-3B) · Apache-2.0\n" ] } ] }