Super-squash branch 'main' using huggingface_hub

d6402da 11 days ago

4.79 kB

base_model: unsloth/gemma-3n-E4B-it
tags:
  - text-generation-inference
  - transformers
  - unsloth
  - gemma3n
  - egyptian
  - codeswitch
  - arabic
  - masry
license: apache-2.0
language:
  - en
  - ar
datasets:
  - MohamedRashad/arabic-english-code-switching
pipeline_tag: automatic-speech-recognition

🇪🇬🎙 MasriSwitch-Gemma3n-Transcriber-v1

MasriSwitch-Gemma3n-Transcriber is an automatic speech transcription model specialized for Egyptian Arabic with strong English code-switching capabilities.

This model is one of the very few publicly available systems explicitly optimized for:

Egyptian Arabic dialect transcription
Natural Arabic ↔ English code-switching
Short and medium-length real-world audio

The model is trained using:

MohamedRashad/arabic-english-code-switching dataset
A private Egyptian speech dataset containing real conversational audio, voice notes, and mixed Arabic/English speech recordings

🔍 Overview

MasriSwitch-Gemma3n-Transcriber is built on the Gemma3n conditional generation architecture and fine-tuned to understand natural Egyptian speech patterns, including mixed Arabic/English utterances commonly used in daily life, workplaces, and online content.

It is suitable for:

Social media content transcription
Customer support calls
Meetings, voice notes, and interviews
Research in dialectal ASR
Multilingual speech processing

✨ Features

🗣 Egyptian Arabic dialect-aware transcription
🔀 Accurate English code-switching support
🎧 Strong performance on informal, real-world speech
⚡ Optimized for short (10–30s) audio segments
🤖 Built using the Gemma3n generation-based ASR pipeline

🎯 Intended Use

Use this model for:

Speech-to-text systems
Captioning and subtitling
Chat or voice assistant pipelines
Indexing/searching Arabic audio content
Research and experimentation

⚠️ Limitations

Best results with clean audio and single speakers
Not optimized for Gulf, Levantine, or MSA-only speech
Struggles with:
- Heavy noise
- Overlapping speakers
- Fast speech
Long recordings should be segmented (20–30s recommended)

🛡 Safety & Privacy

Transcriptions may include sensitive user data — handle with care.
Should not be used for high-stakes decisions without human review.
Biases in training data may affect accuracy.

🧪 Inference Example (Python)

import torch
from transformers import AutoProcessor, Gemma3nForConditionalGeneration

MODEL_ID = "oddadmix/egyptian-code-switching-b4-g2-merged"

def load_model_and_processor(model_id=MODEL_ID, device=None):
    if device is None:
        device = "cuda" if torch.cuda.is_available() else "cpu"

    print(f"Loading model {model_id} to device {device}...")
    
    model = Gemma3nForConditionalGeneration.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16 if device == "cuda" else None,
        device_map="auto" if device == "cuda" else None,
    ).eval()

    if not any(p.device.type == "cuda" for p in model.parameters()) and device == "cuda":
        model.to("cuda")

    processor = AutoProcessor.from_pretrained(model_id)
    return model, processor, device


def transcribe_file(model, processor, audio_path, max_new_tokens=128):
    if not audio_path:
        raise ValueError("audio_path must point to an audio file")

    messages = [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are an assistant that transcribes speech accurately."}
            ],
        },
        {
            "role": "user",
            "content": [
                {"type": "audio", "url": audio_path},
                {"type": "text", "text": "Please transcribe this audio."}
            ],
        },
    ]

    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    )

    device = next(model.parameters()).device
    inputs = {k: v.to(device) for k, v in inputs.items()}
    input_len = inputs["input_ids"].shape[-1]

    with torch.inference_mode():
        generated = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
        )

    gen_tokens = generated[0][input_len:]
    text = processor.decode(gen_tokens, skip_special_tokens=True)
    return text


if __name__ == "__main__":
    audio_path = "path/to/audio.wav"
    model, processor, device = load_model_and_processor()
    transcription = transcribe_file(model, processor, audio_path, max_new_tokens=256)
    print("Transcription:", transcription)