merge2

Browse files

Files changed (5) hide show

README.md +434 -3
__pycache__/evaluate_common_voice.cpython-310.pyc +0 -0
__pycache__/generate_plots.cpython-310.pyc +0 -0
evaluate_common_voice.py +404 -0
generate_plots.py +478 -0

README.md CHANGED Viewed

@@ -1,3 +1,434 @@
----
-license: openrail
----

+---
+license: openrail
+language:
+  - da
+base_model: Qwen/Qwen3-ASR-1.7B
+tags:
+  - automatic-speech-recognition
+  - danish
+  - qwen
+  - asr
+  - speech-to-text
+  - coral
+  - streaming
+datasets:
+  - alexandrainst/coral
+  - mozilla-foundation/common_voice_17_0
+library_name: transformers
+pipeline_tag: automatic-speech-recognition
+metrics:
+  - wer
+  - cer
+model-index:
+  - name: hvisketiske-v2
+    results:
+      - task:
+          type: automatic-speech-recognition
+          name: Speech Recognition
+        dataset:
+          type: alexandrainst/coral
+          name: CoRal v2 Test
+          split: test
+        metrics:
+          - type: wer
+            value: 18.47
+            name: WER
+          - type: cer
+            value: 7.86
+            name: CER
+---
+# hvisketiske-v2: Danish ASR Model
+**hvisketiske-v2** is a state-of-the-art Danish automatic speech recognition (ASR) model based on [Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B), finetuned on the [CoRal v2 dataset](https://huggingface.co/datasets/alexandrainst/coral) for improved Danish transcription accuracy.
+## Key Highlights
+| Feature | Value |
+|---------|-------|
+| **WER on CoRal v2** | 18.47% (14% better than Whisper v3) |
+| **CER on CoRal v2** | 7.86% (11% better than Whisper v3) |
+| **Real-Time Factor** | 0.086 (45% faster than Whisper v3) |
+| **Model Size** | ~1.7B parameters |
+### Inherited Features from Qwen3-ASR
+- **Streaming/Real-time transcription** via vLLM backend
+- **Singing detection** - can transcribe singing voice and songs with BGM
+- **Word-level timestamps** via forced alignment
+- **30+ language support** (Danish optimized)
+- **Long audio support** - up to 20 minutes per request
+---
+## Performance Comparison
+### CoRal v2 Test Set (9,123 samples, 17.3 hours)
+| Model | WER | CER | RTF | Throughput | Parameters |
+|-------|-----|-----|-----|------------|------------|
+| **hvisketiske-v2** | **18.47%** | **7.86%** | **0.086** | 1.71 samples/s | ~1.7B |
+| hviske-v3 (Whisper Large v3) | 21.47% | 8.79% | 0.156 | 0.94 samples/s | ~2B |
+**Improvements over Whisper Large v3:**
+- **14% reduction** in Word Error Rate
+- **11% reduction** in Character Error Rate
+- **45% faster** inference speed
+- **15% fewer** parameters
+### Comparison Plots
+![WER Comparison](plots/wer_comparison.png)
+![Speed Comparison](plots/rtf_comparison.png)
+![Accuracy vs Speed](plots/accuracy_vs_speed.png)
+---
+## Quick Start
+### Installation
+```bash
+pip install qwen-asr transformers torch
+```
+### Basic Usage
+```python
+from qwen_asr import Qwen3ASRModel
+# Load the model
+model = Qwen3ASRModel.from_pretrained(
+    "pluttodk/hvisketiske-v2",
+    dtype="bfloat16",
+    device_map="cuda:0",
+)
+# Transcribe audio file
+results = model.transcribe(
+    audio="path/to/danish_audio.wav",
+    language="Danish",
+)
+print(results[0].text)
+```
+---
+## Advanced Usage
+### Batch Transcription (Fast Processing)
+Process multiple audio files efficiently in a single call:
+```python
+from qwen_asr import Qwen3ASRModel
+model = Qwen3ASRModel.from_pretrained(
+    "pluttodk/hvisketiske-v2",
+    dtype="bfloat16",
+    device_map="cuda:0",
+    max_inference_batch_size=16,  # Process up to 16 files at once
+)
+# Batch transcribe multiple files
+audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
+results = model.transcribe(
+    audio=audio_files,
+    language="Danish",
+)
+for i, result in enumerate(results):
+    print(f"File {i+1}: {result.text}")
+```
+### Transcription with Timestamps
+Get word-level timestamps using the forced aligner:
+```python
+from qwen_asr import Qwen3ASRModel
+model = Qwen3ASRModel.from_pretrained(
+    "pluttodk/hvisketiske-v2",
+    forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
+    dtype="bfloat16",
+    device_map="cuda:0",
+)
+results = model.transcribe(
+    audio="path/to/audio.wav",
+    language="Danish",
+    return_time_stamps=True,
+)
+# Access word-level timestamps
+for item in results[0].time_stamps.items:
+    print(f"{item.start_time:.2f}s - {item.end_time:.2f}s: {item.text}")
+```
+### Streaming/Real-time Transcription (vLLM Backend)
+For real-time streaming transcription, use the vLLM backend:
+```python
+from qwen_asr import Qwen3ASRModel
+# Initialize with vLLM backend for streaming
+model = Qwen3ASRModel.LLM(
+    model="pluttodk/hvisketiske-v2",
+    gpu_memory_utilization=0.8,
+)
+# Initialize streaming state
+state = model.init_streaming_state(
+    language="Danish",
+    chunk_size_sec=2.0,  # Process audio in 2-second chunks
+)
+# Simulate streaming audio (16kHz mono float32)
+import numpy as np
+def audio_stream():
+    """Replace with actual audio stream from microphone."""
+    for chunk in audio_chunks:
+        yield np.array(chunk, dtype=np.float32)
+# Process streaming audio
+for audio_chunk in audio_stream():
+    state = model.streaming_transcribe(audio_chunk, state)
+    print(f"Current transcription: {state.text}")
+# Finalize stream
+state = model.finish_streaming_transcribe(state)
+print(f"Final transcription: {state.text}")
+```
+### Using with Transformers Directly
+For more control, use the model directly with transformers:
+```python
+from transformers import AutoModel, AutoProcessor
+import torch
+import librosa
+# Load model and processor
+model = AutoModel.from_pretrained(
+    "pluttodk/hvisketiske-v2",
+    trust_remote_code=True,
+    torch_dtype=torch.bfloat16,
+    device_map="cuda:0",
+)
+processor = AutoProcessor.from_pretrained(
+    "pluttodk/hvisketiske-v2",
+    trust_remote_code=True,
+)
+# Load and preprocess audio
+audio, sr = librosa.load("path/to/audio.wav", sr=16000, mono=True)
+# Build input using chat template
+messages = [
+    {"role": "system", "content": ""},
+    {"role": "user", "content": [{"type": "audio", "audio": audio}]},
+]
+text = processor.apply_chat_template(
+    messages,
+    add_generation_prompt=True,
+    tokenize=False
+)
+text = text + "language Danish<asr_text>"
+# Process and generate
+inputs = processor(text=[text], audio=[audio], return_tensors="pt", padding=True)
+inputs = inputs.to(model.device).to(model.dtype)
+output_ids = model.generate(**inputs, max_new_tokens=512)
+transcription = processor.batch_decode(
+    output_ids[:, inputs["input_ids"].shape[1]:],
+    skip_special_tokens=True,
+)[0]
+print(transcription)
+```
+### Singing Detection & Multi-Audio Support
+The model inherits Qwen3-ASR's ability to handle singing and background music:
+```python
+from qwen_asr import Qwen3ASRModel
+model = Qwen3ASRModel.from_pretrained(
+    "pluttodk/hvisketiske-v2",
+    dtype="bfloat16",
+    device_map="cuda:0",
+)
+# Transcribe audio with singing or background music
+results = model.transcribe(
+    audio="path/to/song.wav",
+    language="Danish",  # or None for auto-detection
+)
+print(results[0].text)
+```
+---
+## Model Details
+### Model Description
+hvisketiske-v2 is a Danish-specialized automatic speech recognition model created by finetuning [Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) on the [CoRal v2 dataset](https://huggingface.co/datasets/alexandrainst/coral). The model achieves state-of-the-art performance on Danish speech recognition while maintaining fast inference speeds.
+- **Developed by:** Mathias Oliver Valdbjørn Rønnelund
+- **Model type:** Encoder-decoder speech recognition model
+- **Language:** Danish (primary), with inherited multilingual capabilities
+- **License:** Apache 2.0
+- **Finetuned from:** [Qwen/Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B)
+### Architecture
+The model inherits the Qwen3-ASR architecture:
+| Component | Specification |
+|-----------|--------------|
+| Audio Encoder | 24-layer transformer (1024 hidden dim, 16 attention heads) |
+| Text Decoder | 28-layer transformer (2048 hidden dim, 16 attention heads) |
+| Total Parameters | ~1.7 billion |
+| Precision | bfloat16 |
+| Audio Input | 16kHz mono WAV |
+---
+## Training Details
+### Training Data
+The model was finetuned on the [CoRal v2 dataset](https://huggingface.co/datasets/alexandrainst/coral), a comprehensive Danish speech corpus containing:
+- Diverse Danish speakers across demographics
+- Various recording conditions and audio qualities
+- Natural conversational speech
+- Read-aloud speech
+### Training Procedure
+**Training Approach:** Supervised Fine-Tuning (SFT) with chat template formatting
+**Preprocessing:**
+- Audio resampled to 16kHz mono
+- Chat template applied with system prompt, audio input, and target transcription
+- Prefix masking to train only on transcription tokens
+**Training Hyperparameters:**
+| Parameter | Value |
+|-----------|-------|
+| Base model | Qwen/Qwen3-ASR-1.7B |
+| Learning rate | 2e-5 |
+| Batch size (per device) | 8 |
+| Gradient accumulation steps | 4 |
+| Effective batch size | 32 |
+| Epochs | 3 |
+| Warmup ratio | 0.1 |
+| Weight decay | 0.01 |
+| Max gradient norm | 1.0 |
+| Precision | bfloat16 |
+| Optimizer | AdamW |
+| LR scheduler | Linear decay |
+| Total training steps | 23,448 |
+**Hardware:** Training performed on NVIDIA GPUs (~25GB GPU memory per device)
+---
+## Evaluation
+### Test Data
+Evaluated on the CoRal v2 test split:
+- **9,123 samples**
+- **17.3 hours** of audio
+- Diverse Danish speakers and recording conditions
+### Metrics
+| Metric | Description |
+|--------|-------------|
+| **WER** | Word Error Rate - percentage of words incorrectly transcribed (lower is better) |
+| **CER** | Character Error Rate - percentage of characters incorrectly transcribed (lower is better) |
+| **RTF** | Real-Time Factor - ratio of processing time to audio duration (< 1.0 = faster than real-time) |
+### Results Summary
+| Model | WER | CER | RTF | Throughput |
+|-------|-----|-----|-----|------------|
+| **hvisketiske-v2** | **18.47%** | **7.86%** | **0.086** | 1.71 samples/sec |
+| hviske-v3 (Whisper v3) | 21.47% | 8.79% | 0.156 | 0.94 samples/sec |
+---
+## Limitations
+- **Language:** Optimized for Danish; other languages may have degraded performance compared to base Qwen3-ASR
+- **Audio quality:** Best results with clear speech; noisy environments may affect accuracy
+- **Domain:** Trained on CoRal v2 which is primarily conversational/read-aloud speech; specialized domains (medical, legal, technical) may have higher error rates
+- **Streaming:** Real-time streaming requires vLLM backend installation
+## Intended Use
+### Primary Use Cases
+- Danish speech-to-text transcription
+- Subtitle generation for Danish content
+- Voice assistant backends
+- Meeting transcription
+- Accessibility applications
+### Out-of-Scope Use
+- Non-Danish languages (use base Qwen3-ASR instead)
+- Real-time speaker diarization (not supported)
+- Emotion/sentiment detection from speech
+---
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{hvisketiske-v2,
+  author = {Rønnelund, Mathias Oliver Valdbjørn},
+  title = {hvisketiske-v2: Danish ASR Model based on Qwen3-ASR},
+  year = {2025},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/pluttodk/hvisketiske-v2}
+}
+```
+Also consider citing the base model and dataset:
+```bibtex
+@article{qwen3asr,
+  title={Qwen3-ASR Technical Report},
+  author={Qwen Team},
+  journal={arXiv preprint arXiv:2601.21337},
+  year={2025}
+}
+@dataset{coral,
+  title={CoRal: A Danish Speech Corpus},
+  author={Alexandra Institute},
+  year={2024},
+  url={https://huggingface.co/datasets/alexandrainst/coral}
+}
+```
+---
+## Acknowledgements
+- [Qwen Team](https://github.com/QwenLM) for the excellent Qwen3-ASR base model
+- [Alexandra Institute](https://alexandra.dk/) for the CoRal v2 Danish speech corpus

__pycache__/evaluate_common_voice.cpython-310.pyc ADDED Viewed

Binary file (11.1 kB). View file

__pycache__/generate_plots.cpython-310.pyc ADDED Viewed

Binary file (11.2 kB). View file

evaluate_common_voice.py ADDED Viewed

	@@ -0,0 +1,404 @@

+#!/usr/bin/env python
+"""
+Benchmark ASR models on Common Voice Danish dataset.
+This script evaluates hvisketiske-v2 (Qwen3-ASR) and hviske-v3 (Whisper)
+on the Mozilla Common Voice Danish test set for comparison.
+IMPORTANT: Common Voice requires authentication and agreement to terms of use.
+Before running this script:
+1. Create a HuggingFace account at https://huggingface.co
+2. Visit https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0
+3. Agree to the dataset terms of use
+4. Create an access token at https://huggingface.co/settings/tokens
+5. Login via CLI: `huggingface-cli login`
+Usage:
+    # After logging in:
+    python huggingface/evaluate_common_voice.py \
+        --hvisketiske-path ./outputs/hvisketiske-v2/checkpoint-23448 \
+        --max-samples 1000 \
+        --output-file ./results/common_voice_comparison.json
+    # Quick test with fewer samples:
+    python huggingface/evaluate_common_voice.py --max-samples 100
+    # Use specific token:
+    python huggingface/evaluate_common_voice.py --hf-token YOUR_TOKEN
+"""
+import argparse
+import json
+import sys
+import tempfile
+import time
+from dataclasses import dataclass
+from pathlib import Path
+from typing import List, Optional
+import soundfile as sf
+from datasets import load_dataset
+from jiwer import cer, wer
+from tqdm import tqdm
+# Add src to path for imports
+sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
+from hvisketiske.evaluation.model_adapters import (
+    ASRModelAdapter,
+    HviskeV3Adapter,
+    Qwen3ASRAdapter,
+    TranscriptionResult,
+)
+from hvisketiske.evaluation.timing import AggregatedTimingStats
+@dataclass
+class CommonVoiceSample:
+    """A single Common Voice sample."""
+    audio_path: str
+    reference: str
+    audio_duration: float
+def load_common_voice_danish(
+    split: str = "test",
+    max_samples: Optional[int] = None,
+    cache_dir: Optional[str] = None,
+    hf_token: Optional[str] = None,
+) -> List[CommonVoiceSample]:
+    """
+    Load Common Voice Danish dataset and prepare samples.
+    Args:
+        split: Dataset split to load (test, validation, train).
+        max_samples: Maximum number of samples to load.
+        cache_dir: Directory to cache audio files.
+        hf_token: HuggingFace API token for authentication.
+    Returns:
+        List of CommonVoiceSample objects.
+    """
+    print(f"Loading Common Voice Danish ({split} split)...")
+    print("Note: This requires HuggingFace authentication and agreement to dataset terms.")
+    print("Visit: https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0")
+    print()
+    try:
+        ds = load_dataset(
+            "mozilla-foundation/common_voice_17_0",
+            "da",
+            split=split,
+            trust_remote_code=True,
+            token=hf_token,
+        )
+    except Exception as e:
+        error_msg = str(e)
+        if "EmptyDatasetError" in error_msg or "doesn't contain any data" in error_msg:
+            print("\n" + "=" * 70)
+            print("ERROR: Cannot access Common Voice dataset.")
+            print("=" * 70)
+            print("\nThis dataset requires authentication. Please:")
+            print("1. Visit https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0")
+            print("2. Log in and agree to the terms of use")
+            print("3. Run: huggingface-cli login")
+            print("4. Or pass --hf-token YOUR_TOKEN to this script")
+            print("=" * 70 + "\n")
+        raise
+    if max_samples:
+        ds = ds.select(range(min(max_samples, len(ds))))
+    print(f"Loaded {len(ds)} samples")
+    # Create temp directory for audio files if not provided
+    if cache_dir is None:
+        cache_dir = tempfile.mkdtemp(prefix="cv_danish_")
+    cache_path = Path(cache_dir)
+    cache_path.mkdir(parents=True, exist_ok=True)
+    samples = []
+    print("Preparing audio files...")
+    for i, item in enumerate(tqdm(ds, desc="Preparing samples")):
+        # Extract audio array and sample rate
+        audio_array = item["audio"]["array"]
+        sample_rate = item["audio"]["sampling_rate"]
+        # Save to temp file
+        audio_path = cache_path / f"sample_{i:06d}.wav"
+        sf.write(str(audio_path), audio_array, sample_rate)
+        # Calculate duration
+        duration = len(audio_array) / sample_rate
+        samples.append(
+            CommonVoiceSample(
+                audio_path=str(audio_path),
+                reference=item["sentence"],
+                audio_duration=duration,
+            )
+        )
+    return samples
+def normalize_text(text: str) -> str:
+    """Normalize text for fair comparison."""
+    text = text.lower()
+    text = " ".join(text.split())
+    return text
+def evaluate_model(
+    model: ASRModelAdapter,
+    samples: List[CommonVoiceSample],
+    warmup_samples: int = 3,
+) -> dict:
+    """
+    Evaluate a model on the Common Voice samples.
+    Args:
+        model: Model adapter to evaluate.
+        samples: List of samples to evaluate.
+        warmup_samples: Number of warmup iterations.
+    Returns:
+        Dictionary with evaluation results.
+    """
+    print(f"\nEvaluating: {model.model_name}")
+    print("Loading model...")
+    model.load()
+    # Warmup
+    if warmup_samples > 0 and samples:
+        print(f"Running {warmup_samples} warmup iterations...")
+        model.warmup(samples[0].audio_path, num_runs=warmup_samples)
+    # Transcribe all samples
+    predictions = []
+    individual_times = []
+    total_audio_duration = 0.0
+    total_inference_time = 0.0
+    print(f"Transcribing {len(samples)} samples...")
+    for sample in tqdm(samples, desc=f"Evaluating {model.model_name[:30]}"):
+        result = model.transcribe(sample.audio_path)
+        predictions.append(result.text)
+        individual_times.append(result.inference_time_seconds)
+        total_audio_duration += sample.audio_duration
+        total_inference_time += result.inference_time_seconds
+    # Normalize text
+    predictions_norm = [normalize_text(p) for p in predictions]
+    references_norm = [normalize_text(s.reference) for s in samples]
+    # Calculate metrics
+    word_error_rate = wer(references_norm, predictions_norm)
+    char_error_rate = cer(references_norm, predictions_norm)
+    timing_stats = AggregatedTimingStats(
+        total_inference_time_seconds=total_inference_time,
+        total_audio_duration_seconds=total_audio_duration,
+        num_samples=len(samples),
+        individual_times=individual_times,
+    )
+    return {
+        "model_name": model.model_name,
+        "model_size": model.model_size_params,
+        "accuracy": {
+            "wer": word_error_rate,
+            "cer": char_error_rate,
+        },
+        "performance": {
+            "total_inference_time_seconds": timing_stats.total_inference_time_seconds,
+            "total_audio_duration_seconds": timing_stats.total_audio_duration_seconds,
+            "real_time_factor": timing_stats.real_time_factor,
+            "throughput_samples_per_second": timing_stats.throughput_samples_per_second,
+            "mean_time_per_sample_seconds": timing_stats.mean_time_per_sample,
+            "std_time_per_sample_seconds": timing_stats.std_time_per_sample,
+        },
+        "num_samples": len(samples),
+    }
+def print_summary(results: dict) -> None:
+    """Print formatted comparison summary."""
+    print("\n" + "=" * 80)
+    print("COMMON VOICE DANISH - ASR MODEL COMPARISON")
+    print("=" * 80)
+    print(f"Dataset: mozilla-foundation/common_voice_17_0 (Danish)")
+    print(f"Number of models: {len(results['models'])}")
+    sample_count = next(iter(results["models"].values()))["num_samples"]
+    print(f"Samples evaluated: {sample_count}")
+    # Accuracy comparison table
+    print("\n" + "-" * 80)
+    print("ACCURACY METRICS (lower is better)")
+    print("-" * 80)
+    print(f"{'Model':<45} {'WER':>12} {'CER':>12}")
+    print("-" * 80)
+    for name, result in sorted(
+        results["models"].items(), key=lambda x: x[1]["accuracy"]["wer"]
+    ):
+        print(
+            f"{result['model_name'][:45]:<45} "
+            f"{result['accuracy']['wer']:>11.2%} "
+            f"{result['accuracy']['cer']:>11.2%}"
+        )
+    # Performance comparison table
+    print("\n" + "-" * 80)
+    print("PERFORMANCE METRICS (RTF < 1.0 = faster than real-time)")
+    print("-" * 80)
+    print(f"{'Model':<35} {'RTF':>8} {'Throughput':>12} {'Mean Time':>12}")
+    print(f"{'':35} {'':>8} {'(samples/s)':>12} {'(s/sample)':>12}")
+    print("-" * 80)
+    for name, result in sorted(
+        results["models"].items(), key=lambda x: x[1]["performance"]["real_time_factor"]
+    ):
+        perf = result["performance"]
+        print(
+            f"{result['model_name'][:35]:<35} "
+            f"{perf['real_time_factor']:>8.3f} "
+            f"{perf['throughput_samples_per_second']:>12.2f} "
+            f"{perf['mean_time_per_sample_seconds']:>12.3f}"
+        )
+    print("=" * 80)
+def parse_args() -> argparse.Namespace:
+    """Parse command line arguments."""
+    parser = argparse.ArgumentParser(
+        description="Benchmark ASR models on Common Voice Danish"
+    )
+    parser.add_argument(
+        "--output-file",
+        type=Path,
+        default=Path("results/common_voice_comparison.json"),
+        help="Path to save comparison report (JSON)",
+    )
+    parser.add_argument(
+        "--max-samples",
+        type=int,
+        default=None,
+        help="Maximum samples to evaluate (for quick testing)",
+    )
+    parser.add_argument(
+        "--warmup",
+        type=int,
+        default=3,
+        help="Number of warmup iterations per model (default: 3)",
+    )
+    parser.add_argument(
+        "--device",
+        type=str,
+        default="cuda:0",
+        help="Device for inference (default: cuda:0)",
+    )
+    parser.add_argument(
+        "--cache-dir",
+        type=str,
+        default=None,
+        help="Directory to cache audio files",
+    )
+    parser.add_argument(
+        "--hf-token",
+        type=str,
+        default=None,
+        help="HuggingFace API token for authentication (or use huggingface-cli login)",
+    )
+    # Model selection
+    parser.add_argument(
+        "--skip-hviske-v3",
+        action="store_true",
+        help="Skip hviske-v3-conversation model",
+    )
+    parser.add_argument(
+        "--skip-hvisketiske",
+        action="store_true",
+        help="Skip hvisketiske-v2 model",
+    )
+    parser.add_argument(
+        "--hvisketiske-path",
+        type=str,
+        default="./outputs/hvisketiske-v2/checkpoint-23448",
+        help="Path to local hvisketiske checkpoint",
+    )
+    return parser.parse_args()
+def main() -> None:
+    """Main entry point for Common Voice evaluation."""
+    args = parse_args()
+    # Load dataset
+    samples = load_common_voice_danish(
+        split="test",
+        max_samples=args.max_samples,
+        cache_dir=args.cache_dir,
+        hf_token=args.hf_token,
+    )
+    # Configure models to evaluate
+    models = []
+    if not args.skip_hviske_v3:
+        models.append(
+            HviskeV3Adapter(
+                model_id="syvai/hviske-v3-conversation",
+                device=args.device,
+            )
+        )
+    if not args.skip_hvisketiske:
+        models.append(
+            Qwen3ASRAdapter(
+                model_path=args.hvisketiske_path,
+                device=args.device,
+            )
+        )
+    if not models:
+        print("Error: No models selected for evaluation")
+        sys.exit(1)
+    print("=" * 60)
+    print("Common Voice Danish ASR Evaluation")
+    print("=" * 60)
+    print(f"Dataset: mozilla-foundation/common_voice_17_0")
+    print(f"Samples: {len(samples)}")
+    print(f"Device: {args.device}")
+    print(f"Warmup iterations: {args.warmup}")
+    print(f"Models to evaluate: {len(models)}")
+    for m in models:
+        print(f"  - {m.model_name} ({m.model_size_params})")
+    print("=" * 60)
+    # Evaluate all models
+    results = {"dataset": "mozilla-foundation/common_voice_17_0", "models": {}}
+    for model in models:
+        model_results = evaluate_model(model, samples, warmup_samples=args.warmup)
+        results["models"][model.model_name] = model_results
+    # Print summary
+    print_summary(results)
+    # Save results
+    args.output_file.parent.mkdir(parents=True, exist_ok=True)
+    with open(args.output_file, "w", encoding="utf-8") as f:
+        json.dump(results, f, indent=2, ensure_ascii=False)
+    print(f"\nResults saved to: {args.output_file}")
+if __name__ == "__main__":
+    main()

generate_plots.py ADDED Viewed

	@@ -0,0 +1,478 @@

+#!/usr/bin/env python
+"""
+Generate comparison plots for ASR model benchmarks.
+Creates publication-quality visualizations comparing hvisketiske-v2
+against other Danish ASR models on accuracy and performance metrics.
+Usage:
+    python huggingface/generate_plots.py
+    # Specify custom result files:
+    python huggingface/generate_plots.py \
+        --coral-results ./results/full_comparison2.json \
+        --cv-results ./results/common_voice_comparison.json
+Output:
+    huggingface/plots/
+    ├── wer_comparison.png
+    ├── cer_comparison.png
+    ├── rtf_comparison.png
+    └── accuracy_vs_speed.png
+"""
+import argparse
+import json
+from pathlib import Path
+from typing import Dict, List, Optional, Tuple
+import matplotlib.pyplot as plt
+import numpy as np
+# Use a clean style
+plt.style.use("seaborn-v0_8-whitegrid")
+# Color palette - distinct colors for models
+COLORS = {
+    "hvisketiske": "#2ecc71",  # Green for our model (best)
+    "qwen3-base": "#27ae60",  # Darker green for base Qwen
+    "hviske-v2": "#3498db",  # Blue for hviske-v2
+    "hviske-v3": "#2980b9",  # Darker blue for hviske-v3
+    "faster": "#e74c3c",  # Red for faster-whisper models
+    "turbo": "#e67e22",  # Orange for turbo
+    "default": "#95a5a6",  # Gray for others
+}
+# Model display names mapping
+MODEL_DISPLAY_NAMES = {
+    "Qwen3-ASR (checkpoint-23448)": "hvisketiske-v2\n(Qwen3-ASR finetuned)",
+    "hviske-v3-conversation (Whisper Large v3)": "hviske-v3\n(Whisper v3)",
+    "hviske-v2 (Whisper Large v2)": "hviske-v2\n(Whisper v2)",
+    "faster-hviske-v2 (CT2 distilled)": "faster-hviske-v2\n(CT2 distilled)",
+    "Whisper Large v3 Turbo": "Whisper v3 Turbo\n(faster-whisper)",
+    "Qwen3-ASR-1.7B (base)": "Qwen3-ASR-1.7B\n(base, not finetuned)",
+}
+def get_model_color(model_name: str) -> str:
+    """Get color for a model based on its name."""
+    name_lower = model_name.lower()
+    # Our finetuned model (highest priority)
+    if "hvisketiske" in name_lower or "checkpoint" in name_lower:
+        return COLORS["hvisketiske"]
+    # Base Qwen3-ASR (not finetuned)
+    elif "qwen3-asr-1.7b" in name_lower and "base" in name_lower:
+        return COLORS["qwen3-base"]
+    elif "qwen" in name_lower:
+        return COLORS["hvisketiske"]
+    # Turbo model
+    elif "turbo" in name_lower:
+        return COLORS["turbo"]
+    # Faster-whisper models
+    elif "faster" in name_lower or "ct2" in name_lower:
+        return COLORS["faster"]
+    # hviske-v3
+    elif "hviske-v3" in name_lower or "v3" in name_lower:
+        return COLORS["hviske-v3"]
+    # hviske-v2
+    elif "hviske-v2" in name_lower or "v2" in name_lower:
+        return COLORS["hviske-v2"]
+    return COLORS["default"]
+def get_display_name(model_name: str) -> str:
+    """Get display name for a model."""
+    return MODEL_DISPLAY_NAMES.get(model_name, model_name)
+def load_results(path: Path) -> Optional[dict]:
+    """Load benchmark results from JSON file."""
+    if not path.exists():
+        print(f"Warning: Results file not found: {path}")
+        return None
+    with open(path, "r", encoding="utf-8") as f:
+        return json.load(f)
+def extract_metrics(results: dict) -> Tuple[List[str], List[float], List[float], List[float], List[str]]:
+    """
+    Extract metrics from results dictionary.
+    Returns:
+        Tuple of (names, wer_values, cer_values, rtf_values, colors)
+    """
+    names = []
+    wer_values = []
+    cer_values = []
+    rtf_values = []
+    colors = []
+    for model_name, data in results["models"].items():
+        display_name = get_display_name(model_name)
+        names.append(display_name)
+        wer_values.append(data["accuracy"]["wer"] * 100)  # Convert to percentage
+        cer_values.append(data["accuracy"]["cer"] * 100)
+        rtf_values.append(data["performance"]["real_time_factor"])
+        colors.append(get_model_color(model_name))
+    return names, wer_values, cer_values, rtf_values, colors
+def plot_wer_comparison(
+    results: dict,
+    output_path: Path,
+    dataset_name: str = "CoRal v2",
+) -> None:
+    """Generate WER comparison bar chart."""
+    names, wer_values, _, _, colors = extract_metrics(results)
+    fig, ax = plt.subplots(figsize=(8, 5))
+    bars = ax.bar(names, wer_values, color=colors, edgecolor="white", linewidth=1.5)
+    # Add value labels on bars
+    for bar, val in zip(bars, wer_values):
+        height = bar.get_height()
+        ax.annotate(
+            f"{val:.1f}%",
+            xy=(bar.get_x() + bar.get_width() / 2, height),
+            xytext=(0, 5),
+            textcoords="offset points",
+            ha="center",
+            va="bottom",
+            fontsize=12,
+            fontweight="bold",
+        )
+    ax.set_ylabel("Word Error Rate (%)", fontsize=12)
+    ax.set_title(f"WER Comparison on {dataset_name}", fontsize=14, fontweight="bold")
+    ax.set_ylim(0, max(wer_values) * 1.2)
+    # Add grid
+    ax.yaxis.grid(True, linestyle="--", alpha=0.7)
+    ax.set_axisbelow(True)
+    plt.tight_layout()
+    plt.savefig(output_path, dpi=300, bbox_inches="tight", facecolor="white")
+    plt.close()
+    print(f"Saved: {output_path}")
+def plot_cer_comparison(
+    results: dict,
+    output_path: Path,
+    dataset_name: str = "CoRal v2",
+) -> None:
+    """Generate CER comparison bar chart."""
+    names, _, cer_values, _, colors = extract_metrics(results)
+    fig, ax = plt.subplots(figsize=(8, 5))
+    bars = ax.bar(names, cer_values, color=colors, edgecolor="white", linewidth=1.5)
+    # Add value labels on bars
+    for bar, val in zip(bars, cer_values):
+        height = bar.get_height()
+        ax.annotate(
+            f"{val:.1f}%",
+            xy=(bar.get_x() + bar.get_width() / 2, height),
+            xytext=(0, 5),
+            textcoords="offset points",
+            ha="center",
+            va="bottom",
+            fontsize=12,
+            fontweight="bold",
+        )
+    ax.set_ylabel("Character Error Rate (%)", fontsize=12)
+    ax.set_title(f"CER Comparison on {dataset_name}", fontsize=14, fontweight="bold")
+    ax.set_ylim(0, max(cer_values) * 1.2)
+    # Add grid
+    ax.yaxis.grid(True, linestyle="--", alpha=0.7)
+    ax.set_axisbelow(True)
+    plt.tight_layout()
+    plt.savefig(output_path, dpi=300, bbox_inches="tight", facecolor="white")
+    plt.close()
+    print(f"Saved: {output_path}")
+def plot_rtf_comparison(
+    results: dict,
+    output_path: Path,
+    dataset_name: str = "CoRal v2",
+) -> None:
+    """Generate RTF/speed comparison bar chart."""
+    names, _, _, rtf_values, colors = extract_metrics(results)
+    fig, ax = plt.subplots(figsize=(8, 5))
+    bars = ax.bar(names, rtf_values, color=colors, edgecolor="white", linewidth=1.5)
+    # Add value labels on bars
+    for bar, val in zip(bars, rtf_values):
+        height = bar.get_height()
+        ax.annotate(
+            f"{val:.3f}",
+            xy=(bar.get_x() + bar.get_width() / 2, height),
+            xytext=(0, 5),
+            textcoords="offset points",
+            ha="center",
+            va="bottom",
+            fontsize=12,
+            fontweight="bold",
+        )
+    # Add reference line at RTF=1.0 (real-time)
+    ax.axhline(y=1.0, color="red", linestyle="--", linewidth=1.5, label="Real-time (RTF=1.0)")
+    ax.set_ylabel("Real-Time Factor (lower is faster)", fontsize=12)
+    ax.set_title(f"Speed Comparison on {dataset_name}", fontsize=14, fontweight="bold")
+    ax.set_ylim(0, max(max(rtf_values) * 1.3, 1.1))
+    ax.legend(loc="upper right")
+    # Add grid
+    ax.yaxis.grid(True, linestyle="--", alpha=0.7)
+    ax.set_axisbelow(True)
+    plt.tight_layout()
+    plt.savefig(output_path, dpi=300, bbox_inches="tight", facecolor="white")
+    plt.close()
+    print(f"Saved: {output_path}")
+def plot_accuracy_vs_speed(
+    results: dict,
+    output_path: Path,
+    dataset_name: str = "CoRal v2",
+) -> None:
+    """Generate accuracy vs speed scatter plot."""
+    fig, ax = plt.subplots(figsize=(9, 6))
+    for model_name, data in results["models"].items():
+        wer = data["accuracy"]["wer"] * 100
+        rtf = data["performance"]["real_time_factor"]
+        color = get_model_color(model_name)
+        display_name = get_display_name(model_name)
+        # Extract parameter count for bubble size
+        size_str = data["model_size"]
+        if "1.7B" in size_str:
+            size = 400
+        elif "2B" in size_str:
+            size = 500
+        else:
+            size = 300
+        ax.scatter(
+            rtf,
+            wer,
+            s=size,
+            c=color,
+            alpha=0.7,
+            edgecolors="white",
+            linewidth=2,
+            label=display_name.replace("\n", " "),
+        )
+        # Add label
+        ax.annotate(
+            display_name.replace("\n", " "),
+            xy=(rtf, wer),
+            xytext=(10, 10),
+            textcoords="offset points",
+            fontsize=10,
+            ha="left",
+        )
+    # Add reference line at RTF=1.0
+    ax.axvline(x=1.0, color="red", linestyle="--", linewidth=1, alpha=0.5, label="Real-time")
+    ax.set_xlabel("Real-Time Factor (lower is faster)", fontsize=12)
+    ax.set_ylabel("Word Error Rate (%)", fontsize=12)
+    ax.set_title(
+        f"Accuracy vs Speed Trade-off on {dataset_name}\n(bubble size = model parameters)",
+        fontsize=14,
+        fontweight="bold",
+    )
+    # Set axis limits with padding
+    all_wer = [d["accuracy"]["wer"] * 100 for d in results["models"].values()]
+    all_rtf = [d["performance"]["real_time_factor"] for d in results["models"].values()]
+    ax.set_xlim(0, max(all_rtf) * 1.5)
+    ax.set_ylim(min(all_wer) * 0.8, max(all_wer) * 1.2)
+    # Add grid
+    ax.grid(True, linestyle="--", alpha=0.7)
+    # Add annotation for best region
+    ax.annotate(
+        "Better",
+        xy=(0.02, min(all_wer) * 0.85),
+        fontsize=10,
+        color="green",
+        fontweight="bold",
+    )
+    ax.annotate(
+        "Faster & More Accurate",
+        xy=(0.02, min(all_wer) * 0.9),
+        fontsize=8,
+        color="gray",
+    )
+    plt.tight_layout()
+    plt.savefig(output_path, dpi=300, bbox_inches="tight", facecolor="white")
+    plt.close()
+    print(f"Saved: {output_path}")
+def plot_multi_dataset_comparison(
+    coral_results: dict,
+    cv_results: Optional[dict],
+    output_path: Path,
+) -> None:
+    """Generate multi-dataset WER comparison plot."""
+    fig, ax = plt.subplots(figsize=(10, 6))
+    # Prepare data
+    datasets = ["CoRal v2"]
+    if cv_results:
+        datasets.append("Common Voice")
+    # Get model names from coral results
+    model_names = list(coral_results["models"].keys())
+    x = np.arange(len(datasets))
+    width = 0.35
+    for i, model_name in enumerate(model_names):
+        display_name = get_display_name(model_name)
+        color = get_model_color(model_name)
+        wer_values = [coral_results["models"][model_name]["accuracy"]["wer"] * 100]
+        if cv_results and model_name in cv_results["models"]:
+            wer_values.append(cv_results["models"][model_name]["accuracy"]["wer"] * 100)
+        elif cv_results:
+            wer_values.append(0)  # Model not evaluated on this dataset
+        offset = (i - len(model_names) / 2 + 0.5) * width
+        bars = ax.bar(
+            x + offset,
+            wer_values,
+            width,
+            label=display_name.replace("\n", " "),
+            color=color,
+            edgecolor="white",
+            linewidth=1.5,
+        )
+        # Add value labels
+        for bar, val in zip(bars, wer_values):
+            if val > 0:
+                height = bar.get_height()
+                ax.annotate(
+                    f"{val:.1f}%",
+                    xy=(bar.get_x() + bar.get_width() / 2, height),
+                    xytext=(0, 3),
+                    textcoords="offset points",
+                    ha="center",
+                    va="bottom",
+                    fontsize=10,
+                    fontweight="bold",
+                )
+    ax.set_ylabel("Word Error Rate (%)", fontsize=12)
+    ax.set_title("WER Comparison Across Datasets", fontsize=14, fontweight="bold")
+    ax.set_xticks(x)
+    ax.set_xticklabels(datasets, fontsize=11)
+    ax.legend(loc="upper right")
+    ax.yaxis.grid(True, linestyle="--", alpha=0.7)
+    ax.set_axisbelow(True)
+    plt.tight_layout()
+    plt.savefig(output_path, dpi=300, bbox_inches="tight", facecolor="white")
+    plt.close()
+    print(f"Saved: {output_path}")
+def parse_args() -> argparse.Namespace:
+    """Parse command line arguments."""
+    parser = argparse.ArgumentParser(description="Generate ASR comparison plots")
+    parser.add_argument(
+        "--coral-results",
+        type=Path,
+        default=Path("results/full_comparison2.json"),
+        help="Path to CoRal benchmark results",
+    )
+    parser.add_argument(
+        "--cv-results",
+        type=Path,
+        default=Path("results/common_voice_comparison.json"),
+        help="Path to Common Voice benchmark results",
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=Path,
+        default=Path(__file__).parent / "plots",
+        help="Output directory for plots",
+    )
+    return parser.parse_args()
+def main() -> None:
+    """Main entry point for plot generation."""
+    args = parse_args()
+    # Create output directory
+    args.output_dir.mkdir(parents=True, exist_ok=True)
+    # Load results
+    coral_results = load_results(args.coral_results)
+    cv_results = load_results(args.cv_results)
+    if coral_results is None:
+        print("Error: CoRal results file is required")
+        return
+    print("=" * 60)
+    print("Generating ASR Comparison Plots")
+    print("=" * 60)
+    print(f"Output directory: {args.output_dir}")
+    print()
+    # Generate CoRal plots
+    print("Generating CoRal v2 plots...")
+    plot_wer_comparison(coral_results, args.output_dir / "wer_comparison.png", "CoRal v2")
+    plot_cer_comparison(coral_results, args.output_dir / "cer_comparison.png", "CoRal v2")
+    plot_rtf_comparison(coral_results, args.output_dir / "rtf_comparison.png", "CoRal v2")
+    plot_accuracy_vs_speed(coral_results, args.output_dir / "accuracy_vs_speed.png", "CoRal v2")
+    # Generate Common Voice plots if available
+    if cv_results:
+        print("\nGenerating Common Voice plots...")
+        plot_wer_comparison(
+            cv_results, args.output_dir / "wer_comparison_cv.png", "Common Voice Danish"
+        )
+        plot_cer_comparison(
+            cv_results, args.output_dir / "cer_comparison_cv.png", "Common Voice Danish"
+        )
+        plot_rtf_comparison(
+            cv_results, args.output_dir / "rtf_comparison_cv.png", "Common Voice Danish"
+        )
+        # Multi-dataset comparison
+        print("\nGenerating multi-dataset comparison...")
+        plot_multi_dataset_comparison(
+            coral_results, cv_results, args.output_dir / "multi_dataset_wer.png"
+        )
+    print("\n" + "=" * 60)
+    print("Plot generation complete!")
+    print("=" * 60)
+if __name__ == "__main__":
+    main()