Update custom model files, README, and requirements

Browse files

Files changed (5) hide show

README.md +49 -185
asr_config.py +24 -3
asr_modeling.py +250 -1
asr_pipeline.py +27 -47
asr_processing.py +9 -2

README.md CHANGED Viewed

@@ -1,207 +1,71 @@
 ---
-base_model: Qwen/Qwen3-1.7B
-library_name: peft
-pipeline_tag: text-generation
 tags:
-- base_model:adapter:Qwen/Qwen3-1.7B
-- lora
-- transformers
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
 ## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]
-### Framework versions
-- PEFT 0.18.0

 ---
+license: mit
+language:
+- en
+datasets:
+- speechbrain/LoquaciousSet
+base_model:
+- openai/whisper-large-v3-turbo
+- HuggingFaceTB/SmolLM3-3B
+pipeline_tag: automatic-speech-recognition
 tags:
+- asr
+- speech-recognition
+- audio
+- smollm
+- whisper
+- mlp
 ---
+# Tiny Audio
+A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with the [Tiny Audio](https://github.com/alexkroman/tiny-audio) codebase—a minimal, hackable framework for training ASR models.
+## Architecture
+```
+Audio (16kHz) → Whisper Encoder (frozen) → MLP Projector (trained) → SmolLM3-3B (frozen) → Text
+```
+**MLP Projector:**
+- Convolutional downsampling: 4x sequence compression via two stride-2 conv layers
+- Linear (1280 → 2048) → GELU → Linear (2048 → 2048)
+- Output normalization: RMSNorm
 ## Training Details
+| | |
+|---|---|
+| **Dataset** | LoquaciousSet (25,000 hours) |
+| **Hardware** | Single NVIDIA A40 40GB |
+| **Training Time** | ~24 hours |
+| **Cost** | ~$12 |
+| **Trainable Parameters** | ~12M (projector only) |
+## Performance
+**Word Error Rate (WER): 12.14%** on LoquaciousSet test set.
+## Usage
+```python
+from transformers import pipeline
+pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
+result = pipe("path/to/audio.wav")
+print(result["text"])
+```
+## Limitations
+- English only
+- Optimized for 16kHz audio; other sample rates are resampled automatically
+- Performance may degrade on heavily accented speech, noisy environments, or domain-specific jargon
+- Maximum audio length limited by context window
+## Learn More
+- **[Train your own model](https://github.com/alexkroman/tiny-audio)** — The full codebase with training scripts
+- **[Free 3.5-hour course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md)** — Build your own ASR system from scratch

asr_config.py CHANGED Viewed

@@ -22,12 +22,12 @@ class ASRConfig(transformers.PretrainedConfig):
         # Default is Whisper/GLM-ASR structure: conv1(k=3,s=1,p=1) + conv2(k=3,s=2,p=1)
         encoder_conv_layers: Optional[list] = None,
         audio_sample_rate: int = 16000,
-        projector_init_std: float = 0.02,
         projector_pool_stride: int = 4,
         downsample_rate: int = 5,  # Granite default
         projector_hidden_dim: Optional[int] = None,
-        projector_type: str = "moe",  # "moe", "swiglu", "residual", "shared_moe", "mlp", "qformer"
-        projector_num_layers: int = 2,  # Number of layers (for residual projector)
         projector_dropout: float = 0.0,  # Dropout rate for projector layers
         # MoE-specific configuration
         num_experts: int = 4,  # Number of experts in MoE projectors
@@ -41,7 +41,16 @@ class ASRConfig(transformers.PretrainedConfig):
         qformer_intermediate_size: Optional[int] = None,  # FFN size (defaults to 4x hidden)
         label_smoothing: float = 0.0,  # Label smoothing for cross-entropy loss
         inference_warmup_tokens: int = 10,
         max_new_tokens: Optional[int] = None,
         repetition_penalty: Optional[float] = None,
         length_penalty: Optional[float] = None,
         no_repeat_ngram_size: Optional[int] = None,
@@ -52,6 +61,7 @@ class ASRConfig(transformers.PretrainedConfig):
         generation_defaults = {
             "num_beams": 1,
             "max_new_tokens": 256,
             "repetition_penalty": 1.0,
             "length_penalty": 1.0,
             "no_repeat_ngram_size": 0,
@@ -91,12 +101,23 @@ class ASRConfig(transformers.PretrainedConfig):
         self.qformer_intermediate_size = qformer_intermediate_size
         self.label_smoothing = label_smoothing
         self.inference_warmup_tokens = inference_warmup_tokens
         # Generation parameters (use explicit value if provided, else use default)
         self.num_beams = num_beams if num_beams is not None else generation_defaults["num_beams"]
         self.max_new_tokens = (
             max_new_tokens if max_new_tokens is not None else generation_defaults["max_new_tokens"]
         )
         self.repetition_penalty = (
             repetition_penalty
             if repetition_penalty is not None

         # Default is Whisper/GLM-ASR structure: conv1(k=3,s=1,p=1) + conv2(k=3,s=2,p=1)
         encoder_conv_layers: Optional[list] = None,
         audio_sample_rate: int = 16000,
         projector_pool_stride: int = 4,
         downsample_rate: int = 5,  # Granite default
         projector_hidden_dim: Optional[int] = None,
+        projector_type: str = "mlp",  # "mlp", "mosa", "moe", "qformer"
+        projector_num_layers: int = 2,  # Number of layers in MLP projector
+        projector_init_std: float = 0.02,  # Weight initialization std
         projector_dropout: float = 0.0,  # Dropout rate for projector layers
         # MoE-specific configuration
         num_experts: int = 4,  # Number of experts in MoE projectors
         qformer_intermediate_size: Optional[int] = None,  # FFN size (defaults to 4x hidden)
         label_smoothing: float = 0.0,  # Label smoothing for cross-entropy loss
         inference_warmup_tokens: int = 10,
+        # SpecAugment settings (Whisper defaults)
+        use_specaugment: bool = False,
+        mask_time_prob: float = 0.05,  # Probability of masking time steps
+        mask_time_length: int = 10,  # Max length of time mask
+        mask_time_min_masks: int = 2,  # Min number of time masks
+        mask_feature_prob: float = 0.0,  # Probability of masking frequency bins (disabled by default)
+        mask_feature_length: int = 10,  # Max length of frequency mask
+        mask_feature_min_masks: int = 0,  # Min number of frequency masks
         max_new_tokens: Optional[int] = None,
+        min_new_tokens: Optional[int] = None,
         repetition_penalty: Optional[float] = None,
         length_penalty: Optional[float] = None,
         no_repeat_ngram_size: Optional[int] = None,
         generation_defaults = {
             "num_beams": 1,
             "max_new_tokens": 256,
+            "min_new_tokens": 1,
             "repetition_penalty": 1.0,
             "length_penalty": 1.0,
             "no_repeat_ngram_size": 0,
         self.qformer_intermediate_size = qformer_intermediate_size
         self.label_smoothing = label_smoothing
         self.inference_warmup_tokens = inference_warmup_tokens
+        # SpecAugment configuration
+        self.use_specaugment = use_specaugment
+        self.mask_time_prob = mask_time_prob
+        self.mask_time_length = mask_time_length
+        self.mask_time_min_masks = mask_time_min_masks
+        self.mask_feature_prob = mask_feature_prob
+        self.mask_feature_length = mask_feature_length
+        self.mask_feature_min_masks = mask_feature_min_masks
         # Generation parameters (use explicit value if provided, else use default)
         self.num_beams = num_beams if num_beams is not None else generation_defaults["num_beams"]
         self.max_new_tokens = (
             max_new_tokens if max_new_tokens is not None else generation_defaults["max_new_tokens"]
         )
+        self.min_new_tokens = (
+            min_new_tokens if min_new_tokens is not None else generation_defaults["min_new_tokens"]
+        )
         self.repetition_penalty = (
             repetition_penalty
             if repetition_penalty is not None

asr_modeling.py CHANGED Viewed

@@ -1,6 +1,7 @@
 import json
 from pathlib import Path
-from typing import Optional, Union
 import torch
 import torch.nn as nn
@@ -10,6 +11,7 @@ from transformers import (
     AutoModelForCausalLM,
     AutoTokenizer,
     PreTrainedModel,
 )
 from transformers.generation import GenerationMixin
 from transformers.modeling_outputs import CausalLMOutputWithPast
@@ -22,6 +24,122 @@ except ImportError:
     from projectors import PROJECTOR_CLASSES  # type: ignore[no-redef]
 class ASRModel(PreTrainedModel, GenerationMixin):
     """Audio-to-text model combining an audio encoder, projector, and language model."""
@@ -110,6 +228,7 @@ class ASRModel(PreTrainedModel, GenerationMixin):
         # Set up generation config with greedy decoding defaults
         self.generation_config = self.language_model.generation_config
         self.generation_config.max_new_tokens = config.max_new_tokens
         self.generation_config.num_beams = config.num_beams
         self.generation_config.do_sample = False
         # Clear sampling params (inherited from LLM) since we use greedy decoding
@@ -383,6 +502,18 @@ class ASRModel(PreTrainedModel, GenerationMixin):
             inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
         if input_features is not None and input_ids is not None:
             # Encode audio -> flattened (total_audio_tokens, hidden_dim)
             audio_embeds = self._encode_audio(input_features, audio_attention_mask)
@@ -515,6 +646,120 @@ class ASRModel(PreTrainedModel, GenerationMixin):
             return output
         return output.sequences
     def save_pretrained(self, save_directory: Union[str, Path], **kwargs):
         """Save model, tokenizer, and processor."""
         import shutil
@@ -568,6 +813,10 @@ class ASRModel(PreTrainedModel, GenerationMixin):
         # Copy projectors module
         shutil.copy(src_dir / "projectors.py", save_dir / "projectors.py")
 # Register with transformers Auto classes
 AutoConfig.register("asr_model", ASRConfig)

 import json
 from pathlib import Path
+from threading import Thread
+from typing import Iterator, Optional, Union
 import torch
 import torch.nn as nn
     AutoModelForCausalLM,
     AutoTokenizer,
     PreTrainedModel,
+    TextIteratorStreamer,
 )
 from transformers.generation import GenerationMixin
 from transformers.modeling_outputs import CausalLMOutputWithPast
     from projectors import PROJECTOR_CLASSES  # type: ignore[no-redef]
+def _compute_mask_indices(
+    shape: tuple[int, int],
+    mask_prob: float,
+    mask_length: int,
+    min_masks: int = 0,
+    device: torch.device = None,
+) -> torch.Tensor:
+    """Compute random mask spans for SpecAugment.
+    Based on transformers' _compute_mask_indices for Wav2Vec2/Whisper.
+    Args:
+        shape: (batch_size, sequence_length)
+        mask_prob: Probability for each token to be chosen as start of mask span
+        mask_length: Maximum length of mask span
+        min_masks: Minimum number of masks per sample
+        device: Device to create tensor on
+    Returns:
+        Boolean mask tensor of shape (batch_size, sequence_length)
+    """
+    batch_size, sequence_length = shape
+    if mask_length < 1:
+        raise ValueError(f"mask_length must be >= 1, got {mask_length}")
+    if mask_length > sequence_length:
+        raise ValueError(
+            f"mask_length {mask_length} must be <= sequence_length {sequence_length}"
+        )
+    # Compute number of masked spans per sample
+    num_masked_spans = int(mask_prob * sequence_length / mask_length + torch.rand(1).item())
+    num_masked_spans = max(num_masked_spans, min_masks)
+    # Clamp to ensure we don't exceed sequence length
+    if num_masked_spans * mask_length > sequence_length:
+        num_masked_spans = sequence_length // mask_length
+    if num_masked_spans == 0:
+        return torch.zeros((batch_size, sequence_length), dtype=torch.bool, device=device)
+    # Uniformly sample span start indices
+    mask = torch.zeros((batch_size, sequence_length), dtype=torch.bool, device=device)
+    for i in range(batch_size):
+        # Random start indices for this sample
+        spec_aug_start_indices = torch.randint(
+            0, sequence_length - mask_length + 1, (num_masked_spans,), device=device
+        )
+        # Create mask spans
+        for start_idx in spec_aug_start_indices:
+            mask[i, start_idx : start_idx + mask_length] = True
+    return mask
+def apply_specaugment(
+    input_features: torch.Tensor,
+    mask_time_prob: float = 0.05,
+    mask_time_length: int = 10,
+    mask_time_min_masks: int = 2,
+    mask_feature_prob: float = 0.0,
+    mask_feature_length: int = 10,
+    mask_feature_min_masks: int = 0,
+) -> torch.Tensor:
+    """Apply SpecAugment to mel spectrogram features.
+    Args:
+        input_features: Mel spectrogram of shape (batch, n_mels, time)
+        mask_time_prob: Probability of masking time steps
+        mask_time_length: Max length of time mask
+        mask_time_min_masks: Min number of time masks
+        mask_feature_prob: Probability of masking frequency bins
+        mask_feature_length: Max length of frequency mask
+        mask_feature_min_masks: Min number of frequency masks
+    Returns:
+        Augmented mel spectrogram with same shape
+    """
+    batch_size, n_mels, time_steps = input_features.shape
+    device = input_features.device
+    # Clone to avoid modifying original
+    augmented = input_features.clone()
+    # Time masking (along time dimension)
+    if mask_time_prob > 0:
+        time_mask = _compute_mask_indices(
+            shape=(batch_size, time_steps),
+            mask_prob=mask_time_prob,
+            mask_length=mask_time_length,
+            min_masks=mask_time_min_masks,
+            device=device,
+        )
+        # Expand to (batch, 1, time) for broadcasting
+        time_mask = time_mask.unsqueeze(1)
+        augmented = augmented.masked_fill(time_mask, 0.0)
+    # Frequency masking (along mel dimension)
+    if mask_feature_prob > 0:
+        feature_mask = _compute_mask_indices(
+            shape=(batch_size, n_mels),
+            mask_prob=mask_feature_prob,
+            mask_length=mask_feature_length,
+            min_masks=mask_feature_min_masks,
+            device=device,
+        )
+        # Expand to (batch, n_mels, 1) for broadcasting
+        feature_mask = feature_mask.unsqueeze(2)
+        augmented = augmented.masked_fill(feature_mask, 0.0)
+    return augmented
 class ASRModel(PreTrainedModel, GenerationMixin):
     """Audio-to-text model combining an audio encoder, projector, and language model."""
         # Set up generation config with greedy decoding defaults
         self.generation_config = self.language_model.generation_config
         self.generation_config.max_new_tokens = config.max_new_tokens
+        self.generation_config.min_new_tokens = config.min_new_tokens
         self.generation_config.num_beams = config.num_beams
         self.generation_config.do_sample = False
         # Clear sampling params (inherited from LLM) since we use greedy decoding
             inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
         if input_features is not None and input_ids is not None:
+            # Apply SpecAugment during training if enabled
+            if self.training and getattr(self.config, "use_specaugment", False):
+                input_features = apply_specaugment(
+                    input_features,
+                    mask_time_prob=self.config.mask_time_prob,
+                    mask_time_length=self.config.mask_time_length,
+                    mask_time_min_masks=self.config.mask_time_min_masks,
+                    mask_feature_prob=self.config.mask_feature_prob,
+                    mask_feature_length=self.config.mask_feature_length,
+                    mask_feature_min_masks=self.config.mask_feature_min_masks,
+                )
             # Encode audio -> flattened (total_audio_tokens, hidden_dim)
             audio_embeds = self._encode_audio(input_features, audio_attention_mask)
             return output
         return output.sequences
+    def generate_streaming(
+        self,
+        input_features: torch.Tensor,
+        audio_attention_mask: torch.Tensor,
+        system_prompt: Optional[str] = None,
+        **generate_kwargs,
+    ) -> Iterator[str]:
+        """Generate transcription with streaming token output.
+        Yields partial transcript strings as tokens are generated.
+        Reduces time-to-first-word by streaming tokens as they're decoded.
+        Args:
+            input_features: Mel spectrogram features (batch, n_mels, mel_len)
+            audio_attention_mask: Mask for real vs padded mel frames (batch, mel_len)
+            system_prompt: Optional system prompt override
+            **generate_kwargs: Additional generation arguments
+        Yields:
+            Partial transcript text as each token is generated
+        """
+        device = input_features.device
+        batch_size = input_features.shape[0]
+        # Encode audio -> flattened embeddings
+        audio_embeds = self._encode_audio(input_features, audio_attention_mask)
+        # Build prompt with correct number of audio tokens
+        num_audio_tokens = self._get_num_audio_tokens(audio_attention_mask)
+        audio_placeholder = "<audio>" * num_audio_tokens
+        system_prompt = system_prompt or self.system_prompt
+        messages: list[dict[str, str]] = []
+        if system_prompt:
+            messages.append({"role": "system", "content": system_prompt})
+        messages.append({"role": "user", "content": self.TRANSCRIBE_PROMPT + audio_placeholder})
+        chat_result = self.tokenizer.apply_chat_template(
+            messages,
+            tokenize=True,
+            add_generation_prompt=True,
+            return_tensors="pt",
+        )
+        input_ids = chat_result.input_ids.to(device)
+        if input_ids.dim() == 1:
+            input_ids = input_ids.unsqueeze(0)
+        if input_ids.shape[0] == 1 and batch_size > 1:
+            input_ids = input_ids.expand(batch_size, -1)
+        attention_mask = torch.ones_like(input_ids)
+        # Get text embeddings and replace audio tokens with audio embeddings
+        inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
+        audio_token_mask = (input_ids == self.audio_token_id).unsqueeze(-1)
+        inputs_embeds = inputs_embeds.masked_scatter(
+            audio_token_mask.to(inputs_embeds.device),
+            audio_embeds.to(inputs_embeds.device, dtype=inputs_embeds.dtype),
+        )
+        # Setup streamer for token-by-token output
+        streamer = TextIteratorStreamer(
+            self.tokenizer,
+            skip_prompt=True,
+            skip_special_tokens=True,
+        )
+        # Prepare generation kwargs
+        gen_kwargs = {
+            "inputs_embeds": inputs_embeds,
+            "attention_mask": attention_mask,
+            "generation_config": self.generation_config,
+            "streamer": streamer,
+            **generate_kwargs,
+        }
+        # Run generation in background thread
+        thread = Thread(target=self.language_model.generate, kwargs=gen_kwargs)
+        thread.start()
+        # Yield tokens as they're generated, filtering out <think>...</think> blocks
+        # SmolLM3 always starts in thinking mode, so assume we're in a think block
+        in_think_block = True
+        buffer = ""
+        for text in streamer:
+            buffer += text
+            # Check for think block start (in case model outputs multiple think blocks)
+            while "<think>" in buffer:
+                in_think_block = True
+                # Yield any text before <think>
+                before_think = buffer.split("<think>")[0]
+                if before_think:
+                    yield before_think
+                buffer = buffer.split("<think>", 1)[-1]
+            # Check for think block end
+            while in_think_block and "</think>" in buffer:
+                in_think_block = False
+                buffer = buffer.split("</think>", 1)[-1]
+            # Yield text if not in think block
+            if not in_think_block and buffer:
+                yield buffer
+                buffer = ""
+        # Yield any remaining buffer
+        if buffer and not in_think_block:
+            yield buffer
+        thread.join()
     def save_pretrained(self, save_directory: Union[str, Path], **kwargs):
         """Save model, tokenizer, and processor."""
         import shutil
         # Copy projectors module
         shutil.copy(src_dir / "projectors.py", save_dir / "projectors.py")
+    def create_or_update_model_card(self, output_dir: Union[str, Path]):
+        """No-op for model card creation - we use MODEL_CARD.md in repo instead."""
+        pass
 # Register with transformers Auto classes
 AutoConfig.register("asr_model", ASRConfig)

asr_pipeline.py CHANGED Viewed

@@ -476,57 +476,37 @@ class ASRPipeline(transformers.AutomaticSpeechRecognitionPipeline):
         text = self.tokenizer.decode(tokens, skip_special_tokens=True).strip()
         # Strip <think>...</think> tags (Qwen3 doesn't respect /no_think prompt)
         text = re.sub(r"<think>.*?</think>\s*", "", text, flags=re.DOTALL).strip()
-        # Collapse spaced-out acronyms (e.g., "I S D S" -> "ISDS")
-        text = self._collapse_acronyms(text)
-        # Truncate if a word repeats more than 3 times consecutively
-        text = self._truncate_repetitions(text, max_repeats=3)
         return {"text": text}
-    def _collapse_acronyms(self, text: str) -> str:
-        """Collapse spaced-out acronyms into single words.
-        Converts patterns like "I S D S" to "ISDS" when 2+ single letters
-        are separated by spaces.
-        Args:
-            text: Input text with potential spaced acronyms
-        Returns:
-            Text with acronyms collapsed
-        """
-        # Match 2+ single letters (case-insensitive) separated by spaces
-        # Pattern: single letter, then one or more (space + single letter)
-        pattern = r"\b([A-Za-z])((?:\s[A-Za-z]){1,})\b"
-        def collapse_match(match: re.Match) -> str:
-            # Get the full match and remove spaces
-            full = match.group(0)
-            return full.replace(" ", "").upper()
-        return re.sub(pattern, collapse_match, text)
-    def _truncate_repetitions(self, text: str, max_repeats: int = 3) -> str:
-        """Truncate text when a word repeats more than max_repeats times consecutively.
-        Args:
-            text: Input text to check for repetitions
-            max_repeats: Maximum allowed consecutive repetitions (default 3)
-        Returns:
-            Truncated text if repetition detected, otherwise original text
-        """
         words = text.split()
-        if len(words) <= max_repeats:
-            return text
-        repeat_count = 1
-        for i in range(1, len(words)):
-            if words[i].lower() == words[i - 1].lower():
-                repeat_count += 1
-                if repeat_count > max_repeats:
-                    # Keep up to max_repeats of the repeated word
-                    return " ".join(words[:i])
-            else:
-                repeat_count = 1
         return text

         text = self.tokenizer.decode(tokens, skip_special_tokens=True).strip()
         # Strip <think>...</think> tags (Qwen3 doesn't respect /no_think prompt)
         text = re.sub(r"<think>.*?</think>\s*", "", text, flags=re.DOTALL).strip()
+        # Post-process prediction
+        text = self._post_process_prediction(text)
         return {"text": text}
+    def _post_process_prediction(self, text: str) -> str:
+        """Post-process model output to fix common issues."""
+        if not text:
+            return ""
+        # 1. LOWERCASE
+        text = text.lower()
+        # 2. REMOVE REPETITIVE LOOPS
+        # If the model repeats the same phrase more than twice, cut it off.
         words = text.split()
+        if len(words) > 10:
+            # Check for repeating n-grams (1 to 4 words long)
+            for n in range(1, 5):
+                last_sequence = words[-n:]
+                repeat_count = 0
+                idx = len(words) - n
+                while idx >= n and words[idx - n : idx] == last_sequence:
+                    repeat_count += 1
+                    idx -= n
+                # If more than 2 exact repetitions at the end, truncate
+                if repeat_count > 2:
+                    text = " ".join(words[: idx + n])
+                    break
+        # 3. STRIP WHITESPACE
+        text = re.sub(r'\s+', ' ', text).strip()
         return text

asr_processing.py CHANGED Viewed

@@ -94,14 +94,21 @@ class ASRProcessor(ProcessorMixin):
             messages.append({"role": "assistant", "content": text})
         # Tokenize
-        input_ids = self.tokenizer.apply_chat_template(
             messages,
             tokenize=True,
             add_generation_prompt=(text is None),
             return_tensors=return_tensors,
         )
-        if isinstance(input_ids, torch.Tensor) and input_ids.dim() == 1:
             input_ids = input_ids.unsqueeze(0)
         result["input_ids"] = input_ids

             messages.append({"role": "assistant", "content": text})
         # Tokenize
+        tokenized = self.tokenizer.apply_chat_template(
             messages,
             tokenize=True,
             add_generation_prompt=(text is None),
             return_tensors=return_tensors,
         )
+        # Handle both tensor and BatchEncoding returns
+        if isinstance(tokenized, torch.Tensor):
+            input_ids = tokenized
+        else:
+            # BatchEncoding or dict-like object
+            input_ids = tokenized["input_ids"] if "input_ids" in tokenized else tokenized.input_ids
+        if input_ids.dim() == 1:
             input_ids = input_ids.unsqueeze(0)
         result["input_ids"] = input_ids