Upload Wren-ASR-0.5B-multi checkpoint

Browse files

Files changed (15) hide show

.gitattributes +1 -0
README.md +201 -0
added_tokens.json +26 -0
chat_template.jinja +54 -0
config.json +21 -0
configuration_wren_asr.py +30 -0
merges.txt +0 -0
model.safetensors +3 -0
modeling_wren_asr.py +145 -0
processing_wren_asr.py +78 -0
processor_config.json +8 -0
special_tokens_map.json +32 -0
tokenizer.json +3 -0
tokenizer_config.json +212 -0
vocab.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,201 @@

+---
+license: apache-2.0
+language:
+- en
+- de
+- fr
+- es
+- nl
+- it
+- pl
+- pt
+library_name: pytorch
+tags:
+- automatic-speech-recognition
+- asr
+- audio
+- speech-recognition
+- multilingual
+- wren
+- mimi
+- qwen2.5
+- neural-codec
+pipeline_tag: automatic-speech-recognition
+datasets:
+- shangeth/mls-mimi-codes
+- shangeth/libritts-r-mimi-codes
+- shangeth/vctk-mimi-codes
+- shangeth/jenny-mimi-codes
+- shangeth/ljspeech-mimi-codes
+- shangeth/expresso-mimi-codes-tagged
+- facebook/multilingual_librispeech
+- mythicinfinity/libritts_r
+- keithito/lj_speech
+- CSTR-Edinburgh/vctk
+- reach-vb/jenny_tts_dataset
+- ylacombe/expresso
+---
+# Wren-ASR-0.5B-multi
+**Multilingual** automatic speech recognition model in the Wren series. Encodes
+audio with the [Kyutai Mimi](https://huggingface.co/kyutai/mimi) neural codec,
+then transcribes with a [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)
+backbone — no acoustic encoder, no CTC, just a small LLM consuming Mimi codes as
+input embeddings.
+Supports **8 languages**: English, German, French, Spanish, Dutch, Italian, Polish, Portuguese.
+## Links
+- **Training & inference code:** [github.com/shangeth/wren-asr](https://github.com/shangeth/wren-asr)
+- **Wren research project:** [github.com/shangeth/wren](https://github.com/shangeth/wren)
+- **TTS counterpart:** [shangeth/Wren-TTS-0.5B-multi](https://huggingface.co/shangeth/Wren-TTS-0.5B-multi)
+- **Dataset extraction (Mimi codes):** [github.com/shangeth/wren-datasets](https://github.com/shangeth/wren-datasets)
+- **Demo Space:** [huggingface.co/spaces/shangeth/Wren-ASR-0.5B-multi-demo](https://huggingface.co/spaces/shangeth/Wren-ASR-0.5B-multi-demo)
+## Architecture
+```
+audio ──► Mimi encoder (k=3) ──► Qwen2.5-0.5B (audio prefix → text) ──► transcript
+```
+Mimi codes serve as a discrete audio prefix in the LLM's input embedding space.
+At each audio frame the k=3 codebook codes go through k separate input embedding
+tables; their sum (scaled by 1/√k) is the input embedding for that step. The
+audio prefix is wrapped in `<|audio_start|>` / `<|audio_end|>` tokens, after
+which the LLM autoregressively emits text using its native vocabulary and
+`lm_head` — no new output heads were added.
+- **Backbone:** Qwen2.5-0.5B (causal LM; transformer body ~358M params, 151k-token multilingual vocab)
+- **Audio tokenizer:** Mimi (`kyutai/mimi`), 12.5 fps, 2048-entry codebooks
+- **Codebooks used:** first 3 (semantic-content-rich); reduces input embedding size 8/3× vs 8-codebook variants
+- **Audio prefix:** `<|audio_start|>` + summed-codebook embeds × T_frames + `<|audio_end|>`
+- **Output:** standard text autoregression via `model.llm.generate(inputs_embeds=...)`
+## Training data
+Trained on the **union of every dataset used to train Wren-TTS** — the same
+6 corpora that power the en/multi/expressive TTS recipes, with text used as the
+ASR target:
+| Dataset | Rows | Language(s) |
+|---|---|---|
+| VCTK              | ~44k    | en (109 speakers, multiple accents) |
+| Jenny             | ~21k    | en (single speaker) |
+| LibriTTS-R        | ~360k   | en (clean_100 + clean_360 + other_500) |
+| LJSpeech          | ~13k    | en (single speaker) |
+| MLS               | ~6.0M   | de · fr · es · it · nl · pl · pt |
+| Expresso (tagged) | ~26k    | en (style tags stripped at load time) |
+| **Total**         | **~6.46M** rows / epoch | |
+Mimi codes are pre-extracted and published as the per-corpus mimi-codes datasets
+(see Datasets above) — no online encoding during training. Single-pass
+from-scratch training, ~k=3 codebooks. Held-out validation combines LibriTTS-R
+`dev_clean` + MLS `dev` (all 7 langs) + Expresso `dev` (tags stripped) + 5%
+per single-speaker English source. All weights set to 1.0 (every row, every
+epoch, no subsampling). Trained on a single A100-40GB.
+Text casing and punctuation are preserved in the ground-truth transcripts.
+## Usage
+```bash
+pip install torch torchaudio transformers
+```
+```python
+import torch
+import torchaudio
+from transformers import AutoModel, AutoProcessor
+model_id  = "shangeth/Wren-ASR-0.5B-multi"
+device    = "cuda" if torch.cuda.is_available() else "cpu"
+processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
+model     = AutoModel.from_pretrained(model_id, trust_remote_code=True).to(device).eval()
+# Load any short clip (one of the 8 supported languages, ≤ 30 s)
+wav, sr = torchaudio.load("input.wav")
+inputs = processor(audio=wav, sampling_rate=sr)
+inputs = {k: v.to(device) for k, v in inputs.items()}
+ids  = model.generate(**inputs, max_new_tokens=200)
+text = processor.batch_decode(ids, skip_special_tokens=True)[0]
+print(text)
+```
+## Sampling tips
+Defaults: greedy decoding (`do_sample=False`). For longer / harder utterances:
+- Pass `do_sample=True, temperature=0.7, top_p=0.9` for diverse beams
+- Raise `max_new_tokens` if transcripts are getting cut off
+- Audio is hard-capped at 30 s (375 frames @ 12.5 fps) by the training recipe;
+  for longer audio, segment first
+## Limitations & known issues
+- **Language coverage:** only the 8 trained languages. Out-of-distribution
+  audio produces noise / hallucinated text in the closest matching language.
+- **Per-language quality varies with data volume:** German / Dutch / French
+  are strongest (largest training shares); Polish / Portuguese / Italian have
+  less training data and may be less accurate.
+- **Audiobook-style audio dominates training:** MLS / LibriTTS-R / LJSpeech /
+  Jenny are all studio-style read speech. Performance on conversational audio,
+  noisy environments, or accented far-field input may degrade.
+- **0.5B backbone** — quality is below frontier ASR systems (Whisper-large-v3,
+  USM, etc.). The pitch is "small enough to run anywhere" + "shares architecture
+  with Wren-TTS-0.5B-multi for unified speech-text experimentation".
+- **30s audio cap.** Hard-cap at training time; longer audio needs to be
+  segmented externally.
+- **No speaker diarization.** Single-stream transcription only.
+## The Wren series
+Wren is a family of compact (<3B parameter) multimodal speech LLMs — small
+enough to run on a single consumer GPU, designed for open research on unified
+speech understanding and synthesis.
+- **Wren-TTS** — text → speech (English + multilingual + expressive variants)
+- **Wren-ASR** — speech → text (this release)
+- **Wren-LM** — speech-language modelling / dialog (planned)
+- **Wren-Omni** — unified ASR + TTS + LM in one checkpoint (planned)
+All Wren models share the same design principles: small backbone LLM + neural
+audio codec, open weights, simple PyTorch checkpoints, reproducible training
+recipes. Wren-ASR uses the same Qwen2.5-0.5B backbone as Wren-TTS-0.5B-multi
+and is trained on the same corpora — making the pair a natural starting point
+for unified speech-text modelling research.
+## Repository contents
+| File | Purpose |
+|---|---|
+| `model.safetensors` | Model weights |
+| `config.json` | `WrenASRConfig` (with `auto_map` for `trust_remote_code`) |
+| `tokenizer.json` + friends | Qwen2.5 tokenizer with Wren-ASR's 2 special tokens added |
+| `processor_config.json` | `WrenASRProcessor` auto_map |
+| `configuration_wren_asr.py` | `WrenASRConfig(PretrainedConfig)` |
+| `modeling_wren_asr.py` | `WrenForASR(PreTrainedModel)` — loads Mimi codec lazily on first call |
+| `processing_wren_asr.py` | `WrenASRProcessor(ProcessorMixin)` — audio → Mimi codes + text decode |
+| `README.md` | This model card |
+## Citation
+```bibtex
+@misc{wren2026,
+  title  = {Wren: A Family of Small Open-Weight Models for Unified Speech-Text Modelling},
+  author = {Shangeth Rajaa},
+  year   = {2026},
+  url    = {https://github.com/shangeth/wren}
+}
+```
+## License
+Apache-2.0 for the checkpoint weights and code in this repo.
+Upstream components carry their own licenses — review before redistribution.
+The Expresso dataset (used for English style robustness) is CC-BY-NC-4.0; if
+you build derived models on this checkpoint and want to release them
+commercially, retrain with Expresso excluded.

added_tokens.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "</tool_call>": 151658,
+  "<tool_call>": 151657,
+  "<|audio_end|>": 151666,
+  "<|audio_start|>": 151665,
+  "<|box_end|>": 151649,
+  "<|box_start|>": 151648,
+  "<|endoftext|>": 151643,
+  "<|file_sep|>": 151664,
+  "<|fim_middle|>": 151660,
+  "<|fim_pad|>": 151662,
+  "<|fim_prefix|>": 151659,
+  "<|fim_suffix|>": 151661,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644,
+  "<|image_pad|>": 151655,
+  "<|object_ref_end|>": 151647,
+  "<|object_ref_start|>": 151646,
+  "<|quad_end|>": 151651,
+  "<|quad_start|>": 151650,
+  "<|repo_name|>": 151663,
+  "<|video_pad|>": 151656,
+  "<|vision_end|>": 151653,
+  "<|vision_pad|>": 151654,
+  "<|vision_start|>": 151652
+}

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,54 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- messages[0]['content'] }}
+    {%- else %}
+        {{- 'You are a helpful assistant.' }}
+    {%- endif %}
+    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
+    {%- else %}
+        {{- '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- for message in messages %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
+        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {{- '<|im_start|>' + message.role }}
+        {%- if message.content %}
+            {{- '\n' + message.content }}
+        {%- endif %}
+        {%- for tool_call in message.tool_calls %}
+            {%- if tool_call.function is defined %}
+                {%- set tool_call = tool_call.function %}
+            {%- endif %}
+            {{- '\n<tool_call>\n{"name": "' }}
+            {{- tool_call.name }}
+            {{- '", "arguments": ' }}
+            {{- tool_call.arguments | tojson }}
+            {{- '}\n</tool_call>' }}
+        {%- endfor %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- message.content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+{%- endif %}

config.json ADDED Viewed

	@@ -0,0 +1,21 @@

+{
+  "architectures": [
+    "WrenForASR"
+  ],
+  "audio_end_id": 151666,
+  "audio_start_id": 151665,
+  "auto_map": {
+    "AutoConfig": "configuration_wren_asr.WrenASRConfig",
+    "AutoModel": "modeling_wren_asr.WrenForASR"
+  },
+  "codebook_size": 2048,
+  "dtype": "bfloat16",
+  "eos_token_id": 151643,
+  "k_codebooks": 3,
+  "llm_name": "Qwen/Qwen2.5-0.5B",
+  "mimi_model_name": "kyutai/mimi",
+  "model_type": "wren_asr",
+  "sampling_rate": 24000,
+  "transformers_version": "4.57.6",
+  "vocab_size": 151672
+}

configuration_wren_asr.py ADDED Viewed

	@@ -0,0 +1,30 @@

+"""Wren ASR configuration — transformers-compatible."""
+from transformers import PretrainedConfig
+class WrenASRConfig(PretrainedConfig):
+    model_type = "wren_asr"
+    def __init__(
+        self,
+        llm_name:        str = "Qwen/Qwen2.5-0.5B",
+        mimi_model_name: str = "kyutai/mimi",
+        k_codebooks:     int = 3,
+        codebook_size:   int = 2048,
+        vocab_size:      int = 151944,
+        # Special-token IDs (in the resized text vocab)
+        audio_start_id:  int = None,    # <|audio_start|> — opens audio prefix
+        audio_end_id:    int = None,    # <|audio_end|>   — closes audio prefix; text begins after
+        eos_token_id:    int = None,    # end of transcript (LLM's existing eos)
+        sampling_rate:   int = 24000,
+        **kwargs,
+    ):
+        self.llm_name        = llm_name
+        self.mimi_model_name = mimi_model_name
+        self.k_codebooks     = k_codebooks
+        self.codebook_size   = codebook_size
+        self.vocab_size      = vocab_size
+        self.audio_start_id  = audio_start_id
+        self.audio_end_id    = audio_end_id
+        self.sampling_rate   = sampling_rate
+        super().__init__(eos_token_id=eos_token_id, **kwargs)

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4c209c4f1d83e711a818eaabf310ce0f9df18bbfdd7771ba2f59ca49e94f78ac
+size 1009646296

modeling_wren_asr.py ADDED Viewed

	@@ -0,0 +1,145 @@

+"""
+Wren-ASR model — a transformers-compatible wrapper over Qwen2.5-0.5B + Mimi
+input embedding tables.
+Designed for use with `AutoModel.from_pretrained(..., trust_remote_code=True)`.
+Self-contained: no imports from a `src/` folder.
+Sequence layout:
+  [ <audio_start> | sum_q embed_q(codes[q, t]) for t in 0..T-1 | <audio_end> | text... | <eos> ]
+Audio positions feed a single summed-codebook embedding per real frame (no delay
+pattern). Text-token prediction uses the LLM's existing `lm_head`; no new output
+heads are added.
+"""
+import math
+from typing import Optional
+import torch
+import torch.nn as nn
+from transformers import AutoConfig, AutoModelForCausalLM, PreTrainedModel
+try:
+    from .configuration_wren_asr import WrenASRConfig          # package context (HF trust_remote_code)
+except ImportError:
+    import importlib
+    WrenASRConfig = importlib.import_module("configuration_wren_asr").WrenASRConfig
+class WrenForASR(PreTrainedModel):
+    config_class      = WrenASRConfig
+    base_model_prefix = "wren_asr"
+    def __init__(self, config: WrenASRConfig):
+        super().__init__(config)
+        self.k = config.k_codebooks
+        # Build backbone from its config only. Pretrained backbone weights are
+        # already in our state_dict; no need to re-download.
+        llm_cfg            = AutoConfig.from_pretrained(config.llm_name)
+        llm_cfg.vocab_size = config.vocab_size
+        self.llm           = AutoModelForCausalLM.from_config(llm_cfg)
+        hidden = self.llm.config.hidden_size
+        # k input embedding tables (codes are inputs only — no PAD row needed).
+        self.audio_embeds = nn.ModuleList([
+            nn.Embedding(config.codebook_size, hidden)
+            for _ in range(self.k)
+        ])
+        self.embed_scale = 1.0 / math.sqrt(self.k)
+        self._mimi = None  # lazy-loaded on first use
+    # --- Mimi codec (lazy-loaded encoder for raw-waveform input) ---
+    @property
+    def mimi(self):
+        if self._mimi is None:
+            from transformers import MimiModel
+            self._mimi = MimiModel.from_pretrained(self.config.mimi_model_name).to(self.device)
+            self._mimi.eval()
+            for p in self._mimi.parameters():
+                p.requires_grad_(False)
+        return self._mimi
+    @torch.no_grad()
+    def encode_audio(
+        self,
+        waveform:        torch.Tensor,
+        src_sample_rate: int = 24000,
+    ) -> torch.LongTensor:
+        """Encode a waveform to Mimi codes [k, n_frames]."""
+        if waveform.dim() == 1:
+            waveform = waveform.unsqueeze(0)
+        if src_sample_rate != self.config.sampling_rate:
+            import torchaudio.transforms as T
+            waveform = T.Resample(src_sample_rate, self.config.sampling_rate)(waveform)
+        x   = waveform.unsqueeze(0).to(self.device)
+        out = self.mimi.encode(x, num_quantizers=self.k)
+        return out.audio_codes[0].cpu()  # [k, n_frames]
+    # --- Generation ---
+    @torch.no_grad()
+    def generate(
+        self,
+        audio_codes:    torch.LongTensor,    # [k, T] or [B, k, T]
+        max_new_tokens: int   = 200,
+        do_sample:      bool  = False,
+        temperature:    float = 1.0,
+        top_k:          int   = 50,
+        top_p:          float = 1.0,
+        eos_token_id:   Optional[int] = None,
+        pad_token_id:   Optional[int] = None,
+        **kwargs,
+    ) -> torch.LongTensor:
+        """Transcribe Mimi codes to text-token IDs.
+        Returns the generated token IDs (without the audio prefix). Decode them
+        with your tokenizer to get text — typically:
+            ids = model.generate(audio_codes=codes)
+            text = tokenizer.decode(ids[0], skip_special_tokens=True)
+        """
+        device = next(self.parameters()).device
+        self.eval()
+        audio_codes = audio_codes.to(device)
+        if audio_codes.dim() == 2:
+            audio_codes = audio_codes.unsqueeze(0)   # [1, k, T]
+        B, k, T = audio_codes.shape
+        assert k == self.k, f"expected k={self.k}, got {k}"
+        embed_tokens = self.llm.get_input_embeddings()
+        llm_dtype    = next(self.llm.parameters()).dtype
+        start_emb = embed_tokens(torch.tensor([[self.config.audio_start_id]], device=device))
+        end_emb   = embed_tokens(torch.tensor([[self.config.audio_end_id]],   device=device))
+        clamped   = audio_codes.clamp(0, self.config.codebook_size - 1)
+        audio_sum = self.audio_embeds[0](clamped[:, 0, :])
+        for q in range(1, self.k):
+            audio_sum = audio_sum + self.audio_embeds[q](clamped[:, q, :])
+        audio_sum = audio_sum * self.embed_scale
+        prompt_embeds = torch.cat([
+            start_emb.expand(B, -1, -1),
+            audio_sum,
+            end_emb.expand(B, -1, -1),
+        ], dim=1).to(llm_dtype)
+        gen_ids = self.llm.generate(
+            inputs_embeds  = prompt_embeds,
+            max_new_tokens = max_new_tokens,
+            do_sample      = do_sample,
+            temperature    = temperature if do_sample else 1.0,
+            top_k          = top_k       if do_sample else 0,
+            top_p          = top_p       if do_sample else 1.0,
+            eos_token_id   = eos_token_id if eos_token_id is not None else self.config.eos_token_id,
+            pad_token_id   = pad_token_id,
+        )
+        # When called with `inputs_embeds`, HF generate returns ONLY the
+        # generated ids (the prompt has no token ids to echo back).
+        return gen_ids

processing_wren_asr.py ADDED Viewed

	@@ -0,0 +1,78 @@

+"""
+Wren-ASR processor: audio → Mimi codes (and optionally back to text via the
+tokenizer for decoding model outputs).
+Usage:
+  processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True)
+  inputs    = processor(audio=wav, sampling_rate=sr)        # → {"audio_codes": [k, T]}
+  ids       = model.generate(**inputs, max_new_tokens=200)
+  text      = processor.batch_decode(ids, skip_special_tokens=True)[0]
+"""
+from typing import Optional, Union
+import torch
+from transformers.processing_utils import ProcessorMixin
+class WrenASRProcessor(ProcessorMixin):
+    attributes      = ["tokenizer"]
+    tokenizer_class = "AutoTokenizer"
+    def __init__(self, tokenizer, mimi_model_name: str = "kyutai/mimi", k_codebooks: int = 3, **kwargs):
+        super().__init__(tokenizer=tokenizer)
+        self.mimi_model_name = mimi_model_name
+        self.k_codebooks     = k_codebooks
+        self._mimi = None
+    @property
+    def mimi(self):
+        if self._mimi is None:
+            from transformers import MimiModel
+            self._mimi = MimiModel.from_pretrained(self.mimi_model_name).eval()
+            for p in self._mimi.parameters():
+                p.requires_grad_(False)
+        return self._mimi
+    @torch.no_grad()
+    def __call__(
+        self,
+        audio:         Optional[torch.Tensor]      = None,
+        sampling_rate: Optional[int]               = None,
+        audio_codes:   Optional[torch.LongTensor]  = None,
+        **kwargs,
+    ):
+        """Either pass `audio` (raw waveform) + `sampling_rate`, or pre-computed
+        `audio_codes` of shape [k, T] / [B, k, T].
+        Returns: {"audio_codes": LongTensor [B, k, T]}.
+        """
+        if audio_codes is not None:
+            codes = audio_codes
+            if codes.dim() == 2:
+                codes = codes.unsqueeze(0)
+            return {"audio_codes": codes}
+        if audio is None:
+            raise ValueError("Provide either `audio` (waveform) or `audio_codes`.")
+        if sampling_rate is None:
+            raise ValueError("`sampling_rate` is required when passing `audio`.")
+        wav = audio
+        if wav.dim() == 1:
+            wav = wav.unsqueeze(0)
+        if sampling_rate != 24000:
+            import torchaudio.transforms as T
+            wav = T.Resample(sampling_rate, 24000)(wav)
+        x   = wav.unsqueeze(0)                              # [1, 1, T]
+        out = self.mimi.encode(x, num_quantizers=self.k_codebooks)
+        codes = out.audio_codes                             # [1, k, T]
+        return {"audio_codes": codes}
+    def batch_decode(self, *args, **kwargs):
+        return self.tokenizer.batch_decode(*args, **kwargs)
+    def decode(self, *args, **kwargs):
+        return self.tokenizer.decode(*args, **kwargs)

processor_config.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "processor_class": "WrenASRProcessor",
+  "auto_map": {
+    "AutoProcessor": "processing_wren_asr.WrenASRProcessor"
+  },
+  "mimi_model_name": "kyutai/mimi",
+  "k_codebooks": 3
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "additional_special_tokens": [
+    {
+      "content": "<|audio_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|audio_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    }
+  ],
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3e6df08c131c86d31c1366adc78bf78a1c59f2ef0e05bfd93f48c963c3abcea9
+size 11422278

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,212 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151665": {
+      "content": "<|audio_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151666": {
+      "content": "<|audio_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|audio_start|>",
+    "<|audio_end|>"
+  ],
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff