Add missing `config.json`

by Akshat - opened Oct 31, 2025

base: refs/heads/main

←

from: refs/pr/8

Discussion Files changed

+35

-0

Akshat

Oct 31, 2025

No description provided.

Add missing config.json8a25d96b

Akshat

Oct 31, 2025

Problem:
Currently, AutoFeatureExtractor.from_pretrained("kyutai/moshiko-pytorch-bf16") taken from model doc page at huggingface.co/docs/transformers/en/model_doc/moshi under the heading 1. Model generation fails with an OSError because preprocessor_config.json is missing. This is inconsistent with other repos in the collection, like kyutai/moshiko-pytorch-q8 and kmhf/hf-moshiko, which do contain these necessary configuration files.

from datasets import load_dataset, Audio
import torch, math
from transformers import MoshiForConditionalGeneration, AutoFeatureExtractor, AutoTokenizer


librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
feature_extractor = AutoFeatureExtractor.from_pretrained("kyutai/moshiko-pytorch-bf16")
tokenizer = AutoTokenizer.from_pretrained("kyutai/moshiko-pytorch-bf16")
device = "cuda"
dtype = torch.bfloat16

# prepare user input audio 
librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
audio_sample = librispeech_dummy[-1]["audio"]["array"]
user_input_values = feature_extractor(raw_audio=audio_sample, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt").to(device=device, dtype=dtype)

# prepare moshi input values - we suppose moshi didn't say anything while the user spoke
moshi_input_values = torch.zeros_like(user_input_values.input_values)

# prepare moshi input ids - we suppose moshi didn't say anything while the user spoke
num_tokens = math.ceil(moshi_input_values.shape[-1] * waveform_to_token_ratio)
input_ids = torch.ones((1, num_tokens), device=device, dtype=torch.int64) * tokenizer.encode("<pad>")[0]

# generate 25 new tokens (around 2s of audio)
output = model.generate(input_ids=input_ids, user_input_values=user_input_values.input_values, moshi_input_values=moshi_input_values, max_new_tokens=25)

text_tokens = output.sequences
audio_waveforms = output.audio_sequences

error:

OSError: kyutai/moshiko-pytorch-bf16 does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co/kyutai/moshiko-pytorch-bf16/tree/main' for available files.

Confirmation from Source Repository:
This has been confirmed by the model's authors as an issue for the Transformers port to handle (see: https://github.com/kyutai-labs/moshi/issues/234 )

Expected behavior

Proposed Solution:
Adding the missing configuration files will resolve this.

Proposed config.json :
(Based on kyutai/moshiko-pytorch-q8 and kyutai/moshiko-pytorch-bf16/

{
    "moshi_name": "model.safetensors",
    "mimi_name": "tokenizer-e351c8d8-checkpoint125.safetensors",
    "tokenizer_name": "tokenizer_spm_32k_3.model",
    "quantize": false,
    "dim": 4096,
    "text_card": 32000,
    "existing_text_padding_id": 3,
    "n_q": 16,
    "dep_q": 8,
    "card": 2048,
    "num_heads": 32,
    "num_layers": 32,
    "hidden_scale": 4.125,
    "causal": true,
    "layer_scale": null,
    "context": 3000,
    "max_period": 10000,
    "gating": "silu",
    "norm": "rms_norm_f32",
    "positional_embedding": "rope",
    "depformer_dim": 1024,
    "depformer_dim_feedforward": 4224,
    "depformer_num_heads": 16,
    "depformer_num_layers": 6,
    "depformer_causal": true,
    "depformer_layer_scale": null,
    "depformer_multi_linear": true,
    "depformer_context": 8,
    "depformer_max_period": 10000,
    "depformer_gating": "silu",
    "depformer_pos_emb": "none",
    "depformer_weights_per_step": true,
    "delays": [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1]
}

Akshat changed pull request status to open Oct 31, 2025

Akshat

Oct 31, 2025

•

edited Oct 31, 2025

I have added config.json as suggested above but have doubts that preprocessor_config.json will be similar to:
Proposed preprocessor_config.json:
(Copied from kmhf/hf-moshiko)

{
  "feature_extractor_type": "EncodecFeatureExtractor",
  "sampling_rate": 24000,
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": true,
  "chunk_length_s": null,
  "overlap": null
}

If someone can clear this doubt, I can go ahead and add preprocessor_config.json to this PR as well!

flrtemis

Jan 11

yep... would've been nice if i seen all that BEFORE spending how long setting up everything, downloading everything... only to find out that the person doesn't care enough to either upload proper files intitially or doesn't care enough to actually monitor their own repo for messages like these...
Thanks for nothing again HF...

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment