Kimi-Audio random / test fixture

Tiny random-init bundle of Kimi-Audio-7B-Instruct for vLLM-Omni's L1/L2 core_model CI tests. Verifies the full pipeline end-to-end without paying the ~42 GB checkpoint cost.

It follows the same on-disk schema as upstream, but every transformer-style component has shrunk dimensions and random weights:

Component File Upstream Random
LM (Qwen-2-style + MIMO) model.safetensors 16 GB sharded 555 MB (single shard)
Whisper encoder whisper-large-v3/model.safetensors 3 GB 17 MB (encoder only)
Audio detokenizer (FM DiT) audio_detokenizer/model.pt 19 GB 35 MB

Shrunk dims (token IDs / vocab sizes kept at upstream values):

  • LM: hidden_size 3584β†’512, num_hidden_layers 28β†’4, num_attention_heads 28β†’8, intermediate_size 18944β†’1536, kimia_mimo_layers 6β†’2, kimia_mimo_transformer_from_layer_index 21β†’2, kimia_adaptor_input_dim 5120β†’1536
  • Whisper: d_model 1280β†’384, encoder_layers 32β†’4, encoder_ffn_dim 5120β†’1536, encoder_attention_heads 20β†’6 (decoder weights dropped β€” vLLM only uses the encoder)
  • FM DiT: hidden_size 2304β†’384, depth 16β†’4, num_heads 18β†’6, condition_input_dim 1280β†’384

The bundle does not ship a vocoder/ subdir β€” KimiBigVGAN loads from zhangj1an/kimi-audio-bigvgan-hf at runtime.

modeling_moonshot_kimia.py was patched to stub flash_attn symbols (instead of raising) so AutoModelForCausalLM.from_config(trust_remote_code=True) works in CI without flash_attn installed; vLLM-Omni replaces the attention impl anyway.

Do not use for actual generation β€” outputs are noise.

Downloads last month
338
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support