llm-jp-4-8b-speech-chat

English (日本語版は後述)

Overview

llm-jp-4-8b-speech-chat is a Japanese speech-language model for audio input → text output interactions. It supports spoken dialogue, instruction following, speech transcription, and audio description.

Provenance

This model belongs to the LLM-jp-4 8B family, but it is not a direct derivative of the publicly released llm-jp/llm-jp-4-8b-base or llm-jp/llm-jp-4-8b-thinking. It was initialized from a competition-distributed pre-release intermediate checkpoint / derived pseudo-base model from the llm-jp-4-8b development line.

Usage

Quick start

pip install git+https://github.com/Atotti/ja-speech-llm.git

Minimal inference example

import torch
import torchaudio
from transformers import AutoProcessor, AutoTokenizer
from speech_llm_ja import LlamaForSpeechLM, LlamaForSpeechLMConfig

MODEL_ID = "Atotti/llm-jp-4-8b-speech-chat"

config = LlamaForSpeechLMConfig.from_pretrained(MODEL_ID)
model = LlamaForSpeechLM.from_pretrained(
    MODEL_ID,
    config=config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
).eval()

encoder_processor = AutoProcessor.from_pretrained(model.config.encoder_id)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token

# Load audio
waveform, sample_rate = torchaudio.load("path/to/your_audio_file.wav")
if waveform.size(0) > 1:
    waveform = waveform.mean(dim=0, keepdim=True)
waveform = waveform.squeeze(0)
if sample_rate != 16000:
    waveform = torchaudio.functional.resample(waveform, sample_rate, 16000)

# Build prompt
instruction = "音声の指示に従ってください。"
prompt = f"""あなたは音声を理解できるAIアシスタントです。

<|reserved_343|><|reserved_342|>### 指示:
{instruction}

### 応答:
"""

# Encode
encoder_inputs = encoder_processor(
    [waveform.numpy()],
    return_tensors="pt",
    return_attention_mask=True,
    sampling_rate=16000,
)
decoder_inputs = tokenizer(prompt, return_tensors="pt")

# Generate
with torch.no_grad():
    output_ids = model.generate(
        input_features=encoder_inputs.input_features.to(model.device),
        input_ids=decoder_inputs.input_ids.to(model.device),
        encoder_attention_mask=encoder_inputs.attention_mask.to(model.device),
        decoder_attention_mask=decoder_inputs.attention_mask.to(model.device),
        max_new_tokens=1024,
        do_sample=False,
    )

generated_ids = output_ids[0, decoder_inputs.input_ids.shape[1]:]
print(tokenizer.decode(generated_ids, skip_special_tokens=True))

Gradio demo

git clone https://github.com/Atotti/ja-speech-llm.git
cd ja-speech-llm
uv sync
uv run python gradio_demo.py

Recommended prompts

Task	Prompt
Dialogue / Instruction	`音声の指示に従ってください。`
Transcription	`音声を書き起こしてください。`
Description	`音声を説明してください。`

Evaluation

Model	ADU-Bench (ja) ↑	CommonVoice 8 (ja) CER ↓
Whisper-large-v3	-	8.51
SALMONN	1.37	-
Qwen-Audio-Chat	1.08	-
Voxtral Mini3B-2507	5.181	15.65
Gemma3n E4B-it	5.143	51.23
llm-jp-4-8b-speech-asr	-	8.36
llm-jp-4-8b-speech-chat	5.335	10.25
llm-jp-4-8b-speech-chat-dpo-exp	5.165	10.42

Limitations

Primarily optimized for Japanese.
May have recognition or reasoning errors.
Performance may degrade on noisy, long, or spontaneous speech.

日本語

概要

llm-jp-4-8b-speech-chat は、音声入力 → テキスト出力の日本語音声言語モデルです。音声対話、指示追従、音声書き起こし、音声説明をサポートします。

モデルの由来

本モデルは LLM-jp-4 8B 系列に属しますが、公開されている llm-jp/llm-jp-4-8b-base や llm-jp/llm-jp-4-8b-thinking から直接派生したものではありません。 llm-jp-4-8b 開発系列のコンペ配布中間チェックポイント / そこから派生した仮モデルを初期値として使用しています。

使い方

クイックスタート

pip install git+https://github.com/Atotti/ja-speech-llm.git

最小推論例

import torch
import torchaudio
from transformers import AutoProcessor, AutoTokenizer
from speech_llm_ja import LlamaForSpeechLM, LlamaForSpeechLMConfig

MODEL_ID = "Atotti/llm-jp-4-8b-speech-chat"

config = LlamaForSpeechLMConfig.from_pretrained(MODEL_ID)
model = LlamaForSpeechLM.from_pretrained(
    MODEL_ID,
    config=config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
).eval()

encoder_processor = AutoProcessor.from_pretrained(model.config.encoder_id)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token

# 音声読み込み
waveform, sample_rate = torchaudio.load("path/to/your_audio_file.wav")
if waveform.size(0) > 1:
    waveform = waveform.mean(dim=0, keepdim=True)
waveform = waveform.squeeze(0)
if sample_rate != 16000:
    waveform = torchaudio.functional.resample(waveform, sample_rate, 16000)

# プロンプト構築
instruction = "音声の指示に従ってください。"
prompt = f"""あなたは音声を理解できるAIアシスタントです。

<|reserved_343|><|reserved_342|>### 指示:
{instruction}

### 応答:
"""

# エンコード
encoder_inputs = encoder_processor(
    [waveform.numpy()],
    return_tensors="pt",
    return_attention_mask=True,
    sampling_rate=16000,
)
decoder_inputs = tokenizer(prompt, return_tensors="pt")

# 生成
with torch.no_grad():
    output_ids = model.generate(
        input_features=encoder_inputs.input_features.to(model.device),
        input_ids=decoder_inputs.input_ids.to(model.device),
        encoder_attention_mask=encoder_inputs.attention_mask.to(model.device),
        decoder_attention_mask=decoder_inputs.attention_mask.to(model.device),
        max_new_tokens=1024,
        do_sample=False,
    )

generated_ids = output_ids[0, decoder_inputs.input_ids.shape[1]:]
print(tokenizer.decode(generated_ids, skip_special_tokens=True))

Gradio デモ

git clone https://github.com/Atotti/ja-speech-llm.git
cd ja-speech-llm
uv sync
uv run python gradio_demo.py

推奨プロンプト

タスク	プロンプト
対話 / 指示追従	`音声の指示に従ってください。`
書き起こし	`音声を書き起こしてください。`
説明	`音声を説明してください。`

評価

モデル	ADU-Bench (ja) ↑	CommonVoice 8 (ja) CER ↓
Whisper-large-v3	-	8.51
SALMONN	1.37	-
Qwen-Audio-Chat	1.08	-
Voxtral Mini3B-2507	5.181	15.65
Gemma3n E4B-it	5.143	51.23
llm-jp-4-8b-speech-asr	-	8.36
llm-jp-4-8b-speech-chat	5.335	10.25
llm-jp-4-8b-speech-chat-dpo-exp	5.165	10.42

制限

主対象は日本語です。
認識・推論の誤りが発生する可能性があります。
雑音環境、長時間音声、自発性の高い発話では性能が低下する可能性があります。

Reference

@misc{tsutsumi2026jaspeechllm,
  title={atotti/llm-jp-4-8b-speech-chat},
  url={https://huggingface.co/atotti/llm-jp-4-8b-speech-chat},
  author={Ayuto Tsutsumi and Haruki Oshiro},
  year={2026},
}

Downloads last month: 51

Safetensors

Model size

9B params

Tensor type

F32

BF16

Inference Providers NEW

Audio-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Atotti/llm-jp-4-8b-speech-chat

Base model

llm-jp/llm-jp-4-8b-base

Finetuned

(3)

this model

Atotti
/

llm-jp-4-8b-speech-chat