Seems like `chat(stream=True)` is broken β€” requires `do_sample=True` + `num_beams=1`, then crashes with `AttributeError: 'dict' object has no attribute 'sequences'`

#13
by Volotat - opened

Hi, I'm not sure if I'm missing something from the docs, but I believe chat(stream=True) is currently broken due to two sequential bugs in modeling_minicpmo.py. Happy to be corrected if there's a recommended workaround I missed.

Environment: transformers==4.51.0, torch==2.4.x, Python 3.10, CUDA GPU

Minimal reproduction:

from transformers import AutoModel, AutoTokenizer
import torch

model = AutoModel.from_pretrained("openbmb/MiniCPM-o-4_5", trust_remote_code=True,
    torch_dtype=torch.bfloat16, attn_implementation="sdpa",
    init_vision=True, init_audio=True, init_tts=False).eval().cuda()

msgs = [{"role": "user", "content": "Write a haiku about AI."}]

# Attempt 1 β€” fails immediately (assert at line ~1117):
result = model.chat(msgs=msgs, stream=True)
# AssertionError: if use stream mode, make sure do_sample=True

# Attempt 2 β€” fails inside generate() with beam search + streamer conflict:
result = model.chat(msgs=msgs, stream=True, do_sample=True)
# ValueError: `streamer` cannot be used with beam search (yet!).
# Make sure that `num_beams` is set to 1.
# (because prepare_generation_config() defaults num_beams=3)

# Attempt 3 β€” the streamer runs, but chat() crashes reading its own return value:
result = model.chat(msgs=msgs, stream=True, do_sample=True, num_beams=1)
# AttributeError: 'dict' object has no attribute 'sequences'

Bug 1 β€” prepare_generation_config defaults num_beams=3, incompatible with streaming

prepare_generation_config (line ~1023) always starts with num_beams=3 regardless of the stream flag:

def prepare_generation_config(do_sample, max_new_tokens=50, min_new_tokens=0, **kwargs):
    num_beams = kwargs.get("num_beams", 3)  # <-- defaults to 3
    ...

When stream=True, _decode_stream() eventually calls self.llm.generate(streamer=streamer, num_beams=3, ...), which HuggingFace correctly rejects because beam search is incompatible with streamers. Suggested fix: in chat(), force num_beams=1 (and do_sample=True) when stream=True before calling prepare_generation_config.


Bug 2 (main) β€” chat() reads outputs.sequences[0] unconditionally after generate(), but outputs={} in stream mode

In generate() (line ~884–891), the stream path correctly sets outputs = {}:

if stream:
    result = self._decode_stream(model_inputs["inputs_embeds"], tokenizer, **kwargs)
    outputs = {}  # if stream return TextIteratorStreamer and output is empty

But back in chat() (line ~1203), the code unconditionally accesses outputs.sequences regardless of the stream flag:

res, outputs = self.generate(..., stream=stream, ...)

# ... no guard for stream mode ...

generated_ids = outputs.sequences[0]  # <-- crashes: 'dict' has no attribute 'sequences'

The generation thread started inside _decode_stream() is alive and correctly filling the TextIteratorStreamer, but chat() crashes before it can return the streamer to the caller.

Suggested fix:

res, outputs = self.generate(..., stream=stream, ...)

# Early return for stream mode: res is the TextIteratorStreamer, outputs is {}
if stream:
    return res  # caller iterates tokens directly

generated_ids = outputs.sequences[0]
# ... rest of the existing post-processing ...

Workaround I'm currently using (for reference):

Since generate() itself works correctly, I temporarily wrap it to capture the TextIteratorStreamer before chat() crashes:

streamer_ref = []
original_generate = model.generate

def _capturing_generate(*args, **kwargs):
    result = original_generate(*args, **kwargs)
    if isinstance(result, tuple) and isinstance(result[1], dict) and not result[1]:
        streamer_ref.append(result[0])  # save the TextIteratorStreamer
    return result

model.generate = _capturing_generate
try:
    model.chat(msgs=msgs, stream=True, do_sample=True, num_beams=1)
except AttributeError as e:
    if 'sequences' not in str(e):
        raise  # unexpected
    # generation thread is still alive, streamer_ref[0] is valid
finally:
    model.generate = original_generate

for token in streamer_ref[0]:
    print(token, end='', flush=True)

This works, but it's obviously fragile.

OpenBMB org

Thanks for your effort in investigating this! We've just pushed a fix β€” see https://huggingface.co/openbmb/MiniCPM-o-4_5/discussions/16/files.

result = model.chat(msgs=msgs, stream=True)
for token in result:
    print(token, end='', flush=True)

# example output
Silent circuits hum,
Mind's echo in digital streamsβ€”
Future learns to think.<|im_end|>

Thank you very much. Now everything works as expected without extra voodoo in the code. Thanks for the great model guys, it works extremely well. For some reason it is the only omni model that fits my 8GB GPU at least with 4bit quantization. Even Gemini-3n-e2B is not working for me with all the modalities active even though it theoretically should be smaller.

Is there any chance to expect even smaller omni models in the future from you guys? Having an even bigger context window on the modest GPU like mine would be amazing to see.

OpenBMB org

Thanks.

More powerful and more efficient omni models are definitely on our radar. We're actively exploring ways to push the efficiency frontier further, so stay tuned β€” we hope to have exciting updates in the future!

In the meantime, feel free to share any feedback or feature requests. Your use case is exactly the kind of scenario we care about supporting.

Sign up or log in to comment