Batch vs individual inference output mismatch

#9
by E1eMental - opened

Description

I'm experiencing inconsistent outputs when comparing individual inference vs batch inference with Qwen3-VL-2B-Instruct. Despite using deterministic settings (temperature=0.0, do_sample=False, num_beams=1), one sample produces different results depending on whether it's processed individually or in a batch.

Issue Details

  • Sample 1: Outputs differ between individual (1153 chars) and batch inference (1159 chars) ❌
  • Sample 2: Outputs match perfectly βœ“
  • Using the same image and prompts in both modes
  • Tried both left and right padding - neither resolves the issue

Environment

  • Model: Qwen/Qwen3-VL-2B-Instruct
  • Device: CUDA
  • Precision: torch.bfloat16
  • Attention: flash_attention_2

Reproducible Code

import torch
from qwen_vl_utils import process_vision_info
from transformers import AutoModelForImageTextToText, AutoProcessor

# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load model and processor
model_id = "Qwen/Qwen3-VL-2B-Instruct"
print(f"\nLoading model: {model_id}")
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(model_id)
processor.tokenizer.padding_side = "left"  # Tried both "left" and "right"


messages1 = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
            },
            {"type": "text", "text": "Describe this image in detail."},
        ],
    }
]
messages2 = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
            },
            {"type": "text", "text": "What do you see in this image? Please provide a comprehensive description."},
        ],
    }
]
# Combine messages for batch processing
messages = [messages1, messages2]

# Run each sample separately
print("\n=== INDIVIDUAL INFERENCE ===")

individual_outputs = []
for idx, msg in enumerate([messages1, messages2], 1):
    text = processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    image_inputs, _ = process_vision_info([msg])
    inputs = processor(
        text=[text],
        images=image_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to("cuda")

    generated_ids = model.generate(**inputs, max_new_tokens=256, num_beams=1, do_sample=False, temperature=0.0)
    generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    individual_outputs.append(output_text[0])
    print(f"Sample {idx}: {output_text[0]}")

# Batch Inference
print("\n=== BATCH INFERENCE ===")

texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in messages]
image_inputs, _ = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=256, num_beams=1, do_sample=False, temperature=0.0)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
for idx, output in enumerate(output_texts, 1):
    print(f"Sample {idx}: {output}")

# Text Comparison
print("\n=== COMPARISON ===")
for idx in range(len(output_texts)):
    match = individual_outputs[idx] == output_texts[idx]
    status = "βœ“ MATCH" if match else "βœ— MISMATCH"
    print(f"Sample {idx + 1}: {status}")
    if not match:
        print(f"  Length: {len(individual_outputs[idx])} vs {len(output_texts[idx])}")

all_match = all(individual_outputs[i] == output_texts[i] for i in range(len(output_texts)))
print(f"\nOverall: {'βœ“ All match' if all_match else 'βœ— Differences found'}")

Output

=== COMPARISON ===
Sample 1: βœ— MISMATCH
  Length: 1153 vs 1159
Sample 2: βœ“ MATCH

Overall: βœ— Differences found

What I've Tried

βœ… Set temperature=0.0, do_sample=False, num_beams=1 for deterministic generation
βœ… Tried padding_side="left" - still produces mismatches
βœ… Tried padding_side="right" - still produces mismatches
βœ… Used the same generation parameters in both individual and batch modes

How to run Qwen3-VL model in batch mode with transformers?

Sign up or log in to comment