Batch vs individual inference output mismatch

#9
by E1eMental - opened

Description

I'm experiencing inconsistent outputs when comparing individual inference vs batch inference with Qwen3-VL-2B-Instruct. Despite using deterministic settings (temperature=0.0, do_sample=False, num_beams=1), one sample produces different results depending on whether it's processed individually or in a batch.

Issue Details

  • Sample 1: Outputs differ between individual (1153 chars) and batch inference (1159 chars) ❌
  • Sample 2: Outputs match perfectly βœ“
  • Using the same image and prompts in both modes
  • Tried both left and right padding - neither resolves the issue

Environment

  • Model: Qwen/Qwen3-VL-2B-Instruct
  • Device: CUDA
  • Precision: torch.bfloat16
  • Attention: flash_attention_2

Reproducible Code

import torch
from qwen_vl_utils import process_vision_info
from transformers import AutoModelForImageTextToText, AutoProcessor

# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load model and processor
model_id = "Qwen/Qwen3-VL-2B-Instruct"
print(f"\nLoading model: {model_id}")
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(model_id)
processor.tokenizer.padding_side = "left"  # Tried both "left" and "right"


messages1 = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
            },
            {"type": "text", "text": "Describe this image in detail."},
        ],
    }
]
messages2 = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
            },
            {"type": "text", "text": "What do you see in this image? Please provide a comprehensive description."},
        ],
    }
]
# Combine messages for batch processing
messages = [messages1, messages2]

# Run each sample separately
print("\n=== INDIVIDUAL INFERENCE ===")

individual_outputs = []
for idx, msg in enumerate([messages1, messages2], 1):
    text = processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    image_inputs, _ = process_vision_info([msg])
    inputs = processor(
        text=[text],
        images=image_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to("cuda")

    generated_ids = model.generate(**inputs, max_new_tokens=256, num_beams=1, do_sample=False, temperature=0.0)
    generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    individual_outputs.append(output_text[0])
    print(f"Sample {idx}: {output_text[0]}")

# Batch Inference
print("\n=== BATCH INFERENCE ===")

texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in messages]
image_inputs, _ = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=256, num_beams=1, do_sample=False, temperature=0.0)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
for idx, output in enumerate(output_texts, 1):
    print(f"Sample {idx}: {output}")

# Text Comparison
print("\n=== COMPARISON ===")
for idx in range(len(output_texts)):
    match = individual_outputs[idx] == output_texts[idx]
    status = "βœ“ MATCH" if match else "βœ— MISMATCH"
    print(f"Sample {idx + 1}: {status}")
    if not match:
        print(f"  Length: {len(individual_outputs[idx])} vs {len(output_texts[idx])}")

all_match = all(individual_outputs[i] == output_texts[i] for i in range(len(output_texts)))
print(f"\nOverall: {'βœ“ All match' if all_match else 'βœ— Differences found'}")

Output

=== COMPARISON ===
Sample 1: βœ— MISMATCH
  Length: 1153 vs 1159
Sample 2: βœ“ MATCH

Overall: βœ— Differences found

What I've Tried

βœ… Set temperature=0.0, do_sample=False, num_beams=1 for deterministic generation
βœ… Tried padding_side="left" - still produces mismatches
βœ… Tried padding_side="right" - still produces mismatches
βœ… Used the same generation parameters in both individual and batch modes

How to run Qwen3-VL model in batch mode with transformers?

@E1eMental
maybe you should try saving random states as a file at that time of inference for 100% deterministic reproducible response everytime, which is my standard practice for training any kind of models for resuming anytime.

Example:

def save_random_states(filepath="random_states.pth"):
    states = {
        'python_random': random.getstate(),
        'numpy': np.random.get_state(),
        'torch_cpu': torch.get_rng_state(),
        'torch_cuda': torch.cuda.get_rng_state_all() if torch.cuda.is_available() else None
    }
    
    torch.save(states, filepath)
    print(f"Random states saved to {filepath}")

OR YOU CAN TRY:

inputs = processor(
    text=texts,
    images=image_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# create position ids that ignore padding , ensuringthe first non-pad token becomes position 0
position_ids = inputs.attention_mask.long().cumsum(-1) - 1

# add it...
inputs["position_ids"] = position_ids

@VINAYU7
I’ve done some further research and found that it is currently impossible to run Qwen3-VL correctly in batch mode within the Hugging Face Transformers infrastructure. This is due to the Qwen3 positional encoding logic.

When using right padding, the positional encoding for the input tokens is correct, but the padding tokens inserted after the prompt disrupt the encoding for all subsequent generated tokens. Conversely, left padding shifts the sequence so that the positional encodings are offset from the very first prompt token.

I experimented with manually passing position_ids to bypass this, but was unsuccessful in achieving parity with non-batched inference. In my testing, left padding is the only way to get coherent results, though they still differ from single-inference. Right padding results in almost entirely random output.

Currently, there doesn't seem to be a way to achieve 1:1 parity with non-batched inference using standard HF padding.

Sign up or log in to comment