Batch vs individual inference output mismatch
Description
I'm experiencing inconsistent outputs when comparing individual inference vs batch inference with Qwen3-VL-2B-Instruct. Despite using deterministic settings (temperature=0.0, do_sample=False, num_beams=1), one sample produces different results depending on whether it's processed individually or in a batch.
Issue Details
- Sample 1: Outputs differ between individual (1153 chars) and batch inference (1159 chars) β
- Sample 2: Outputs match perfectly β
- Using the same image and prompts in both modes
- Tried both left and right padding - neither resolves the issue
Environment
- Model:
Qwen/Qwen3-VL-2B-Instruct - Device: CUDA
- Precision:
torch.bfloat16 - Attention:
flash_attention_2
Reproducible Code
import torch
from qwen_vl_utils import process_vision_info
from transformers import AutoModelForImageTextToText, AutoProcessor
# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
# Load model and processor
model_id = "Qwen/Qwen3-VL-2B-Instruct"
print(f"\nLoading model: {model_id}")
model = AutoModelForImageTextToText.from_pretrained(
model_id,
dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(model_id)
processor.tokenizer.padding_side = "left" # Tried both "left" and "right"
messages1 = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
},
{"type": "text", "text": "Describe this image in detail."},
],
}
]
messages2 = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
},
{"type": "text", "text": "What do you see in this image? Please provide a comprehensive description."},
],
}
]
# Combine messages for batch processing
messages = [messages1, messages2]
# Run each sample separately
print("\n=== INDIVIDUAL INFERENCE ===")
individual_outputs = []
for idx, msg in enumerate([messages1, messages2], 1):
text = processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
image_inputs, _ = process_vision_info([msg])
inputs = processor(
text=[text],
images=image_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=256, num_beams=1, do_sample=False, temperature=0.0)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
individual_outputs.append(output_text[0])
print(f"Sample {idx}: {output_text[0]}")
# Batch Inference
print("\n=== BATCH INFERENCE ===")
texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in messages]
image_inputs, _ = process_vision_info(messages)
inputs = processor(
text=texts,
images=image_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=256, num_beams=1, do_sample=False, temperature=0.0)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output_texts = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
for idx, output in enumerate(output_texts, 1):
print(f"Sample {idx}: {output}")
# Text Comparison
print("\n=== COMPARISON ===")
for idx in range(len(output_texts)):
match = individual_outputs[idx] == output_texts[idx]
status = "β MATCH" if match else "β MISMATCH"
print(f"Sample {idx + 1}: {status}")
if not match:
print(f" Length: {len(individual_outputs[idx])} vs {len(output_texts[idx])}")
all_match = all(individual_outputs[i] == output_texts[i] for i in range(len(output_texts)))
print(f"\nOverall: {'β All match' if all_match else 'β Differences found'}")
Output
=== COMPARISON ===
Sample 1: β MISMATCH
Length: 1153 vs 1159
Sample 2: β MATCH
Overall: β Differences found
What I've Tried
β
Set temperature=0.0, do_sample=False, num_beams=1 for deterministic generation
β
Tried padding_side="left" - still produces mismatches
β
Tried padding_side="right" - still produces mismatches
β
Used the same generation parameters in both individual and batch modes
How to run Qwen3-VL model in batch mode with transformers?
@E1eMental
maybe you should try saving random states as a file at that time of inference for 100% deterministic reproducible response everytime, which is my standard practice for training any kind of models for resuming anytime.
Example:
def save_random_states(filepath="random_states.pth"):
states = {
'python_random': random.getstate(),
'numpy': np.random.get_state(),
'torch_cpu': torch.get_rng_state(),
'torch_cuda': torch.cuda.get_rng_state_all() if torch.cuda.is_available() else None
}
torch.save(states, filepath)
print(f"Random states saved to {filepath}")
OR YOU CAN TRY:
inputs = processor(
text=texts,
images=image_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# create position ids that ignore padding , ensuringthe first non-pad token becomes position 0
position_ids = inputs.attention_mask.long().cumsum(-1) - 1
# add it...
inputs["position_ids"] = position_ids
@VINAYU7
Iβve done some further research and found that it is currently impossible to run Qwen3-VL correctly in batch mode within the Hugging Face Transformers infrastructure. This is due to the Qwen3 positional encoding logic.
When using right padding, the positional encoding for the input tokens is correct, but the padding tokens inserted after the prompt disrupt the encoding for all subsequent generated tokens. Conversely, left padding shifts the sequence so that the positional encodings are offset from the very first prompt token.
I experimented with manually passing position_ids to bypass this, but was unsuccessful in achieving parity with non-batched inference. In my testing, left padding is the only way to get coherent results, though they still differ from single-inference. Right padding results in almost entirely random output.
Currently, there doesn't seem to be a way to achieve 1:1 parity with non-batched inference using standard HF padding.