Empty response from chat call with small max_token set when running on vLLM
Dear all, I encountered a problem when running this model on vllm/vllm-openai:gptoss . For simple chat call like below, when the max_tokens is set below 1000, the model does not return message.content despite "completion_tokens" in the output was max out. The lower the max_tokens params, the more liklely the model return empty response.
Does anybody run into the same problem?
Input
...
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What are current and historical capitals of Poland? Prepare description of each of the capital with date of establishing and reason behind movement."
}
]
},
{
"role": "assistant",
"content": ""
}
],
"max_tokens": 300
...
Output
...
"choices": [
{
"index": 0,
"message": {
"role": "assistant"
},
"finish_reason": "length"
}
]
...
"usage": {
"completion_tokens": 300,
"prompt_tokens": 100,
"total_tokens": 400
},
If the max_tokens is set to 1000, the model give proper response.
...
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "**Poland’s Capitals – a chronological overview**\n\n| Capital (city) | Period when it served as *de‑facto* or *de‑jure* capital* | Approx. date of establishment as capital | Why the capital was moved there (or away from it) |\n|----------------|-----------------------------------------------------------|------------------------------------------|---------------------------------------------------|\n| **Gniezno** | Early Piast Poland (c. 966 – 1038) | c. 966 – the year Duke Mieszko I adopted Christianity and made his court sit in Gniezno, the oldest Polish episcopal see. | The city was the cradle of the Polish state and the seat of the first royal coronation (Bolesław I the Brave, 1025). After"
},
"finish_reason": "length"
}
],
...
"usage": {
"completion_tokens": 1000,
"prompt_tokens": 100,
"total_tokens": 1100
The model generally needs some space for the chain of thought so if you set the max tokens too low you won't receive an output. This is the same behavior with our hosted reasoning models so we recommend to always leave some breathing room.
Thank you. So the first few hundred tokens consumed for the "thinking" and not for generating token?
I see the token being generated under reasoning_content. Thank you
The model generally needs some space for the chain of thought so if you set the max tokens too low you won't receive an output. This is the same behavior with our hosted reasoning models so we recommend to always leave some breathing room.
Is this the same reason for getting empty response from the API of o4-mini? I set the max_token as 20000, but the problems are very hard. Then Around 5% problems get empty responses. When I use the dataset with lower difficulty, all the problems get valid responses.
For me, it was not for the max_token. I was getting blank response when I was using it for structured output and asked the same question back to back twice. Following was the error:
groq.BadRequestError: Error code: 400 - {'error': {'message': \"Failed to validate JSON. Please adjust your prompt. See 'failed_generation' for more details.\", 'type': 'invalid_request_error', 'code': 'json_validate_failed', 'failed_generation': ''}}
So when I changed my prompt from this:
You are an Order Recipient Manager. Your responsibility is to pick the user order and respond it with a structured format. For the fields not available in the data, keep them blank.
to this:
You are an Order Recipient Manager. Your responsibility is to pick the user order and respond it with a structured format. For the fields not available in the data, keep them blank. User may ask the same question multiple times, response the same result.
The problem was gone.
For both of the cases, the structured output schema was:
{
"title": "Product",
"type": "object",
"properties": {
"productName": {
"description": "Name of the product",
"type": "string"
},
"productType": {
"description": "Type of the product",
"type": "string",
"enum": [
"LAPTOP",
"SMARTPHONE",
"APPLIANCE"
]
}
},
"required": [
"productName",
"productType"
]
}
So, even if the query was the same ("I want to buy a washing machine. Do you have that?"), when I updated the prompt with an explicit affirmation on "User may ask the same question multiple times, response the same result.", the model no longer returned blank.
So, gpt-oss-* tends to return blank when it is confused or become unsure if its response would be correct or not.
Note: I was using LangGraph's agent.