Increase max_tokens to 4000 for thinking model compatibility
Browse filesThinking models (Gemini 3.1 Pro, o3-mini) consume tokens internally
for reasoning before generating output. max_tokens=200 caused content=None.
4000 ensures both thought budget and response fit within the limit.
- inference.py +3 -1
inference.py
CHANGED
|
@@ -111,8 +111,10 @@ def _call_llm(
|
|
| 111 |
model=provider["model"],
|
| 112 |
messages=messages,
|
| 113 |
temperature=0.2,
|
| 114 |
-
max_tokens=
|
| 115 |
)
|
|
|
|
|
|
|
| 116 |
return completion.choices[0].message.content or ""
|
| 117 |
except Exception as e:
|
| 118 |
last_err = e
|
|
|
|
| 111 |
model=provider["model"],
|
| 112 |
messages=messages,
|
| 113 |
temperature=0.2,
|
| 114 |
+
max_tokens=4000, # Thinking models (Gemini 3.1 Pro, o3) use tokens for reasoning
|
| 115 |
)
|
| 116 |
+
# content can be None for thinking models if limit was too low;
|
| 117 |
+
# 4000 ensures thinking budget + response both fit
|
| 118 |
return completion.choices[0].message.content or ""
|
| 119 |
except Exception as e:
|
| 120 |
last_err = e
|