ThinkingBudgetRouter
A fast, lightweight 3-class classifier that decides how many thinking tokens a query needs β before you spend them.
Built on DistilBERT (66M params), fine-tuned to classify any user message into one of three thinking budget tiers:
| Label | Budget | Meaning |
|---|---|---|
no_thinking |
0 tokens | Direct lookup or trivial β answer immediately |
brief_thinking |
~512 tokens | Structured reasoning needed, but not exhaustive |
deep_thinking |
8192+ tokens | Full chain-of-thought required |
Why This Exists
Modern reasoning models β Qwen3, DeepSeek-R1, Claude 3.7 Sonnet (extended thinking), Gemini 2.0 Flash Thinking β all support a configurable thinking budget. But most users either always use maximum thinking (slow, expensive) or never use it (misses hard problems).
ThinkingBudgetRouter makes this decision in ~10ms on CPU, before any tokens are spent reasoning.
Quick Start
from transformers import pipeline
router = pipeline("text-classification", model="tripathyShaswata/ThinkingBudgetRouter")
# Single prediction
result = router("What is the capital of France?")
print(result)
# [{'label': 'no_thinking', 'score': 0.97}]
# Batch
queries = [
"What is 15 + 27?", # no_thinking
"What does HTTP stand for?", # no_thinking
"Write a Python function to merge two sorted arrays.", # brief_thinking
"Explain how Dijkstra's algorithm works step by step.", # brief_thinking
"Design a distributed rate limiter for 1 billion users.", # deep_thinking
"Prove there are infinitely many prime numbers.", # deep_thinking
"Debug this race condition in async code: ...", # deep_thinking
]
results = router(queries)
for q, r in zip(queries, results):
print(f" {r['label']:>16} ({r['score']:.2f}) β {q}")
Use With Qwen3 (Thinking Budget Control)
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
budget_router = pipeline("text-classification", model="tripathyShaswata/ThinkingBudgetRouter")
BUDGET_MAP = {
"no_thinking": 0,
"brief_thinking": 512,
"deep_thinking": 8192,
}
def call_with_budget(user_message: str, model, tokenizer):
label = budget_router(user_message)[0]["label"]
budget = BUDGET_MAP[label]
messages = [{"role": "user", "content": user_message}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
# Qwen3 thinking budget parameter
thinking_budget=budget,
)
inputs = tokenizer([text], return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=2048)
return tokenizer.decode(output[0], skip_special_tokens=True)
Full 3-Stage Agent Pipeline
Pair with AgentIntentRouter and QueryComplexityRouter for a complete routing stack:
from transformers import pipeline
intent_router = pipeline("text-classification", model="tripathyShaswata/AgentIntentRouter")
complexity_router = pipeline("text-classification", model="tripathyShaswata/QueryComplexityRouter")
budget_router = pipeline("text-classification", model="tripathyShaswata/ThinkingBudgetRouter")
BUDGET_MAP = {"no_thinking": 0, "brief_thinking": 512, "deep_thinking": 8192}
def route(user_message: str):
intent = intent_router(user_message)[0]
complexity = complexity_router(user_message)[0]
budget = budget_router(user_message)[0]
print(f"Intent: {intent['label']} ({intent['score']:.2f})")
print(f"Complexity: {complexity['label']} ({complexity['score']:.2f})")
print(f"Budget: {budget['label']} β {BUDGET_MAP[budget['label']]} tokens")
if complexity["label"] == "no_llm":
return handle_with_rules(user_message)
thinking_tokens = BUDGET_MAP[budget["label"]]
if complexity["label"] == "small_llm":
return call_small_model(user_message, thinking_budget=thinking_tokens)
else:
return call_large_model(user_message, thinking_budget=thinking_tokens)
User Message
β
βΌ
AgentIntentRouter β WHAT: code / search / chat / math / ...
β
βΌ
QueryComplexityRouter β WHICH MODEL: no_llm / small_llm / large_llm
β
βΌ
ThinkingBudgetRouter β HOW MUCH THINKING: 0 / 512 / 8192 tokens
β
βΌ
Model call with exact budget
Budget Labels
no_thinking β 0 tokens
Answer immediately, no chain-of-thought needed:
- "What is the capital of France?"
- "What does HTTP stand for?"
- "Is 17 divisible by 3?"
- "Convert 100 km to miles."
- "What HTTP status code means Not Found?"
brief_thinking β ~512 tokens
Structured reasoning, but not exhaustive:
- "Write a Python function to binary search a sorted list."
- "What is the difference between REST and GraphQL?"
- "Fix this off-by-one error: for i in range(len(arr)): total += arr[i+1]"
- "How would you detect a cycle in a linked list?"
- "Prove that the sum of two even numbers is even."
deep_thinking β 8192+ tokens
Full extended chain-of-thought required:
- "Design a distributed rate limiter for 1 billion users."
- "Prove there are infinitely many prime numbers."
- "Debug this race condition: two asyncio coroutines sharing a counter."
- "Solve: given a matrix of costs, find the minimum-cost path from top-left to bottom-right."
- "Compare PostgreSQL, MongoDB, and Cassandra across consistency, latency, and cost."
Model Details
| Property | Value |
|---|---|
| Base model | distilbert-base-uncased |
| Parameters | ~66M |
| Inference (CPU) | ~10ms |
| Inference (GPU) | ~2ms |
| Input max length | 128 tokens |
| Task | 3-class sequence classification |
Compatible Reasoning Models
| Model | Thinking control param |
|---|---|
| Qwen3 (all sizes) | thinking_budget in chat template |
| DeepSeek-R1 | /think token budget |
| Claude 3.7 Sonnet | thinking.budget_tokens in API |
| Gemini 2.0 Flash Thinking | thinkingConfig.thinkingBudget |
- Downloads last month
- 26