ThinkingBudgetRouter

A fast, lightweight 3-class classifier that decides how many thinking tokens a query needs β€” before you spend them.

Built on DistilBERT (66M params), fine-tuned to classify any user message into one of three thinking budget tiers:

Label Budget Meaning
no_thinking 0 tokens Direct lookup or trivial β€” answer immediately
brief_thinking ~512 tokens Structured reasoning needed, but not exhaustive
deep_thinking 8192+ tokens Full chain-of-thought required

Why This Exists

Modern reasoning models β€” Qwen3, DeepSeek-R1, Claude 3.7 Sonnet (extended thinking), Gemini 2.0 Flash Thinking β€” all support a configurable thinking budget. But most users either always use maximum thinking (slow, expensive) or never use it (misses hard problems).

ThinkingBudgetRouter makes this decision in ~10ms on CPU, before any tokens are spent reasoning.

Quick Start

from transformers import pipeline

router = pipeline("text-classification", model="tripathyShaswata/ThinkingBudgetRouter")

# Single prediction
result = router("What is the capital of France?")
print(result)
# [{'label': 'no_thinking', 'score': 0.97}]

# Batch
queries = [
    "What is 15 + 27?",                                          # no_thinking
    "What does HTTP stand for?",                                  # no_thinking
    "Write a Python function to merge two sorted arrays.",        # brief_thinking
    "Explain how Dijkstra's algorithm works step by step.",       # brief_thinking
    "Design a distributed rate limiter for 1 billion users.",     # deep_thinking
    "Prove there are infinitely many prime numbers.",              # deep_thinking
    "Debug this race condition in async code: ...",               # deep_thinking
]
results = router(queries)
for q, r in zip(queries, results):
    print(f"  {r['label']:>16} ({r['score']:.2f}) β€” {q}")

Use With Qwen3 (Thinking Budget Control)

from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

budget_router = pipeline("text-classification", model="tripathyShaswata/ThinkingBudgetRouter")

BUDGET_MAP = {
    "no_thinking":    0,
    "brief_thinking": 512,
    "deep_thinking":  8192,
}

def call_with_budget(user_message: str, model, tokenizer):
    label = budget_router(user_message)[0]["label"]
    budget = BUDGET_MAP[label]

    messages = [{"role": "user", "content": user_message}]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        # Qwen3 thinking budget parameter
        thinking_budget=budget,
    )
    inputs = tokenizer([text], return_tensors="pt")
    output = model.generate(**inputs, max_new_tokens=2048)
    return tokenizer.decode(output[0], skip_special_tokens=True)

Full 3-Stage Agent Pipeline

Pair with AgentIntentRouter and QueryComplexityRouter for a complete routing stack:

from transformers import pipeline

intent_router     = pipeline("text-classification", model="tripathyShaswata/AgentIntentRouter")
complexity_router = pipeline("text-classification", model="tripathyShaswata/QueryComplexityRouter")
budget_router     = pipeline("text-classification", model="tripathyShaswata/ThinkingBudgetRouter")

BUDGET_MAP = {"no_thinking": 0, "brief_thinking": 512, "deep_thinking": 8192}

def route(user_message: str):
    intent     = intent_router(user_message)[0]
    complexity = complexity_router(user_message)[0]
    budget     = budget_router(user_message)[0]

    print(f"Intent:     {intent['label']} ({intent['score']:.2f})")
    print(f"Complexity: {complexity['label']} ({complexity['score']:.2f})")
    print(f"Budget:     {budget['label']} β†’ {BUDGET_MAP[budget['label']]} tokens")

    if complexity["label"] == "no_llm":
        return handle_with_rules(user_message)

    thinking_tokens = BUDGET_MAP[budget["label"]]

    if complexity["label"] == "small_llm":
        return call_small_model(user_message, thinking_budget=thinking_tokens)
    else:
        return call_large_model(user_message, thinking_budget=thinking_tokens)
User Message
    β”‚
    β–Ό
AgentIntentRouter          ← WHAT: code / search / chat / math / ...
    β”‚
    β–Ό
QueryComplexityRouter       ← WHICH MODEL: no_llm / small_llm / large_llm
    β”‚
    β–Ό
ThinkingBudgetRouter        ← HOW MUCH THINKING: 0 / 512 / 8192 tokens
    β”‚
    β–Ό
Model call with exact budget

Budget Labels

no_thinking β€” 0 tokens

Answer immediately, no chain-of-thought needed:

  • "What is the capital of France?"
  • "What does HTTP stand for?"
  • "Is 17 divisible by 3?"
  • "Convert 100 km to miles."
  • "What HTTP status code means Not Found?"

brief_thinking β€” ~512 tokens

Structured reasoning, but not exhaustive:

  • "Write a Python function to binary search a sorted list."
  • "What is the difference between REST and GraphQL?"
  • "Fix this off-by-one error: for i in range(len(arr)): total += arr[i+1]"
  • "How would you detect a cycle in a linked list?"
  • "Prove that the sum of two even numbers is even."

deep_thinking β€” 8192+ tokens

Full extended chain-of-thought required:

  • "Design a distributed rate limiter for 1 billion users."
  • "Prove there are infinitely many prime numbers."
  • "Debug this race condition: two asyncio coroutines sharing a counter."
  • "Solve: given a matrix of costs, find the minimum-cost path from top-left to bottom-right."
  • "Compare PostgreSQL, MongoDB, and Cassandra across consistency, latency, and cost."

Model Details

Property Value
Base model distilbert-base-uncased
Parameters ~66M
Inference (CPU) ~10ms
Inference (GPU) ~2ms
Input max length 128 tokens
Task 3-class sequence classification

Compatible Reasoning Models

Model Thinking control param
Qwen3 (all sizes) thinking_budget in chat template
DeepSeek-R1 /think token budget
Claude 3.7 Sonnet thinking.budget_tokens in API
Gemini 2.0 Flash Thinking thinkingConfig.thinkingBudget
Downloads last month
26
Safetensors
Model size
67M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support