ThinkingBudgetRouter

A fast, lightweight 3-class classifier that decides how many thinking tokens a query needs — before you spend them.

Built on DistilBERT (66M params), fine-tuned to classify any user message into one of three thinking budget tiers:

Label	Budget	Meaning
`no_thinking`	0 tokens	Direct lookup or trivial — answer immediately
`brief_thinking`	~512 tokens	Structured reasoning needed, but not exhaustive
`deep_thinking`	8192+ tokens	Full chain-of-thought required

Why This Exists

Modern reasoning models — Qwen3, DeepSeek-R1, Claude 3.7 Sonnet (extended thinking), Gemini 2.0 Flash Thinking — all support a configurable thinking budget. But most users either always use maximum thinking (slow, expensive) or never use it (misses hard problems).

ThinkingBudgetRouter makes this decision in ~10ms on CPU, before any tokens are spent reasoning.

Quick Start

from transformers import pipeline

router = pipeline("text-classification", model="tripathyShaswata/ThinkingBudgetRouter")

# Single prediction
result = router("What is the capital of France?")
print(result)
# [{'label': 'no_thinking', 'score': 0.97}]

# Batch
queries = [
    "What is 15 + 27?",                                          # no_thinking
    "What does HTTP stand for?",                                  # no_thinking
    "Write a Python function to merge two sorted arrays.",        # brief_thinking
    "Explain how Dijkstra's algorithm works step by step.",       # brief_thinking
    "Design a distributed rate limiter for 1 billion users.",     # deep_thinking
    "Prove there are infinitely many prime numbers.",              # deep_thinking
    "Debug this race condition in async code: ...",               # deep_thinking
]
results = router(queries)
for q, r in zip(queries, results):
    print(f"  {r['label']:>16} ({r['score']:.2f}) — {q}")

Use With Qwen3 (Thinking Budget Control)

from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

budget_router = pipeline("text-classification", model="tripathyShaswata/ThinkingBudgetRouter")

BUDGET_MAP = {
    "no_thinking":    0,
    "brief_thinking": 512,
    "deep_thinking":  8192,
}

def call_with_budget(user_message: str, model, tokenizer):
    label = budget_router(user_message)[0]["label"]
    budget = BUDGET_MAP[label]

    messages = [{"role": "user", "content": user_message}]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        # Qwen3 thinking budget parameter
        thinking_budget=budget,
    )
    inputs = tokenizer([text], return_tensors="pt")
    output = model.generate(**inputs, max_new_tokens=2048)
    return tokenizer.decode(output[0], skip_special_tokens=True)

Full 3-Stage Agent Pipeline

Pair with AgentIntentRouter and QueryComplexityRouter for a complete routing stack:

from transformers import pipeline

intent_router     = pipeline("text-classification", model="tripathyShaswata/AgentIntentRouter")
complexity_router = pipeline("text-classification", model="tripathyShaswata/QueryComplexityRouter")
budget_router     = pipeline("text-classification", model="tripathyShaswata/ThinkingBudgetRouter")

BUDGET_MAP = {"no_thinking": 0, "brief_thinking": 512, "deep_thinking": 8192}

def route(user_message: str):
    intent     = intent_router(user_message)[0]
    complexity = complexity_router(user_message)[0]
    budget     = budget_router(user_message)[0]

    print(f"Intent:     {intent['label']} ({intent['score']:.2f})")
    print(f"Complexity: {complexity['label']} ({complexity['score']:.2f})")
    print(f"Budget:     {budget['label']} → {BUDGET_MAP[budget['label']]} tokens")

    if complexity["label"] == "no_llm":
        return handle_with_rules(user_message)

    thinking_tokens = BUDGET_MAP[budget["label"]]

    if complexity["label"] == "small_llm":
        return call_small_model(user_message, thinking_budget=thinking_tokens)
    else:
        return call_large_model(user_message, thinking_budget=thinking_tokens)

User Message
    │
    ▼
AgentIntentRouter          ← WHAT: code / search / chat / math / ...
    │
    ▼
QueryComplexityRouter       ← WHICH MODEL: no_llm / small_llm / large_llm
    │
    ▼
ThinkingBudgetRouter        ← HOW MUCH THINKING: 0 / 512 / 8192 tokens
    │
    ▼
Model call with exact budget

Budget Labels

`no_thinking` — 0 tokens

Answer immediately, no chain-of-thought needed:

"What is the capital of France?"
"What does HTTP stand for?"
"Is 17 divisible by 3?"
"Convert 100 km to miles."
"What HTTP status code means Not Found?"

`brief_thinking` — ~512 tokens

Structured reasoning, but not exhaustive:

"Write a Python function to binary search a sorted list."
"What is the difference between REST and GraphQL?"
"Fix this off-by-one error: for i in range(len(arr)): total += arr[i+1]"
"How would you detect a cycle in a linked list?"
"Prove that the sum of two even numbers is even."

`deep_thinking` — 8192+ tokens

Full extended chain-of-thought required:

"Design a distributed rate limiter for 1 billion users."
"Prove there are infinitely many prime numbers."
"Debug this race condition: two asyncio coroutines sharing a counter."
"Solve: given a matrix of costs, find the minimum-cost path from top-left to bottom-right."
"Compare PostgreSQL, MongoDB, and Cassandra across consistency, latency, and cost."

Model Details

Property	Value
Base model	`distilbert-base-uncased`
Parameters	~66M
Inference (CPU)	~10ms
Inference (GPU)	~2ms
Input max length	128 tokens
Task	3-class sequence classification

Compatible Reasoning Models

Model	Thinking control param
Qwen3 (all sizes)	`thinking_budget` in chat template
DeepSeek-R1	`/think` token budget
Claude 3.7 Sonnet	`thinking.budget_tokens` in API
Gemini 2.0 Flash Thinking	`thinkingConfig.thinkingBudget`

Downloads last month: 26

Safetensors

Model size

67M params

Tensor type

F32