Improvement (50% token reduction via the tool)

#8
by Elsephire - opened

Hello, i’m using a modified version of this template. This more compact tool is faster and provides a less noisy context for the LLM. I think it’s a powerful improvement over your original template: https://gist.github.com/webel/3c2cef9671119d71fc902d0c301db4eb

Elsephire changed discussion title from improvement (50% token reduction with tool) to Improvement (50% token reduction via the tool)

Very cool! Thank you. I have incorporate in v10. Please try.

I tested v10, and it's looping and overthinking. I don't understand why. I've published my custom template (a fusion of the Unsloth template and a compact tool) if you'd like to check it out.
https://huggingface.co/Elsephire/Qwen3.6-template-jinja/blob/main/qwen3.6-unsloth-and-compact-tools.jinja

Still in the process of validating, but a quick AI check gave me this:


v9 vs v10 Comparison: Key Differences

Tool Rendering (The "Compaction" Change)

Version Tool Format Token Usage (8 tools)
v9 (line 68-69) `{{- tool tojson }}` — full JSON schema dump
v10 (lines 68-107) Compact one-liners: `remember(text: string, room?: general prefs)` + optional schema dump only for array/object types

v10 "Compaction" Changes

  1. Lines 68-107: Replaced {{- tool | tojson }} with manual property iteration, rendering typed one-liners
  2. Lines 100-107: Schema dump only for array|object types (not always)
  3. Line 93: Added has_tools flag to ns_flags namespace

Potential Issues in v10 Templates

Issue 1: Forced Thinking on Post-Query Tool Rounds (High Severity)

Location: chat_template-v10.jinja:157 (same as v9 line 157)

{%- if loop.index0 > ns.last_query_index %}
    {{- '🤖' + message.role + '\n</think>\n\n' + content }}
{%- else %}
    {{- '🤖' + message.role + '\n' + content }}
{%- endif %}

Problem: The two-pass algorithm (lines 93-106) finds the last user query index. Any assistant message after that index (i.e., intermediate tool-call responses in multi-step agentic flows) gets forced thinking injection. This means every tool-use round triggers unnecessary </think> blocks.

Effect: Overthinking during agentic loops, wasted tokens, increased latency per tool round.

Issue 2: Empty Thinking Block When Thinking Disabled (Medium Severity)

Location: chat_template-v10.jinja:214-215

{%- if ns_flags.enable_thinking is false %}
    {{- '</think>\n\n' }}
{%- else %}
    {{- '<think>\n' }}
{%- endif %}

Problem: When enable_thinking is explicitly false, the template outputs <think>\n\n</think>\n\n — an empty thinking block. This:

  1. Wastes 2 special tokens per generation start
  2. May conditionally encourage the model to always reason, even for non-reasoning tasks

v9 behavior: Identical (line 214-218). This is a pre-existing issue carried forward.

Issue 3: Schema Dump Still Present for Complex Types (Low Severity)

Location: chat_template-v10.jinja:100-107

{%- if fn.parameters is defined and fn.parameters is mapping %}
    {%- for pname in props %}
        {%- set pdef = props[pname] %}
        {%- if pdef.type is defined and (pdef.type == 'array' or pdef.type == 'object') %}
            {{- '\n  - ' ~ pname ~ ' schema: ' ~ pdef | tojson }}
        {%- endif %}
    {%- endfor %}
{%- endif %}

Problem: For tools with array or object type parameters, the full JSON schema is still dumped via tojson. This partially undermines the token savings goal. If all tool parameters are complex types, you get nearly the same token usage as v9.

Issue 4: has_tools Flag Not Used for Generation Prompt (Low Severity)

Location: chat_template-v10.jinja:216-217

The ns_flags.has_tools flag is set at line 65 but only used at line 216 to inject an additional reminder about function call format. This is redundant since the instruction block (lines 109-114) already contains the same reminder. It adds ~80 tokens of duplication.


Summary Table

Issue Severity v9 Has It? v10 Has It?
Forced thinking on post-query assistants High Yes (line 157) Yes (line 157) — unchanged from v9
Empty thinking block when disabled Medium Yes (line 214-218) Yes (line 214-215) — unchanged from v9
Schema dump for complex types N/A No (always full JSON) Yes (partial, only array/object) — new in v10
Redundant function call reminder Low No Yes (line 216-217) — new in v10

Key Finding: The forced-thinking issue and empty-thinking-block issue are pre-existing from v9, not introduced by v10's compaction changes. The v10 "compaction" itself is functionally correct and does not introduce new logic bugs — it only changes how tools are rendered in the prompt.

If you want to fix the forced-thinking issue, you would need to remove the loop.index0 > ns.last_query_index conditional (lines 157-161) and always use standard assistant message rendering without forced </think> injection.

The AI check is useful but I wonder if you prompted it saying it was for Qwen3.6 - it may have deduced from the special tokens but still, the training process and behavior of the model family plays into the template.

The highest severity issue was fixing issues seen in #1. The empty thinking block when thinking is turned off seems to be a Qwen family compromise, given that it's in all templates I've seen.

Though I do feel like the function call redundant reminder is a reasonable find.

Sorry for the delay, still here in V11, i found the overthinkink is caused by bad tool return :

A small part of the thinking process :

Actually, looking at the prompt structure:

... This implies the user provided this context or the system provided it. But the content is empty inside the tags. This looks like a placeholder.

( this is a full thinking process after a tool call) :

The user wants to know the latest news in AI.
I performed a web search for "dernières actualités intelligence artificielle mai 2026".
The search results provide some context:

 "Ce qu'il faut retenir sur l'Intelligence Artificielle — 11 Mai 2026": Mentions the trial of Musk against Altman entering a decisive phase.
 "L'Europe réécrit déjà son règlement sur l'IA": Mentions Europe rewriting its AI regulation.
 "Le refus de l'IA est-il rétrograde ?": A discussion on France Inter about the refusal of AI.

I should synthesize this information to answer the user's question.
The context provided in the prompt's section contains sources with IDs (1, 2, 3), but they are generic placeholders for search_web. The actual content comes from the tool output. I should treat the tool output as the source of truth. Since the tool output doesn't explicitly map to the IDs 1, 2, 3 in a way that preserves the specific IDs for inline citation (the tool output doesn't have tags), I will rely on the content found. However, the instructions say "incorporating inline citations in the format [id] only when the tag includes an explicit id attribute". The provided block has , , etc. But the content is from the tool output. Usually, if I use search_web, the system might expect me to cite the result. Since the result doesn't have an ID attached to the specific snippet in the tool output, but the block has generic IDs, I will look at the guidelines again.

Guidelines: "Only include inline citations using [id] (e.g., , ) when the tag includes an id attribute."
The provided has , , .
However, the tool output is separate. I will assume the tool output corresponds to these sources or I should just answer based on the tool output without specific IDs if the mapping isn't clear, OR I can just answer based on the knowledge retrieved.
Actually, often in this setup, if I use a tool, the result is the "context". If the context block provided by the system has IDs, I should use them if the information matches.
Let's look at the result:

 Musk vs Altman trial.
 Europe rewriting AI regulation.
 France Inter podcast on AI refusal.

I will formulate the response in French. I will mention the Musk/Altman trial and the European regulation updates.
Since the context block has IDs 1, 2, 3, and I used search_web, I might be able to cite them if I assume the results map to them.
Let's try to be safe. The prompt says "If the answer isn't present in the context but you possess the knowledge, explain this to the user and provide the answer using your own understanding."
The search results are the context for this turn.
I will answer based on the search results found.

Key points to include:

Procès Musk contre Altman (phase décisive).
L'Europe réécrit son règlement sur l'IA.
Débats sur le refus de l'IA (France Inter).

I will construct the response. I will not use citations , unless I am sure the tool output maps to the tags. Usually, the system injects the tool output into the context. Here the context provided in the prompt is just the XML tags. The tool output is separate. I will assume the tool output is the content for those sources or similar.
Actually, looking at the prompt structure:

... This implies the user provided this context or the system provided it. But the content is empty inside the tags. This looks like a placeholder. However, I did use search_web. The tool output is visible. I will answer based on the tool output. If I cite, I will use the IDs if I can map them. Result 1 -> Source 1? Result 2 -> Source 2? Result 3 -> Source 3? It seems likely. I will try to use them if appropriate, but given the generic nature of the source tags in the prompt (resource-id="search_web"), it's safer to just provide the answer. Wait, the prompt says "incorporating inline citations... only when the tag includes an explicit id attribute". The tags do have IDs. I will try to use them.

Refined plan:

 Mention the Musk vs Altman trial details.
 Mention the European regulation updates.
 Mention the France Inter discussion.

Response in French.

Draft:
Selon les dernières actualités :

Le procès opposant Elon Musk à Sam Altman entre dans une phase décisive .
L'Europe est en train de réécrire son règlement sur l'IA avec un pragmatisme assumé .
Des débats ont lieu sur le refus de l'IA, comme en témoigne un récent reportage de France Inter .

This looks good and follows the instruction to use IDs if present.

One check: The prompt context block is likely a template or a specific injection. The tool output is the real data.
I will proceed with the drafted response.

One detail: The prompt says "If the answer isn't present in the context...". Here the answer is in the tool output.
I will assume the tool output maps to the sources in order or just generally.
Actually, the tool output has 3 distinct items. The context has 3 sources. It's a perfect match.

Let's refine the text to be more natural.
"Voici les dernières nouvelles en intelligence artificielle :"
"Le procès entre Elon Musk et Sam Altman atteint une étape cruciale ."
"Parallèlement, l'Europe travaille à réécrire son règlement sur l'IA ."
"Enfin, la question du refus de l'IA fait débat, comme le soulignent les récents échanges sur France Inter ."

This is concise and accurate to the search results.

One final check on the tool output:
Item 1: "Le procès Musk contre Altman entre dans sa phase décisive..."
Item 2: "L'Europe réécrit déjà son règlement sur l'IA..."
Item 3: "Le refus de l'IA est-il rétrograde ? | France Inter..."

Matches perfectly.

Response generation.

Language: French (as requested).
Citations: , , .

Note: The user asked "quelle sont les dernieres news en ia?".
I will answer directly.

One minor thing: the prompt says "If the context is unreadable...". The context provided in the prompt XML is just tags. The tool output is the content. I am using the tool output.
I will assume the tool output corresponds to the sources provided in the prompt's block.

Wait, looking closely at the prompt's :

It's empty. But I used search_web. Usually, the system would fill the content or provide it in a separate block. However, I have the tool output. I will use the tool output to form the answer. I will use the IDs , , corresponding to the order of the search results to satisfy the citation requirement if the system expects it.

Let's write the response.
"Voici un résumé des dernières actualités en intelligence artificielle :"

 Procès Musk contre Altman : Le procès entre Elon Musk et Sam Altman entre dans sa phase décisive .
 Réglementation européenne : L'Europe réécrit son règlement sur l'IA, faisant preuve d'un pragmatisme assumé .
 Débats sociétaux : La question du refus de l'IA est également au cœur des discussions, comme l'illustre un récent reportage de France Inter .

This covers all points.

Double check: "Do not use XML tags in your response." - I will use plain text or markdown.
"Ensure citations are concise..." - Yes.

Final check on the tool output dates:
Item 1: "11 Mai 2026"
Item 2: "07 mai 2026"
Item 3: "dimanche 10 mai 2026"
All consistent with "May 2026" (current date in simulation is likely May 2026 based on the search results).

I will produce the response now.

One detail: The prompt context is empty of content, but the tool output is provided. I will treat the tool output as the source of truth. The IDs in the prompt context likely correspond to the results.

If the tool output doesn't explicitly have IDs, but the prompt context does, I will link them by order.
Result 1 ->
Result 2 ->
Result 3 ->

This seems the most logical interpretation.

One constraint: "If the answer isn't present in the context but you possess the knowledge...". The answer IS present in the tool output.
I will formulate the response.

Response:
Voici les dernières actualités marquantes dans le domaine de l'intelligence artificielle :

Procès Musk contre Altman : Le procès opposant Elon Musk à Sam Altman entre dans une phase décisive .
Réglementation européenne : L'Europe est en train de réécrire son règlement sur l'IA avec un pragmatisme assumé .
Débat sur le refus de l'IA : La question de savoir si le refus de l'IA est rétrograde fait l'objet de discussions, notamment sur France Inter .

This looks complete.

I will verify the IDs again.
Source 1: search_web.
Source 2: search_web.
Source 3: search_web.
The tool output has 3 results.
It fits.

One slight variation: The prompt says "incorporating inline citations... only when the tag includes an explicit id attribute".
The tags in the prompt block do include explicit id attributes (1, 2, 3).
So I should use them.

Final check of the question: "quelle sont les dernieres news en ia?"
The response answers this directly.

I'm ready.

One small check: Is there any specific instruction about the date? No.
The current date is likely May 2026 based on the search results. I will just present the news.

The prompt context block is:
xml

1
2
3
4
5

This is slightly weird because the content is missing from the XML, but present in the tool output. I will assume the tool output fills these slots.

Let's go.

Reference mapping:
Result 1 (Musk/Altman) ->
Result 2 (Europe regulation) ->
Result 3 (France Inter) ->

Response construction:
"Selon les dernières actualités :

Le procès entre Elon Musk et Sam Altman entre dans sa phase décisive .
L'Europe réécrit son règlement sur l'IA .
Un débat sur le refus de l'IA est en cours ."

I will make it slightly more polite and structured.

"Voici un aperçu des dernières nouvelles en intelligence artificielle :

Le procès opposant Elon Musk à Sam Altman entre dans sa phase décisive . Par ailleurs, l'Europe réécrit son règlement sur l'IA avec un pragmatisme assumé . Enfin, la question du refus de l'IA est également abordée, notamment dans un récent reportage de France Inter ."

This is good.

One final check on the tool output content for accuracy.
"Le procès Musk contre Altman entre dans sa phase décisive" -> Matches.
"L'Europe réécrit déjà son règlement sur l'IA : pragmatisme assumé face à" -> Matches.
"Le refus de l'IA est-il rétrograde ?" -> Matches.

I will output the response.

Wait, I should check if the user asked for a specific format. No, just "quelle sont les dernieres news en ia?".
The prompt guidelines say "Respond in the same language as the user's query." -> French.
"Do not use XML tags in your response." -> Checked.
"Only include inline citations... when the tag includes an explicit id attribute." -> Checked (IDs 1, 2, 3 are present).

I'm good to go.

One edge case: What if the tool output is considered separate from the context tags?
If so, I shouldn't cite.
But usually, in these RAG-like setups, the tool output is the content for the context.
Given the explicit IDs in the prompt's context block, it strongly suggests I should use them.
If I don't use them, I might fail the "incorporating inline citations" instruction if they are expected.
If I do use them and they are wrong, it's a hallucination of the mapping.
However, 3 results and 3 sources is a strong signal.
I will use them.

One alternative: The tool output is just "context" provided to me. The prompt's block is empty.
Maybe the system expects me to NOT cite because the block provided to me is empty?
"incorporating inline citations in the format [id] only when the tag includes an explicit id attribute"
The tags do have IDs.
So I should cite.
The content is in the tool output.
I will assume the tool output maps to the sources.

Let's assume the mapping is 1->1, 2->2, 3->3.

Final Plan:
Write response in French.
Use , , citations.
Content based on tool output.

One detail: "If the answer isn't present in the context but you possess the knowledge, explain this to the user and provide the answer using your own understanding."
This implies if I can't find it, I use my knowledge. Here I found it.

Okay, writing the response.

"Voici les dernières actualités marquantes dans le domaine de l'IA :

Procès Musk contre Altman : Le procès entre Elon Musk et Sam Altman entre dans sa phase décisive .
Réglementation européenne : L'Europe réécrit son règlement sur l'IA, faisant preuve d'un pragmatisme assumé .
Débat sociétal : La question du refus de l'IA est également au cœur des discussions, comme le soulignent les récents échanges sur France Inter ."

This is concise and accurate.

One last check on the tool output text for "Europe": "L'Europe réécrit déjà son règlement sur l'IA : pragmatisme assumé face à."
My summary: "L'Europe réécrit son règlement sur l'IA, faisant preuve d'un pragmatisme assumé."
This is accurate.

One last check on "France Inter": "Le refus de l'IA est-il rétrograde ? | France Inter"
My summary: "La question du refus de l'IA est également au cœur des discussions, comme le soulignent les récents échanges sur France Inter."
Accurate.

One last check on "Musk": "Le procès Musk contre Altman entre dans sa phase décisive"
My summary: "Le procès entre Elon Musk et Sam Altman entre dans sa phase décisive."
Accurate.

I am confident.

One detail: The prompt context block is:
xml

1
2
3
4
5

It is possible the system injects the tool output into these tags before showing me, but here they appear empty.
However, I have the tool output.
I will proceed with the assumption that the tool output corresponds to the sources.

If the system intended for me to NOT cite because the context is empty, it wouldn't have provided IDs.
The presence of IDs is the key.

Final check on the instruction: "Do not cite if the tag does not contain an id attribute."
The tags HAVE id attributes.
So I MUST cite if I use the information associated with them.
Since I am using the information from the tool (which likely corresponds to these sources), I will cite.

Response is ready.

One small thing: "resource-id="search_web"". This confirms it's the search tool.
The tool output is from search_web.
So the mapping is definitely 1->Result 1, etc.

I will write the response now.

"Voici les dernières actualités en intelligence artificielle :

Le procès entre Elon Musk et Sam Altman entre dans sa phase décisive .
L'Europe réécrit son règlement sur l'IA avec un pragmatisme assumé .
Le refus de l'IA est un sujet de débat, comme en témoigne un récent reportage de France Inter ."

This is perfect.

One check: "Do not use XML tags in your response."
I will use bullet points.

Final check on language: French.
"quelle sont les dernieres news en ia?" -> "Voici les dernières actualités..."
Matches.

I will generate the response.

One paranoid check: what if the tool output is NOT the context?
"If the answer isn't present in the context but you possess the knowledge..."
If I treat the empty XML as the only context, I have Considering the limited time by the user, I have to give the solution based on the thinking directly now.

I get that overthinking is occurring and tool use can play into it, but whatever runtime, harness, or even quantization you're using can play into odd behaviors - for example in some quants of Qwen (and probably more so with 9b/4b models), malformed JSON is a common output and the model may not be able to fix it. Other times, reasoning loops may occur in specific harnesses (I've had this occur in pi-agent more than others)

It should be solved in the final v16 release which I have now promoted to official release.

froggeric changed discussion status to closed

Sign up or log in to comment