File size: 4,036 Bytes
419aa26
 
 
 
 
 
 
 
 
 
 
4ab37ae
 
419aa26
4ab37ae
 
 
 
 
 
 
 
 
 
419aa26
 
4ab37ae
419aa26
4ab37ae
 
419aa26
4ab37ae
419aa26
4ab37ae
419aa26
 
 
4ab37ae
419aa26
4ab37ae
419aa26
4ab37ae
419aa26
4ab37ae
419aa26
4ab37ae
419aa26
4ab37ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
419aa26
4ab37ae
 
419aa26
4ab37ae
419aa26
4ab37ae
 
419aa26
4ab37ae
419aa26
4ab37ae
419aa26
4ab37ae
419aa26
4ab37ae
419aa26
 
4ab37ae
 
 
 
419aa26
 
 
 
 
 
 
4ab37ae
419aa26
 
4ab37ae
 
 
 
 
419aa26
 
 
 
4ab37ae
 
 
 
 
 
 
 
 
3579f27
4ab37ae
163fcb6
4ab37ae
163fcb6
4ab37ae
163fcb6
4ab37ae
163fcb6
4ab37ae
163fcb6
4ab37ae
163fcb6
4ab37ae
163fcb6
4ab37ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
  - llm
  - debugging
  - inference
  - qwen
---

Q2 — Debugging LLMs (Friendli AI Take-Home)
Fixed Model Repository

https://huggingface.co/syedahmedsoftware/broken-model-fixed

Original Broken Model

https://huggingface.co/yunmorning/broken-model

This repository contains the minimal, production-safe fixes required to make the original model usable behind an OpenAI-compatible /chat/completions API server.

No model weights were modified.
Only configuration metadata was corrected.

Problem (a) — Why inference failed

The original repository could not serve chat requests because two required metadata components were missing or incorrect.

Root causes
1) Missing chat_template (tokenizer_config.json)

Modern OpenAI-style runtimes rely on:

tokenizer.apply_chat_template(messages)


The original model did not define a chat template, which meant:

Chat messages could not be rendered into prompts

/chat/completions servers had no deterministic formatting

Inference either failed or produced undefined behavior

2) Missing / incorrect pad_token_id (config.json)

Production inference relies on:

Batched decoding

Attention masking

Padded sequences

The original config.json did not define a valid pad_token_id, which makes batching unsafe and causes runtime instability.

Minimal fixes applied (no weight changes)

Only metadata was changed:

tokenizer_config.json
Added a ChatML-style chat_template so chat messages can be rendered correctly.

config.json
Set pad_token_id to match the tokenizer’s pad token so batching and masking are safe.

generation_config.json
Normalized pad_token_id and eos_token_id to prevent decoding edge cases.

No architecture, tokenizer vocabulary, or model weights were modified.

Verification (remote)
python - <<'PY' from transformers import AutoTokenizer

repo_id = "syedahmedsoftware/broken-model-fixed" tok = AutoTokenizer.from_pretrained(repo_id, use_fast=True)

print("tokenizer:", tok.class.name) print("has_chat_template:", bool(getattr(tok, "chat_template", None))) print("pad_token_id:", tok.pad_token_id, "eos_token_id:", tok.eos_token_id)

messages = [ {"role":"system","content":"You are helpful."}, {"role":"user","content":"Say hello."} ] prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) print(" REMOTE prompt renders. Preview:") print(prompt[:250]) PY

Expected output
tokenizer: Qwen2Tokenizer
has_chat_template: True
pad_token_id: 151643
eos_token_id: 151645

Rendered prompt:
<|im_start|>system
You are helpful.<|im_end|>
<|im_start|>user
Say hello.<|im_end|>
<|im_start|>assistant


This confirms that the model now supports OpenAI-style chat formatting.

Problem (b) — Why reasoning_effort has no effect
Root cause

reasoning_effort is not a native Transformers generation parameter.

Unless the serving runtime explicitly maps this field to real compute or decoding policies, it is silently ignored. The base model itself has no internal mechanism to interpret the concept of “effort.”

What is required to make reasoning_effort meaningful
1) Runtime orchestration (required)

The server must translate reasoning_effort into real execution strategies, for example:

low → single pass, greedy decoding

medium → temperature + nucleus sampling

high → multi-sample + rerank

very_high → tree-of-thought or verifier loop

The runtime must explicitly log which policy is applied.

2) Architectural support (at least one)

One or more of the following must exist:

Multi-pass verifier or self-consistency loop

Tree search or branch-and-bound reasoning

Reflection / critique agent

Budget-controlled decoding controller

Control-token trained model

Without these, reasoning_effort remains a no-op.

Final notes

These fixes:

Restore deterministic chat formatting

Enable production-safe batching

Make the model compatible with OpenAI-style /chat/completions servers

The model is now deployable in real inference environments.