syedahmedsoftware commited on
Commit
4ab37ae
·
verified ·
1 Parent(s): ac9bcec

Fix: README

Browse files
Files changed (1) hide show
  1. README.md +91 -77
README.md CHANGED
@@ -9,78 +9,83 @@ tags:
9
  - qwen
10
  ---
11
 
 
 
12
 
13
- Q2 Debugging LLMs (Friendli AI Take-Home)
14
- Fixed Model Repo:
https://huggingface.co/syedahmedsoftware/broken-model-fixed
15
- Original Model:
https://huggingface.co/yunmorning/broken-model
16
- This repository contains the minimal, production-safe fixes required to make the original model usable behind an OpenAI-compatible /chat/completions API server.
No model weights were modified. Only configuration metadata was corrected.
 
 
 
 
 
 
17
 
18
  Problem (a) — Why inference failed
 
19
  The original repository could not serve chat requests because two required metadata components were missing or incorrect.
20
- Root Causes
 
21
  1) Missing chat_template (tokenizer_config.json)
 
22
  Modern OpenAI-style runtimes rely on:
23
- python
24
  tokenizer.apply_chat_template(messages)
25
 
26
 
27
- The original model did not define a chat template, so:
28
- • Chat messages could not be rendered into prompts
29
- • /chat/completions servers had no deterministic formatting
30
- • Inference failed or produced undefined behavior
31
 
32
- 2) Missing/incorrect pad_token_id (config.json)
33
- Production inference uses:
34
- • batched decoding
35
- • attention masking
36
- • padded sequences
37
- The original config.json did not define a valid pad_token_id, making batching unsafe and causing runtime instability.
38
 
 
39
 
40
- Minimal Fixes Applied (No Weight Changes)
41
- File
42
- Change
43
- Why
44
- tokenizer_config.json
45
- Added ChatML-style chat_template
46
- Required for OpenAI-style chat formatting
47
- config.json
48
- Set pad_token_id to tokenizer’s pad token
49
- Enables safe batching and attention masks
50
- generation_config.json
51
- Normalized pad/eos fields (kept defaults)
52
- Prevents decoding edge cases
53
 
54
- No architecture, tokenizer vocab, or model weights were changed.
55
 
56
- Verification (Remote)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
 
 
 
58
 
59
- python - <<'PY'
60
- from transformers import AutoTokenizer
61
 
62
- repo_id = "syedahmedsoftware/broken-model-fixed"
63
- tok = AutoTokenizer.from_pretrained(repo_id, use_fast=True)
64
 
65
- print("tokenizer:", tok.__class__.__name__)
66
- print("has_chat_template:", bool(getattr(tok, "chat_template", None)))
67
- print("pad_token_id:", tok.pad_token_id, "eos_token_id:", tok.eos_token_id)
68
 
69
- messages = [
70
- {"role":"system","content":"You are helpful."},
71
- {"role":"user","content":"Say hello."}
72
- ]
73
- prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
74
- print("✅ REMOTE prompt renders. Preview:")
75
- print(prompt[:250])
76
- PY
77
 
78
- Expected output:
79
 
 
80
  tokenizer: Qwen2Tokenizer
81
  has_chat_template: True
82
- pad_token_id: 151643 eos_token_id: 151645
83
- ✅ REMOTE prompt renders. Preview:
 
 
84
  <|im_start|>system
85
  You are helpful.<|im_end|>
86
  <|im_start|>user
@@ -88,45 +93,54 @@ Say hello.<|im_end|>
88
  <|im_start|>assistant
89
 
90
 
 
91
 
92
  Problem (b) — Why reasoning_effort has no effect
93
- Root Cause
94
- reasoning_effort is not a native Transformers generation parameter.
Unless the serving runtime explicitly maps it to real compute policies, it is silently ignored.
95
- The base model has no internal mechanism to interpret “effort.
 
 
96
 
97
  What is required to make reasoning_effort meaningful
98
  1) Runtime orchestration (required)
99
- The server must map reasoning_effort to actual strategies, for example:
100
- Effort Level
101
- Runtime Policy
102
- low
103
- single pass, greedy
104
- medium
105
- temperature + nucleus sampling
106
- high
107
- multi-sample + rerank
108
- very_high
109
- tree-of-thought / verifier loop
110
- The runtime must log which policy is applied.
111
- 2) Architectural support (one of)
112
- • Multi-pass verifier loop
113
- • Tree search / self-consistency
114
- • Reflection / critique agent
115
- • Budgeted decoding controller
116
- • Control-token trained model
117
- Without these, reasoning_effort remains a no-op.
118
 
119
- Final Notes
120
- These fixes:
121
- • Restore deterministic chat formatting
122
- • Enable production-safe batching
123
- • Make the model compatible with OpenAI-style /chat/completions servers
124
- The model is now deployable in real inference environments.
 
 
 
125
 
 
126
 
 
127
 
 
128
 
 
129
 
 
130
 
 
131
 
 
132
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  - qwen
10
  ---
11
 
12
+ Q2 — Debugging LLMs (Friendli AI Take-Home)
13
+ Fixed Model Repository
14
 
15
+ https://huggingface.co/syedahmedsoftware/broken-model-fixed
16
+ Fixed Model Repo:
https://huggingface.co/syedahmedsoftware/broken-model-fixed
17
+ Original Broken Model:
https://huggingface.co/yunmorning/broken-model
18
+ This repository contains the minimal, production-safe fixes required to make the original model usable behind an OpenAI-compatible /chat/completions API server.
No model weights were modified. Only configuration metadata was corrected.
19
+ https://huggingface.co/yunmorning/broken-model
20
+
21
+ This repository contains the minimal, production-safe fixes required to make the original model usable behind an OpenAI-compatible /chat/completions API server.
22
+
23
+ No model weights were modified.
24
+ Only configuration metadata was corrected.
25
 
26
  Problem (a) — Why inference failed
27
+
28
  The original repository could not serve chat requests because two required metadata components were missing or incorrect.
29
+
30
+ Root causes
31
  1) Missing chat_template (tokenizer_config.json)
32
+
33
  Modern OpenAI-style runtimes rely on:
34
+
35
  tokenizer.apply_chat_template(messages)
36
 
37
 
38
+ The original model did not define a chat template, which meant:
 
 
 
39
 
40
+ Chat messages could not be rendered into prompts
 
 
 
 
 
41
 
42
+ /chat/completions servers had no deterministic formatting
43
 
44
+ Inference either failed or produced undefined behavior
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
+ 2) Missing / incorrect pad_token_id (config.json)
47
 
48
+ Production inference relies on:
49
+
50
+ Batched decoding
51
+
52
+ Attention masking
53
+
54
+ Padded sequences
55
+
56
+ The original config.json did not define a valid pad_token_id, which makes batching unsafe and causes runtime instability.
57
+
58
+ Minimal fixes applied (no weight changes)
59
+
60
+ Only metadata was changed:
61
+
62
+ tokenizer_config.json
63
+ Added a ChatML-style chat_template so chat messages can be rendered correctly.
64
+
65
+ config.json
66
+ Set pad_token_id to match the tokenizer’s pad token so batching and masking are safe.
67
 
68
+ generation_config.json
69
+ Normalized pad_token_id and eos_token_id to prevent decoding edge cases.
70
 
71
+ No architecture, tokenizer vocabulary, or model weights were modified.
 
72
 
73
+ Verification (remote)
74
+ python - <<'PY' from transformers import AutoTokenizer
75
 
76
+ repo_id = "syedahmedsoftware/broken-model-fixed" tok = AutoTokenizer.from_pretrained(repo_id, use_fast=True)
 
 
77
 
78
+ print("tokenizer:", tok.class.name) print("has_chat_template:", bool(getattr(tok, "chat_template", None))) print("pad_token_id:", tok.pad_token_id, "eos_token_id:", tok.eos_token_id)
 
 
 
 
 
 
 
79
 
80
+ messages = [ {"role":"system","content":"You are helpful."}, {"role":"user","content":"Say hello."} ] prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) print(" REMOTE prompt renders. Preview:") print(prompt[:250]) PY
81
 
82
+ Expected output
83
  tokenizer: Qwen2Tokenizer
84
  has_chat_template: True
85
+ pad_token_id: 151643
86
+ eos_token_id: 151645
87
+
88
+ Rendered prompt:
89
  <|im_start|>system
90
  You are helpful.<|im_end|>
91
  <|im_start|>user
 
93
  <|im_start|>assistant
94
 
95
 
96
+ This confirms that the model now supports OpenAI-style chat formatting.
97
 
98
  Problem (b) — Why reasoning_effort has no effect
99
+ Root cause
100
+ reasoning_effort is not a native Transformers generation parameter.
Unless the serving runtime explicitly maps it to real compute policies, it is silently ignored.
101
+ reasoning_effort is not a native Transformers generation parameter.
102
+
103
+ Unless the serving runtime explicitly maps this field to real compute or decoding policies, it is silently ignored. The base model itself has no internal mechanism to interpret the concept of “effort.”
104
 
105
  What is required to make reasoning_effort meaningful
106
  1) Runtime orchestration (required)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
 
108
+ The server must translate reasoning_effort into real execution strategies, for example:
109
+
110
+ low single pass, greedy decoding
111
+
112
+ medium temperature + nucleus sampling
113
+
114
+ high → multi-sample + rerank
115
+
116
+ very_high → tree-of-thought or verifier loop
117
 
118
+ The runtime must explicitly log which policy is applied.
119
 
120
+ 2) Architectural support (at least one)
121
 
122
+ One or more of the following must exist:
123
 
124
+ Multi-pass verifier or self-consistency loop
125
 
126
+ Tree search or branch-and-bound reasoning
127
 
128
+ Reflection / critique agent
129
 
130
+ Budget-controlled decoding controller
131
 
132
+ Control-token trained model
133
+
134
+ Without these, reasoning_effort remains a no-op.
135
+
136
+ Final notes
137
+
138
+ These fixes:
139
+
140
+ Restore deterministic chat formatting
141
+
142
+ Enable production-safe batching
143
+
144
+ Make the model compatible with OpenAI-style /chat/completions servers
145
+
146
+ The model is now deployable in real inference environments.