File size: 4,036 Bytes
419aa26 4ab37ae 419aa26 4ab37ae 419aa26 4ab37ae 419aa26 4ab37ae 419aa26 4ab37ae 419aa26 4ab37ae 419aa26 4ab37ae 419aa26 4ab37ae 419aa26 4ab37ae 419aa26 4ab37ae 419aa26 4ab37ae 419aa26 4ab37ae 419aa26 4ab37ae 419aa26 4ab37ae 419aa26 4ab37ae 419aa26 4ab37ae 419aa26 4ab37ae 419aa26 4ab37ae 419aa26 4ab37ae 419aa26 4ab37ae 419aa26 4ab37ae 419aa26 4ab37ae 419aa26 4ab37ae 3579f27 4ab37ae 163fcb6 4ab37ae 163fcb6 4ab37ae 163fcb6 4ab37ae 163fcb6 4ab37ae 163fcb6 4ab37ae 163fcb6 4ab37ae 163fcb6 4ab37ae | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 | ---
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- llm
- debugging
- inference
- qwen
---
Q2 — Debugging LLMs (Friendli AI Take-Home)
Fixed Model Repository
https://huggingface.co/syedahmedsoftware/broken-model-fixed
Original Broken Model
https://huggingface.co/yunmorning/broken-model
This repository contains the minimal, production-safe fixes required to make the original model usable behind an OpenAI-compatible /chat/completions API server.
No model weights were modified.
Only configuration metadata was corrected.
Problem (a) — Why inference failed
The original repository could not serve chat requests because two required metadata components were missing or incorrect.
Root causes
1) Missing chat_template (tokenizer_config.json)
Modern OpenAI-style runtimes rely on:
tokenizer.apply_chat_template(messages)
The original model did not define a chat template, which meant:
Chat messages could not be rendered into prompts
/chat/completions servers had no deterministic formatting
Inference either failed or produced undefined behavior
2) Missing / incorrect pad_token_id (config.json)
Production inference relies on:
Batched decoding
Attention masking
Padded sequences
The original config.json did not define a valid pad_token_id, which makes batching unsafe and causes runtime instability.
Minimal fixes applied (no weight changes)
Only metadata was changed:
tokenizer_config.json
Added a ChatML-style chat_template so chat messages can be rendered correctly.
config.json
Set pad_token_id to match the tokenizer’s pad token so batching and masking are safe.
generation_config.json
Normalized pad_token_id and eos_token_id to prevent decoding edge cases.
No architecture, tokenizer vocabulary, or model weights were modified.
Verification (remote)
python - <<'PY' from transformers import AutoTokenizer
repo_id = "syedahmedsoftware/broken-model-fixed" tok = AutoTokenizer.from_pretrained(repo_id, use_fast=True)
print("tokenizer:", tok.class.name) print("has_chat_template:", bool(getattr(tok, "chat_template", None))) print("pad_token_id:", tok.pad_token_id, "eos_token_id:", tok.eos_token_id)
messages = [ {"role":"system","content":"You are helpful."}, {"role":"user","content":"Say hello."} ] prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) print(" REMOTE prompt renders. Preview:") print(prompt[:250]) PY
Expected output
tokenizer: Qwen2Tokenizer
has_chat_template: True
pad_token_id: 151643
eos_token_id: 151645
Rendered prompt:
<|im_start|>system
You are helpful.<|im_end|>
<|im_start|>user
Say hello.<|im_end|>
<|im_start|>assistant
This confirms that the model now supports OpenAI-style chat formatting.
Problem (b) — Why reasoning_effort has no effect
Root cause
reasoning_effort is not a native Transformers generation parameter.
Unless the serving runtime explicitly maps this field to real compute or decoding policies, it is silently ignored. The base model itself has no internal mechanism to interpret the concept of “effort.”
What is required to make reasoning_effort meaningful
1) Runtime orchestration (required)
The server must translate reasoning_effort into real execution strategies, for example:
low → single pass, greedy decoding
medium → temperature + nucleus sampling
high → multi-sample + rerank
very_high → tree-of-thought or verifier loop
The runtime must explicitly log which policy is applied.
2) Architectural support (at least one)
One or more of the following must exist:
Multi-pass verifier or self-consistency loop
Tree search or branch-and-bound reasoning
Reflection / critique agent
Budget-controlled decoding controller
Control-token trained model
Without these, reasoning_effort remains a no-op.
Final notes
These fixes:
Restore deterministic chat formatting
Enable production-safe batching
Make the model compatible with OpenAI-style /chat/completions servers
The model is now deployable in real inference environments.
|