--- license: apache-2.0 library_name: transformers pipeline_tag: text-generation tags: - llm - debugging - inference - qwen --- Q2 — Debugging LLMs (Friendli AI Take-Home) Fixed Model Repository https://huggingface.co/syedahmedsoftware/broken-model-fixed Original Broken Model https://huggingface.co/yunmorning/broken-model This repository contains the minimal, production-safe fixes required to make the original model usable behind an OpenAI-compatible /chat/completions API server. No model weights were modified. Only configuration metadata was corrected. Problem (a) — Why inference failed The original repository could not serve chat requests because two required metadata components were missing or incorrect. Root causes 1) Missing chat_template (tokenizer_config.json) Modern OpenAI-style runtimes rely on: tokenizer.apply_chat_template(messages) The original model did not define a chat template, which meant: Chat messages could not be rendered into prompts /chat/completions servers had no deterministic formatting Inference either failed or produced undefined behavior 2) Missing / incorrect pad_token_id (config.json) Production inference relies on: Batched decoding Attention masking Padded sequences The original config.json did not define a valid pad_token_id, which makes batching unsafe and causes runtime instability. Minimal fixes applied (no weight changes) Only metadata was changed: tokenizer_config.json Added a ChatML-style chat_template so chat messages can be rendered correctly. config.json Set pad_token_id to match the tokenizer’s pad token so batching and masking are safe. generation_config.json Normalized pad_token_id and eos_token_id to prevent decoding edge cases. No architecture, tokenizer vocabulary, or model weights were modified. Verification (remote) python - <<'PY' from transformers import AutoTokenizer repo_id = "syedahmedsoftware/broken-model-fixed" tok = AutoTokenizer.from_pretrained(repo_id, use_fast=True) print("tokenizer:", tok.class.name) print("has_chat_template:", bool(getattr(tok, "chat_template", None))) print("pad_token_id:", tok.pad_token_id, "eos_token_id:", tok.eos_token_id) messages = [ {"role":"system","content":"You are helpful."}, {"role":"user","content":"Say hello."} ] prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) print(" REMOTE prompt renders. Preview:") print(prompt[:250]) PY Expected output tokenizer: Qwen2Tokenizer has_chat_template: True pad_token_id: 151643 eos_token_id: 151645 Rendered prompt: <|im_start|>system You are helpful.<|im_end|> <|im_start|>user Say hello.<|im_end|> <|im_start|>assistant This confirms that the model now supports OpenAI-style chat formatting. Problem (b) — Why reasoning_effort has no effect Root cause reasoning_effort is not a native Transformers generation parameter. Unless the serving runtime explicitly maps this field to real compute or decoding policies, it is silently ignored. The base model itself has no internal mechanism to interpret the concept of “effort.” What is required to make reasoning_effort meaningful 1) Runtime orchestration (required) The server must translate reasoning_effort into real execution strategies, for example: low → single pass, greedy decoding medium → temperature + nucleus sampling high → multi-sample + rerank very_high → tree-of-thought or verifier loop The runtime must explicitly log which policy is applied. 2) Architectural support (at least one) One or more of the following must exist: Multi-pass verifier or self-consistency loop Tree search or branch-and-bound reasoning Reflection / critique agent Budget-controlled decoding controller Control-token trained model Without these, reasoning_effort remains a no-op. Final notes These fixes: Restore deterministic chat formatting Enable production-safe batching Make the model compatible with OpenAI-style /chat/completions servers The model is now deployable in real inference environments.