--- license: apache-2.0 language: - en - kk base_model: - Qwen/Qwen3-0.6B datasets: - issai/foggen-data - issai/KazCulture pipeline_tag: text-generation tags: - edge-cloud-routing - verbalized-confidence - self-aware - routing - continual-learning - multi-round library_name: transformers --- # FogGen: Self-Aware Edge–Cloud LLM Router > **A 0.6B parameter edge LLM trained to emit a calibrated verbalized confidence score before its answer, enabling efficient edge–cloud routing without an external router.** ![FogGen overview: (a) self-aware routing at inference, (b) self-evolving training loop](./foggen_overview.png) FogGen is a small, self-aware edge model that knows when to answer locally and when to defer to a stronger cloud model. At inference (figure (a)) it emits a confidence score then an answer in one forward pass; if confidence `c ≥ τ` the local answer is returned, otherwise the query is routed to the cloud. Training (figure (b)) is a self-evolving loop: each round, the current checkpoint self-samples N=8 generations per question to derive confidence buckets, then SFTs on `(question, confidence, answer)` triples. The released checkpoint is the endpoint (`R14`) of a 14-round chain trained across seven domains: finance, science, coding, law, math, Kazakh culture, medical. ## Quick demo ```python from transformers import AutoTokenizer, AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("issai/foggen", torch_dtype="bfloat16", device_map="auto") tokenizer = AutoTokenizer.from_pretrained("issai/foggen") SYSTEM = """You are a self-aware multiple-choice assistant. Rules: - Do not output tags. - First, assess your confidence in solving this question. - Then give your answer. - Output format: Confidence: <0.0|0.25|0.5|0.75|1.0> Final answer: """ question = """A firm reports $400M in total liabilities and $600M in shareholders' equity. What is the firm's debt-to-equity ratio? A. 0.67 B. 1.00 C. 1.50 D. 2.00""" messages = [ {"role": "system", "content": SYSTEM}, {"role": "user", "content": question}, ] inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True, enable_thinking=False).to(model.device) outputs = model.generate(inputs, max_new_tokens=64, do_sample=False) print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)) # Expected: # Confidence: 1.0 # Final answer: A ``` ## How routing works ```python import re def route_query(model_output: str, tau: float = 0.5): """Parse FogGen output. Returns (action, confidence, answer). action is 'keep_local' if confidence >= tau, else 'route_to_cloud'.""" conf_match = re.search(r"Confidence\s*:\s*([\d.]+)", model_output) ans_match = re.search(r"Final\s+answer\s*:\s*([A-D])", model_output) if not conf_match: return "route_to_cloud", None, None confidence = float(conf_match.group(1)) answer = ans_match.group(1) if ans_match else None return ("keep_local" if confidence >= tau else "route_to_cloud", confidence, answer) ``` At τ=0.5 on the trained domains, the model routes ~22% of queries to the cloud while achieving 67.8% mean system accuracy. ## Model details | | | |---|---| | **Base model** | [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) | | **Parameters** | 0.6 B | | **Training method** | LoRA SFT (rank=16, α=32, all-linear), bf16, 2 epochs/round | | **Rounds** | 14 sequential rounds (R0 → R14) | | **Training tokens** | ~1800 SFT rows × 14 rounds | | **Domains** | finance, science, coding, law, math, Kazakh culture, medical | | **Cloud teacher** | [Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) | | **Output format** | `Confidence: \nFinal answer: ` | | **Confidence buckets** | 5 discrete values: 0.0, 0.25, 0.5, 0.75, 1.0 | | **License** | Apache 2.0 (inherited from base) | ## Performance System accuracy at τ=0.5 on seven MCQ domains (full test sets, ~16,200 questions), measured against Random routing and a cloud-only baseline (Qwen3-30B-A3B-Instruct-2507): | Domain | Cloud only | R14 raw | Random @ τ=0.5 | **FogGen @ τ=0.5** | Cloud routed | |---|---|---|---|---|---| | Finance | 69.5% | 57.0% | 59.9% | **65.8%** | 23.3% | | Science | 72.7% | 56.9% | 60.1% | **64.5%** | 20.4% | | Coding | 74.2% | 61.8% | 64.2% | **69.5%** | 19.7% | | Law | 70.7% | 55.3% | 58.4% | **62.4%** | 20.1% | | Math | 60.1% | 42.2% | 50.8% | **58.1%** | 47.7% | | Kazakh culture | 95.8% | 91.3% | 91.4% | **91.9%** | 1.0% | | Medical | 74.0% | 52.6% | 57.1% | **62.2%** | 20.9% | | **Mean** | **73.9%** | **59.6%** | **63.1%** | **67.8%** | **21.9%** | Mean lift over Random at τ=0.5: **+4.6** (system accuracy minus random-routing accuracy, averaged across the seven domains). ### Baseline comparison Direct comparison against AutoMix (Aggarwal et al., 2024) on the same R14 checkpoint, same evaluation sets: | Method | SysAcc | Cloud routed | Δ over Random | Fwd passes / query | |---|---|---|---|---| | AutoMix | 67.2% | 29.0% | +3.7 | 9 (1 answer + 8 verify) | | **FogGen (ours)** | **67.8%** | **21.9%** | **+4.6** | **1** | FogGen achieves higher accuracy at lower cloud cost and 9× lower per-query inference cost. ## Open-ended generalization The MCQ-trained chain transfers to open-ended task types zero-shot. Local accuracy and routing benefit at τ=0.5 on three held-out OE benchmarks: | Benchmark | Format | R14 raw | R14 Δ@τ=0.5 | |---|---|---|---| | [SQuAD v1.1](https://huggingface.co/datasets/rajpurkar/squad) | extractive RC | 81.0% | +1.4 | | [TruthfulQA gen](https://huggingface.co/datasets/truthfulqa/truthful_qa) | adversarial factual | 36.5% | −0.7 (anti-calibrated) | | [GSM8K](https://huggingface.co/datasets/openai/gsm8k) (CoT) | math word-problems | 52.0% | +2.2 | One additional round of OE training (R15, 1876 SFT rows) lifts local accuracy on these three benchmarks to 86.5% / 40.0% / 58.0% respectively; see [`issai/foggen-r15-oe`](https://huggingface.co/issai/foggen-r15-oe). ## Citation Paper coming soon. ## Acknowledgements Thanks to the Qwen team at Alibaba for the base model and cloud teacher.