KU-DFI
/

TelecomGPT-R1

Safetensors

qwen3_5

Model card Files Files and versions

xet

Community

wbhVince829 commited on May 18

Commit

fd7747b

1 Parent(s): af478d3

update outreach 0518

Browse files

Files changed (1) hide show

README.md +211 -1

README.md CHANGED Viewed

@@ -1,4 +1,214 @@
 ---
 license: apache-2.0
 ---
-hello

 ---
 license: apache-2.0
 ---
+# TelecomGPT-R1: The Best Open-Source Telecom Large Language Model
+> A 27B open model that ranks **#1 on the GSMA Open Telco Leaderboard**, **#1 among all open-source models by a 27-point margin**, and **beats GPT-5 on 6 of 7 benchmarks**.
+---
+## 1 — A New State of the Art for Telecom LLMs
+**TelecomGPT-R1 is the strongest publicly available large language model for telecommunications.** On the public **[GSMA Open Telco Leaderboard](https://huggingface.co/spaces/GSMA/open-telco-leaderboard)** — the standard benchmark suite that aggregates seven public telecom benchmarks across knowledge QA, protocol understanding, fault analysis, and modeling & computation — a single 27B open-source policy ranks **#1 overall out of 86 evaluated models**, **#1 among all open-source models by a 27-point margin**, **wins 6 of 7 benchmark match-ups against GPT-5**, and **leads the leaderboard's hardest axis (TeleTables) by +29.8 points over every other model on the board — open or closed.** No prior open model — and no general-purpose frontier model from OpenAI, Google, or Anthropic — comes close to this combination of breadth and depth on telecom tasks.
+![radar_chart_v0](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/-ZkxlB0p1XHmJCEDS6MKb.png)
+**Figure 1 | TelecomGPT-R1 vs frontier closed-source models on the GSMA Open Telco Leaderboard.** *Each spoke is one benchmark (plus the overall average), normalized by its per-axis leaderboard best so that `1.0` = best score on that benchmark. Our 27B open-source policy reaches `1.0` on four of eight axes (3GPP-TSG, TeleLogs, TeleTables, Average) and stays at or above `0.89` on every other axis — visibly tracing the outer edge of the radar where no other model can match it on all axes simultaneously.*
+### Three takeaways
+> **1. #1 among all open-source models by +25.9 points.** TelecomGPT-R1 beats the next-best open model (DeepSeek-V3 at 685B parameters) by a margin larger than DeepSeek's own margin over a 14B base. Our 27B is the new floor for open-source telecom LLMs.
+>
+> **2. #1 on the leaderboard's hardest axis by +26.2 points.** TeleTables is the axis where every frontier closed-source model collapses below 50%. The self-rubric reward (introduced in §5) was designed for exactly this regime — and it shows.
+>
+> **3. #1 overall — past every operator-internal model too.** TelecomGPT-R1 edges out AT&T's operator-internal OTel-LLM-8.3B on the leaderboard's overall average by **+0.6 points**, carried by a **+29.8-point lead on TeleTables** that more than offsets per-benchmark losses on three knowledge-heavy axes. Open, closed, or operator-trained — no model on the GSMA Open Telco Leaderboard ranks above us.
+---
+## 2 — Toward Universal Telecom Reasoning
+Large language models have entered a new era. **General-purpose frontier models** — GPT-5, Claude-Opus-4.6, Gemini-3.1-Pro — write code, prove theorems, and solve olympiad-level problems; their long-CoT reasoning ability sets the modern bar. But step into telecommunications and these same models stumble: a workflow that asks an engineer to *recall a 3GPP clause, follow a multi-step procedure, read a log, and close a link-budget derivation in the same session* breaks them, because what looks like "one telecom problem" is really **four very different kinds of thinking layered on top of dense domain knowledge that their pretraining covers only thinly and never targeted with reasoning-grade supervision**. **Telecom-specialized LLMs** — TelecomGPT, Tele-LLMs, and operator-internal models like AT&T's OTel-LLM and SoftBank's LTM — narrow this gap by training on domain corpora, but they treat each task in isolation, supervise only on *extractive / classification* outputs (answer-the-MCQ, label-the-Tdoc, fill-the-equation, summarize-the-code), and remain closed-source. The reasoning engineers actually do day-to-day — *chain three procedure steps, trace a KPI dip to a root cause, derive capacity bounds, write a MATLAB beamformer that compiles* — is exactly the part current telecom AI cannot do.
+**TelecomGPT-R1 closes this gap with a single open-source model built on top of [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B).** It inherits the TelecomGPT substrate and extends it into the **reasoning** regime across four telecom axes — knowledge QA, protocol understanding, fault analysis, and modeling & computation — under one policy.
+![four_axes_radia](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/IYs4rpe9Ij1e6KJf5qKy5.png)
+**Figure 2 | The four kinds of reasoning a telecom engineer juggles.** *Each scope shows one axis of telecom work — knowledge QA (15.3%), protocol understanding (22.7%), fault analysis (18.5%), modeling & computation (43.5%) — and the share of the 158,915-example TelecomGPT-R1 training corpus that targets it. The cross-axis distribution explains why we train one unified policy rather than four specialists: a real workflow mixes all four in the same session.*
+### What a telecom engineer actually does
+A telecom engineer's day cuts across four very different kinds of thinking — and a useful AI has to fluidly switch between them:
+| Job | What it looks like in practice |
+|---|---|
+| **Knowledge QA** | "What does 3GPP Release 18 say about [feature]? What's the typical PRACH timing budget?" |
+| **Protocol understanding** | Reading 3GPP / ITU / IETF specs and following multi-step procedure flows |
+| **Fault analysis** | Looking at a PCAP, RAN log, or KPI dashboard and finding the root cause |
+| **Modeling & Computation** | Closing a link-budget or queueing-theory derivation; reconstructing a system-model equation from a paper; writing a MATLAB beamformer that actually runs |
+### Why TelecomGPT-R1 matters
+> **1. Not just retrieves — reasons.** Existing telecom LLMs stop at extractive answers; TelecomGPT-R1 chains multi-step 3GPP procedures, traces KPI dips to root causes from raw logs, closes link-budget and Shannon-capacity derivations end to end, and writes srsRAN-style code that compiles. It thinks through a telecom problem instead of pulling the closest paragraph.
+>
+> **2. An autonomous-agent core, not a chat sidekick.** Under one 27B policy, the model covers all four axes a telecom engineer rotates through in a day — knowledge QA, protocol understanding, fault analysis, modeling & computation. That makes it deployable as the reasoning core of an autonomous NOC operator, a spec-compliance bot, or a fault-triage copilot — automating slices of the engineer's workflow rather than living alongside it as another chatbot.
+>
+> **3. The strongest open brain for telecom — built to be extended.** TelecomGPT-R1 is **#1 on the GSMA Open Telco Leaderboard**, **+27 points clear of every other open model**, and runnable on a single H100 with weights, recipe, and training data all public. This is the open foundation that operator-specific fine-tunes, downstream telecom agents, and standards-grade drafting tools can build on — without routing operator-confidential traffic through a closed-API black box.
+---
+## 3 — How We Did It: The Recipe at a Glance
+The recipe rests on two design decisions: **(i)** treat the entire training corpus as **one unified whole** — 158,915 examples flowing through one shared eight-step curation pipeline before being indexed by axis, never as a stack of benchmark-specific subsets; and **(ii)** post-train with a **three-pillar GRPO reinforcement-learning recipe** that combines DAPO stabilization, an offline difficulty-mined curriculum with multi-stage continual KL anchoring, and a self-rubric reward that decomposes each rollout's score over a set of teacher- or reference-derived rubrics covering structure, logic, format, and key facts — never reducing to a single 0/1 outcome signal. The first decision makes ablations and reweighting modular; the second makes reinforcement learning *survive* on derivation-heavy axes where outcome-only rewards starve the gradient.
+![recipe_4stage_v0.png-2026-05-19-00-20-52-309](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/v0pnV58Y3uu3hqPQ6ZQeu.png)
+**Figure 3 | The TelecomGPT-R1 three-stage post-training recipe.** *Stage ① curates heterogeneous telecom sources through an eight-step pipeline into one axis-indexed 158,915-example corpus. Stage ② installs cross-axis long-CoT reasoning on [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) via LoRA-SFT. Stage ③ runs a single GRPO loop in which **DAPO** stabilizes the gradient, a **difficulty-mined curriculum** advances the prompt distribution from easy to hard, and a **self-rubric reward** — rubrics generated by a teacher LLM or projected from the expert reference, then scored as a decomposed sum of per-rubric binary indicators — replaces the sparse 0/1 outcome signal, yielding the final TelecomGPT-R1 27B open-weight policy.*
+Two stages, one idea:
+> **Treat all four reasoning axes as slices of one corpus and one policy — not as four specialists glued together at inference time.** Everything that follows is engineered around that single-policy claim.
+---
+## 4 — Stage 1: One Corpus, Eight Steps
+Rather than train one specialist per benchmark, we curate a **single** telecom post-training corpus that passes through **one shared eight-step pipeline** before being indexed by the four reasoning axes.
+### The eight pipeline steps at a glance
+| Step | What it does | Why it matters |
+|---|---|---|
+| **S1** — Source-grounded extraction | Modality-specific extractors (AST for code, VLM PDF parsing for textbooks, working-group label projection for specs, row-window slicing for tables, formula masking for math papers, engineering-feature aggregation for raw logs) | Different sources, **one common output schema** |
+| **S2** — Long-CoT generation | Three trace generators chosen by **reasoning type**: teacher LLM (with self-validation) for QA, executable-Python-grounded CoT for derivations, deterministic rule-replay CoT for diagnosis | Right tool for each reasoning type — not one teacher for everything |
+| **S3** — Multi-pass verification | Axis-matched verifiers (exact match / unit-tolerant numeric closeness / rule-replay accuracy / on-policy re-answering) | Bad CoTs never enter the corpus |
+| **S4** — Augmentation | Variable resampling 5×–20×; prefix/suffix decomposition into intermediate-target + final-target pairs | One seed → many supervised rows, with intermediate-step supervision |
+| **S5** — Leakage prevention | Cross-benchmark dedup vs. all public eval splits; SHA-256-archived test sets; "no implicit reference" prompt guards | No train→eval leakage; no hallucinated citations |
+| **S6** — Difficulty stratification | Offline difficulty mining; per-axis class rebalancing | The **same** difficulty pass that filters SFT data also feeds the RL curriculum (Stage 2) |
+| **S7** — Format unification | One `{system, user, assistant}` chat schema; fixed answer-format vocabulary; `meta.axis` and `meta.source_track` tags on every row | Train and ablate the corpus **as one whole**, with axis-/source-wise reweighting |
+| **S8** — Style mixing | Small fraction of general-domain long-CoT mixed in | Preserve reflective markers — *"wait…"*, *"hmm…"*, self-correction — that pure telecom traces lack |
+### The result: one corpus, indexed by axis
+All eight steps converge on **one 158,915-example corpus indexed by reasoning axis** — and those same four axes are exactly the lens through which the seven public benchmarks on the GSMA Open Telco Leaderboard evaluate every telecom LLM.
+#### Benchmark → axis mapping
+| Benchmark | Tests |
+|---|---|
+| **TeleQnA** (10k MCQ) | Knowledge QA |
+| **3GPP-TSG** (working-group classification) | Protocol understanding |
+| **ORANBench** (1.5k MCQ on O-RAN specs) | Protocol understanding |
+| **TeleTables** (table-grounded MCQ on 3GPP tables) | Protocol understanding |
+| **TeleLogs** (5G RAN root-cause analysis) | Fault analysis |
+| **TeleMath** (telecom math problem solving) | Modeling & computation |
+| **srsRANBench** (5G code understanding) | Modeling & computation |
+![data_radar_v0](https://cdn-uploads.huggingface.co/production/uploads/6882f57510e86d9f80580702/82qavUtMTYeOs3fEGcIWm.png)
+*Figure 4. The 158,915-example unified telecom corpus, broken down by source track. Outer ring: five source families that fold into the four reasoning axes. Middle ring: individual sub-corpora. Inner radial bars: per-corpus row counts on a log scale.*
+> **Why this matters.** Because every row carries the same chat schema, the same answer-format vocabulary, and the same axis tag, **the corpus is trained and ablated as a single whole** with axis-aware reweighting — not as a stack of benchmark-specific subsets. That single-whole property is what makes the downstream RL recipe modular.
+---
+## 5 — Stage 2: Three Reinforcement-Learning Pillars
+After supervised fine-tuning (LoRA adapters on a 27B open base), we refine the model with reinforcement learning. Vanilla RL on long-trace mixed-domain training breaks in well-known ways. We stack **three orthogonal fixes**, each tackling a different failure mode:
+| Pillar | What it fixes | What it does, in plain language |
+|---|---|---|
+| **1 — DAPO Stabilization** | Entropy collapse · zero-gradient groups · long-trace dilution | Asymmetric trust region (lets rare-but-good tokens through), dynamic-sampling filter (drops groups where every attempt is right or every attempt is wrong), token-level loss (long reasoning traces stop being underweighted). Plus: keep a small KL anchor to the SFT model so the structured output layout doesn't drift. |
+| **2 — Difficulty-Mined Curriculum** | Easy axes saturate too fast · hard axes never get gradient | Pre-filter prompts by their pass rate against the SFT model — keep only the ones where the model gets it right *some* of the time. Then train in stages, where each stage is anchored to the *previous* stage rather than to the static SFT model. Anchoring to the latest capable policy avoids catastrophic forgetting when harder slices are introduced. |
+| **3 — Self-Rubric Reward** | A 0/1 outcome reward starves derivation-heavy axes of any signal | For each training prompt, **pre-generate a set of fine-grained rubrics** with a strong teacher LLM (or projected directly from the expert reference solution), jointly covering *structure*, *logic*, *format*, and *key factual content*. At training time the current policy samples K responses; each response is scored as a **decomposed sum of per-rubric binary indicators**, yielding a dense per-attempt reward instead of a sparse 0/1 outcome signal. **GRPO** then consumes this rubric-decomposed reward as its group-relative advantage and updates the policy parameters — propagating gradient through the model even when no rollout reaches the correct final answer. |
+### How the three pillars compose into one training loop
+```
+prompt
+  → keep only if difficulty is "just right"        (Pillar 2)
+  → pre-generate rubrics per prompt                (Pillar 3, offline)
+       teacher LLM or expert reference solution
+       cover: structure · logic · format · facts
+  → current policy samples K responses
+  → score each response against its rubrics        (Pillar 3)
+       per-rubric binary indicator → summed dense reward
+  → drop groups with no learning signal            (Pillars 1 + 2 step-level)
+  → compute group-relative advantage
+  → stabilized GRPO update                          (Pillar 1)
+  → repeat
+```
+Each pillar plugs into the next. Pillar 2 supplies the prompt distribution, Pillar 3 shapes the per-attempt signal, Pillar 1 turns the signal into a stable gradient step. **Remove any one of the three and final accuracy measurably drops** — the lessons in §6 quantify this.
+### Three properties that make Pillar 3 the "universal" enabler
+| Property | Plain-language explanation |
+|---|---|
+| **Dense, decomposed credit** | A response that satisfies even a subset of its rubrics still receives non-zero reward, so groups of rollouts that all fail the final answer continue to produce a usable gradient — escaping the early "all rollouts wrong → no gradient" trap that plagues outcome-only RL. |
+| **Multi-dimensional supervision in one reward** | The rubrics per prompt jointly score *structure*, *logic*, *format*, and *key factual content* — so a single reward simultaneously shapes everything a long-CoT telecom response must satisfy, with no separate format-preservation or factuality-loss terms hanging off the loss. |
+| **Reference-grounded, runtime-cheap** | Rubrics are authored once per prompt — by a strong teacher LLM or projected from the expert reference solution — and at training time the per-rubric checks reduce to lightweight binary indicators. This buys teacher-quality grading criteria without paying a full LLM-judge inference on every rollout. |
+---
+## 6 — Five Things We Learned
+| # | Lesson | Why it matters |
+|:---:|---|---|
+| **1** | **Domain knowledge — not reasoning ability — is the bottleneck.** | A strong general reasoner produces well-formed chains operating on **wrong telecom facts**. RL cannot manufacture knowledge that was never in the model. Invest in SFT data curation *first*. |
+| **2** | **Self-rubric reward is what makes the model universal.** | Without rubric-decomposed credit, a 27B base produces zero correct rollouts on derivation-heavy axes for hundreds of training steps, and RL gets no gradient. Pillar 3 is the difference between "knowledge-QA specialist that guesses on hard axes" and "universal reasoner". |
+| **3** | **Verifier rigor matters as much as reward weights.** | A permissive verifier silently rewards lucky digit matches and penalizes correct reasoning in the wrong format. Unit normalization, tolerance bands, symbolic equivalence, and code execution were all as important as choosing the reward weights themselves. |
+| **4** | **Difficulty-mined curriculum prevents axis collapse.** | Easy axes (RFC-style knowledge QA) saturate within hundreds of RL steps; hard axes (math, code, complex logs) keep improving. Without curriculum, easy axes hog the gradient and stall the rest. |
+| **5** | **Mixing general-domain CoT preserves reasoning style.** | Reflective markers — *"wait…"*, *"hmm…"*, self-correction — are too thin in pure telecom traces. A small general-domain mix improves both naturalness and hardest-axis accuracy. |
+---
+## 7 — In One Line
+> **For vertical-domain LLMs, knowledge curation deserves at least as much attention as the choice of RL algorithm.**
+TelecomGPT-R1 is what happens when you treat telecom reasoning as **one universal capability** — one corpus, one policy, four axes — and engineer the recipe around that single-whole property end-to-end.
+The model doesn't just *quote* the standard. It *reasons* through it.
+---
+## 8 — What This Opens Up Next
+TelecomGPT-R1 is a foundation, not an endpoint. Three directions are within immediate reach, and one is the long-term ambition:
+1. **Production telecom copilots.** Incident-response assistants for NOC operators, real-time fault-diagnosis bots over live log feeds, and spec-compliance automation for vendor implementations — all benefit from a single model that *reasons over heterogeneous evidence* (logs + tables + math + code) instead of stitching together a RAG pipeline per task. With an open-weight 27B reasoner that already leads TeleLogs and TeleTables, the path from research artifact to operations tooling is short.
+2. **Scaling the recipe — bigger model, more modalities.** The unified-corpus + three-pillar-GRPO recipe is parameter- and modality-agnostic. The same eight-step pipeline scales naturally to a 70B / 200B telecom reasoner; the axis-indexed corpus extends naturally to KPI dashboards, network-topology graphs, RF spectrum images, and protocol message-flow diagrams. Both extensions are mechanical to engineer once the single-whole property of the corpus is in place.
+3. **A transferable recipe for any structured-derivation vertical.** The pattern — *heterogeneous sources curated into one axis-indexed corpus, then trained with self-rubric rewards decomposed across structure / logic / format / facts* — is not telecom-specific. Power-grid operations, semiconductor manufacturing, clinical decision support, automotive safety analysis, and other infrastructure verticals all share the same shape (heterogeneous tasks with structured intermediate sub-goals) and should be directly amenable to it.
+4. **Standards-grade co-drafting (the long bet).** Once a reasoning model can simulate a 3GPP procedure flow, verify an equation derivation, and flag cross-spec inconsistencies, the line between *AI that learns from telecom* and *AI that contributes to telecom* begins to blur. We see a plausible path where a future descendant of TelecomGPT-R1 sits inside a 3GPP / IEEE / IETF working group as a drafting assistant — detecting protocol-flow ambiguities, suggesting equation simplifications, and surfacing inconsistencies across releases. This is the direction we are most excited about, and it is what motivates keeping the recipe open: standards are a public good, and the AI that helps draft them should be too.
+---
+### Resources
+- **Paper.** [arXiv link coming soon!]
+- **Model weights.** [HuggingFace link coming soon!]
+- **Unified benchmark.** [GSMA Open Telco Leaderboard](https://huggingface.co/spaces/GSMA/open-telco-leaderboard)
+### Citation
+```bibtex
+@inproceedings{wang2026telecomgptr1,
+  title     = {TelecomGPT-R1: Post-Training Recipes for Universal Reasoning in Telecom},
+  author    = {Wang, Bohao and Wu, Chenwei and Li, Haoyu and Zou, Hang and Tian, Yu
+               and Barial, Lina and Huang, Chongwen and Shen, Zhang, Zhaoyang and Debbah, M\'{e}rouane},
+  booktitle = {[Venue coming soon!]},
+  year      = {2026}
+}
+@article{zou2025telecomgpt,
+  title     ={Telecomgpt: A framework to build telecom-specific large language models},
+  author    ={Zou, Hang and Zhao, Qiyang and Tian, Yu and Bariah, Lina and Bader, Faouzi and Lestable, Thierry and Debbah, M\'{e}rouane},
+  journal   ={IEEE Transactions on Machine Learning in Communications and Networking},
+  year      ={2025},
+  publisher ={IEEE}
+}
+```
+### Acknowledgements
+This work was supported by the Digital Future Institute of Khalifa University; the College of Information Science and Electronic Engineering, Zhejiang University; the College of Computer Science and Technology, Zhejiang University; and the Research Computing team of Khalifa University.