KU-DFI
/

TelecomGPT-R1

Safetensors

qwen3_5

Model card Files Files and versions

xet

Community

wbhVince829 commited on 26 days ago

Commit

3cde620

verified ·

1 Parent(s): 99f638a

update quickstart after local test

Browse files

Files changed (1) hide show

README.md +12 -3

README.md CHANGED Viewed

@@ -28,7 +28,9 @@ license: apache-2.0
 ### Quickstart
-**Requirements:** `transformers >= 4.51.0`, `torch >= 2.1`.
 Here is a code snippet demonstrating how to load TelecomGPT-R1 with `transformers` and generate a telecom-grounded response:
@@ -83,6 +85,14 @@ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
 print(response)
 ```
 For production / batch serving on operator-confidential data, host with [vLLM](https://github.com/vllm-project/vllm):
 ```bash
@@ -94,11 +104,10 @@ vllm serve KU-DFI/TelecomGPT-R1 \
 (Scale `--tensor-parallel-size`, `--max-model-len`, and `--gpu-memory-utilization` up as needed for multi-GPU nodes or higher-throughput serving.)
-**Hardware**: Following the official [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) deployment guidance, TelecomGPT-R1 (27B, bf16) runs on a single **A100 80GB** (or equivalent **H100 80GB** / **MI300X**) with the default settings above. Multi-GPU nodes allow longer contexts and larger batches behind an operator firewall.
 **Smaller TelecomGPT-R1 variants — coming soon.** Lighter checkpoints better suited to **edge / device-side** inference are currently in training, extending the family from data-center GPU deployment down toward on-device telecom intelligence.
 ---

 ### Quickstart
+**Requirements (verified):** `transformers >= 5.0`, `torch >= 2.4`, and — for vLLM serving — `vllm >= 0.19`. TelecomGPT-R1's tokenizer relies on the `TokenizersBackend` class introduced in `transformers 5.x`, so older 4.x releases fail at tokenizer load. The end-to-end stack we verified is `transformers 5.3.0.dev0` + `torch 2.10.0+cu128` + `vllm 0.19.1`.
+**First-call download.** The first `from_pretrained("KU-DFI/TelecomGPT-R1")` pulls ~51 GB into `~/.cache/huggingface/`; expect ~15 minutes on a 50 MB/s connection and at least 60 GB of free disk on the cache filesystem.
 Here is a code snippet demonstrating how to load TelecomGPT-R1 with `transformers` and generate a telecom-grounded response:
 print(response)
 ```
+**Expected timing & fast-path libraries.** Loading the bf16 weights takes roughly 30–60 s on a single 80 GB GPU (we measured 22 s sharded across 8× H200). Out of the box, generation runs in the slow torch fallback at ~10–20 tok/s on a single sequence (we measured 15.8 tok/s on 8× H200, bf16). For faster inference, install Qwen3.5's optional fast-path kernels:
+```bash
+pip install flash-linear-attention causal-conv1d
+```
+(See [flash-linear-attention](https://github.com/fla-org/flash-linear-attention) and [causal-conv1d](https://github.com/Dao-AILab/causal-conv1d) for build details.)
 For production / batch serving on operator-confidential data, host with [vLLM](https://github.com/vllm-project/vllm):
 ```bash
 (Scale `--tensor-parallel-size`, `--max-model-len`, and `--gpu-memory-utilization` up as needed for multi-GPU nodes or higher-throughput serving.)
+**Hardware**: Following the official [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) deployment guidance, TelecomGPT-R1 (27B, bf16) runs on a single **A100 80GB** (or equivalent **H100 80GB** / **MI300X**) with the default settings above. The bf16 weights occupy ~54 GB, leaving roughly 14 GB of an 80 GB card for KV cache at `--gpu-memory-utilization 0.85` — enough for context lengths up to ~8K tokens on a single GPU. Longer contexts (16K, 32K) or larger batches require multi-GPU sharding (e.g. `--tensor-parallel-size 2` or more) behind an operator firewall.
 **Smaller TelecomGPT-R1 variants — coming soon.** Lighter checkpoints better suited to **edge / device-side** inference are currently in training, extending the family from data-center GPU deployment down toward on-device telecom intelligence.
 ---