update quickstart after local test
Browse files
README.md
CHANGED
|
@@ -28,7 +28,9 @@ license: apache-2.0
|
|
| 28 |
|
| 29 |
### Quickstart
|
| 30 |
|
| 31 |
-
**Requirements:** `transformers >=
|
|
|
|
|
|
|
| 32 |
|
| 33 |
Here is a code snippet demonstrating how to load TelecomGPT-R1 with `transformers` and generate a telecom-grounded response:
|
| 34 |
|
|
@@ -83,6 +85,14 @@ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
|
| 83 |
print(response)
|
| 84 |
```
|
| 85 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
For production / batch serving on operator-confidential data, host with [vLLM](https://github.com/vllm-project/vllm):
|
| 87 |
|
| 88 |
```bash
|
|
@@ -94,11 +104,10 @@ vllm serve KU-DFI/TelecomGPT-R1 \
|
|
| 94 |
|
| 95 |
(Scale `--tensor-parallel-size`, `--max-model-len`, and `--gpu-memory-utilization` up as needed for multi-GPU nodes or higher-throughput serving.)
|
| 96 |
|
| 97 |
-
**Hardware**: Following the official [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) deployment guidance, TelecomGPT-R1 (27B, bf16) runs on a single **A100 80GB** (or equivalent **H100 80GB** / **MI300X**) with the default settings above.
|
| 98 |
|
| 99 |
**Smaller TelecomGPT-R1 variants — coming soon.** Lighter checkpoints better suited to **edge / device-side** inference are currently in training, extending the family from data-center GPU deployment down toward on-device telecom intelligence.
|
| 100 |
|
| 101 |
-
|
| 102 |
---
|
| 103 |
|
| 104 |
|
|
|
|
| 28 |
|
| 29 |
### Quickstart
|
| 30 |
|
| 31 |
+
**Requirements (verified):** `transformers >= 5.0`, `torch >= 2.4`, and — for vLLM serving — `vllm >= 0.19`. TelecomGPT-R1's tokenizer relies on the `TokenizersBackend` class introduced in `transformers 5.x`, so older 4.x releases fail at tokenizer load. The end-to-end stack we verified is `transformers 5.3.0.dev0` + `torch 2.10.0+cu128` + `vllm 0.19.1`.
|
| 32 |
+
|
| 33 |
+
**First-call download.** The first `from_pretrained("KU-DFI/TelecomGPT-R1")` pulls ~51 GB into `~/.cache/huggingface/`; expect ~15 minutes on a 50 MB/s connection and at least 60 GB of free disk on the cache filesystem.
|
| 34 |
|
| 35 |
Here is a code snippet demonstrating how to load TelecomGPT-R1 with `transformers` and generate a telecom-grounded response:
|
| 36 |
|
|
|
|
| 85 |
print(response)
|
| 86 |
```
|
| 87 |
|
| 88 |
+
**Expected timing & fast-path libraries.** Loading the bf16 weights takes roughly 30–60 s on a single 80 GB GPU (we measured 22 s sharded across 8× H200). Out of the box, generation runs in the slow torch fallback at ~10–20 tok/s on a single sequence (we measured 15.8 tok/s on 8× H200, bf16). For faster inference, install Qwen3.5's optional fast-path kernels:
|
| 89 |
+
|
| 90 |
+
```bash
|
| 91 |
+
pip install flash-linear-attention causal-conv1d
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
(See [flash-linear-attention](https://github.com/fla-org/flash-linear-attention) and [causal-conv1d](https://github.com/Dao-AILab/causal-conv1d) for build details.)
|
| 95 |
+
|
| 96 |
For production / batch serving on operator-confidential data, host with [vLLM](https://github.com/vllm-project/vllm):
|
| 97 |
|
| 98 |
```bash
|
|
|
|
| 104 |
|
| 105 |
(Scale `--tensor-parallel-size`, `--max-model-len`, and `--gpu-memory-utilization` up as needed for multi-GPU nodes or higher-throughput serving.)
|
| 106 |
|
| 107 |
+
**Hardware**: Following the official [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) deployment guidance, TelecomGPT-R1 (27B, bf16) runs on a single **A100 80GB** (or equivalent **H100 80GB** / **MI300X**) with the default settings above. The bf16 weights occupy ~54 GB, leaving roughly 14 GB of an 80 GB card for KV cache at `--gpu-memory-utilization 0.85` — enough for context lengths up to ~8K tokens on a single GPU. Longer contexts (16K, 32K) or larger batches require multi-GPU sharding (e.g. `--tensor-parallel-size 2` or more) behind an operator firewall.
|
| 108 |
|
| 109 |
**Smaller TelecomGPT-R1 variants — coming soon.** Lighter checkpoints better suited to **edge / device-side** inference are currently in training, extending the family from data-center GPU deployment down toward on-device telecom intelligence.
|
| 110 |
|
|
|
|
| 111 |
---
|
| 112 |
|
| 113 |
|