Safetensors
qwen3_5
wbhVince829 commited on
Commit
3cde620
·
verified ·
1 Parent(s): 99f638a

update quickstart after local test

Browse files
Files changed (1) hide show
  1. README.md +12 -3
README.md CHANGED
@@ -28,7 +28,9 @@ license: apache-2.0
28
 
29
  ### Quickstart
30
 
31
- **Requirements:** `transformers >= 4.51.0`, `torch >= 2.1`.
 
 
32
 
33
  Here is a code snippet demonstrating how to load TelecomGPT-R1 with `transformers` and generate a telecom-grounded response:
34
 
@@ -83,6 +85,14 @@ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
83
  print(response)
84
  ```
85
 
 
 
 
 
 
 
 
 
86
  For production / batch serving on operator-confidential data, host with [vLLM](https://github.com/vllm-project/vllm):
87
 
88
  ```bash
@@ -94,11 +104,10 @@ vllm serve KU-DFI/TelecomGPT-R1 \
94
 
95
  (Scale `--tensor-parallel-size`, `--max-model-len`, and `--gpu-memory-utilization` up as needed for multi-GPU nodes or higher-throughput serving.)
96
 
97
- **Hardware**: Following the official [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) deployment guidance, TelecomGPT-R1 (27B, bf16) runs on a single **A100 80GB** (or equivalent **H100 80GB** / **MI300X**) with the default settings above. Multi-GPU nodes allow longer contexts and larger batches behind an operator firewall.
98
 
99
  **Smaller TelecomGPT-R1 variants — coming soon.** Lighter checkpoints better suited to **edge / device-side** inference are currently in training, extending the family from data-center GPU deployment down toward on-device telecom intelligence.
100
 
101
-
102
  ---
103
 
104
 
 
28
 
29
  ### Quickstart
30
 
31
+ **Requirements (verified):** `transformers >= 5.0`, `torch >= 2.4`, and — for vLLM serving — `vllm >= 0.19`. TelecomGPT-R1's tokenizer relies on the `TokenizersBackend` class introduced in `transformers 5.x`, so older 4.x releases fail at tokenizer load. The end-to-end stack we verified is `transformers 5.3.0.dev0` + `torch 2.10.0+cu128` + `vllm 0.19.1`.
32
+
33
+ **First-call download.** The first `from_pretrained("KU-DFI/TelecomGPT-R1")` pulls ~51 GB into `~/.cache/huggingface/`; expect ~15 minutes on a 50 MB/s connection and at least 60 GB of free disk on the cache filesystem.
34
 
35
  Here is a code snippet demonstrating how to load TelecomGPT-R1 with `transformers` and generate a telecom-grounded response:
36
 
 
85
  print(response)
86
  ```
87
 
88
+ **Expected timing & fast-path libraries.** Loading the bf16 weights takes roughly 30–60 s on a single 80 GB GPU (we measured 22 s sharded across 8× H200). Out of the box, generation runs in the slow torch fallback at ~10–20 tok/s on a single sequence (we measured 15.8 tok/s on 8× H200, bf16). For faster inference, install Qwen3.5's optional fast-path kernels:
89
+
90
+ ```bash
91
+ pip install flash-linear-attention causal-conv1d
92
+ ```
93
+
94
+ (See [flash-linear-attention](https://github.com/fla-org/flash-linear-attention) and [causal-conv1d](https://github.com/Dao-AILab/causal-conv1d) for build details.)
95
+
96
  For production / batch serving on operator-confidential data, host with [vLLM](https://github.com/vllm-project/vllm):
97
 
98
  ```bash
 
104
 
105
  (Scale `--tensor-parallel-size`, `--max-model-len`, and `--gpu-memory-utilization` up as needed for multi-GPU nodes or higher-throughput serving.)
106
 
107
+ **Hardware**: Following the official [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) deployment guidance, TelecomGPT-R1 (27B, bf16) runs on a single **A100 80GB** (or equivalent **H100 80GB** / **MI300X**) with the default settings above. The bf16 weights occupy ~54 GB, leaving roughly 14 GB of an 80 GB card for KV cache at `--gpu-memory-utilization 0.85` enough for context lengths up to ~8K tokens on a single GPU. Longer contexts (16K, 32K) or larger batches require multi-GPU sharding (e.g. `--tensor-parallel-size 2` or more) behind an operator firewall.
108
 
109
  **Smaller TelecomGPT-R1 variants — coming soon.** Lighter checkpoints better suited to **edge / device-side** inference are currently in training, extending the family from data-center GPU deployment down toward on-device telecom intelligence.
110
 
 
111
  ---
112
 
113