llm_mutil_npu / README.md
xianglarry's picture
Add Chinese README (README_zh.md)
ac7d8da
# qwen3-moe-aclnn
Pure C++ inference of **Qwen3-235B-A22B-Instruct** BF16 on **Ascend 910 ร— 16 NPU**, built directly on the aclnn EAGER API (no graph compilation, no PyTorch, no ggml).
ไธญๆ–‡็‰ˆๆœฌ๏ผš[README_zh.md](README_zh.md)
---
## Performance
Measured on Ascend 910 initial-gen ร— 16 NPU (TP=16) with Qwen3-235B-A22B-Instruct-2507 BF16 weights.
All numbers are **quality-preserving TG** (output was manually verified); greedy `temperature=0`.
| Configuration | TG | Applicable prompts |
|---|---|---|
| Untuned baseline | 12 t/s | All |
| **Default recommended** (no PLD) | **~27 t/s** | **All prompts, stable output** |
| PLD with degeneration guard | 29-45 t/s | Structured text (essays, long-form answers) |
| PLD on creative prompts | 25-40 t/s | Stories / varied generation |
| PLD on factual / code prompts | unstable (21-95 t/s, high variance) | Not recommended |
Reference: `cann-recipes-infer` GE graph baseline reports ~54 t/s on the same hardware. **This project does not exceed that baseline** โ€” it trades some peak speed for (a) no graph compilation, (b) no PyTorch dependency, (c) full control over operator scheduling.
### Key optimizations that contributed (in order of magnitude)
| Rank | Optimization | Gain | Where |
|---|---|---|---|
| ๐Ÿฅ‡ | HCCL env tuning (`AIV` + `FFTS` + `TASK_QUEUE=2`) | +89% (12โ†’23 t/s) | `scripts/tp_launch.sh` |
| ๐Ÿฅˆ | Fused RoPE via `aclnnApplyRotaryPosEmbV2` | +17% (23โ†’27 t/s) | `include/rope.h` |
| ๐Ÿฅ‰ | Prompt Lookup Decoding (PLD) w/ degeneration guard | +10-60% on applicable prompts | `src/main_cli.cpp` |
| โ—‹ | Device-side topk-w normalize, MoE argsort, cos/sin cache | ~+15% cumulative | `include/engine.h` |
| โ—‹ | WorkspacePool (thread-local + retain-old) | reduces alloc overhead | `include/workspace_pool.h` |
---
## Architecture
**Model**: Qwen3-235B-A22B, 94 layers, 128 experts (top-k=8), GQA (64 Q heads, 4 KV heads), BF16.
**Parallelism**: TP=16 via HCCL ring AllReduce. KV heads sharded 1-per-rank (since 4 KV heads < 16 ranks, Q heads 0-3 on each rank share KV head 0).
**Execution**: aclnn EAGER mode โ€” every op goes through `aclnn*` single-op API with workspace pool; no graph capture, no GE IR. Async stream execution with `TASK_QUEUE_ENABLE=2` for kernel submission overlap.
**Tokenizer**: Uses HuggingFace `transformers` via a Python subprocess for encoding; vocab decode is pure C++ from an exported `vocab.bin`.
### Per-layer forward flow
```
x_in [S, D=4096]
โ†“
โ”Œโ”€โ”€ Attention branch (TP: Q_DIM=512=4hร—128, KV_DIM=128=1hร—128) โ”€โ”€โ”
โ”‚ RmsNorm(input_layernorm)
โ”‚ linear_hf q_proj / k_proj / v_proj โ†’ q, k, v
โ”‚ Per-head RmsNorm q_norm, k_norm
โ”‚ Fused RoPE: aclnnApplyRotaryPosEmbV2 (layout=1, "half")
โ”‚ Append K, V to per-layer KV cache
โ”‚ Mask selection:
โ”‚ prefill: 2048ร—2048 causal + sparse_mode=3
โ”‚ decode S=1: nullptr + sparse_mode=0
โ”‚ batch decode: [1,1,S,past+S] custom bool mask + sparse_mode=0
โ”‚ FIAS (aclnnFusedInferAttentionScore)
โ”‚ o_proj linear_hf โ†’ partial per-rank
โ”‚ HCCL AllReduce (ring + AIV + FFTS) โ†’ full
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ†“ residual add
โ”Œโ”€โ”€ MoE branch โ”€โ”€โ”
โ”‚ RmsNorm(post_attention_layernorm)
โ”‚ router linear_hf โ†’ logits [S, 128]
โ”‚ moe_gating_topk_softmax โ†’ topk_w[S,8], topk_idx[S,8]
โ”‚ Device-side normalize (reduce_sum + adds + cast + div)
โ”‚ moe_init_routing_v3 โ†’ expanded_x, expanded_ri, tokens_per_expert
โ”‚ grouped_matmul_v4 gate/up/down (SwiGLU activation)
โ”‚ Device-side argsort ร— 2 โ†’ fwd permutation (avoids host sync)
โ”‚ IndexSelect โ†’ packed
โ”‚ Broadcast-mul by topk_w + ReduceSum axis=1
โ”‚ HCCL AllReduce โ†’ full
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ†“ residual add
x_out
```
---
## Model weights
This project targets **Qwen3-235B-A22B-Instruct-2507** (BF16). About **470 GB** of safetensors shards.
**Download sources**:
- HuggingFace: https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507
- ModelScope: https://www.modelscope.cn/models/Qwen/Qwen3-235B-A22B-Instruct-2507
Download via `huggingface-cli` or `modelscope` CLI:
```bash
# HuggingFace
huggingface-cli download Qwen/Qwen3-235B-A22B-Instruct-2507 --local-dir /path/to/Qwen3-235B-A22B-Instruct-2507-BF16
# ModelScope
modelscope download --model Qwen/Qwen3-235B-A22B-Instruct-2507 --local_dir /path/to/Qwen3-235B-A22B-Instruct-2507-BF16
```
**Weights format**: the binary reads HuggingFace `.safetensors` shards (multi-shard mmap), `config.json`, and `tokenizer.json` directly from the model directory. No conversion step is needed โ€” point `--model-dir` at the downloaded directory.
**Expected directory contents**:
```
Qwen3-235B-A22B-Instruct-2507-BF16/
โ”œโ”€โ”€ config.json
โ”œโ”€โ”€ tokenizer.json
โ”œโ”€โ”€ tokenizer_config.json
โ”œโ”€โ”€ model-00001-of-000XX.safetensors
โ”œโ”€โ”€ ...
โ””โ”€โ”€ model.safetensors.index.json
```
---
## Build
```bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
cmake -B build
cmake --build build -j8 --target qwen3-moe-aclnn
```
**Requires**:
- CANN 8.5.1 or compatible
- Python 3 + `transformers` + `torch_npu` (for tokenizer subprocess and reference-data generation only)
- C++17 compiler
- Ascend 910 ร— 16 NPU
- nlohmann/json (bundled as `external/json.hpp`)
**Python environment setup** โ€” the tokenizer calls a Python subprocess. Override the activation command via `QWEN3_PYENV_INIT` if your conda / venv layout differs from the default:
```bash
export QWEN3_PYENV_INIT="source /opt/my_conda/etc/profile.d/conda.sh && conda activate my_env && "
```
If unset, the default tries `${HOME}/miniconda3` with env `qwen3` and auto-sources the Ascend toolkit.
---
## Quick-start inference
```bash
# 1. Export tokenizer vocab to binary (one-time setup)
python3 scripts/export_vocab.py /path/to/Qwen3-235B-A22B-Instruct-2507-BF16
# 2. Run inference (TP=16)
./scripts/tp_launch.sh 16 ./build/qwen3-moe-aclnn \
--model-dir /path/to/Qwen3-235B-A22B-Instruct-2507-BF16 \
--prompt "The capital of France is" \
--n-predict 100 \
--temperature 0 \
--vocab tokenizer_data/vocab.bin
```
Expected: ~27 t/s, coherent output.
### Recommended flags by use case
**Universal default (stable, any prompt)** โ€” no PLD:
```bash
./scripts/tp_launch.sh 16 ./build/qwen3-moe-aclnn --model-dir ... --temperature 0 --no-stream
```
**Structured / long-form (essays, explanations)** โ€” PLD with guard gives +60-90%:
```bash
./scripts/tp_launch.sh 16 ./build/qwen3-moe-aclnn --model-dir ... --pld --temperature 0 --no-stream
```
**Interactive REPL (multi-turn chat)**:
```bash
./scripts/tp_launch.sh 16 ./build/qwen3-moe-aclnn --model-dir ... \
--interactive --chat --temperature 0.7 --top-p 0.8
```
---
## PLD degeneration guard
Prompt Lookup Decoding speeds up generation by having the model verify a batch of "draft" tokens in a single forward pass. The drafts are copied from the generation history via n-gram match.
**Known failure mode**: on prompts the model tends to repeat on (factual Q&A, code generation), the n-gram match feeds the model's own repetition back as drafts, creating a positive feedback loop that accelerates degenerate output. Early versions of this project reported misleading peak TG numbers driven by this loop.
**This project's guard** blocks suspect drafts with two heuristics:
1. **low-distinct**: draft's distinct-token count < threshold โ†’ reject
2. **tail-echo**: all of last N hist tokens equal draft[0] โ†’ reject
Rejected drafts fall back to single-token decode. A `[warn]` line is emitted once if the generated tail shows 8 consecutive identical tokens.
Flags:
```
--pld enable PLD (opt-in)
--pld-k N draft window size (default: 10)
--pld-ngram N n-gram match size (default: 1, with multi-level fallback)
--pld-min-hist N skip PLD until history >= N tokens (default: 20)
--pld-no-guard disable the degeneration guard (dangerous: can produce dead loops)
--pld-guard-distinct N minimum distinct tokens in draft (default: 3)
--pld-guard-tail N tail-echo window (default: 6)
--pld-loop-warn N emit warning on N consecutive identical tokens (default: 8)
```
**Honest benchmarking**: use `scripts/bench_pld_safe.sh`, which classifies each run's output as OK / LOOP_N / LOW_DIVERSITY and separates TG statistics for OK-only vs degraded runs.
---
## Correctness verification
15+ unit / integration tests checked against Python (HuggingFace Transformers) reference:
```bash
./build/test_attention_layer # rel=4.9e-4 vs Python prefill
./build/test_attention_decode # rel=0 (bit-exact)
./build/test_moe_layer # rel=3.6e-3
./build/test_layer_forward # full single layer
./build/test_runner # multi-layer runner
./build/test_rope_fused # aclnnApplyRotaryPosEmbV2 vs manual HF rotate_half
./build/test_batch_decode # S=1..8 timing
./build/test_batch_correctness # argmax consistency
./build/test_op_support # 910-specific op availability
# Integration smoke:
./tests/test_chat_flow.sh # 7/7 PASS
```
Tests expect reference data under `tests/<name>_data/` generated by `scripts/gen_*_reference.py`. See each script's docstring.
---
## Environment tuning (auto-applied by `tp_launch.sh`)
```bash
HCCL_WHITELIST_DISABLE=1
HCCL_ALGO=level0:ring # ring, not fullmesh (fullmesh causes garbled output)
HCCL_BUFFSIZE=200 # sweet spot; 100 and 400 both slower
HCCL_OP_EXPANSION_MODE=AIV # key: AI Vector cores participate in reduce scheduling
HCCL_OP_BASE_FFTS_MODE_ENABLE=1 # key: Fast Frequently-used Transfer Scheduling
TASK_QUEUE_ENABLE=2 # key: aggressive async task submission
```
Removing any of the three "key" env vars drops TG by 20-40%.
---
## Directory layout
```
include/
โ”œโ”€โ”€ acl_common.h RAII wrappers, DeviceBuffer, make_contig_tensor
โ”œโ”€โ”€ aclnn_ops.h single-op wrappers + WorkspacePool integration
โ”œโ”€โ”€ acl_runtime.h AclRuntime (device + stream management)
โ”œโ”€โ”€ device_weights.h safetensors โ†’ device loading + TP sharding
โ”œโ”€โ”€ engine.h attention_forward + moe_forward + RopeCache
โ”œโ”€โ”€ hccl_comm.h HCCL init + allreduce + broadcast
โ”œโ”€โ”€ model_config.h Qwen3 hyperparameters + compute_derived
โ”œโ”€โ”€ rope.h apply_rope_fused (aclnnApplyRotaryPosEmbV2 wrapper)
โ”œโ”€โ”€ runner.h Runner class (prefill/decode/decode_batch/rewind/profile)
โ”œโ”€โ”€ safetensors_loader.h multi-shard safetensors mmap parser
โ”œโ”€โ”€ tokenizer.h vocab decode + Python subprocess encode
โ””โ”€โ”€ workspace_pool.h thread-local aclnn workspace pool (retain-old)
src/
โ”œโ”€โ”€ device_weights.cpp load_attention (GQA fix), load_moe (permute sync fix)
โ”œโ”€โ”€ main_cli.cpp CLI entry + PLD main loop + degeneration guard + multi-turn
โ”œโ”€โ”€ model_config.cpp compute_derived (GQA KV sharding)
โ”œโ”€โ”€ runner.cpp Runner (build_batch_decode_mask_ etc.)
โ”œโ”€โ”€ safetensors_loader.cpp
โ””โ”€โ”€ tokenizer.cpp
scripts/
โ”œโ”€โ”€ tp_launch.sh production launcher (auto-applies HCCL env)
โ”œโ”€โ”€ bench_tg.sh stable N-run TG measurement
โ”œโ”€โ”€ bench_pld_safe.sh PLD benchmark with output-correctness classifier
โ”œโ”€โ”€ bench_hccl[_adv].sh HCCL parameter sweep
โ”œโ”€โ”€ bench_pld[_k].sh PLD K ร— ngram sweep (legacy, prefer bench_pld_safe.sh)
โ”œโ”€โ”€ export_vocab.py vocab.bin exporter from HF tokenizer
โ””โ”€โ”€ gen_*_reference.py per-op Python reference data generators
tests/
โ”œโ”€โ”€ test_attention_* attention correctness (prefill / decode)
โ”œโ”€โ”€ test_moe_layer MoE correctness
โ”œโ”€โ”€ test_layer_forward full single layer
โ”œโ”€โ”€ test_runner multi-layer Runner
โ”œโ”€โ”€ test_rope_fused fused RoPE vs manual HF
โ”œโ”€โ”€ test_batch_* batch decode timing + correctness
โ”œโ”€โ”€ test_op_support 910-specific op availability probe
โ””โ”€โ”€ test_chat_flow.sh end-to-end integration smoke
```
---
## CLI reference
```
--model-dir <path> (required) HF safetensors directory
--prompt "<text>" prompt text
--prompt-file FILE read prompt from file (avoids shell-escape issues)
--n-predict N maximum tokens to generate
--tp-size N tensor parallelism (or set TP_SIZE env)
--max-seq N KV cache + context cap (default: 512)
--temperature F 0 = greedy; typical 0.7
--top-k N 0 = disabled
--top-p F 1.0 = disabled
--seed N 0 = time-based
--chat apply Qwen3 chat template
--system "<text>" system role text (with --chat)
--interactive, -i REPL mode (multi-turn memory with --chat)
--reset force stateless REPL (reset KV between turns)
--no-stream batch-print final text instead of per-token streaming
--vocab <path> vocab.bin path (default: tokenizer_data/vocab.bin)
--pld* see "PLD degeneration guard" section
```
---
## Known limitations
- **Not yet reaching cann-recipes GE graph 54 t/s baseline** (currently ~27 t/s stable / up to ~45 t/s PLD).
Closing the gap requires one of: (a) real graph compilation, (b) fused collectives (`MatmulAllReduce`, `GroupedMatmulAllReduce`) which are absent on 910 initial-gen, (c) migration to 910B/A2/A3.
- **Only `tp_size` โˆˆ {1, 2, 4, 8, 16}** supported. Values that don't evenly divide 64 Q heads will error.
- **PLD on factual/code prompts is unreliable** โ€” either produces baseline TG (guard rejects most drafts) or enters partial degeneration the classifier may not catch at low-severity. Use `bench_pld_safe.sh` to evaluate honestly.
- **Tokenizer requires Python subprocess** โ€” adds ~1s startup for first encode. Override via `QWEN3_PYENV_INIT` env if default conda path doesn't match.
- **NPU performance has high run-to-run variance** (up to 4ร— in some configurations) due to BF16 + MoE intrinsic non-determinism and shared hardware resources. Report medians over โ‰ฅ5 runs.
---
## Future directions (prioritized)
1. **Draft Model Speculative Decoding** with Qwen3-0.6B โ€” more stable accept rate than n-gram PLD, expected +60-100% TG across prompt types (1-2 week implementation).
2. **HCCL AllReduce / compute overlap** โ€” ~+10-15% in theory, limited by EAGER path serial dependencies.
3. **KV cache INT8 quantization** โ€” reduces memory-bandwidth pressure, ~+15-25% on long contexts (pending 910-initial-gen op support verification).
4. **W8 weight quantization** โ€” ~+10-20% if aclnn quantization kernels exist on 910 initial-gen.
Not recommended:
- `aclmdlRI` stream-capture-style graph recording (POC proved 1.13ร— ceiling, not worth the engineering cost).
- Custom AscendC fused ops (high maintenance cost unless dedicated kernel engineer).
- torchair / torch.compile migration (breaks pure-C++ design).
---
## Documentation
- [`docs/optimization-summary-zh.md`](docs/optimization-summary-zh.md) โ€” ้˜ถๆฎตๆ€งไผ˜ๅŒ–ๆ€ป็ป“๏ผˆไธญๆ–‡๏ผ‰๏ผšๅ…ณ้”ฎไผ˜ๅŒ–ๅŽŸๅ› ใ€PLD ๆญฃ็กฎๆ€ง่พน็•Œใ€้กน็›ฎ็บงๆ•™่ฎญ
- [`docs/next-steps-draft-model-speculative.md`](docs/next-steps-draft-model-speculative.md) โ€” Draft Model Speculative Decoding๏ผˆQwen3-0.6B๏ผ‰ๆ‰ง่กŒ่ง„ๆ ผ๏ผšM1-M4 ้‡Œ็จ‹็ข‘ใ€ๆญฃ็กฎๆ€งๆต‹่ฏ•ๅ่ฎฎใ€้ฃŽ้™ฉๅ…œๅบ•
---
## License
Apache License 2.0 โ€” see `LICENSE`.