Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,97 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: other
|
| 3 |
+
license_name: sequential-hidden-decoding
|
| 4 |
+
license_link: LICENSE
|
| 5 |
+
base_model:
|
| 6 |
+
- Qwen/Qwen3-8B-Base
|
| 7 |
+
tags:
|
| 8 |
+
- sequential-hidden-decoding
|
| 9 |
+
- pretrained
|
| 10 |
+
- base-model
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# Sequential-Hidden-Decoding-8B-n2
|
| 14 |
+
|
| 15 |
+
This is the **n=2** variant of Sequential Hidden Decoding, a method that scales sequence length by n× with only additional Embedding parameters — same Transformer, more compute per token.
|
| 16 |
+
|
| 17 |
+
- **Base model:** [Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base)
|
| 18 |
+
- **Scale:** 2×
|
| 19 |
+
- **Additional Embedding Params:** 1.9B
|
| 20 |
+
- **Training Tokens:** 75B
|
| 21 |
+
- **Dtype:** bfloat16
|
| 22 |
+
|
| 23 |
+
> **Note:** This is a **base model** (not instruction-tuned). It is intended for benchmarking, text completion, and as a foundation for downstream fine-tuning (SFT / RLHF). For conversational or instruction-following use cases, please fine-tune on your own data.
|
| 24 |
+
|
| 25 |
+
## Key Idea
|
| 26 |
+
|
| 27 |
+
Prepare *n* independent Embedding matrices to encode the same token sequence *n* times, interleave the results, and feed the *n*×-length sequence into the same Transformer. Only the last embedding of each token computes the next-token loss, while the preceding embeddings serve as implicit reasoning steps in a continuous latent space.
|
| 28 |
+
|
| 29 |
+
## Results
|
| 30 |
+
|
| 31 |
+
| Benchmark | # Shots | 8B Baseline | 8B scale n=2 | 8B scale n=4 | 8B scale n=8 |
|
| 32 |
+
|-----------|:-------:|:-----------:|:------------:|:------------:|:------------:|
|
| 33 |
+
| BBH (EM) | 3-shot | 78.8 | **81.3** | 83.0 | 83.9 |
|
| 34 |
+
| MMLU (EM) | 5-shot | 79.8 | **80.9** | 81.9 | 82.2 |
|
| 35 |
+
| MBPP+ (Pass@1) | 1-shot | 66.7 | **69.4** | 68.7 | 69.4 |
|
| 36 |
+
| MATH (LLM-judge) | 4-shot | 56.0 | **58.2** | 60.0 | 61.1 |
|
| 37 |
+
| ARC-C | 25-shot | 93.9 | **94.3** | 94.4 | 94.7 |
|
| 38 |
+
| Hellaswag | 10-shot | 79.7 | **83.1** | 85.0 | 85.3 |
|
| 39 |
+
| GSM8K | 4-shot | 92.5 | **93.3** | 93.9 | 94.6 |
|
| 40 |
+
|
| 41 |
+
## Serving (SGLang)
|
| 42 |
+
|
| 43 |
+
This model requires a patched version of [SGLang](https://github.com/sgl-project/sglang) for inference. See the [project page](https://huggingface.co/collections/tencent/sequential-hidden-decoding) for installation options (Docker image, forked repo, or manual patch).
|
| 44 |
+
|
| 45 |
+
```bash
|
| 46 |
+
python -m sglang.launch_server \
|
| 47 |
+
--model-path tencent/Sequential-Hidden-Decoding-8B-n2 \
|
| 48 |
+
--trust-remote-code \
|
| 49 |
+
--tp-size 1 \
|
| 50 |
+
--port 30000 --host 0.0.0.0 \
|
| 51 |
+
--chunked-prefill-size -1 \
|
| 52 |
+
--attention-backend fa3 \
|
| 53 |
+
--mem-fraction-static 0.82 \
|
| 54 |
+
--max-running-requests 32 \
|
| 55 |
+
--context-length 131072 \
|
| 56 |
+
--cuda-graph-max-bs 128 \
|
| 57 |
+
--cuda-graph-bs 1 2 4 8 16 32 64 128
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
```python
|
| 61 |
+
from openai import OpenAI
|
| 62 |
+
|
| 63 |
+
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
|
| 64 |
+
response = client.completions.create(
|
| 65 |
+
model="tencent/Sequential-Hidden-Decoding-8B-n2",
|
| 66 |
+
prompt="The meaning of life is",
|
| 67 |
+
max_tokens=128,
|
| 68 |
+
temperature=0,
|
| 69 |
+
)
|
| 70 |
+
print(response.choices[0].text)
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
## All Models
|
| 74 |
+
|
| 75 |
+
| Model | Scale | Embedding Params | Training Tokens |
|
| 76 |
+
|-------|:-----:|:----------------:|:---------------:|
|
| 77 |
+
| [Sequential-Hidden-Decoding-8B-n2](https://huggingface.co/tencent/Sequential-Hidden-Decoding-8B-n2) | 2× | 1.9B | 75B |
|
| 78 |
+
| [Sequential-Hidden-Decoding-8B-n4](https://huggingface.co/tencent/Sequential-Hidden-Decoding-8B-n4) | 4× | 3.1B | 150B |
|
| 79 |
+
| [Sequential-Hidden-Decoding-8B-n8](https://huggingface.co/tencent/Sequential-Hidden-Decoding-8B-n8) | 8× | 5.6B | 187B |
|
| 80 |
+
|
| 81 |
+
## Citation
|
| 82 |
+
|
| 83 |
+
```bibtex
|
| 84 |
+
@article{hidden_decoding_2026,
|
| 85 |
+
title = {Sequential Hidden Decoding: Scaling Sequence Length in Pretraining},
|
| 86 |
+
year = {2026},
|
| 87 |
+
url = {https://welm.weixin.qq.com/posts/hidden_decoding/}
|
| 88 |
+
}
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
## Contact
|
| 92 |
+
|
| 93 |
+
Sijun Zhang (nepheloturbulence@gmail.com), Aiwei Liu (liuaiwei20@gmail.com)
|
| 94 |
+
|
| 95 |
+
## License
|
| 96 |
+
|
| 97 |
+
This model is released under the [License Terms of Sequential-Hidden-Decoding](LICENSE).
|