exlaw commited on
Commit
c74b4fa
·
verified ·
1 Parent(s): 0d62c3c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +97 -0
README.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: sequential-hidden-decoding
4
+ license_link: LICENSE
5
+ base_model:
6
+ - Qwen/Qwen3-8B-Base
7
+ tags:
8
+ - sequential-hidden-decoding
9
+ - pretrained
10
+ - base-model
11
+ ---
12
+
13
+ # Sequential-Hidden-Decoding-8B-n2
14
+
15
+ This is the **n=2** variant of Sequential Hidden Decoding, a method that scales sequence length by n× with only additional Embedding parameters — same Transformer, more compute per token.
16
+
17
+ - **Base model:** [Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base)
18
+ - **Scale:** 2×
19
+ - **Additional Embedding Params:** 1.9B
20
+ - **Training Tokens:** 75B
21
+ - **Dtype:** bfloat16
22
+
23
+ > **Note:** This is a **base model** (not instruction-tuned). It is intended for benchmarking, text completion, and as a foundation for downstream fine-tuning (SFT / RLHF). For conversational or instruction-following use cases, please fine-tune on your own data.
24
+
25
+ ## Key Idea
26
+
27
+ Prepare *n* independent Embedding matrices to encode the same token sequence *n* times, interleave the results, and feed the *n*×-length sequence into the same Transformer. Only the last embedding of each token computes the next-token loss, while the preceding embeddings serve as implicit reasoning steps in a continuous latent space.
28
+
29
+ ## Results
30
+
31
+ | Benchmark | # Shots | 8B Baseline | 8B scale n=2 | 8B scale n=4 | 8B scale n=8 |
32
+ |-----------|:-------:|:-----------:|:------------:|:------------:|:------------:|
33
+ | BBH (EM) | 3-shot | 78.8 | **81.3** | 83.0 | 83.9 |
34
+ | MMLU (EM) | 5-shot | 79.8 | **80.9** | 81.9 | 82.2 |
35
+ | MBPP+ (Pass@1) | 1-shot | 66.7 | **69.4** | 68.7 | 69.4 |
36
+ | MATH (LLM-judge) | 4-shot | 56.0 | **58.2** | 60.0 | 61.1 |
37
+ | ARC-C | 25-shot | 93.9 | **94.3** | 94.4 | 94.7 |
38
+ | Hellaswag | 10-shot | 79.7 | **83.1** | 85.0 | 85.3 |
39
+ | GSM8K | 4-shot | 92.5 | **93.3** | 93.9 | 94.6 |
40
+
41
+ ## Serving (SGLang)
42
+
43
+ This model requires a patched version of [SGLang](https://github.com/sgl-project/sglang) for inference. See the [project page](https://huggingface.co/collections/tencent/sequential-hidden-decoding) for installation options (Docker image, forked repo, or manual patch).
44
+
45
+ ```bash
46
+ python -m sglang.launch_server \
47
+ --model-path tencent/Sequential-Hidden-Decoding-8B-n2 \
48
+ --trust-remote-code \
49
+ --tp-size 1 \
50
+ --port 30000 --host 0.0.0.0 \
51
+ --chunked-prefill-size -1 \
52
+ --attention-backend fa3 \
53
+ --mem-fraction-static 0.82 \
54
+ --max-running-requests 32 \
55
+ --context-length 131072 \
56
+ --cuda-graph-max-bs 128 \
57
+ --cuda-graph-bs 1 2 4 8 16 32 64 128
58
+ ```
59
+
60
+ ```python
61
+ from openai import OpenAI
62
+
63
+ client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
64
+ response = client.completions.create(
65
+ model="tencent/Sequential-Hidden-Decoding-8B-n2",
66
+ prompt="The meaning of life is",
67
+ max_tokens=128,
68
+ temperature=0,
69
+ )
70
+ print(response.choices[0].text)
71
+ ```
72
+
73
+ ## All Models
74
+
75
+ | Model | Scale | Embedding Params | Training Tokens |
76
+ |-------|:-----:|:----------------:|:---------------:|
77
+ | [Sequential-Hidden-Decoding-8B-n2](https://huggingface.co/tencent/Sequential-Hidden-Decoding-8B-n2) | 2× | 1.9B | 75B |
78
+ | [Sequential-Hidden-Decoding-8B-n4](https://huggingface.co/tencent/Sequential-Hidden-Decoding-8B-n4) | 4× | 3.1B | 150B |
79
+ | [Sequential-Hidden-Decoding-8B-n8](https://huggingface.co/tencent/Sequential-Hidden-Decoding-8B-n8) | 8× | 5.6B | 187B |
80
+
81
+ ## Citation
82
+
83
+ ```bibtex
84
+ @article{hidden_decoding_2026,
85
+ title = {Sequential Hidden Decoding: Scaling Sequence Length in Pretraining},
86
+ year = {2026},
87
+ url = {https://welm.weixin.qq.com/posts/hidden_decoding/}
88
+ }
89
+ ```
90
+
91
+ ## Contact
92
+
93
+ Sijun Zhang (nepheloturbulence@gmail.com), Aiwei Liu (liuaiwei20@gmail.com)
94
+
95
+ ## License
96
+
97
+ This model is released under the [License Terms of Sequential-Hidden-Decoding](LICENSE).