| | --- |
| | license: other |
| | license_name: sequential-hidden-decoding |
| | license_link: LICENSE |
| | base_model: |
| | - Qwen/Qwen3-8B-Base |
| | tags: |
| | - sequential-hidden-decoding |
| | - pretrained |
| | - base-model |
| | --- |
| | |
| | # Sequential-Hidden-Decoding-8B-n2 |
| |
|
| | This is the **n=2** variant of Sequential Hidden Decoding, a method that scales sequence length by n× with only additional Embedding parameters — same Transformer, more compute per token. |
| |
|
| | - **Base model:** [Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base) |
| | - **Scale:** 2× |
| | - **Additional Embedding Params:** 1.9B |
| | - **Training Tokens:** 75B |
| | - **Dtype:** bfloat16 |
| |
|
| | > **Note:** This is a **base model** (not instruction-tuned). It is intended for benchmarking, text completion, and as a foundation for downstream fine-tuning (SFT / RLHF). For conversational or instruction-following use cases, please fine-tune on your own data. |
| |
|
| | ## Key Idea |
| |
|
| | Prepare *n* independent Embedding matrices to encode the same token sequence *n* times, interleave the results, and feed the *n*×-length sequence into the same Transformer. Only the last embedding of each token computes the next-token loss, while the preceding embeddings serve as implicit reasoning steps in a continuous latent space. |
| |
|
| | ## Results |
| |
|
| | | Benchmark | # Shots | 8B Baseline | 8B scale n=2 | 8B scale n=4 | 8B scale n=8 | |
| | |-----------|:-------:|:-----------:|:------------:|:------------:|:------------:| |
| | | BBH (EM) | 3-shot | 78.8 | **81.3** | 83.0 | 83.9 | |
| | | MMLU (EM) | 5-shot | 79.8 | **80.9** | 81.9 | 82.2 | |
| | | MBPP+ (Pass@1) | 1-shot | 66.7 | **69.4** | 68.7 | 69.4 | |
| | | MATH (LLM-judge) | 4-shot | 56.0 | **58.2** | 60.0 | 61.1 | |
| | | ARC-C | 25-shot | 93.9 | **94.3** | 94.4 | 94.7 | |
| | | Hellaswag | 10-shot | 79.7 | **83.1** | 85.0 | 85.3 | |
| | | GSM8K | 4-shot | 92.5 | **93.3** | 93.9 | 94.6 | |
| |
|
| | ## Serving (SGLang) |
| |
|
| | This model requires a patched version of [SGLang](https://github.com/sgl-project/sglang) for inference. See the [project page](https://github.com/Tencent/Sequential-Hidden-Decoding) for installation options (Docker image, forked repo, or manual patch). |
| |
|
| | ```bash |
| | python -m sglang.launch_server \ |
| | --model-path tencent/Sequential-Hidden-Decoding-8B-n2 \ |
| | --trust-remote-code \ |
| | --tp-size 1 \ |
| | --port 30000 --host 0.0.0.0 \ |
| | --chunked-prefill-size -1 \ |
| | --attention-backend fa3 \ |
| | --mem-fraction-static 0.82 \ |
| | --max-running-requests 32 \ |
| | --context-length 131072 \ |
| | --cuda-graph-max-bs 128 \ |
| | --cuda-graph-bs 1 2 4 8 16 32 64 128 |
| | ``` |
| |
|
| | ```python |
| | from openai import OpenAI |
| | |
| | client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY") |
| | response = client.completions.create( |
| | model="tencent/Sequential-Hidden-Decoding-8B-n2", |
| | prompt="The meaning of life is", |
| | max_tokens=128, |
| | temperature=0, |
| | ) |
| | print(response.choices[0].text) |
| | ``` |
| |
|
| | ## All Models |
| |
|
| | | Model | Scale | Embedding Params | Training Tokens | |
| | |-------|:-----:|:----------------:|:---------------:| |
| | | [Sequential-Hidden-Decoding-8B-n2](https://huggingface.co/tencent/Sequential-Hidden-Decoding-8B-n2) | 2× | 1.9B | 75B | |
| | | [Sequential-Hidden-Decoding-8B-n4](https://huggingface.co/tencent/Sequential-Hidden-Decoding-8B-n4) | 4× | 3.1B | 150B | |
| | | [Sequential-Hidden-Decoding-8B-n8](https://huggingface.co/tencent/Sequential-Hidden-Decoding-8B-n8) | 8× | 5.6B | 187B | |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @article{hidden_decoding_2026, |
| | title = {Hidden Decoding: Scaling Sequence Length in Pretraining}, |
| | year = {2026}, |
| | url = {https://welm.weixin.qq.com/posts/hidden_decoding/} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | This model is released under the [License Terms of Sequential-Hidden-Decoding](LICENSE). |
| |
|