tencent
/

Sequential-Hidden-Decoding-8B-n8-Instruct

Text Generation

qwen3_scale_seq

sequential-hidden-decoding

Model card Files Files and versions

Sequential-Hidden-Decoding-8B-n8-Instruct / README.md

exlaw's picture

Add files using upload-large-folder tool

fe7f33c verified 2 days ago

|

history blame contribute delete

3.58 kB

	---
	license: other
	license_name: sequential-hidden-decoding
	license_link: LICENSE
	base_model:
	- tencent/Sequential-Hidden-Decoding-8B-n8
	- Qwen/Qwen3-8B-Base
	tags:
	- sequential-hidden-decoding
	- instruct
	- text-generation
	- conversational
	---

	# Sequential-Hidden-Decoding-8B-n8-Instruct

	This is the instruction-tuned variant of Sequential Hidden Decoding 8B n=8, designed for conversational and instruction-following use cases.

	- Base model: [Sequential-Hidden-Decoding-8B-n8](https://huggingface.co/tencent/Sequential-Hidden-Decoding-8B-n8)
	- Underlying architecture: [Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base)
	- Scale: 8x
	- Context Length: 131072
	- Dtype: bfloat16

	## Key Idea

	Sequential Hidden Decoding scales sequence length by preparing multiple Embedding matrices for the same token sequence, interleaving the results, and feeding the expanded sequence into the same Transformer. This model is the instruction-tuned release of the 8B n=8 variant.

	## Serving (SGLang)

	This model requires a patched version of [SGLang](https://github.com/sgl-project/sglang) for inference. See the [project page](https://github.com/Tencent/Sequential-Hidden-Decoding) for installation options.

	```bash
	python -m sglang.launch_server \
	--model-path tencent/Sequential-Hidden-Decoding-8B-n8-Instruct \
	--trust-remote-code \
	--tp-size 1 \
	--port 30000 --host 0.0.0.0 \
	--chunked-prefill-size -1 \
	--attention-backend fa3 \
	--mem-fraction-static 0.82 \
	--max-running-requests 32 \
	--context-length 131072 \
	--cuda-graph-max-bs 128 \
	--cuda-graph-bs 1 2 4 8 16 32 64 128
	```

	> Note: Sequential Hidden Decoding models process n×-length sequences internally, so `--chunked-prefill-size -1`, `--attention-backend fa3`, and conservative batch sizing are important for stability and performance.

	## Chat Usage

	This is an instruction-tuned model. Use the `/v1/chat/completions` endpoint:

	```python
	from openai import OpenAI

	client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
	response = client.chat.completions.create(
	model="tencent/Sequential-Hidden-Decoding-8B-n8-Instruct",
	messages=[
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": "Explain the idea of hidden decoding in simple terms."},
	],
	max_tokens=512,
	temperature=0.7,
	)
	print(response.choices[0].message.content)
	```

	## Files

	This repository includes the custom architecture files required by `trust_remote_code`:

	- `configuration_qwen3_scale_seq.py`
	- `modeling_qwen3_scale_seq.py`

	## Related Models

	\| Model \| Type \| Notes \|
	\|-------\|:----:\|-------\|
	\| [Sequential-Hidden-Decoding-8B-n2](https://huggingface.co/tencent/Sequential-Hidden-Decoding-8B-n2) \| Base \| 2x scale base model \|
	\| [Sequential-Hidden-Decoding-8B-n4](https://huggingface.co/tencent/Sequential-Hidden-Decoding-8B-n4) \| Base \| 4x scale base model \|
	\| [Sequential-Hidden-Decoding-8B-n8](https://huggingface.co/tencent/Sequential-Hidden-Decoding-8B-n8) \| Base \| 8x scale base model \|
	\| [Sequential-Hidden-Decoding-8B-n8-Instruct](https://huggingface.co/tencent/Sequential-Hidden-Decoding-8B-n8-Instruct) \| Instruct \| Instruction-tuned 8x scale model \|

	## Citation

	```bibtex
	@article{hidden_decoding_2026,
	title = {Hidden Decoding: Scaling Sequence Length in Pretraining},
	year = {2026},
	url = {https://welm.weixin.qq.com/posts/hidden_decoding/}
	}
	```

	## License

	This model is released under the [License Terms of Sequential-Hidden-Decoding](LICENSE).