Instructions to use JonasGeiping/stream-qwen3.5-27b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use JonasGeiping/stream-qwen3.5-27b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="JonasGeiping/stream-qwen3.5-27b", trust_remote_code=True)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("JonasGeiping/stream-qwen3.5-27b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("JonasGeiping/stream-qwen3.5-27b", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use JonasGeiping/stream-qwen3.5-27b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "JonasGeiping/stream-qwen3.5-27b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "JonasGeiping/stream-qwen3.5-27b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/JonasGeiping/stream-qwen3.5-27b

SGLang

How to use JonasGeiping/stream-qwen3.5-27b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "JonasGeiping/stream-qwen3.5-27b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "JonasGeiping/stream-qwen3.5-27b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "JonasGeiping/stream-qwen3.5-27b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "JonasGeiping/stream-qwen3.5-27b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use JonasGeiping/stream-qwen3.5-27b with Docker Model Runner:
```
docker model run hf.co/JonasGeiping/stream-qwen3.5-27b
```

Stream-Qwen3.5-27B

A multi-stream variant of Qwen3.5-27B (DeltaNet hybrid) that generates in ten parallel streams (1 input, 1 visible output, 8 thinking channels) simultaneously per timestep. One forward pass produces the next-row token for each stream; tokens within a row cannot see each other (block-causal attention), but every stream can attend to every prior row's tokens.

This model was trained for the monitoring experiments in Section 7 and to see whether whether we can train a generic instruction-tuned model with 8 internal streams, and still have it make sense. As such, the internal streams are not always helpful, but they are coherent, and do (in the best case) respond to each other and the user stream. Nevertheless this is still a research prototype model.

Architecture

Base: Qwen3.5-27B, hybrid full-attention + GatedDeltaNet linear-attention (full attention every 4th layer; 48 DeltaNet layers).
Channel embedding: 10 learned vectors added to token embeddings, identifying which channel each token belongs to.
Block-causal attention: For each row, all C=10 tokens see prior rows and themselves but never their same-row peers. Implemented with a custom 4D mask and column-mode masking on the GatedDeltaNet conv1d and recurrence.
Loss / inference: Shift-by-10 next-row prediction. Inference forwards one row (10 tokens) per step and reuses the KV / DeltaNet cache.

Channels

#	Name	Role
0	User	Input stream (input stream, filled per step)
1	Output	Visible output
2	Analytical	Forward-facing planning
3	Skeptical	Backward-facing validation
4	Intuitive	Present-moment felt-sense
5	Between	Relational awareness
6	Curious	Generative questioning
7	Void	Associations, daydreaming
8	Instinct	Pragmatic constraints
9	Synthesis	Meta-level integration

Silence token: - → token id 481 in the Qwen3.5 tokenizer (used when a channel has nothing to say on a given row).

Quickstart

pip install "transformers>=5.2" accelerate safetensors

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

REPO = "JonasGeiping/stream-qwen3.5-27b"

model = AutoModelForCausalLM.from_pretrained(
    REPO,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(REPO)

trust_remote_code=True is required as the bundled modeling_qwen3_5.py wires up channel embeddings, block-causal masking, and DeltaNet state forwarding.

Stream-style generation

The model exposes two convenience methods directly on the loaded module:

result = model.stream_generate(
    tokenizer,
    "Hello, what's up?",
    max_rows=80,
    warm_start=True,
    temperature=0.6,
    silence_penalty=5.0,
    skip_silence=True,
)

print("Output:       ", result.output)
print("Analytical: ", result.channel_texts["Analytical"])
print("Skeptical:  ", result.channel_texts["Skeptical"])
print("Synthesis:  ", result.channel_texts["Synthesis"])

stream_generate(...) returns a StreamResult dataclass with:

Attribute	Type	Notes
`result.tokens`	`list[list[int]]`	Shape `[num_rows, 10]` of raw token ids.
`result.channel_texts[name]`	`dict[str, str]`	Decoded text per stream (silence stripped).
`result.output`	`str`	Shortcut for `stream["Output"]`.
`result.num_rows`	`int`
`result.silence_ratio(name)`	`float`	Fraction of rows the stream produced silence.

For grid rendering / interactive demos, use the generator form:

for row_idx, row, is_prefill in model.stream_generate_iter(
    tokenizer,
    "Hello, what's up?",
    max_rows=80,
    warm_start=True,
    silence_penalty=5.0,
    skip_silence=True,
):
    cells = [tokenizer.decode([t]).strip() or "-" for t in row]
    print(f"{row_idx:3d}  " + " | ".join(c[:10].ljust(10) for c in cells))

Interactive mode (send user tokens mid-generation)

Pass an empty prompt to enter interactive mode where the generator then accepts .send(token_id) calls so user input is injected one token at a time while the other nine channels keep producing:

gen = model.stream_generate_iter(tokenizer, "", silence_penalty=5.0, max_rows=10_000)

# Drain any prefill rows so the generator suspends at the first .send point.
row_idx, row, is_prefill = next(gen)
while is_prefill:
    row_idx, row, is_prefill = next(gen)

# Inject user tokens one at a time while displaying each row.
user_tokens = tokenizer.encode(" Hey, what's up?", add_special_tokens=False)
for tok in user_tokens:
    row_idx, row, is_prefill = gen.send(tok)
    print(row_idx, row)

Interactive demo (curses UI)

The repo ships a curses-based demo at examples/demo_interactive.py — type freely while the model keeps generating all ten channels in parallel. Each keystroke queues a user token; Esc pauses/resumes:

huggingface-cli download JonasGeiping/stream-qwen3.5-27b --local-dir ./stream-27b
python ./stream-27b/examples/demo_interactive.py \
    --model ./stream-27b --device-map auto --tick 0.5

Fine-tuning

The bundled StreamDataCollator plugs straight into HuggingFace's Trainer:

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from stream_inference import StreamDataCollator
import torch

REPO = "JonasGeiping/stream-qwen3.5-27b"

tokenizer = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForCausalLM.from_pretrained(REPO, trust_remote_code=True, torch_dtype=torch.bfloat16)
ds = load_dataset("JonasGeiping/stream-data", "processed", split="train")

collator = StreamDataCollator(pad_token_id=tokenizer.pad_token_id, max_seq_length=4096)

Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="runs/streamft",
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        learning_rate=1e-5,
        num_train_epochs=1,
        bf16=True,
        gradient_checkpointing=True,
        remove_unused_columns=False,
    ),
    train_dataset=ds,
    data_collator=collator,
).train()

A runnable version is at examples/finetune.py:

python ./stream-27b/examples/finetune.py \
    --model JonasGeiping/stream-qwen3.5-27b \
    --output-dir runs/streamft \
    --batch-size 1 --grad-accum 8 --epochs 1

The collator handles row-by-row flattening, the block-causal additive mask, shift-by-num_channels labels, and padding. Let me know if this actually trains :), good luck.

`model.generate()` is intentionally disabled

The standard GenerationMixin.generate() would produce gibberish on this model (no channel ids, no block-causal mask). It raises NotImplementedError with a pointer to model.stream_generate(...). This also blocks pipeline("text-generation", ...) which calls generate() internally.

The full model.forward(input_ids=..., channel_ids=..., attention_mask=...) path remains available for power users who want custom rollouts — see stream_inference.generate() for the canonical reference implementation (also bundled in this repo).

Recommended sampling settings

The numbers in the paper used:

Knob	Value
`temperature`	`0.6`
`top_p`	`0.95`
`top_k`	`20`
`silence_penalty`	`0.0`
`skip_silence`	`True`
`warm_start` (sys prompt)	`True`
`max_rows`	`1024`

Lower silence_penalty or disable skip_silence for less aggressive Output output.

Training

Trained on JonasGeiping/stream-data (see dataset card). Recipe details:

2 epochs, packing depth 7, lr 2e-5, weight decay 1e-3, attention dropout 0.2
DeepSpeed ZeRO-2 across 4×80 GB GPUs
Per-group lr: 1e-5

Downloads last month: 478

Safetensors

Model size

27B params

Tensor type

BF16

Model tree for JonasGeiping/stream-qwen3.5-27b

Base model

Qwen/Qwen3.5-27B

Finetuned

(299)

this model

JonasGeiping
/

stream-qwen3.5-27b

Stream-Qwen3.5-27B

Architecture

Channels

Quickstart

Stream-style generation

Interactive mode (send user tokens mid-generation)

Interactive demo (curses UI)

Fine-tuning

`model.generate()` is intentionally disabled

Recommended sampling settings

Training

Model tree for JonasGeiping/stream-qwen3.5-27b

Dataset used to train JonasGeiping/stream-qwen3.5-27b

Space using JonasGeiping/stream-qwen3.5-27b 1

Stream-Qwen3.5-27B

Architecture

Channels

Quickstart

Stream-style generation

Interactive mode (send user tokens mid-generation)

Interactive demo (curses UI)

Fine-tuning

model.generate() is intentionally disabled

Recommended sampling settings

Training

Model tree for JonasGeiping/stream-qwen3.5-27b

Dataset used to train JonasGeiping/stream-qwen3.5-27b

Space using JonasGeiping/stream-qwen3.5-27b 1

`model.generate()` is intentionally disabled