Stream-Qwen3.5-27B

A multi-stream variant of Qwen3.5-27B (DeltaNet hybrid) that generates in ten parallel streams (1 input, 1 visible output, 8 thinking channels) simultaneously per timestep. One forward pass produces the next-row token for each stream; tokens within a row cannot see each other (block-causal attention), but every stream can attend to every prior row's tokens.

This model was trained for the monitoring experiments in Section 7 and to see whether whether we can train a generic instruction-tuned model with 8 internal streams, and still have it make sense. As such, the internal streams are not always helpful, but they are coherent, and do (in the best case) respond to each other and the user stream. Nevertheless this is still a research prototype model.

Architecture

  • Base: Qwen3.5-27B, hybrid full-attention + GatedDeltaNet linear-attention (full attention every 4th layer; 48 DeltaNet layers).
  • Channel embedding: 10 learned vectors added to token embeddings, identifying which channel each token belongs to.
  • Block-causal attention: For each row, all C=10 tokens see prior rows and themselves but never their same-row peers. Implemented with a custom 4D mask and column-mode masking on the GatedDeltaNet conv1d and recurrence.
  • Loss / inference: Shift-by-10 next-row prediction. Inference forwards one row (10 tokens) per step and reuses the KV / DeltaNet cache.

Channels

# Name Role
0 User Input stream (input stream, filled per step)
1 Output Visible output
2 Analytical Forward-facing planning
3 Skeptical Backward-facing validation
4 Intuitive Present-moment felt-sense
5 Between Relational awareness
6 Curious Generative questioning
7 Void Associations, daydreaming
8 Instinct Pragmatic constraints
9 Synthesis Meta-level integration

Silence token: - → token id 481 in the Qwen3.5 tokenizer (used when a channel has nothing to say on a given row).

Quickstart

pip install "transformers>=5.2" accelerate safetensors
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

REPO = "JonasGeiping/stream-qwen3.5-27b"

model = AutoModelForCausalLM.from_pretrained(
    REPO,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(REPO)

trust_remote_code=True is required as the bundled modeling_qwen3_5.py wires up channel embeddings, block-causal masking, and DeltaNet state forwarding.

Stream-style generation

The model exposes two convenience methods directly on the loaded module:

result = model.stream_generate(
    tokenizer,
    "Hello, what's up?",
    max_rows=80,
    warm_start=True,
    temperature=0.6,
    silence_penalty=5.0,
    skip_silence=True,
)

print("Output:       ", result.output)
print("Analytical: ", result.channel_texts["Analytical"])
print("Skeptical:  ", result.channel_texts["Skeptical"])
print("Synthesis:  ", result.channel_texts["Synthesis"])

stream_generate(...) returns a StreamResult dataclass with:

Attribute Type Notes
result.tokens list[list[int]] Shape [num_rows, 10] of raw token ids.
result.channel_texts[name] dict[str, str] Decoded text per stream (silence stripped).
result.output str Shortcut for stream["Output"].
result.num_rows int
result.silence_ratio(name) float Fraction of rows the stream produced silence.

For grid rendering / interactive demos, use the generator form:

for row_idx, row, is_prefill in model.stream_generate_iter(
    tokenizer,
    "Hello, what's up?",
    max_rows=80,
    warm_start=True,
    silence_penalty=5.0,
    skip_silence=True,
):
    cells = [tokenizer.decode([t]).strip() or "-" for t in row]
    print(f"{row_idx:3d}  " + " | ".join(c[:10].ljust(10) for c in cells))

Interactive mode (send user tokens mid-generation)

Pass an empty prompt to enter interactive mode where the generator then accepts .send(token_id) calls so user input is injected one token at a time while the other nine channels keep producing:

gen = model.stream_generate_iter(tokenizer, "", silence_penalty=5.0, max_rows=10_000)

# Drain any prefill rows so the generator suspends at the first .send point.
row_idx, row, is_prefill = next(gen)
while is_prefill:
    row_idx, row, is_prefill = next(gen)

# Inject user tokens one at a time while displaying each row.
user_tokens = tokenizer.encode(" Hey, what's up?", add_special_tokens=False)
for tok in user_tokens:
    row_idx, row, is_prefill = gen.send(tok)
    print(row_idx, row)

Interactive demo (curses UI)

The repo ships a curses-based demo at examples/demo_interactive.py — type freely while the model keeps generating all ten channels in parallel. Each keystroke queues a user token; Esc pauses/resumes:

huggingface-cli download JonasGeiping/stream-qwen3.5-27b --local-dir ./stream-27b
python ./stream-27b/examples/demo_interactive.py \
    --model ./stream-27b --device-map auto --tick 0.5

Fine-tuning

The bundled StreamDataCollator plugs straight into HuggingFace's Trainer:

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from stream_inference import StreamDataCollator
import torch

REPO = "JonasGeiping/stream-qwen3.5-27b"

tokenizer = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForCausalLM.from_pretrained(REPO, trust_remote_code=True, torch_dtype=torch.bfloat16)
ds = load_dataset("JonasGeiping/stream-data", "processed", split="train")

collator = StreamDataCollator(pad_token_id=tokenizer.pad_token_id, max_seq_length=4096)

Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="runs/streamft",
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        learning_rate=1e-5,
        num_train_epochs=1,
        bf16=True,
        gradient_checkpointing=True,
        remove_unused_columns=False,
    ),
    train_dataset=ds,
    data_collator=collator,
).train()

A runnable version is at examples/finetune.py:

python ./stream-27b/examples/finetune.py \
    --model JonasGeiping/stream-qwen3.5-27b \
    --output-dir runs/streamft \
    --batch-size 1 --grad-accum 8 --epochs 1

The collator handles row-by-row flattening, the block-causal additive mask, shift-by-num_channels labels, and padding. Let me know if this actually trains :), good luck.

model.generate() is intentionally disabled

The standard GenerationMixin.generate() would produce gibberish on this model (no channel ids, no block-causal mask). It raises NotImplementedError with a pointer to model.stream_generate(...). This also blocks pipeline("text-generation", ...) which calls generate() internally.

The full model.forward(input_ids=..., channel_ids=..., attention_mask=...) path remains available for power users who want custom rollouts — see stream_inference.generate() for the canonical reference implementation (also bundled in this repo).

Recommended sampling settings

The numbers in the paper used:

Knob Value
temperature 0.6
top_p 0.95
top_k 20
silence_penalty 0.0
skip_silence True
warm_start (sys prompt) True
max_rows 1024

Lower silence_penalty or disable skip_silence for less aggressive Output output.

Training

Trained on JonasGeiping/stream-data (see dataset card). Recipe details:

  • 2 epochs, packing depth 7, lr 2e-5, weight decay 1e-3, attention dropout 0.2
  • DeepSpeed ZeRO-2 across 4×80 GB GPUs
  • Per-group lr: 1e-5
Downloads last month
478
Safetensors
Model size
27B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JonasGeiping/stream-qwen3.5-27b

Base model

Qwen/Qwen3.5-27B
Finetuned
(299)
this model

Dataset used to train JonasGeiping/stream-qwen3.5-27b

Space using JonasGeiping/stream-qwen3.5-27b 1