Instructions to use JonasGeiping/stream-qwen3.5-27b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use JonasGeiping/stream-qwen3.5-27b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="JonasGeiping/stream-qwen3.5-27b", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("JonasGeiping/stream-qwen3.5-27b", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("JonasGeiping/stream-qwen3.5-27b", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use JonasGeiping/stream-qwen3.5-27b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "JonasGeiping/stream-qwen3.5-27b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "JonasGeiping/stream-qwen3.5-27b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/JonasGeiping/stream-qwen3.5-27b
- SGLang
How to use JonasGeiping/stream-qwen3.5-27b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "JonasGeiping/stream-qwen3.5-27b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "JonasGeiping/stream-qwen3.5-27b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "JonasGeiping/stream-qwen3.5-27b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "JonasGeiping/stream-qwen3.5-27b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use JonasGeiping/stream-qwen3.5-27b with Docker Model Runner:
docker model run hf.co/JonasGeiping/stream-qwen3.5-27b
Stream-Qwen3.5-27B
A multi-stream variant of Qwen3.5-27B (DeltaNet hybrid) that generates in ten parallel streams (1 input, 1 visible output, 8 thinking channels) simultaneously per timestep. One forward pass produces the next-row token for each stream; tokens within a row cannot see each other (block-causal attention), but every stream can attend to every prior row's tokens.
This model was trained for the monitoring experiments in Section 7 and to see whether whether we can train a generic instruction-tuned model with 8 internal streams, and still have it make sense. As such, the internal streams are not always helpful, but they are coherent, and do (in the best case) respond to each other and the user stream. Nevertheless this is still a research prototype model.
Architecture
- Base: Qwen3.5-27B, hybrid full-attention + GatedDeltaNet linear-attention (full attention every 4th layer; 48 DeltaNet layers).
- Channel embedding: 10 learned vectors added to token embeddings, identifying which channel each token belongs to.
- Block-causal attention: For each row, all C=10 tokens see prior rows and themselves but never their same-row peers. Implemented with a custom 4D mask and column-mode masking on the GatedDeltaNet conv1d and recurrence.
- Loss / inference: Shift-by-10 next-row prediction. Inference forwards one row (10 tokens) per step and reuses the KV / DeltaNet cache.
Channels
| # | Name | Role |
|---|---|---|
| 0 | User | Input stream (input stream, filled per step) |
| 1 | Output | Visible output |
| 2 | Analytical | Forward-facing planning |
| 3 | Skeptical | Backward-facing validation |
| 4 | Intuitive | Present-moment felt-sense |
| 5 | Between | Relational awareness |
| 6 | Curious | Generative questioning |
| 7 | Void | Associations, daydreaming |
| 8 | Instinct | Pragmatic constraints |
| 9 | Synthesis | Meta-level integration |
Silence token: - → token id 481 in the Qwen3.5 tokenizer (used when a
channel has nothing to say on a given row).
Quickstart
pip install "transformers>=5.2" accelerate safetensors
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
REPO = "JonasGeiping/stream-qwen3.5-27b"
model = AutoModelForCausalLM.from_pretrained(
REPO,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(REPO)
trust_remote_code=Trueis required as the bundledmodeling_qwen3_5.pywires up channel embeddings, block-causal masking, and DeltaNet state forwarding.
Stream-style generation
The model exposes two convenience methods directly on the loaded module:
result = model.stream_generate(
tokenizer,
"Hello, what's up?",
max_rows=80,
warm_start=True,
temperature=0.6,
silence_penalty=5.0,
skip_silence=True,
)
print("Output: ", result.output)
print("Analytical: ", result.channel_texts["Analytical"])
print("Skeptical: ", result.channel_texts["Skeptical"])
print("Synthesis: ", result.channel_texts["Synthesis"])
stream_generate(...) returns a StreamResult dataclass with:
| Attribute | Type | Notes |
|---|---|---|
result.tokens |
list[list[int]] |
Shape [num_rows, 10] of raw token ids. |
result.channel_texts[name] |
dict[str, str] |
Decoded text per stream (silence stripped). |
result.output |
str |
Shortcut for stream["Output"]. |
result.num_rows |
int |
|
result.silence_ratio(name) |
float |
Fraction of rows the stream produced silence. |
For grid rendering / interactive demos, use the generator form:
for row_idx, row, is_prefill in model.stream_generate_iter(
tokenizer,
"Hello, what's up?",
max_rows=80,
warm_start=True,
silence_penalty=5.0,
skip_silence=True,
):
cells = [tokenizer.decode([t]).strip() or "-" for t in row]
print(f"{row_idx:3d} " + " | ".join(c[:10].ljust(10) for c in cells))
Interactive mode (send user tokens mid-generation)
Pass an empty prompt to enter interactive mode where the generator then accepts
.send(token_id) calls so user input is injected one token at a time while
the other nine channels keep producing:
gen = model.stream_generate_iter(tokenizer, "", silence_penalty=5.0, max_rows=10_000)
# Drain any prefill rows so the generator suspends at the first .send point.
row_idx, row, is_prefill = next(gen)
while is_prefill:
row_idx, row, is_prefill = next(gen)
# Inject user tokens one at a time while displaying each row.
user_tokens = tokenizer.encode(" Hey, what's up?", add_special_tokens=False)
for tok in user_tokens:
row_idx, row, is_prefill = gen.send(tok)
print(row_idx, row)
Interactive demo (curses UI)
The repo ships a curses-based demo at examples/demo_interactive.py — type
freely while the model keeps generating all ten channels in parallel. Each
keystroke queues a user token; Esc pauses/resumes:
huggingface-cli download JonasGeiping/stream-qwen3.5-27b --local-dir ./stream-27b
python ./stream-27b/examples/demo_interactive.py \
--model ./stream-27b --device-map auto --tick 0.5
Fine-tuning
The bundled StreamDataCollator plugs straight into HuggingFace's
Trainer:
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from stream_inference import StreamDataCollator
import torch
REPO = "JonasGeiping/stream-qwen3.5-27b"
tokenizer = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForCausalLM.from_pretrained(REPO, trust_remote_code=True, torch_dtype=torch.bfloat16)
ds = load_dataset("JonasGeiping/stream-data", "processed", split="train")
collator = StreamDataCollator(pad_token_id=tokenizer.pad_token_id, max_seq_length=4096)
Trainer(
model=model,
args=TrainingArguments(
output_dir="runs/streamft",
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
learning_rate=1e-5,
num_train_epochs=1,
bf16=True,
gradient_checkpointing=True,
remove_unused_columns=False,
),
train_dataset=ds,
data_collator=collator,
).train()
A runnable version is at examples/finetune.py:
python ./stream-27b/examples/finetune.py \
--model JonasGeiping/stream-qwen3.5-27b \
--output-dir runs/streamft \
--batch-size 1 --grad-accum 8 --epochs 1
The collator handles row-by-row flattening, the block-causal additive mask,
shift-by-num_channels labels, and padding. Let me know if this actually trains :), good luck.
model.generate() is intentionally disabled
The standard GenerationMixin.generate() would produce gibberish on this
model (no channel ids, no block-causal mask). It raises
NotImplementedError with a pointer to model.stream_generate(...). This
also blocks pipeline("text-generation", ...) which calls generate()
internally.
The full model.forward(input_ids=..., channel_ids=..., attention_mask=...)
path remains available for power users who want custom rollouts — see
stream_inference.generate() for the canonical reference implementation
(also bundled in this repo).
Recommended sampling settings
The numbers in the paper used:
| Knob | Value |
|---|---|
temperature |
0.6 |
top_p |
0.95 |
top_k |
20 |
silence_penalty |
0.0 |
skip_silence |
True |
warm_start (sys prompt) |
True |
max_rows |
1024 |
Lower silence_penalty or disable skip_silence for less aggressive Output
output.
Training
Trained on JonasGeiping/stream-data (see dataset card). Recipe details:
- 2 epochs, packing depth 7, lr 2e-5, weight decay 1e-3, attention dropout 0.2
- DeepSpeed ZeRO-2 across 4×80 GB GPUs
- Per-group lr: 1e-5
- Downloads last month
- 478
Model tree for JonasGeiping/stream-qwen3.5-27b
Base model
Qwen/Qwen3.5-27B