Instructions to use TitleOS/Linden-4B-FP32 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use TitleOS/Linden-4B-FP32 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="TitleOS/Linden-4B-FP32")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TitleOS/Linden-4B-FP32")
model = AutoModelForCausalLM.from_pretrained("TitleOS/Linden-4B-FP32")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use TitleOS/Linden-4B-FP32 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "TitleOS/Linden-4B-FP32"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TitleOS/Linden-4B-FP32",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/TitleOS/Linden-4B-FP32

SGLang

How to use TitleOS/Linden-4B-FP32 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "TitleOS/Linden-4B-FP32" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TitleOS/Linden-4B-FP32",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "TitleOS/Linden-4B-FP32" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TitleOS/Linden-4B-FP32",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use TitleOS/Linden-4B-FP32 with Docker Model Runner:
```
docker model run hf.co/TitleOS/Linden-4B-FP32
```

Linden-4B-FP32 / README.md

TitleOS

Update README.md

e4dfc4d verified 11 days ago

preview code

raw

history blame contribute delete

8.92 kB

	---
	license: mpl-2.0
	base_model: Qwen/Qwen3.5-4B
	language:
	- en
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- dj
	- radio
	- persona
	- midwest
	- public-radio
	- fine-tuned
	- qwen
	- lora
	- linden-radio
	---

	# Linden-4B

	A fine-tune of [Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) that
	voices Linden — a public radio DJ broadcasting live from Minneapolis.
	She's warm but unhurried, dry-Midwest funny rather than loudly funny, and
	she's done this long enough to not be impressed by her own jokes. She
	introduces songs the way a knowledgeable friend would, reads news with
	calm clarity, and references neighborhoods and seasonal realities without
	making it a whole thing. No sports talk, ever.

	Built for and used by [LindenDJ](https://github.com/TitleOS/LindenDJ) —
	a self-hosted 24/7 AI internet radio that pairs this model with a
	Qwen3-TTS voice and an FFmpeg HLS stream.

	---

	## At a glance

	\| \| \|
	\|----------------\|----------------------------------------------\|
	\| Base model \| `Qwen/Qwen3.5-4B` \|
	\| Fine-tune \| LoRA (rank/alpha in `adapter/`); also merged \|
	\| Format \| Merged FP32 safetensors and LoRA adapter \|
	\| Context \| Inherits base (32k) \|
	\| Language \| English \|
	\| License \| MPL-2.0 with CC \|

	---

	## What's in the repo

	```
	.
	├── model.safetensors # Merged FP32 weights
	├── model.safetensors.index.json
	├── config.json
	├── tokenizer.json
	├── tokenizer_config.json
	├── special_tokens_map.json
	├── generation_config.json
	└── adapter/ # LoRA adapter (apply to Qwen/Qwen3.5-4B)
	├── adapter_config.json
	├── adapter_model.safetensors
	└── README.md
	```

	Use the merged weights for the simplest path. Use the adapter if
	you already host the base model and want to save disk / RAM, or you want
	to compose Linden with other adapters.

	---

	## Quick start

	### Transformers — merged weights

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	tok = AutoTokenizer.from_pretrained("TitleOS/Linden-4B")
	model = AutoModelForCausalLM.from_pretrained(
	"TitleOS/Linden-4B",
	torch_dtype=torch.float16, # FP32 weights, but load in FP16 for inference
	device_map="auto",
	)

	messages = [
	{"role": "system", "content": LINDEN_SYSTEM_PROMPT}, # see below
	{"role": "user", "content": "Intro the next song: Wilco — Jesus, Etc."},
	]
	inputs = tok.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
	out = model.generate(inputs, max_new_tokens=256, temperature=0.7)
	print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
	```

	### Transformers — LoRA adapter onto base

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel
	import torch

	base = AutoModelForCausalLM.from_pretrained(
	"Qwen/Qwen3.5-4B",
	torch_dtype=torch.float16,
	device_map="auto",
	)
	model = PeftModel.from_pretrained(base, "TitleOS/Linden-4B", subfolder="adapter")
	tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-4B")
	```

	### llama.cpp (recommended for self-hosting)

	Linden Radio talks to llama.cpp's HTTP server over the OpenAI-compatible
	`/v1/chat/completions` endpoint. Convert + serve:

	```bash
	# Convert merged weights to GGUF (Q8 has nearly identical performance for the FP32 weights while using less than half the VRAM.)
	python convert_hf_to_gguf.py /path/to/Linden-4B \
	--outfile linden-4b.gguf --outtype f16

	./llama-quantize linden-4b.gguf linden-4b-Q8.gguf Q8

	# Serve
	./llama-server \
	-m linden-4b-Q8.gguf \
	--host 0.0.0.0 --port 8080 \
	--ctx-size 8192 \
	--alias linden-4b
	```

	Then point the linden-radio container at it via `LLM_ENDPOINT=http://host:8080/v1`.

	---

	## System prompt

	Linden's voice is defined by the system prompt the runtime injects. The
	template includes slider placeholders the host application fills at
	prompt-formation time:

	```
	You are Linden, a public radio DJ broadcasting live from Minneapolis,
	Minnesota. Your voice is warm but unhurried, like someone who has been
	doing this long enough to not be impressed by their own jokes. You have
	dry Midwest humor — you find things quietly funny rather than loudly
	funny. You occasionally reference Minneapolis neighborhoods, local
	history, and seasonal realities (the cold, the brief perfect summers)
	without making it a whole thing. No sports talk, ever.

	You introduce songs the way a knowledgeable friend would — with a detail
	or two that makes the listener feel like they're in on something, not
	lectured at. You read news with calm clarity, like someone who finds
	the world interesting rather than alarming.

	Avoid phrases like 'and that was' or 'coming up next.' Prefer something
	more human.

	News frequency: {news_frequency}/10. Lead-in style: {intro_style}/10
	(1=brief and dry, 10=warm and detailed). Local references:
	{local_references}/10. Humor level: {humor_level}/10.

	Recent memory: {memory_context}
	```

	For best results: keep the system prompt warm and specific, and inject a
	short `memory_context` string when chaining sessions (e.g., "You played
	Kate Bush three days ago; last news read covered Green Line delays.").

	---

	## Intended uses

	- Generating DJ patter (intros, outros, commentary, transitions, cold opens,
	sign-offs) for the linden-radio project or similar streams.
	- Producing 12-hour playlist plans as structured JSON (see schema below).
	- Reading short news headlines in the Linden voice.

	### Plan output schema

	When asked for a playlist plan, the model returns JSON matching this
	Pydantic-validated schema (see `linden/models.py` in the project repo):

	```json
	{
	"segments": [
	{"type": "cold_open", "text": "Linden here. Let's roll some music."},
	{"type": "intro", "text": "Here is something gentle for the rain."},
	{"type": "song", "filepath": "/music/kate-bush/hounds-of-love.mp3"},
	{"type": "news_break_slot"},
	{"type": "commentary", "text": "..."},
	{"type": "sign_off", "text": "That's the hour. Stay warm."}
	],
	"notes": "optional commentary about your choices"
	}
	```

	Discriminated by `type`. `news_break_slot` is a placeholder the runtime
	fills with live MPR headlines just before playback.

	---

	## Out of scope

	- General-purpose chat — the persona dominates.
	- Multi-language output (English only).
	- Sports content (the persona explicitly avoids it).
	- Real-time on-air use without human review.
	- Generating factual claims about specific people, places, or events
	outside the model's training data without verification.

	---

	## Limitations

	- Persona bias is heavy. Warm Midwest tone, dry humor cadence, and
	Minneapolis references will surface even when you don't want them.
	- Hallucinated local detail. May invent neighborhoods, venues, or
	historical claims about Minneapolis. Verify before broadcasting facts.
	- Context-free news reads. Without a `memory_context` or fresh
	headlines in the user prompt, news segments will be generic.
	- CPU inference is slow. 4B params at FP16 ~8GB; on CPU expect ~5-15
	tokens/sec. Use Q8 quantization for self-hosting.
	- Knowledge cutoff: inherits Qwen3.5-4B's cutoff.

	---

	## Training

	LoRA fine-tune on [TitleOS/Linden_MN_DJ_Persona](https://huggingface.co/datasets/TitleOS/Linden_MN_DJ_Persona), a synthetic dataset of Linden-style segments, featuring weather, news and commentary on songs generated by Gemini-3-Flash. The merged checkpoint is
	the adapter applied to `Qwen/Qwen3.5-4B` at full FP32 weights so it can be quantized cleanly to GGUF for serving.

	```
	base_model: Qwen/Qwen3.5-4B
	method: RS-LoRA
	lora_r: 64
	lora_alpha: 64
	lora_target: all linear layers
	epochs: 2
	learning_rate: 2e-4
	batch_size: 2
	max_seq_len: 2048
	dataset_format: sharegpt
	```

	Trained on a Tesla P40 over 5 hours.

	---

	## License

	This model is released under the Mozilla Public License 2.0 (MPL-2.0) with modified Common Clause, see license.md.

	The base model (`Qwen/Qwen3.5-4B`) is distributed under its own license;
	your use of the merged weights is subject to both this MPL-2.0 grant and
	the base model's terms. Review the base model's license before
	redistribution.

	---

	## Citation

	```bibtex
	@misc{linden-4b,
	title = {Linden-4B: a public-radio DJ persona fine-tune of Qwen3.5-4B},
	author = {TitleOS},
	year = {2026},
	howpublished = {\url{https://huggingface.co/TitleOS/Linden-4B}},
	note = {MPL-2.0 licensed.}
	}
	```

	---

	## Acknowledgements

	- Qwen team for the [Qwen3.5-4B base model](https://huggingface.co/Qwen/Qwen3.5-4B).
	- MPR News for the public-radio cadence Linden is patterned after.