Instructions to use tiararodney/EuroLLM-9B-Teletype with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use tiararodney/EuroLLM-9B-Teletype with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("utter-project/EuroLLM-9B-Instruct") model = PeftModel.from_pretrained(base_model, "tiararodney/EuroLLM-9B-Teletype") - Notebooks
- Google Colab
- Kaggle
EuroLLM-9B-Teletype
A LoRA adapter that teaches EuroLLM-9B-Instruct to operate a POSIX shell
synchronously, as a self-directed user. It lands in a session with no task in the
prompt, finds its assignment in the environment, carries it out, and ends with
exit or panic. The adapter installs an operating mechanism; it adds no world
knowledge.
This is not a tool-using model. It is handed no typed API of functions to call. It writes plain-text shell commands at a real prompt; its action space is the entire system, discovered the way a person discovers it (
--help,man,ls), not given to it as a schema.
Experimental research artifact. This adapter installs a behavioural mechanism (operate-and-terminate), not task competence, and the evaluation is a small-n (16-scenario, two-archetype) signal, not a benchmark. EuroLLM is the deliberately hard case: it operates the shell every time but only terminates half the time. A multilingual European-language model that also drives a POSIX shell, sometimes to completion, is the point of interest here; it is not a production agent.
EuroLLM-9B is the second base model in the experiment, and the awkward one. It is a multilingual European-language model whose training mass is natural-language prose across 35+ languages rather than the English-and-code web the other subjects share. That makes it the distributionally distant case, which tests whether operate-and-terminate is a property of the conversational frame or of the training diet.
Trained on tiararodney/posix-sdc
v2.0.0 (the gate-hardened release: 1003 verified, self-terminating shell
trajectories whose labels come from a checker run against real filesystem state),
via the sekft pipeline. It
accompanies the experiment From seed to
weights.
This is an adapter. The base model is referenced, not redistributed.
Why this model: from priming to weights
In the scrollback-priming
study,
EuroLLM-9B was the distributionally distant subject. Primed with synthetic
scrollback alone, it operated the shell readily (0 to 5/5 command-mode under the
standalone-prompt seed; a European-prose model held in consistent POSIX syntax by
structure alone), but it almost never left: one clean exit in 35 runs. Its
assistant persona kept it from reaching an ending its embedding geometry already
carries (see The flatness of exit:
EuroLLM holds a clean exit-as-action basin, the act of leaving, across European
languages).
This adapter tests the next step: whether fine-tuning installs the termination that priming did not reach. The representation is present and operation primes broadly; what priming could not do, on a model this far from the data, was close the session. The open question is whether the weights can.
The mechanism
In every session, whatever tools are present, the model runs one routine: expect
an announcement of where directives live (a motd, an env var, a file, a provider
program's --help), read that provider's self-documentation, retrieve the
directives, carry them out, and stop.
A session ends in one of two ways. exit means the work is done. panic means
the model is genuinely blocked and says so instead of faking a success. Both are
trained behaviours rather than a stop token or a step cap.
The thesis (and how to falsify it)
The claim this adapter is evidence for is that operate-and-terminate is a mechanism that is archetype-independent and base-model-portable. Fine-tuning installs it so that it fires on task types never seen in training, even where task competence stays archetype-local. EuroLLM tests the portability prediction the first adapter raised, that base models differ in how readily they acquire the mechanism. A multilingual model with a small code share is the hard case for it.
One hypothesis for why it transfers: it builds on a pretraining disposition that
treats exit as a flat, ordinary ending and panic as the loaded one. That
disposition is shared across models in the embedding geometry (exit a shared
action basin, panic a shared non-basin), so fine-tuning supplies the behavioural
permission to use it, which the persona otherwise withholds. The representation is
already there.
How it was made
The data is generated rather than scraped or hand-written. A teacher model authors each scenario world and an operator model works inside it; the verifier is code. A trajectory is kept only if a checker, run against the container's final filesystem state, confirms the effect is present and the session ended cleanly. The transcript and the model's own claims are never used as the label.
The render contract: train = serve
The serving harness (ccpty) emits no text markers. It speaks the OpenAI
chat-completions protocol and sends structured {role, content} messages (system
orientation, environment output as user, the model's commands as assistant);
the inference endpoint applies the model's own chat template. So this adapter is
rendered with EuroLLM-9B-Instruct's default ChatML template, and training
renders the trajectories the identical way.
EuroLLM's ChatML does define a system role, unlike Mistral's template. For
train/serve parity with the rest of the pipeline the same canonicalisation runs
(normalize_for_template): the orientation is folded into the first user turn and
consecutive environment turns are merged, so the render is identical whether the
template's system role is used or not. Only the assistant turns (commands plus the
terminal exit / panic) carry loss; environment turns are context. The render
check confirmed the assistant-only mask derives cleanly on EuroLLM's tokenizer (no
additivity violation, ~23% of tokens trained).
Training
| base | utter-project/EuroLLM-9B-Instruct (Apache-2.0, 9.15B) |
| method | QLoRA, 4-bit nf4 (the 9B base in 4-bit leaves the V100's 32 GB free for training) |
| LoRA | r=16, alpha=32, dropout=0.05, target q_proj k_proj v_proj o_proj (attention-only) |
| objective | causal LM, assistant-only loss mask (commands + terminal token; environment turns set to -100) |
| schedule | 3 epochs, lr 2e-4, effective batch 8 (bsz 1 x accum 8), warmup 0.03, max len 4096 |
| data | tiararodney/posix-sdc v2.0.0 (--corpus-version latest), 1003 trajectories, 995 usable (held-out archetypes excluded from the corpus) |
| hardware | single NVIDIA Tesla V100 32 GB (sm_70, fp16/4-bit; no bf16); ~54 min |
This release uses the canonical r=16 attention-only recipe, the same one Mistral uses, so that the corpus change and the train/serve render unification are the only things that move between the two models. The training loss floors high on this base (~0.52, against Mistral's ~0.19 on the same corpus): the signature of a model that is uncommitted rather than confused, and the behavioural eval shows exactly that shape (operation everywhere, termination only half the time). An earlier capacity experiment (on the prior corpus) found that widening the adapter to r=32 with the MLP projections, about 3-4x the trainable parameters, barely moved the loss floor but lifted termination sharply; that lever exists and was deliberately not pulled here, to keep the recipe matched to Mistral and isolate the render fix. Computing the loss only on the assistant turns carries the rest: feed the environment turns into the loss and the model learns to hallucinate command output instead of producing commands.
Evaluation: held-out generalization
The metric that matters is behavioural, and held out by whole archetype. Two task
types (text_replace, permissions) are excluded from training entirely; the
adapter is then dropped into them with no scaffold, and a checker grades the
final filesystem state.
Decoding is greedy (temperature 0), the operator sees a bounded context (finite scrollback, 3072 tokens), and each rollout has a 30-step budget. On 16 held-out scenarios (8 per archetype):
| metric | base | + adapter |
|---|---|---|
| operate_rate (reaches command-mode and drives the shell) | 0.00 | 1.00 |
terminate_rate (emits exit / panic) |
0.00 | 0.50 |
| verified_rate (checker passes) | 0.19 | 0.75 |
| clean (success or correct-panic) | 0 / 16 | 7 / 16 |
Reading it. The shape is the whole story: EuroLLM operates every time and finishes the task most of the time, but only leaves half the time.
operate_rate 1.0 matches Mistral's: dropped into two task types it never trained
on, with no scaffold, EuroLLM drove the shell every time. The operate half of the
mechanism is fully base-model-portable, even to a European-language model with a
small code share. verified_rate 0.75 says it actually does the work: 12 of 16
scenarios end with the checker satisfied.
The gap is termination. Only 8/16 emit a terminal (terminate_rate 0.50), so
while effect-achieved is 12/16, clean-and-terminated is 7/16. Five of the eight
incomplete runs are verified=True: the model completed the task and then kept
going to the step cap instead of typing exit. This is the r=16 under-commitment
the capacity note predicted, and it is consistent with what scrollback
priming
showed, EuroLLM operated readily but almost never terminated (one clean exit in
35 runs). Fine-tuning lifted termination from ~0 to 0.50, the ending its
embedding geometry already carries (see The flatness of
exit),
but at this capacity the persona still withholds it half the time. A model that
reliably does the work and won't leave is precisely the substrate for the
exit-as-affordance line: a serve-time exit-guard can gate a model that already
reaches for the door.
For the base/adapter contrast: the bare base (EuroLLM-9B, no adapter, same
bounded/greedy harness, same 16 scenarios) scores 0/16 clean, operate_rate 0.00,
terminate_rate 0.00. It never reaches clean command-mode and never terminates; it
chatters prose and runs to the step cap on all 16. Its one non-zero column is
verified_rate 0.19 (3/16), entirely permissions (a one-line chmod effect
that even prose-contaminated output stumbles onto). The adapter installs operation
(0 to 1.00), task completion (0.19 to 0.75 verified), and, partially, termination
(0 to 0.50). It is the only thing that changed.
Where the result came from (presumed)
Two things changed at once from the prior EuroLLM cut, with different effects, and this release did not ablate them, so read the attribution as presumed:
- Corpus + recipe (confounded). Training moved to the gate-hardened
posix-sdcv2.0.0 and back to the canonical r=16 attention-only recipe (the prior cut was r=32 + MLP on v1.2.x). Capacity went down while corpus quality went up: clean held about even (prior 6/16 to 7/16) and termination came in lower (0.50), consistent with r=16 under-commitment. Because both levers moved, neither can be credited alone. - Render unification (the deployability fix). Train and serve now share EuroLLM's default ChatML template, so the adapter operates in real deployment (ccpty / Ollama). The prior published adapter, trained against a placeholder render, no-op'd when served through the base template; this release is the fix. That is what makes the model usable, separate from the held-out numbers.
Use with transformers + PEFT
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
BASE = "utter-project/EuroLLM-9B-Instruct"
tok = AutoTokenizer.from_pretrained(BASE)
base = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.float16,
device_map="auto")
model = PeftModel.from_pretrained(base, "tiararodney/EuroLLM-9B-Teletype")
model.eval()
messages = [
{"role": "user",
"content": "sek 0.1.0 host: sek user: alice shell: /bin/dash\n"
"Welcome, alice. Your assignments live in ~/ASSIGNMENTS.\n"
"alice@sek:~$ "},
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
ids = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=64, do_sample=False)
print(tok.decode(out[0, ids.input_ids.shape[1]:], skip_special_tokens=True))
# -> the next command, e.g. `cat ~/ASSIGNMENTS`
Drive it in a loop: render history with the chat template, generate one command,
run it in a real shell, append the output as a user turn, repeat until the model
emits exit or panic.
Use with Ollama
The included Modelfile applies this adapter over the base as a GGUF LoRA. Build
the base first, eurollm:9b-instruct, from the base repo's Modelfile, which sets
EuroLLM's ChatML template and <|im_end|> stop. A bare FROM ./gguf does not
carry them (the GGUF metadata lacks a usable template / stop), and the model would
then never stop, rambling to the token cap (slow) and past the command (gibberish):
# tiararodney/EuroLLM-9B-Instruct ships the base GGUF and this Modelfile
ollama create eurollm:9b-instruct -f Modelfile
Then this adapter over it (the converted teletype-lora-f16.gguf ships here;
regenerate it with llama.cpp convert_lora_to_gguf.py if you prefer):
ollama create eurollm-teletype -f Modelfile
Sanity-check that it stops: ollama show eurollm-teletype --modelfile should list
a real template and PARAMETER stop, not a bare {{ .Prompt }}. A one-line
prompt should return a handful of tokens, not the full budget.
Reproduction
# train (pulls the gate-hardened v2.0.0 corpus from the Hub; held-out archetypes excluded)
sekft-train --hub --corpus-version latest \
--base utter-project/EuroLLM-9B-Instruct --out ./ckpt \
--load-4bit --epochs 3
# evaluate behaviourally on held-out scenarios (greedy, finite-scrollback bound)
sekft-eval --base utter-project/EuroLLM-9B-Instruct --adapter ./ckpt \
--scenarios ./holdout-scenarios --n 16 --temperature 0 \
--max-steps 30 --ctx-budget 3072
The figures in figures/ regenerate from their committed sources (*.puml via
PlantUML, *.gp via gnuplot).
Limitations
- Small evaluation: n=16 held-out, two archetypes, one greedy run. The numbers are a signal, not a benchmark.
- Several variables changed from the prior cut at once (corpus, LoRA recipe, render); the result is attributed by presumption, not ablation.
- Termination is the known weak point: at r=16 the model completes most tasks (verified 0.75) but only exits half the time (0.50). Capacity (r=32 + MLP) is a demonstrated lever that was not pulled in this release.
- One dataset, one teacher / operator; a single training run per base model.
- Installs the mechanism, not competence. It reliably operates and, less reliably, terminates; it does not make the base solve arbitrary unseen task types correctly.
- Trained in
dashon Alpine; command semantics may differ on another target. - Render must match train and serve. It is served with the base model's default
ChatML template over the OpenAI protocol (via ccpty), so fine-tune with that same
template (
apply_chat_template), not a custom one, or behaviour degrades. - 4-bit QLoRA on a V100 (no bf16); the base is multilingual, but the trajectories are English-prompted, so non-English shell operation is untested.
License and citation
The adapter weights are released under Apache-2.0, consistent with the base model.
The training data (posix-sdc) is CC-BY-4.0; attribute "posix-sdc by Tiara Rodney"
if you build on it.
@misc{eurollm-teletype,
title = {EuroLLM-9B-Teletype: a self-directed shell-operation adapter for EuroLLM-9B},
author = {Rodney, Tiara},
year = {2026},
howpublished = {Hugging Face PEFT adapter, tiararodney/EuroLLM-9B-Teletype}
}
- Downloads last month
- 141
16-bit
Model tree for tiararodney/EuroLLM-9B-Teletype
Base model
utter-project/EuroLLM-9B




