Instructions to use LLMWildling/gpt-oss-180b-goomba with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use LLMWildling/gpt-oss-180b-goomba with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="LLMWildling/gpt-oss-180b-goomba")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("LLMWildling/gpt-oss-180b-goomba")
model = AutoModelForCausalLM.from_pretrained("LLMWildling/gpt-oss-180b-goomba", device_map="auto")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use LLMWildling/gpt-oss-180b-goomba with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "LLMWildling/gpt-oss-180b-goomba"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LLMWildling/gpt-oss-180b-goomba",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/LLMWildling/gpt-oss-180b-goomba

SGLang

How to use LLMWildling/gpt-oss-180b-goomba with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "LLMWildling/gpt-oss-180b-goomba" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LLMWildling/gpt-oss-180b-goomba",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "LLMWildling/gpt-oss-180b-goomba" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LLMWildling/gpt-oss-180b-goomba",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use LLMWildling/gpt-oss-180b-goomba with Docker Model Runner:
```
docker model run hf.co/LLMWildling/gpt-oss-180b-goomba
```

gpt-oss-180b-goomba

gpt-oss-180b-goomba is an agentic coding model derived from GPT-OSS 120B.

Goomba expands the GPT-OSS 120B base with additional specialist MoE capacity and is intended for agentic coding, repository work, SWE-style tasks, and tool-using automation.

Goomba is the first release in this line to feature a new post-training data formulation. It is completely different from the previous releases and is much stronger at tool calling, raw SWE-style coding, and math-assisted reasoning.

This model was trained on just two GPUs.

Overview

Base model: openai/gpt-oss-120b
Approx total parameters: 181B
Approx active parameters: 16.5B per token at top-k=16
Total expert rows: 200
Added specialist experts: 72
Format: MXFP4
Out-of-box active experts: top-k=16
Intended use: agentic coding, SWE-style workflows, repository exploration, tool-using automation, raw SWE coding, math-assisted coding
Status: research preview

Recommended vLLM

This model was primarily tested with vLLM using the GPT-OSS reasoning parser and OpenAI tool-call parser.

vllm serve /path/to/model \
  --served-model-name vllm/doobee \
  --tensor-parallel-size 2 \
  --max-model-len 60000 \
  --gpu-memory-utilization 0.88 \
  --enforce-eager \
  --trust-remote-code \
  --reasoning-parser openai_gptoss \
  --tool-call-parser openai \
  --enable-auto-tool-choice

Recommended parameters:

num_experts_per_tok=16 is already set in config.json
tensor-parallel-size=2
max-model-len=60000
gpu-memory-utilization=0.88
reasoning-parser=openai_gptoss
tool-call-parser=openai
enable-auto-tool-choice

The config ships with both num_experts_per_tok=16 and experts_per_token=16, so runtimes that respect the model config should use top-k 16 automatically. If your runtime overrides or ignores those fields, pass this explicitly:

--hf-overrides '{"num_experts_per_tok": 16}'

Tool Calling

Goomba was primarily tested as an agentic coding model. Basic OpenAI-compatible tool calling is expected to work best with the vLLM GPT-OSS reasoning parser and OpenAI tool-call parser enabled.

Suggested temperatures:

0.3 for steady coding-agent work
0.5 for broader agentic exploration

Recommended range: 0.3-0.5.

For repository exploration tasks, use an agent prompt that asks the model to inspect subdirectories, identify entry points, and summarize the project structure rather than stopping after a single directory listing.

License

Replace the placeholder license: other metadata with the actual license you want to publish under after confirming compatibility with the base model and your added weights.

Downloads last month: 8

Safetensors

Model size

187B params

Tensor type

BF16

F32

Model tree for LLMWildling/gpt-oss-180b-goomba

Base model

openai/gpt-oss-120b

Quantized

(121)

this model