Instructions to use derprofi2431/Prisma-32B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use derprofi2431/Prisma-32B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="derprofi2431/Prisma-32B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("derprofi2431/Prisma-32B")
model = AutoModelForCausalLM.from_pretrained("derprofi2431/Prisma-32B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use derprofi2431/Prisma-32B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "derprofi2431/Prisma-32B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "derprofi2431/Prisma-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/derprofi2431/Prisma-32B

SGLang

How to use derprofi2431/Prisma-32B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "derprofi2431/Prisma-32B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "derprofi2431/Prisma-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "derprofi2431/Prisma-32B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "derprofi2431/Prisma-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use derprofi2431/Prisma-32B with Docker Model Runner:
```
docker model run hf.co/derprofi2431/Prisma-32B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Prisma-32B

Prisma-32B is a 32 billion parameter language model optimized for advanced coding, technical reasoning, and cybersecurity workflows. It the first Prisma Model with no security blocking. It is the second release in the Prisma series, following Prisma-0.6B.

Prisma-32B is designed to be a capable, direct, and technically rigorous assistant for users who need a model that engages substantively with complex technical material.

Model Details

Property	Value
Parameters	32B
Architecture	Transformer Decoder
Context Length	32,768 tokens
Languages	English, German, Chinese (+ 20 more)
License	Apache 2.0

Intended Use

Prisma-32B is intended for:

Coding assistance — full-stack development, debugging, refactoring, code review
Cybersecurity research — offensive security workflows (red team, CTF, exploit analysis) and defensive workflows (incident response, hardening, secure code review)
Technical writing — documentation, system specifications, architecture
Research and experimentation in controlled environments

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "derprofi2431/Prisma-32B",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("derprofi2431/Prisma-32B")

messages = [
    {"role": "user", "content": "Write a port scanner in Python."}
]
inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

output = model.generate(inputs, max_new_tokens=2048, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Recommended Sampling

Parameter	Value
`temperature`	0.6 – 0.8
`top_p`	0.9
`top_k`	40
`repetition_penalty`	1.05

Quantized Versions

GGUF quantizations for local inference via Ollama and llama.cpp will be released as separate repositories.

Limitations and Responsible Use

The user is fully responsible for the content they generate and how they use it.
The model is not aligned for general consumer-facing deployment. For production use, deploy behind an appropriate safety layer (input filtering, output classification, etc.).
The model may reflect biases present in large-scale text corpora.
Intended for adult, technically competent users in controlled environments.

By downloading or using this model, you agree to use it lawfully and ethically within your jurisdiction. The author assumes no liability for misuse.

Citation

@misc{prisma32b2026,
  title  = {Prisma-32B},
  author = {Jannik},
  year   = {2026},
  url    = {https://huggingface.co/derprofi2431/Prisma-32B}
}

Downloads last month: 145

Safetensors

Model size

33B params

Tensor type

F16

Model tree for derprofi2431/Prisma-32B

Quantizations

2 models