CyberSentinel-9B-bnb-4bit

Pre-quantized bitsandbytes 4bit (nf4, double-quant, bf16 compute) version of lkjiop8/CyberSentinel-9B.

Download size ~5GB, runtime VRAM ~6-8GB with 8K-16K context. Fits 12GB GPU easily.

Install

pip install -U torch --index-url https://download.pytorch.org/whl/cu121
pip install -U "transformers>=4.46" accelerate "bitsandbytes>=0.43" sentencepiece

Load (already 4bit, no runtime quantization)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

REPO = "lkjiop8/CyberSentinel-9B-bnb-4bit"
tok = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True)
mdl = AutoModelForCausalLM.from_pretrained(
    REPO, device_map="auto", trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)
mdl.eval()

msgs = [
    {"role":"system","content":"You are a red-team security assistant."},
    {"role":"user","content":"Found suspected SQLi on 10.10.50.23, give full plan."},
]
ids = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(mdl.device)
out = mdl.generate(ids, max_new_tokens=2048, temperature=0.6, top_p=0.95, top_k=20, repetition_penalty=1.05, do_sample=True)
print(tok.decode(out[0][ids.shape[-1]:], skip_special_tokens=True))

Recommended sampling

  • temperature=0.6, top_p=0.95, top_k=20, repetition_penalty=1.05
  • max_new_tokens 2048-4096

VRAM on 12GB GPU

context total VRAM
8192 ~6.8 GB
16384 ~8.2 GB
32768 ~10.7 GB

Note

Based on Qwen3-Next hybrid linear-attention architecture, which llama.cpp / Ollama do not support. Use this 4bit HF version or vLLM for deployment.

Downloads last month
3
Safetensors
Model size
9B params
Tensor type
F32
BF16
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for lkjiop8/CyberSentinel-9B-bnb-4bit

Quantized
(2)
this model