Instructions to use seedboxai/KafkaLM-15B-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use seedboxai/KafkaLM-15B-Base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="seedboxai/KafkaLM-15B-Base")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("seedboxai/KafkaLM-15B-Base")
model = AutoModelForCausalLM.from_pretrained("seedboxai/KafkaLM-15B-Base")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use seedboxai/KafkaLM-15B-Base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "seedboxai/KafkaLM-15B-Base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "seedboxai/KafkaLM-15B-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/seedboxai/KafkaLM-15B-Base

SGLang

How to use seedboxai/KafkaLM-15B-Base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "seedboxai/KafkaLM-15B-Base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "seedboxai/KafkaLM-15B-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "seedboxai/KafkaLM-15B-Base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "seedboxai/KafkaLM-15B-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use seedboxai/KafkaLM-15B-Base with Docker Model Runner:
```
docker model run hf.co/seedboxai/KafkaLM-15B-Base
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Disclaimer

This model is a base model which received aggressive pruning and knowledge distillation. To make it usable for your individual application it must we finetuned.

Model Description

KafkaLM‑15B‑Base is a 15‑billion‑parameter, sparsity‑aware language model distilled from Mistral‑Small‑24B‑Base‑2501.
This experimental model was created in three stages:

Stage	What we did	Why it matters
1. SimplePrune	Applied a hierarchical, hardware‑aware pruning pipeline that combines block‑, channel‑ and layer-selective 2:4 structured sparsity (≈ 37.5 % parameter reduction)	Slashes memory footprint while minimizing perplexity degradation
2. Teacher calibration	Briefly fine‑tuned the unpruned 24 B teacher on a 10 B‑token multilingual European corpus on a AMD M300A cluster	Produces stable logits and hidden states for distillation
3. Knowledge distillation	Distilled the calibrated teacher into the pruned 15 B student using a fused loss: `L Pooled SquareHead + LKL + 0.25 * LCE`	Transfers teacher capabiities effectively with <15B tokens (< 2 epochs) on 64 MI300A nodes

Key capabilities

Balanced for both multitask and multilingual conversation and long context handling
Structured 2:4 sparsity → runs up to 40 % faster on sparsity‑aware kernels
Distilled on a combination of multilingual pretraining and synthetic data
Training pipeline optimized for unified‑memory GPUs (AMD MI300A) but runs on any CUDA / ROCm device

Pruning Process

Pruning & Distillation Strategy — SimplePrune Hardware‑aware, hierarchical pipeline. SimplePrune starts with coarse block‑level pruning and drills down to channel‑ and neuron‑level removals, finishing with 2 : 4 structured sparsity. This staged approach converts compression ratios into real memory‑bandwidth and latency gains.

Sensitivity‑guided selection Each stage is driven by activation‑magnitude profiles and Hessian‑based importance scores captured asynchronously during training, allowing the framework to run inside the MI300A’s 512 GB unified memory without OOM interruptions.

Two‑phase optimisation A fast greedy pass prunes low‑impact blocks in MLP expansion layers, after which a Tabu‑Search meta‑heuristic explores cross‑layer combinations for a better global trade‑off between sparsity and perplexity/KL divergence.

Post‑pruning knowledge distillation The pruned 15 B student is distilled from a calibrated 24 B teacher using a fused LSquareHead + KL + 0.25 · CE loss across 20 B multilingual tokens, restoring > 96 % of the original quality in ≤ 2 epochs on up to 64 MI300A nodes.

Results

Up to 40 % parameter reduction (24 B → 15 B) delivers 2× lower TTFT and ≈ 40 % higher tokens/s versus the uncompressed teacher while matching perplexity and divergence metrics—validating SimplePrune as an effective route to deploy KafkaLM in memory‑constrained, sparsity‑accelerated environments.

Metric	Mistral‑24B	KafkaLM‑15B	Δ
Time‑to‑First‑Token	4.91 s	2.46 s	−50%
Prompts / s	4.70	6.55	+38%
Tokens / s	579	812	+40%

Training scalability (distillation run, MI300A cluster)

Nodes	Tokens / s	Speed‑up
4	1 461	–
8	3 327	2.3 ×
16	7 423	5.1 ×
32	15 286	10.5 ×
64	25 455	17.4 ×

Near‑linear scaling thanks to sharded ZeRO‑3 + RCCL optimisations.

Citation

@misc{kafkalm2025,
  title={Evaluating AMD's MI300A APU: Performance Insights on LLM Training via Knowledge Distillation},
  author={Dennis Dickmann, Philipp Offenhäuser, Rishabh Saxena, George S. Markomanolis, Alessandro Rigazzi, Patrick Keller, Dennis Hoppe},
  howpublished={Cray User Group Conference, 2025},
  note={to be published},
  year={2025}
}

Downloads last month: 3

Safetensors

Model size

16B params

Tensor type

BF16

Model tree for seedboxai/KafkaLM-15B-Base

Finetunes

1 model

Quantizations

3 models