AlemLLM — BF16 GGUF

BF16 GGUF conversion of astanahub/alemllm for use with llama.cpp and compatible tools.

Quantized variants (Q4_K_M, Q5_K_M, Q8_0) and GGUF benchmarks coming soon.

Description

AlemLLM is a large language model developed by Astana Hub to improve the helpfulness of LLM-generated responses in the Kazakh language. Built on a Mixture-of-Experts (MoE) architecture, AlemLLM achieves state-of-the-art results across Kazakh, Russian, and English benchmarks while keeping inference efficient with only 22B active parameters per token.

This repository provides the BF16 GGUF conversion for use with llama.cpp, Ollama, LM Studio, vLLM, Jan, and other GGUF-compatible inference engines.

Model Specification

Parameter	Value
Architecture	DeepSeek Mixture of Experts (MoE)
Total Parameters	247B
Active Parameters	22B per token
Routed Experts	112 (6 active per token)
Shared Experts	2
Number of Layers	56
Vocabulary Size	100,352
Tokenizer	SentencePiece
Activation Function	SwiGLU
Positional Encoding	RoPE
Quantization	BF16
GGUF Size	461 GB (split into 11 parts)

Evaluation Metrics

Model evaluations were conducted using established benchmarks. AlemLLM achieves top-tier results across all three languages.

Kazakh Leaderboard

Model	Average	MMLU	Winogrande	Hellaswag	ARC	GSM8k	DROP
AlemLLM	0.826	0.757	0.837	0.775	0.949	0.917	0.719
Yi-Lightning	0.812	0.720	0.852	0.820	0.940	0.880	0.660
DeepSeek R1	0.798	0.753	0.764	0.680	0.868	0.937	0.784
GPT-4o	0.776	0.730	0.704	0.830	0.940	0.900	0.550
KazLLM-1.0-70B	0.766	0.660	0.806	0.790	0.920	0.770	0.650
DeepSeek V3 37A	0.715	0.650	0.628	0.640	0.900	0.890	0.580
Llama-3.1-70b-inst.	0.639	0.610	0.585	0.520	0.820	0.780	0.520
QwQ 32B	0.628	0.591	0.613	0.499	0.661	0.826	0.576

Russian Leaderboard

Model	Average	MMLU	Winogrande	Hellaswag	ARC	GSM8k	DROP
AlemLLM	0.848	0.801	0.858	0.843	0.959	0.896	0.729
DeepSeek R1	0.845	0.838	0.811	0.827	0.972	0.928	0.694
QwQ 32B	0.840	0.810	0.807	0.823	0.964	0.926	0.709
Yi-Lightning	0.834	0.750	0.854	0.870	0.960	0.890	0.680
DeepSeek V3 37A	0.818	0.784	0.756	0.840	0.960	0.910	0.660
GPT-4o	0.808	0.776	0.771	0.880	0.960	0.890	0.570
Llama-3.1-70b-inst.	0.752	0.660	0.691	0.730	0.920	0.880	0.630
KazLLM-1.0-70B	0.748	0.650	0.806	0.860	0.790	0.810	0.570

English Leaderboard

Model	Average	MMLU	Winogrande	Hellaswag	ARC	GSM8k	DROP
AlemLLM	0.921	0.874	0.928	0.909	0.978	0.926	0.911
QwQ 32B	0.914	0.864	0.886	0.897	0.969	0.969	0.896
Yi-Lightning	0.909	0.820	0.936	0.930	0.980	0.930	0.860
DeepSeek R1	0.908	0.855	0.857	0.882	0.977	0.960	0.915
DeepSeek V3 37A	0.880	0.840	0.790	0.900	0.980	0.950	0.820
GPT-4o	0.862	0.830	0.793	0.940	0.980	0.910	0.720
KazLLM-1.0-70B	0.855	0.820	0.843	0.920	0.970	0.820	0.760
Llama-3.1-70b-inst.	0.841	0.770	0.718	0.880	0.960	0.900	0.820

GGUF-specific perplexity benchmarks for quantized variants will be published soon.

Files

File	Size	Description
`alemllm-bf16.gguf.part00of11` — `part09of11`	46 GB each	BF16 GGUF split parts
`alemllm-bf16.gguf.part10of11`	0.5 GB	Final split part
`imatrix.dat`	354 MB	Importance matrix (partial, 24/192 chunks)

Total size: 461 GB

How to Use

Step 1: Download & Reassemble

# Download all parts
huggingface-cli download Sherkhan243/alemllm-GGUF --local-dir ./alemllm-GGUF

# Reassemble into a single GGUF file
cat alemllm-GGUF/alemllm-bf16.gguf.part*of11 > alemllm-bf16.gguf

# Verify (optional)
# SHA256: 2effdb42be5e37561da0a5a458f9b9519b9a407e89fa21f57813d34183765979
sha256sum alemllm-bf16.gguf

Step 2: Run with llama.cpp

# Requires ~470 GB RAM or multi-GPU setup
./llama-server \
  -m alemllm-bf16.gguf \
  -c 8192 \
  -ngl 99 \
  --host 0.0.0.0 \
  --port 8080

Run with Ollama

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./alemllm-bf16.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
EOF

# Create and run
ollama create alemllm -f Modelfile
ollama run alemllm

Run with LM Studio

Place the reassembled alemllm-bf16.gguf in your LM Studio models directory
Open LM Studio and select the model
Adjust context length and GPU layers as needed

Run with vLLM (GGUF support)

python -m vllm.entrypoints.openai.api_server \
  --model alemllm-bf16.gguf \
  --tokenizer-mode slow \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.98 \
  --max-model-len 8192 \
  --port 8000

Hardware Requirements

Configuration	RAM/VRAM Required	Notes
CPU only	~470 GB RAM	Slow but works
Multi-GPU (4x 128GB)	512 GB VRAM	Full GPU offload, best performance
Multi-GPU (8x 80GB)	640 GB VRAM	Full GPU offload
Hybrid (GPU + CPU)	Varies	Partial GPU offload with `-ngl N`

Quantized versions (Q4_K_M ~130 GB, Q8_0 ~260 GB) will significantly reduce these requirements.

Upcoming

Q4_K_M quantization (~130 GB) — most practical for multi-GPU setups
Q5_K_M quantization (~170 GB)
Q8_0 quantization (~260 GB)
GGUF perplexity benchmarks
Complete importance matrix (full 192 chunks)

License

This model follows the same license as the original astanahub/alemllm.

Research and non-commercial use only. See the original LICENSE for full terms.

Attribution

Original model: Astana Hub — developed with technical support from 01.AI
GGUF conversion: Sherkhan243

Intended Use & Limitations

Intended Use: Research and development in line with Kazakhstan's AI initiatives.

Limitations: The model may generate inaccurate, biased, or unsafe content; users must apply responsible use practices.

Safety & Compliance: Publication is subject to applicable laws, export control, and cybersecurity regulations.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Sherkhan243/alemllm-GGUF

Base model

astanahub/alemllm

Finetuned

(1)

this model