AlemLLM β€” BF16 GGUF

BF16 GGUF conversion of astanahub/alemllm for use with llama.cpp and compatible tools.

Quantized variants (Q4_K_M, Q5_K_M, Q8_0) and GGUF benchmarks coming soon.


Description

AlemLLM is a large language model developed by Astana Hub to improve the helpfulness of LLM-generated responses in the Kazakh language. Built on a Mixture-of-Experts (MoE) architecture, AlemLLM achieves state-of-the-art results across Kazakh, Russian, and English benchmarks while keeping inference efficient with only 22B active parameters per token.

This repository provides the BF16 GGUF conversion for use with llama.cpp, Ollama, LM Studio, vLLM, Jan, and other GGUF-compatible inference engines.

Model Specification

Parameter Value
Architecture DeepSeek Mixture of Experts (MoE)
Total Parameters 247B
Active Parameters 22B per token
Routed Experts 112 (6 active per token)
Shared Experts 2
Number of Layers 56
Vocabulary Size 100,352
Tokenizer SentencePiece
Activation Function SwiGLU
Positional Encoding RoPE
Quantization BF16
GGUF Size 461 GB (split into 11 parts)

Evaluation Metrics

Model evaluations were conducted using established benchmarks. AlemLLM achieves top-tier results across all three languages.

Kazakh Leaderboard

Model Average MMLU Winogrande Hellaswag ARC GSM8k DROP
AlemLLM 0.826 0.757 0.837 0.775 0.949 0.917 0.719
Yi-Lightning 0.812 0.720 0.852 0.820 0.940 0.880 0.660
DeepSeek R1 0.798 0.753 0.764 0.680 0.868 0.937 0.784
GPT-4o 0.776 0.730 0.704 0.830 0.940 0.900 0.550
KazLLM-1.0-70B 0.766 0.660 0.806 0.790 0.920 0.770 0.650
DeepSeek V3 37A 0.715 0.650 0.628 0.640 0.900 0.890 0.580
Llama-3.1-70b-inst. 0.639 0.610 0.585 0.520 0.820 0.780 0.520
QwQ 32B 0.628 0.591 0.613 0.499 0.661 0.826 0.576

Russian Leaderboard

Model Average MMLU Winogrande Hellaswag ARC GSM8k DROP
AlemLLM 0.848 0.801 0.858 0.843 0.959 0.896 0.729
DeepSeek R1 0.845 0.838 0.811 0.827 0.972 0.928 0.694
QwQ 32B 0.840 0.810 0.807 0.823 0.964 0.926 0.709
Yi-Lightning 0.834 0.750 0.854 0.870 0.960 0.890 0.680
DeepSeek V3 37A 0.818 0.784 0.756 0.840 0.960 0.910 0.660
GPT-4o 0.808 0.776 0.771 0.880 0.960 0.890 0.570
Llama-3.1-70b-inst. 0.752 0.660 0.691 0.730 0.920 0.880 0.630
KazLLM-1.0-70B 0.748 0.650 0.806 0.860 0.790 0.810 0.570

English Leaderboard

Model Average MMLU Winogrande Hellaswag ARC GSM8k DROP
AlemLLM 0.921 0.874 0.928 0.909 0.978 0.926 0.911
QwQ 32B 0.914 0.864 0.886 0.897 0.969 0.969 0.896
Yi-Lightning 0.909 0.820 0.936 0.930 0.980 0.930 0.860
DeepSeek R1 0.908 0.855 0.857 0.882 0.977 0.960 0.915
DeepSeek V3 37A 0.880 0.840 0.790 0.900 0.980 0.950 0.820
GPT-4o 0.862 0.830 0.793 0.940 0.980 0.910 0.720
KazLLM-1.0-70B 0.855 0.820 0.843 0.920 0.970 0.820 0.760
Llama-3.1-70b-inst. 0.841 0.770 0.718 0.880 0.960 0.900 0.820

GGUF-specific perplexity benchmarks for quantized variants will be published soon.

Files

File Size Description
alemllm-bf16.gguf.part00of11 β€” part09of11 46 GB each BF16 GGUF split parts
alemllm-bf16.gguf.part10of11 0.5 GB Final split part
imatrix.dat 354 MB Importance matrix (partial, 24/192 chunks)

Total size: 461 GB

How to Use

Step 1: Download & Reassemble

# Download all parts
huggingface-cli download Sherkhan243/alemllm-GGUF --local-dir ./alemllm-GGUF

# Reassemble into a single GGUF file
cat alemllm-GGUF/alemllm-bf16.gguf.part*of11 > alemllm-bf16.gguf

# Verify (optional)
# SHA256: 2effdb42be5e37561da0a5a458f9b9519b9a407e89fa21f57813d34183765979
sha256sum alemllm-bf16.gguf

Step 2: Run with llama.cpp

# Requires ~470 GB RAM or multi-GPU setup
./llama-server \
  -m alemllm-bf16.gguf \
  -c 8192 \
  -ngl 99 \
  --host 0.0.0.0 \
  --port 8080

Run with Ollama

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./alemllm-bf16.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
EOF

# Create and run
ollama create alemllm -f Modelfile
ollama run alemllm

Run with LM Studio

  1. Place the reassembled alemllm-bf16.gguf in your LM Studio models directory
  2. Open LM Studio and select the model
  3. Adjust context length and GPU layers as needed

Run with vLLM (GGUF support)

python -m vllm.entrypoints.openai.api_server \
  --model alemllm-bf16.gguf \
  --tokenizer-mode slow \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.98 \
  --max-model-len 8192 \
  --port 8000

Hardware Requirements

Configuration RAM/VRAM Required Notes
CPU only ~470 GB RAM Slow but works
Multi-GPU (4x 128GB) 512 GB VRAM Full GPU offload, best performance
Multi-GPU (8x 80GB) 640 GB VRAM Full GPU offload
Hybrid (GPU + CPU) Varies Partial GPU offload with -ngl N

Quantized versions (Q4_K_M ~130 GB, Q8_0 ~260 GB) will significantly reduce these requirements.

Upcoming

  • Q4_K_M quantization (~130 GB) β€” most practical for multi-GPU setups
  • Q5_K_M quantization (~170 GB)
  • Q8_0 quantization (~260 GB)
  • GGUF perplexity benchmarks
  • Complete importance matrix (full 192 chunks)

License

This model follows the same license as the original astanahub/alemllm.

Research and non-commercial use only. See the original LICENSE for full terms.

Attribution

Intended Use & Limitations

Intended Use: Research and development in line with Kazakhstan's AI initiatives.

Limitations: The model may generate inaccurate, biased, or unsafe content; users must apply responsible use practices.

Safety & Compliance: Publication is subject to applicable laws, export control, and cybersecurity regulations.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Sherkhan243/alemllm-GGUF

Finetuned
(1)
this model