AlemLLM β BF16 GGUF
BF16 GGUF conversion of astanahub/alemllm for use with llama.cpp and compatible tools.
Quantized variants (Q4_K_M, Q5_K_M, Q8_0) and GGUF benchmarks coming soon.
Description
AlemLLM is a large language model developed by Astana Hub to improve the helpfulness of LLM-generated responses in the Kazakh language. Built on a Mixture-of-Experts (MoE) architecture, AlemLLM achieves state-of-the-art results across Kazakh, Russian, and English benchmarks while keeping inference efficient with only 22B active parameters per token.
This repository provides the BF16 GGUF conversion for use with llama.cpp, Ollama, LM Studio, vLLM, Jan, and other GGUF-compatible inference engines.
Model Specification
| Parameter | Value |
|---|---|
| Architecture | DeepSeek Mixture of Experts (MoE) |
| Total Parameters | 247B |
| Active Parameters | 22B per token |
| Routed Experts | 112 (6 active per token) |
| Shared Experts | 2 |
| Number of Layers | 56 |
| Vocabulary Size | 100,352 |
| Tokenizer | SentencePiece |
| Activation Function | SwiGLU |
| Positional Encoding | RoPE |
| Quantization | BF16 |
| GGUF Size | 461 GB (split into 11 parts) |
Evaluation Metrics
Model evaluations were conducted using established benchmarks. AlemLLM achieves top-tier results across all three languages.
Kazakh Leaderboard
| Model | Average | MMLU | Winogrande | Hellaswag | ARC | GSM8k | DROP |
|---|---|---|---|---|---|---|---|
| AlemLLM | 0.826 | 0.757 | 0.837 | 0.775 | 0.949 | 0.917 | 0.719 |
| Yi-Lightning | 0.812 | 0.720 | 0.852 | 0.820 | 0.940 | 0.880 | 0.660 |
| DeepSeek R1 | 0.798 | 0.753 | 0.764 | 0.680 | 0.868 | 0.937 | 0.784 |
| GPT-4o | 0.776 | 0.730 | 0.704 | 0.830 | 0.940 | 0.900 | 0.550 |
| KazLLM-1.0-70B | 0.766 | 0.660 | 0.806 | 0.790 | 0.920 | 0.770 | 0.650 |
| DeepSeek V3 37A | 0.715 | 0.650 | 0.628 | 0.640 | 0.900 | 0.890 | 0.580 |
| Llama-3.1-70b-inst. | 0.639 | 0.610 | 0.585 | 0.520 | 0.820 | 0.780 | 0.520 |
| QwQ 32B | 0.628 | 0.591 | 0.613 | 0.499 | 0.661 | 0.826 | 0.576 |
Russian Leaderboard
| Model | Average | MMLU | Winogrande | Hellaswag | ARC | GSM8k | DROP |
|---|---|---|---|---|---|---|---|
| AlemLLM | 0.848 | 0.801 | 0.858 | 0.843 | 0.959 | 0.896 | 0.729 |
| DeepSeek R1 | 0.845 | 0.838 | 0.811 | 0.827 | 0.972 | 0.928 | 0.694 |
| QwQ 32B | 0.840 | 0.810 | 0.807 | 0.823 | 0.964 | 0.926 | 0.709 |
| Yi-Lightning | 0.834 | 0.750 | 0.854 | 0.870 | 0.960 | 0.890 | 0.680 |
| DeepSeek V3 37A | 0.818 | 0.784 | 0.756 | 0.840 | 0.960 | 0.910 | 0.660 |
| GPT-4o | 0.808 | 0.776 | 0.771 | 0.880 | 0.960 | 0.890 | 0.570 |
| Llama-3.1-70b-inst. | 0.752 | 0.660 | 0.691 | 0.730 | 0.920 | 0.880 | 0.630 |
| KazLLM-1.0-70B | 0.748 | 0.650 | 0.806 | 0.860 | 0.790 | 0.810 | 0.570 |
English Leaderboard
| Model | Average | MMLU | Winogrande | Hellaswag | ARC | GSM8k | DROP |
|---|---|---|---|---|---|---|---|
| AlemLLM | 0.921 | 0.874 | 0.928 | 0.909 | 0.978 | 0.926 | 0.911 |
| QwQ 32B | 0.914 | 0.864 | 0.886 | 0.897 | 0.969 | 0.969 | 0.896 |
| Yi-Lightning | 0.909 | 0.820 | 0.936 | 0.930 | 0.980 | 0.930 | 0.860 |
| DeepSeek R1 | 0.908 | 0.855 | 0.857 | 0.882 | 0.977 | 0.960 | 0.915 |
| DeepSeek V3 37A | 0.880 | 0.840 | 0.790 | 0.900 | 0.980 | 0.950 | 0.820 |
| GPT-4o | 0.862 | 0.830 | 0.793 | 0.940 | 0.980 | 0.910 | 0.720 |
| KazLLM-1.0-70B | 0.855 | 0.820 | 0.843 | 0.920 | 0.970 | 0.820 | 0.760 |
| Llama-3.1-70b-inst. | 0.841 | 0.770 | 0.718 | 0.880 | 0.960 | 0.900 | 0.820 |
GGUF-specific perplexity benchmarks for quantized variants will be published soon.
Files
| File | Size | Description |
|---|---|---|
alemllm-bf16.gguf.part00of11 β part09of11 |
46 GB each | BF16 GGUF split parts |
alemllm-bf16.gguf.part10of11 |
0.5 GB | Final split part |
imatrix.dat |
354 MB | Importance matrix (partial, 24/192 chunks) |
Total size: 461 GB
How to Use
Step 1: Download & Reassemble
# Download all parts
huggingface-cli download Sherkhan243/alemllm-GGUF --local-dir ./alemllm-GGUF
# Reassemble into a single GGUF file
cat alemllm-GGUF/alemllm-bf16.gguf.part*of11 > alemllm-bf16.gguf
# Verify (optional)
# SHA256: 2effdb42be5e37561da0a5a458f9b9519b9a407e89fa21f57813d34183765979
sha256sum alemllm-bf16.gguf
Step 2: Run with llama.cpp
# Requires ~470 GB RAM or multi-GPU setup
./llama-server \
-m alemllm-bf16.gguf \
-c 8192 \
-ngl 99 \
--host 0.0.0.0 \
--port 8080
Run with Ollama
# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./alemllm-bf16.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
EOF
# Create and run
ollama create alemllm -f Modelfile
ollama run alemllm
Run with LM Studio
- Place the reassembled
alemllm-bf16.ggufin your LM Studio models directory - Open LM Studio and select the model
- Adjust context length and GPU layers as needed
Run with vLLM (GGUF support)
python -m vllm.entrypoints.openai.api_server \
--model alemllm-bf16.gguf \
--tokenizer-mode slow \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.98 \
--max-model-len 8192 \
--port 8000
Hardware Requirements
| Configuration | RAM/VRAM Required | Notes |
|---|---|---|
| CPU only | ~470 GB RAM | Slow but works |
| Multi-GPU (4x 128GB) | 512 GB VRAM | Full GPU offload, best performance |
| Multi-GPU (8x 80GB) | 640 GB VRAM | Full GPU offload |
| Hybrid (GPU + CPU) | Varies | Partial GPU offload with -ngl N |
Quantized versions (Q4_K_M ~130 GB, Q8_0 ~260 GB) will significantly reduce these requirements.
Upcoming
- Q4_K_M quantization (~130 GB) β most practical for multi-GPU setups
- Q5_K_M quantization (~170 GB)
- Q8_0 quantization (~260 GB)
- GGUF perplexity benchmarks
- Complete importance matrix (full 192 chunks)
License
This model follows the same license as the original astanahub/alemllm.
Research and non-commercial use only. See the original LICENSE for full terms.
Attribution
- Original model: Astana Hub β developed with technical support from 01.AI
- GGUF conversion: Sherkhan243
Intended Use & Limitations
Intended Use: Research and development in line with Kazakhstan's AI initiatives.
Limitations: The model may generate inaccurate, biased, or unsafe content; users must apply responsible use practices.
Safety & Compliance: Publication is subject to applicable laws, export control, and cybersecurity regulations.
Model tree for Sherkhan243/alemllm-GGUF
Base model
astanahub/alemllm