YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
GGUF Inference Performance Benchmarks
Original Model: issai/Qolda
Benchmarks conducted using llama.cpp with llama-bench.
About Qolda
Qolda is a 4.3B parameter vision-language model designed for Kazakh, Russian, and English. Built on InternVL3.5 and Qwen3, it combines the InternViT-300M vision encoder with the Qwen3-4B language model. The name reflects both accessibility ("қолда" — in hand) and support ("қолдау" — to support).
Test Configuration
| Parameter | Value |
|---|---|
| Prompt tokens (pp) | 1024 |
| Generation tokens (tg) | 256 |
| Runs per test | 3 |
| Flash Attention | Enabled |
Hardware Configurations
NVIDIA A100-SXM4-40GB
- Backend: CUDA + NVIDIA Tensor Cores (Ampere architecture)
- Memory: 40 GB HBM2e
- Memory Bandwidth: 1,555 GB/s
- Compute Performance (Peak):
- FP16 / BF16 (Tensor-Core): 312 TFLOPS
- FP16 / BF16 w/ Sparsity (Tensor-Core): 624 TFLOPS
- INT8 (Tensor-Core): 624 TOPS
- INT8 w/ Sparsity (Tensor-Core): 1,248 TOPS
- FP32 (CUDA cores): 19.5 TFLOPS
- FP64 (CUDA cores): 9.7 TFLOPS
Apple MacBook Pro M4 Pro
- Backend: Metal + Apple Silicon GPU (unified memory architecture)
- GPU: Apple M4 Pro 16-core GPU
- Memory: 24 GB unified LPDDR5X
- Memory Bandwidth: 273 GB/s
- Compute Performance (Peak):
- FP32: ~7.4 TFLOPS
Qualcomm Snapdragon 8 Gen 3 (OnePlus 13)
- Backend: CPU (ARM NEON / I8MM-enabled)
- CPU: Octa-core (1×3.3 GHz Cortex-X4 + 5×3.2 GHz Cortex-A720 + 2×2.3 GHz Cortex-A520)
- GPU: Adreno 750 (not used in this benchmark)
- Memory: 16 GB LPDDR5X
- Memory Bandwidth: 77 GB/s
- Compute Performance (Peak, GPU):
- FP32: ~2.7 TFLOPS
Qualcomm Snapdragon 8 Elite Gen 5 (2026 Flagship — Projected)
- Backend: CPU (ARM NEON / I8MM-enabled)
- Process: 4 nm (TSMC)
- CPU: Octa-core (2+6 Cluster), 4.6 GHz
- GPU: Adreno 840 (not used in this benchmark)
- Memory: 16 GB LPDDR5X
- Compute Performance (Peak, GPU):
- FP32: ~5.5 TFLOPS
Benchmark Results
Prompt Processing (pp1024) — tokens/second
| Quantization | Size | A100 | M4 | SD 8 Gen 3 | SD 8 Elite Gen 5 (proj.) |
|---|---|---|---|---|---|
| F16 | 7.49 GiB | 13,576.25 | 797.20 | 13.72 | 31.55 |
| Q8_0 | 3.98 GiB | 7,731.88 | 726.31 | 14.71 | 33.83 |
| Q6_K | 3.07 GiB | 6,953.67 | 651.65 | 8.59 | 19.76 |
| Q5_K_M | 2.69 GiB | 7,387.72 | 686.72 | 8.84 | 20.33 |
| Q5_K_S | 2.62 GiB | 7,556.95 | 654.02 | 7.72 | 17.76 |
| Q4_K_M | 2.32 GiB | 7,589.05 | 742.13 | 18.07 | 41.56 |
| Q4_K_S | 2.21 GiB | 7,783.02 | 734.80 | 20.70 | 47.61 |
| Q4_1 | 2.41 GiB | 7,751.64 | 792.08 | 11.60 | 26.68 |
| Q4_0 | 2.20 GiB | 7,853.97 | 739.13 | 36.96 | 85.00 |
| IQ4_NL | 2.22 GiB | 7,799.43 | 746.65 | 18.99 | 43.68 |
| IQ4_XS | 2.12 GiB | 8,032.40 | 726.70 | 8.74 | 20.10 |
| Q3_K_M | 1.93 GiB | 6,264.09 | 643.15 | 10.92 | 25.12 |
| Q3_K_S | 1.75 GiB | 5,717.55 | 633.62 | 7.75 | 17.82 |
| Q2_K | 1.55 GiB | 5,154.50 | 660.86 | 8.70 | 20.01 |
| TQ1_0 | 1.01 GiB | — | — | 11.96 | 27.50 |
Text Generation (tg256) — tokens/second
| Quantization | Size | A100 | M4 | SD 8 Gen 3 | SD 8 Elite Gen 5 (proj.) |
|---|---|---|---|---|---|
| F16 | 7.49 GiB | 123.46 | 26.03 | 4.34 | 9.98 |
| Q8_0 | 3.98 GiB | 154.43 | 47.66 | 7.50 | 17.25 |
| Q6_K | 3.07 GiB | 150.04 | 43.91 | 6.31 | 14.52 |
| Q5_K_M | 2.69 GiB | 169.12 | 57.77 | 7.16 | 16.46 |
| Q5_K_S | 2.62 GiB | 174.78 | 58.70 | 6.70 | 15.41 |
| Q4_K_M | 2.32 GiB | 179.73 | 68.92 | 9.73 | 22.38 |
| Q4_K_S | 2.21 GiB | 186.12 | 71.96 | 10.67 | 24.54 |
| Q4_1 | 2.41 GiB | 207.61 | 71.21 | 6.72 | 15.45 |
| Q4_0 | 2.20 GiB | 199.58 | 71.02 | 12.37 | 28.45 |
| IQ4_NL | 2.22 GiB | 194.92 | 69.07 | 10.77 | 24.77 |
| IQ4_XS | 2.12 GiB | 201.83 | 69.76 | 7.40 | 17.02 |
| Q3_K_M | 1.93 GiB | 150.29 | 61.93 | 8.19 | 18.83 |
| Q3_K_S | 1.75 GiB | 138.07 | 58.16 | 6.97 | 16.03 |
| Q2_K | 1.55 GiB | 166.89 | 70.61 | 8.21 | 18.88 |
| TQ1_0 | 1.01 GiB | — | — | 8.82 | 20.28 |
Vision Encoder Benchmarks (InternViT-300M)
The vision encoder processes images separately from the LLM backbone. These benchmarks measure end-to-end image processing latency including encoding and decoding stages.
Image Processing Latency
| Stage | A100 | M4 Pro | SD 8 Gen 3 |
|---|---|---|---|
| Image Slice Encoding | 17 ms | 295 ms | 7,285 ms |
| Image Decoding (total) | 79 ms | 1,375 ms | 9,112 ms |
| Total Processing | 96 ms | 1,670 ms | 16,397 ms |
Memory Usage
| Device | BF16 | Q8_0 |
|---|---|---|
| Desktop (M4 Pro) | 9.6 GB | 1.4 GB |
| Mobile (SD 8 Gen 3) | 1.5 GB | 1.2 GB |
Notes
- A100 Results: Full GPU offload with flash attention enabled
- Macbook M4 Pro Results: Metal backend leveraging unified memory architecture
- Snapdragon Results: CPU-only inference; Q4_0 shows exceptional ARM NEON optimization
- All measurements report mean ± standard deviation across 3 runs
- Higher values indicate better performance
- These benchmarks are for the LLM backbone (Qwen3-4B) component; vision encoder runs separately
License
This model is licensed under the Apache License 2.0.
- Downloads last month
- -
1-bit
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit