YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

GGUF Inference Performance Benchmarks

Original Model: issai/Qolda

Benchmarks conducted using llama.cpp with llama-bench.

About Qolda

Qolda is a 4.3B parameter vision-language model designed for Kazakh, Russian, and English. Built on InternVL3.5 and Qwen3, it combines the InternViT-300M vision encoder with the Qwen3-4B language model. The name reflects both accessibility ("қолда" — in hand) and support ("қолдау" — to support).


Test Configuration

Parameter Value
Prompt tokens (pp) 1024
Generation tokens (tg) 256
Runs per test 3
Flash Attention Enabled

Hardware Configurations

NVIDIA A100-SXM4-40GB

  • Backend: CUDA + NVIDIA Tensor Cores (Ampere architecture)
  • Memory: 40 GB HBM2e
  • Memory Bandwidth: 1,555 GB/s
  • Compute Performance (Peak):
    • FP16 / BF16 (Tensor-Core): 312 TFLOPS
    • FP16 / BF16 w/ Sparsity (Tensor-Core): 624 TFLOPS
    • INT8 (Tensor-Core): 624 TOPS
    • INT8 w/ Sparsity (Tensor-Core): 1,248 TOPS
    • FP32 (CUDA cores): 19.5 TFLOPS
    • FP64 (CUDA cores): 9.7 TFLOPS

Apple MacBook Pro M4 Pro

  • Backend: Metal + Apple Silicon GPU (unified memory architecture)
  • GPU: Apple M4 Pro 16-core GPU
  • Memory: 24 GB unified LPDDR5X
  • Memory Bandwidth: 273 GB/s
  • Compute Performance (Peak):
    • FP32: ~7.4 TFLOPS

Qualcomm Snapdragon 8 Gen 3 (OnePlus 13)

  • Backend: CPU (ARM NEON / I8MM-enabled)
  • CPU: Octa-core (1×3.3 GHz Cortex-X4 + 5×3.2 GHz Cortex-A720 + 2×2.3 GHz Cortex-A520)
  • GPU: Adreno 750 (not used in this benchmark)
  • Memory: 16 GB LPDDR5X
  • Memory Bandwidth: 77 GB/s
  • Compute Performance (Peak, GPU):
    • FP32: ~2.7 TFLOPS

Qualcomm Snapdragon 8 Elite Gen 5 (2026 Flagship — Projected)

  • Backend: CPU (ARM NEON / I8MM-enabled)
  • Process: 4 nm (TSMC)
  • CPU: Octa-core (2+6 Cluster), 4.6 GHz
  • GPU: Adreno 840 (not used in this benchmark)
  • Memory: 16 GB LPDDR5X
  • Compute Performance (Peak, GPU):
    • FP32: ~5.5 TFLOPS

Benchmark Results

Prompt Processing (pp1024) — tokens/second

Quantization Size A100 M4 SD 8 Gen 3 SD 8 Elite Gen 5 (proj.)
F16 7.49 GiB 13,576.25 797.20 13.72 31.55
Q8_0 3.98 GiB 7,731.88 726.31 14.71 33.83
Q6_K 3.07 GiB 6,953.67 651.65 8.59 19.76
Q5_K_M 2.69 GiB 7,387.72 686.72 8.84 20.33
Q5_K_S 2.62 GiB 7,556.95 654.02 7.72 17.76
Q4_K_M 2.32 GiB 7,589.05 742.13 18.07 41.56
Q4_K_S 2.21 GiB 7,783.02 734.80 20.70 47.61
Q4_1 2.41 GiB 7,751.64 792.08 11.60 26.68
Q4_0 2.20 GiB 7,853.97 739.13 36.96 85.00
IQ4_NL 2.22 GiB 7,799.43 746.65 18.99 43.68
IQ4_XS 2.12 GiB 8,032.40 726.70 8.74 20.10
Q3_K_M 1.93 GiB 6,264.09 643.15 10.92 25.12
Q3_K_S 1.75 GiB 5,717.55 633.62 7.75 17.82
Q2_K 1.55 GiB 5,154.50 660.86 8.70 20.01
TQ1_0 1.01 GiB — — 11.96 27.50

Text Generation (tg256) — tokens/second

Quantization Size A100 M4 SD 8 Gen 3 SD 8 Elite Gen 5 (proj.)
F16 7.49 GiB 123.46 26.03 4.34 9.98
Q8_0 3.98 GiB 154.43 47.66 7.50 17.25
Q6_K 3.07 GiB 150.04 43.91 6.31 14.52
Q5_K_M 2.69 GiB 169.12 57.77 7.16 16.46
Q5_K_S 2.62 GiB 174.78 58.70 6.70 15.41
Q4_K_M 2.32 GiB 179.73 68.92 9.73 22.38
Q4_K_S 2.21 GiB 186.12 71.96 10.67 24.54
Q4_1 2.41 GiB 207.61 71.21 6.72 15.45
Q4_0 2.20 GiB 199.58 71.02 12.37 28.45
IQ4_NL 2.22 GiB 194.92 69.07 10.77 24.77
IQ4_XS 2.12 GiB 201.83 69.76 7.40 17.02
Q3_K_M 1.93 GiB 150.29 61.93 8.19 18.83
Q3_K_S 1.75 GiB 138.07 58.16 6.97 16.03
Q2_K 1.55 GiB 166.89 70.61 8.21 18.88
TQ1_0 1.01 GiB — — 8.82 20.28

Vision Encoder Benchmarks (InternViT-300M)

The vision encoder processes images separately from the LLM backbone. These benchmarks measure end-to-end image processing latency including encoding and decoding stages.

Image Processing Latency

Stage A100 M4 Pro SD 8 Gen 3
Image Slice Encoding 17 ms 295 ms 7,285 ms
Image Decoding (total) 79 ms 1,375 ms 9,112 ms
Total Processing 96 ms 1,670 ms 16,397 ms

Memory Usage

Device BF16 Q8_0
Desktop (M4 Pro) 9.6 GB 1.4 GB
Mobile (SD 8 Gen 3) 1.5 GB 1.2 GB

Notes

  • A100 Results: Full GPU offload with flash attention enabled
  • Macbook M4 Pro Results: Metal backend leveraging unified memory architecture
  • Snapdragon Results: CPU-only inference; Q4_0 shows exceptional ARM NEON optimization
  • All measurements report mean ± standard deviation across 3 runs
  • Higher values indicate better performance
  • These benchmarks are for the LLM backbone (Qwen3-4B) component; vision encoder runs separately

License

This model is licensed under the Apache License 2.0.

Downloads last month
-
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support