llcuda Models

Optimized GGUF models for llcuda - Zero-config CUDA-accelerated LLM inference.

Models

google_gemma-3-1b-it-Q4_K_M.gguf

  • Model: Google Gemma 3 1B Instruct
  • Quantization: Q4_K_M (4-bit)
  • Size: 769 MB
  • Use case: General-purpose chat, Q&A, code assistance
  • Recommended for: 1GB+ VRAM GPUs

Performance:

  • Tesla T4 (Colab/Kaggle): ~15 tok/s
  • Tesla P100 (Colab): ~18 tok/s
  • GeForce 940M (1GB): ~15 tok/s
  • RTX 30xx/40xx: ~25+ tok/s

Usage

With llcuda (Recommended)

pip install llcuda

import llcuda
engine = llcuda.InferenceEngine()
engine.load_model("gemma-3-1b-Q4_K_M")
result = engine.infer("What is AI?")
print(result.text)

With llama.cpp

# Download model
huggingface-cli download waqasm86/llcuda-models google_gemma-3-1b-it-Q4_K_M.gguf --local-dir ./models

# Run with llama.cpp
./llama-server -m ./models/google_gemma-3-1b-it-Q4_K_M.gguf -ngl 26

Supported Platforms

  • βœ… Google Colab (T4, P100, V100, A100)
  • βœ… Kaggle (Tesla T4)
  • βœ… Local GPUs (GeForce, RTX, Tesla)
  • βœ… All NVIDIA GPUs with compute capability 5.0+

Links

License

Apache 2.0 - Models are provided as-is for educational and research purposes.

Credits

  • Model: Google Gemma 3 1B
  • Quantization: llama.cpp GGUF format
  • Package: llcuda by Waqas Muhammad
Downloads last month
24
GGUF
Model size
1.0B params
Architecture
gemma3
Hardware compatibility
Log In to view the estimation

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support