--- license: apache-2.0 tags: - llama.cpp - gguf - gemma - quantized - cuda language: - en pipeline_tag: text-generation --- # llcuda Models Optimized GGUF models for llcuda - Zero-config CUDA-accelerated LLM inference. ## Models ### google_gemma-3-1b-it-Q4_K_M.gguf - **Model**: Google Gemma 3 1B Instruct - **Quantization**: Q4_K_M (4-bit) - **Size**: 769 MB - **Use case**: General-purpose chat, Q&A, code assistance - **Recommended for**: 1GB+ VRAM GPUs **Performance:** - Tesla T4 (Colab/Kaggle): ~15 tok/s - Tesla P100 (Colab): ~18 tok/s - GeForce 940M (1GB): ~15 tok/s - RTX 30xx/40xx: ~25+ tok/s ## Usage ### With llcuda (Recommended) ```python pip install llcuda import llcuda engine = llcuda.InferenceEngine() engine.load_model("gemma-3-1b-Q4_K_M") result = engine.infer("What is AI?") print(result.text) ``` ### With llama.cpp ```bash # Download model huggingface-cli download waqasm86/llcuda-models google_gemma-3-1b-it-Q4_K_M.gguf --local-dir ./models # Run with llama.cpp ./llama-server -m ./models/google_gemma-3-1b-it-Q4_K_M.gguf -ngl 26 ``` ## Supported Platforms - ✅ Google Colab (T4, P100, V100, A100) - ✅ Kaggle (Tesla T4) - ✅ Local GPUs (GeForce, RTX, Tesla) - ✅ All NVIDIA GPUs with compute capability 5.0+ ## Links - **PyPI**: [pypi.org/project/llcuda](https://pypi.org/project/llcuda/) - **GitHub**: [github.com/waqasm86/llcuda](https://github.com/waqasm86/llcuda) - **Documentation**: [waqasm86.github.io](https://waqasm86.github.io/) ## License Apache 2.0 - Models are provided as-is for educational and research purposes. ## Credits - Model: Google Gemma 3 1B - Quantization: llama.cpp GGUF format - Package: llcuda by Waqas Muhammad