---
license: apache-2.0
tags:
  - llama.cpp
  - gguf
  - gemma
  - quantized
  - cuda
language:
  - en
pipeline_tag: text-generation
---

# llcuda Models

Optimized GGUF models for llcuda - Zero-config CUDA-accelerated LLM inference.

## Models

### google_gemma-3-1b-it-Q4_K_M.gguf

- **Model**: Google Gemma 3 1B Instruct
- **Quantization**: Q4_K_M (4-bit)
- **Size**: 769 MB
- **Use case**: General-purpose chat, Q&A, code assistance
- **Recommended for**: 1GB+ VRAM GPUs

**Performance:**
- Tesla T4 (Colab/Kaggle): ~15 tok/s
- Tesla P100 (Colab): ~18 tok/s
- GeForce 940M (1GB): ~15 tok/s
- RTX 30xx/40xx: ~25+ tok/s

## Usage

### With llcuda (Recommended)

```python
pip install llcuda

import llcuda
engine = llcuda.InferenceEngine()
engine.load_model("gemma-3-1b-Q4_K_M")
result = engine.infer("What is AI?")
print(result.text)
```

### With llama.cpp

```bash
# Download model
huggingface-cli download waqasm86/llcuda-models google_gemma-3-1b-it-Q4_K_M.gguf --local-dir ./models

# Run with llama.cpp
./llama-server -m ./models/google_gemma-3-1b-it-Q4_K_M.gguf -ngl 26
```

## Supported Platforms

- ✅ Google Colab (T4, P100, V100, A100)
- ✅ Kaggle (Tesla T4)
- ✅ Local GPUs (GeForce, RTX, Tesla)
- ✅ All NVIDIA GPUs with compute capability 5.0+

## Links

- **PyPI**: [pypi.org/project/llcuda](https://pypi.org/project/llcuda/)
- **GitHub**: [github.com/waqasm86/llcuda](https://github.com/waqasm86/llcuda)
- **Documentation**: [waqasm86.github.io](https://waqasm86.github.io/)

## License

Apache 2.0 - Models are provided as-is for educational and research purposes.

## Credits

- Model: Google Gemma 3 1B
- Quantization: llama.cpp GGUF format
- Package: llcuda by Waqas Muhammad