llcuda-models / README.md
waqasm86's picture
Upload README.md with huggingface_hub
8f24916 verified
---
license: apache-2.0
tags:
- llama.cpp
- gguf
- gemma
- quantized
- cuda
language:
- en
pipeline_tag: text-generation
---
# llcuda Models
Optimized GGUF models for llcuda - Zero-config CUDA-accelerated LLM inference.
## Models
### google_gemma-3-1b-it-Q4_K_M.gguf
- **Model**: Google Gemma 3 1B Instruct
- **Quantization**: Q4_K_M (4-bit)
- **Size**: 769 MB
- **Use case**: General-purpose chat, Q&A, code assistance
- **Recommended for**: 1GB+ VRAM GPUs
**Performance:**
- Tesla T4 (Colab/Kaggle): ~15 tok/s
- Tesla P100 (Colab): ~18 tok/s
- GeForce 940M (1GB): ~15 tok/s
- RTX 30xx/40xx: ~25+ tok/s
## Usage
### With llcuda (Recommended)
```python
pip install llcuda
import llcuda
engine = llcuda.InferenceEngine()
engine.load_model("gemma-3-1b-Q4_K_M")
result = engine.infer("What is AI?")
print(result.text)
```
### With llama.cpp
```bash
# Download model
huggingface-cli download waqasm86/llcuda-models google_gemma-3-1b-it-Q4_K_M.gguf --local-dir ./models
# Run with llama.cpp
./llama-server -m ./models/google_gemma-3-1b-it-Q4_K_M.gguf -ngl 26
```
## Supported Platforms
- βœ… Google Colab (T4, P100, V100, A100)
- βœ… Kaggle (Tesla T4)
- βœ… Local GPUs (GeForce, RTX, Tesla)
- βœ… All NVIDIA GPUs with compute capability 5.0+
## Links
- **PyPI**: [pypi.org/project/llcuda](https://pypi.org/project/llcuda/)
- **GitHub**: [github.com/waqasm86/llcuda](https://github.com/waqasm86/llcuda)
- **Documentation**: [waqasm86.github.io](https://waqasm86.github.io/)
## License
Apache 2.0 - Models are provided as-is for educational and research purposes.
## Credits
- Model: Google Gemma 3 1B
- Quantization: llama.cpp GGUF format
- Package: llcuda by Waqas Muhammad