waqasm86
/

llcuda-models

Text Generation

Model card Files Files and versions

llcuda-models / README.md

waqasm86's picture

Upload README.md with huggingface_hub

8f24916 verified 11 days ago

|

history blame contribute delete

1.7 kB

	---
	license: apache-2.0
	tags:
	- llama.cpp
	- gguf
	- gemma
	- quantized
	- cuda
	language:
	- en
	pipeline_tag: text-generation
	---

	# llcuda Models

	Optimized GGUF models for llcuda - Zero-config CUDA-accelerated LLM inference.

	## Models

	### google_gemma-3-1b-it-Q4_K_M.gguf

	- Model: Google Gemma 3 1B Instruct
	- Quantization: Q4_K_M (4-bit)
	- Size: 769 MB
	- Use case: General-purpose chat, Q&A, code assistance
	- Recommended for: 1GB+ VRAM GPUs

	Performance:
	- Tesla T4 (Colab/Kaggle): ~15 tok/s
	- Tesla P100 (Colab): ~18 tok/s
	- GeForce 940M (1GB): ~15 tok/s
	- RTX 30xx/40xx: ~25+ tok/s

	## Usage

	### With llcuda (Recommended)

	```python
	pip install llcuda

	import llcuda
	engine = llcuda.InferenceEngine()
	engine.load_model("gemma-3-1b-Q4_K_M")
	result = engine.infer("What is AI?")
	print(result.text)
	```

	### With llama.cpp

	```bash
	# Download model
	huggingface-cli download waqasm86/llcuda-models google_gemma-3-1b-it-Q4_K_M.gguf --local-dir ./models

	# Run with llama.cpp
	./llama-server -m ./models/google_gemma-3-1b-it-Q4_K_M.gguf -ngl 26
	```

	## Supported Platforms

	- ✅ Google Colab (T4, P100, V100, A100)
	- ✅ Kaggle (Tesla T4)
	- ✅ Local GPUs (GeForce, RTX, Tesla)
	- ✅ All NVIDIA GPUs with compute capability 5.0+

	## Links

	- PyPI: [pypi.org/project/llcuda](https://pypi.org/project/llcuda/)
	- GitHub: [github.com/waqasm86/llcuda](https://github.com/waqasm86/llcuda)
	- Documentation: [waqasm86.github.io](https://waqasm86.github.io/)

	## License

	Apache 2.0 - Models are provided as-is for educational and research purposes.

	## Credits

	- Model: Google Gemma 3 1B
	- Quantization: llama.cpp GGUF format
	- Package: llcuda by Waqas Muhammad