llamatelemetry Models
Curated collection of GGUF models optimized for llamatelemetry on Kaggle dual Tesla T4 GPUs (2Γ 15GB VRAM).
π― About This Repository
This repository contains GGUF models tested and verified to work with:
- llamatelemetry v0.1.0 - CUDA-first OpenTelemetry Python SDK for LLM inference observability
- Platform: Kaggle Notebooks (2Γ Tesla T4, 30GB total VRAM)
- CUDA: 12.5
π¦ Available Models
Status: Repository created, models coming soon!
Planned Models (v0.1.0)
| Model | Size | Quantization | VRAM | Speed (tok/s) | Status |
|---|---|---|---|---|---|
| Gemma 3 1B Instruct | 1B | Q4_K_M | ~1.5GB | ~80 | π Coming soon |
| Gemma 3 3B Instruct | 3B | Q4_K_M | ~3GB | ~50 | π Coming soon |
| Llama 3.2 3B Instruct | 3B | Q4_K_M | ~3GB | ~50 | π Coming soon |
| Qwen 2.5 1.5B Instruct | 1.5B | Q4_K_M | ~2GB | ~70 | π Coming soon |
| Mistral 7B Instruct | 7B | Q4_K_M | ~6GB | ~25 | π Coming soon |
Model Selection Criteria
Models in this repository are:
- β Tested on Kaggle dual T4 GPUs
- β Verified to fit in 15GB VRAM (single GPU)
- β Compatible with llamatelemetry's observability features
- β Optimized for GGUF + CUDA acceleration
- β Documented with performance benchmarks
π Quick Start
Install llamatelemetry
# On Kaggle with GPU T4 Γ 2
pip install --no-cache-dir --force-reinstall \
git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0
Download and Run a Model
import llamatelemetry
from llamatelemetry import InferenceEngine
from huggingface_hub import hf_hub_download
# Download model (example - not yet available)
model_path = hf_hub_download(
repo_id="waqasm86/llamatelemetry-models",
filename="gemma-3-1b-it-Q4_K_M.gguf",
local_dir="/kaggle/working/models"
)
# Load model on GPU 0
engine = InferenceEngine()
engine.load_model(model_path, silent=True)
# Run inference with telemetry
result = engine.infer("Explain quantum computing in simple terms", max_tokens=150)
print(result.text)
π Recommended Models by Use Case
For Fast Prototyping
- Gemma 3 1B - Fastest inference, good for testing
- Qwen 2.5 1.5B - Balance of speed and quality
For Production Quality
- Gemma 3 3B - High quality, reasonable speed
- Llama 3.2 3B - Strong reasoning capabilities
For Complex Tasks
- Mistral 7B - Best quality, slower but fits in single T4
π Model Sources
Models are sourced from reputable providers:
- Unsloth GGUF Models - Optimized GGUF conversions
- TheBloke GGUF Models - Community standard
- Bartowski GGUF Models - High-quality quants
All models are:
- β Publicly available under permissive licenses
- β Re-hosted here for convenience and verification
- β Credited to original authors
π― Dual GPU Strategies
llamatelemetry supports multi-GPU workloads:
Strategy 1: LLM on GPU 0, Observability on GPU 1
from llamatelemetry.server import ServerManager
# Start llama-server on GPU 0 only
server = ServerManager()
server.start_server(
model_path=model_path,
gpu_layers=99,
tensor_split="1.0,0.0", # 100% GPU 0, 0% GPU 1
flash_attn=1,
)
# GPU 1 is now free for RAPIDS/Graphistry visualization
Strategy 2: Model Sharding Across Both GPUs
# Split large model across both T4s
server.start_server(
model_path=large_model_path,
gpu_layers=99,
tensor_split="0.5,0.5", # 50% GPU 0, 50% GPU 1
)
π Documentation & Links
- GitHub: https://github.com/llamatelemetry/llamatelemetry
- Installation Guide: KAGGLE_INSTALL_GUIDE.md
- Binaries: https://huggingface.co/waqasm86/llamatelemetry-binaries
- Tutorials: notebooks/
π Getting Help
- GitHub Issues: https://github.com/llamatelemetry/llamatelemetry/issues
- Documentation: https://llamatelemetry.github.io (planned)
π License
This repository: MIT License
Individual models: See model cards for specific licenses (Apache 2.0, MIT, Gemma License, etc.)
Maintained by: waqasm86
Status: Repository initialized, models coming soon
Target Platform: Kaggle dual Tesla T4 (CUDA 12.5)
Last Updated: 2026-02-03
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support