--- license: mit tags: - llm - gguf - llama - gemma - mistral - qwen - inference - opentelemetry - observability - kaggle library_name: llamatelemetry --- # llamatelemetry Models (v1.2.0) Curated collection of GGUF models optimized for **llamatelemetry v1.2.0** on Kaggle dual Tesla T4 GPUs (2× 15 GB VRAM), using `gen_ai.*` OpenTelemetry semantic conventions. ## 🎯 About This Repository This repository contains GGUF models tested and verified to work with: - **llamatelemetry v1.2.0** — CUDA-first OpenTelemetry Python SDK for LLM inference observability - **Platform**: Kaggle Notebooks (2× Tesla T4, 30 GB total VRAM) - **CUDA**: 12.5 | **Compute Capability**: SM 7.5 ## 📦 Available Models > **Status**: Repository initialized. Models will be added as they are verified on Kaggle T4x2. ### Planned Models (v1.2.0) | Model | Size | Quantization | VRAM | Speed (tok/s) | Status | |-------|------|--------------|------|---------------|--------| | Gemma 3 1B Instruct | 1B | Q4_K_M | ~1.5 GB | ~80 | 🔄 Coming soon | | Gemma 3 4B Instruct | 4B | Q4_K_M | ~3.5 GB | ~50 | 🔄 Coming soon | | Llama 3.2 3B Instruct | 3B | Q4_K_M | ~3 GB | ~50 | 🔄 Coming soon | | Qwen 2.5 1.5B Instruct | 1.5B | Q4_K_M | ~2 GB | ~70 | 🔄 Coming soon | | Mistral 7B Instruct v0.3 | 7B | Q4_K_M | ~6 GB | ~25 | 🔄 Coming soon | ### Model Selection Criteria All models in this repository are: 1. ✅ **Tested** on Kaggle dual T4 GPUs with llamatelemetry v1.2.0 2. ✅ **Verified** to fit in 15 GB VRAM (single GPU) or 30 GB (split) 3. ✅ **Compatible** with GenAI semconv (`gen_ai.*` attributes) 4. ✅ **Instrumented** — TTFT, TPOT, token usage captured automatically 5. ✅ **Documented** with performance benchmarks ## 🚀 Quick Start ### Install llamatelemetry v1.2.0 ```bash pip install -q --no-cache-dir git+https://github.com/llamatelemetry/llamatelemetry.git@v1.2.0 ``` ### Verify CUDA (v1.2.0 requirement) ```python import llamatelemetry llamatelemetry.require_cuda() # Raises RuntimeError if no GPU ``` ### Download and Run a Model ```python import llamatelemetry from llamatelemetry import ServerManager, ServerConfig from llamatelemetry.llama import LlamaCppClient from huggingface_hub import hf_hub_download # Initialize SDK with GenAI metrics llamatelemetry.init( service_name="kaggle-inference", otlp_endpoint="http://localhost:4317", enable_metrics=True, gpu_enrichment=True, ) # Download model from this repo (once available) model_path = hf_hub_download( repo_id="waqasm86/llamatelemetry-models", filename="gemma-3-4b-it-Q4_K_M.gguf", local_dir="/kaggle/working/models" ) # Start server on dual T4 config = ServerConfig( model_path=model_path, tensor_split=[0.5, 0.5], n_gpu_layers=-1, flash_attn=True, ) server = ServerManager(config) server.start() # Instrumented inference — emits gen_ai.* spans + metrics client = LlamaCppClient(base_url=server.url, strict_operation_names=True) response = client.chat( messages=[{"role": "user", "content": "Explain CUDA tensor cores."}], max_tokens=512, ) print(response.choices[0].message.content) llamatelemetry.shutdown() server.stop() ``` ## 📊 GenAI Metrics Captured (v1.2.0) Every inference call automatically records: | Metric | Unit | Description | |--------|------|-------------| | `gen_ai.client.token.usage` | `{token}` | Input + output token count | | `gen_ai.client.operation.duration` | `s` | Total request duration | | `gen_ai.server.time_to_first_token` | `s` | TTFT latency | | `gen_ai.server.time_per_output_token` | `s` | Per-token decode time | | `gen_ai.server.request.active` | `{request}` | Concurrent in-flight requests | ## 🎯 Dual GPU Strategies ### Strategy 1: Inference on GPU 0, Analytics on GPU 1 ```python config = ServerConfig( model_path=model_path, tensor_split=[1.0, 0.0], # 100% GPU 0 n_gpu_layers=-1, ) # GPU 1 free for RAPIDS / Graphistry / cuDF ``` ### Strategy 2: Model Split Across Both T4s (for larger models) ```python config = ServerConfig( model_path=large_model_path, tensor_split=[0.5, 0.5], # 50% each n_gpu_layers=-1, ) ``` ## 🔧 Benchmarking Models ```python from llamatelemetry.bench import BenchmarkRunner, BenchmarkProfile runner = BenchmarkRunner(client=client, profile=BenchmarkProfile.STANDARD) results = runner.run( model_name="gemma-3-4b-it-Q4_K_M", prompts=[ "Explain attention mechanisms.", "Write a Python function to sort a list.", ], ) print(results.summary()) # Output: TTFT p50/p95, tokens/sec, prefill_ms, decode_ms ``` ## 🔗 Links - **GitHub Repository**: https://github.com/llamatelemetry/llamatelemetry - **GitHub Releases**: https://github.com/llamatelemetry/llamatelemetry/releases/tag/v1.2.0 - **Binaries Repository**: https://huggingface.co/waqasm86/llamatelemetry-binaries - **Kaggle Guide**: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/KAGGLE_GUIDE.md - **Integration Guide**: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/INTEGRATION_GUIDE.md - **API Reference**: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/API_REFERENCE.md ## 🔗 Model Sources Models are sourced from reputable community providers: - [Unsloth GGUF Models](https://huggingface.co/unsloth) — Optimized GGUF conversions - [Bartowski GGUF Models](https://huggingface.co/bartowski) — High-quality quants - [LM Studio Community](https://huggingface.co/lmstudio-community) — Curated GGUF models All models are: - ✅ Publicly available under permissive licenses - ✅ Verified on llamatelemetry v1.2.0 + Kaggle T4x2 - ✅ Credited to original authors ## 🆘 Getting Help - **GitHub Issues**: https://github.com/llamatelemetry/llamatelemetry/issues - **Discussions**: https://github.com/llamatelemetry/llamatelemetry/discussions ## 📄 License This repository: MIT License. Individual models: See each model card for specific license (Apache 2.0, MIT, Gemma License, etc.) --- **Maintained by**: [waqasm86](https://huggingface.co/waqasm86) **SDK Version**: 1.2.0 **Last Updated**: 2026-02-20 **Target Platform**: Kaggle dual Tesla T4 (CUDA 12.5, SM 7.5) **Status**: Active — models being added