| | --- |
| | license: mit |
| | tags: |
| | - llm |
| | - gguf |
| | - llama |
| | - gemma |
| | - mistral |
| | - qwen |
| | - inference |
| | - opentelemetry |
| | - observability |
| | - kaggle |
| | library_name: llamatelemetry |
| | --- |
| | |
| | # llamatelemetry Models (v1.2.0) |
| |
|
| | Curated collection of GGUF models optimized for **llamatelemetry v1.2.0** on Kaggle dual Tesla T4 |
| | GPUs (2Γ 15 GB VRAM), using `gen_ai.*` OpenTelemetry semantic conventions. |
| |
|
| | ## π― About This Repository |
| |
|
| | This repository contains GGUF models tested and verified to work with: |
| | - **llamatelemetry v1.2.0** β CUDA-first OpenTelemetry Python SDK for LLM inference observability |
| | - **Platform**: Kaggle Notebooks (2Γ Tesla T4, 30 GB total VRAM) |
| | - **CUDA**: 12.5 | **Compute Capability**: SM 7.5 |
| |
|
| | ## π¦ Available Models |
| |
|
| | > **Status**: Repository initialized. Models will be added as they are verified on Kaggle T4x2. |
| |
|
| | ### Planned Models (v1.2.0) |
| |
|
| | | Model | Size | Quantization | VRAM | Speed (tok/s) | Status | |
| | |-------|------|--------------|------|---------------|--------| |
| | | Gemma 3 1B Instruct | 1B | Q4_K_M | ~1.5 GB | ~80 | π Coming soon | |
| | | Gemma 3 4B Instruct | 4B | Q4_K_M | ~3.5 GB | ~50 | π Coming soon | |
| | | Llama 3.2 3B Instruct | 3B | Q4_K_M | ~3 GB | ~50 | π Coming soon | |
| | | Qwen 2.5 1.5B Instruct | 1.5B | Q4_K_M | ~2 GB | ~70 | π Coming soon | |
| | | Mistral 7B Instruct v0.3 | 7B | Q4_K_M | ~6 GB | ~25 | π Coming soon | |
| |
|
| | ### Model Selection Criteria |
| |
|
| | All models in this repository are: |
| | 1. β
**Tested** on Kaggle dual T4 GPUs with llamatelemetry v1.2.0 |
| | 2. β
**Verified** to fit in 15 GB VRAM (single GPU) or 30 GB (split) |
| | 3. β
**Compatible** with GenAI semconv (`gen_ai.*` attributes) |
| | 4. β
**Instrumented** β TTFT, TPOT, token usage captured automatically |
| | 5. β
**Documented** with performance benchmarks |
| |
|
| | ## π Quick Start |
| |
|
| | ### Install llamatelemetry v1.2.0 |
| |
|
| | ```bash |
| | pip install -q --no-cache-dir git+https://github.com/llamatelemetry/llamatelemetry.git@v1.2.0 |
| | ``` |
| |
|
| | ### Verify CUDA (v1.2.0 requirement) |
| |
|
| | ```python |
| | import llamatelemetry |
| | llamatelemetry.require_cuda() # Raises RuntimeError if no GPU |
| | ``` |
| |
|
| | ### Download and Run a Model |
| |
|
| | ```python |
| | import llamatelemetry |
| | from llamatelemetry import ServerManager, ServerConfig |
| | from llamatelemetry.llama import LlamaCppClient |
| | from huggingface_hub import hf_hub_download |
| | |
| | # Initialize SDK with GenAI metrics |
| | llamatelemetry.init( |
| | service_name="kaggle-inference", |
| | otlp_endpoint="http://localhost:4317", |
| | enable_metrics=True, |
| | gpu_enrichment=True, |
| | ) |
| | |
| | # Download model from this repo (once available) |
| | model_path = hf_hub_download( |
| | repo_id="waqasm86/llamatelemetry-models", |
| | filename="gemma-3-4b-it-Q4_K_M.gguf", |
| | local_dir="/kaggle/working/models" |
| | ) |
| | |
| | # Start server on dual T4 |
| | config = ServerConfig( |
| | model_path=model_path, |
| | tensor_split=[0.5, 0.5], |
| | n_gpu_layers=-1, |
| | flash_attn=True, |
| | ) |
| | server = ServerManager(config) |
| | server.start() |
| | |
| | # Instrumented inference β emits gen_ai.* spans + metrics |
| | client = LlamaCppClient(base_url=server.url, strict_operation_names=True) |
| | response = client.chat( |
| | messages=[{"role": "user", "content": "Explain CUDA tensor cores."}], |
| | max_tokens=512, |
| | ) |
| | print(response.choices[0].message.content) |
| | |
| | llamatelemetry.shutdown() |
| | server.stop() |
| | ``` |
| |
|
| | ## π GenAI Metrics Captured (v1.2.0) |
| |
|
| | Every inference call automatically records: |
| |
|
| | | Metric | Unit | Description | |
| | |--------|------|-------------| |
| | | `gen_ai.client.token.usage` | `{token}` | Input + output token count | |
| | | `gen_ai.client.operation.duration` | `s` | Total request duration | |
| | | `gen_ai.server.time_to_first_token` | `s` | TTFT latency | |
| | | `gen_ai.server.time_per_output_token` | `s` | Per-token decode time | |
| | | `gen_ai.server.request.active` | `{request}` | Concurrent in-flight requests | |
| |
|
| | ## π― Dual GPU Strategies |
| |
|
| | ### Strategy 1: Inference on GPU 0, Analytics on GPU 1 |
| |
|
| | ```python |
| | config = ServerConfig( |
| | model_path=model_path, |
| | tensor_split=[1.0, 0.0], # 100% GPU 0 |
| | n_gpu_layers=-1, |
| | ) |
| | # GPU 1 free for RAPIDS / Graphistry / cuDF |
| | ``` |
| |
|
| | ### Strategy 2: Model Split Across Both T4s (for larger models) |
| |
|
| | ```python |
| | config = ServerConfig( |
| | model_path=large_model_path, |
| | tensor_split=[0.5, 0.5], # 50% each |
| | n_gpu_layers=-1, |
| | ) |
| | ``` |
| |
|
| | ## π§ Benchmarking Models |
| |
|
| | ```python |
| | from llamatelemetry.bench import BenchmarkRunner, BenchmarkProfile |
| | |
| | runner = BenchmarkRunner(client=client, profile=BenchmarkProfile.STANDARD) |
| | results = runner.run( |
| | model_name="gemma-3-4b-it-Q4_K_M", |
| | prompts=[ |
| | "Explain attention mechanisms.", |
| | "Write a Python function to sort a list.", |
| | ], |
| | ) |
| | print(results.summary()) |
| | # Output: TTFT p50/p95, tokens/sec, prefill_ms, decode_ms |
| | ``` |
| |
|
| | ## π Links |
| |
|
| | - **GitHub Repository**: https://github.com/llamatelemetry/llamatelemetry |
| | - **GitHub Releases**: https://github.com/llamatelemetry/llamatelemetry/releases/tag/v1.2.0 |
| | - **Binaries Repository**: https://huggingface.co/waqasm86/llamatelemetry-binaries |
| | - **Kaggle Guide**: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/KAGGLE_GUIDE.md |
| | - **Integration Guide**: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/INTEGRATION_GUIDE.md |
| | - **API Reference**: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/API_REFERENCE.md |
| | |
| | ## π Model Sources |
| | |
| | Models are sourced from reputable community providers: |
| | - [Unsloth GGUF Models](https://huggingface.co/unsloth) β Optimized GGUF conversions |
| | - [Bartowski GGUF Models](https://huggingface.co/bartowski) β High-quality quants |
| | - [LM Studio Community](https://huggingface.co/lmstudio-community) β Curated GGUF models |
| | |
| | All models are: |
| | - β
Publicly available under permissive licenses |
| | - β
Verified on llamatelemetry v1.2.0 + Kaggle T4x2 |
| | - β
Credited to original authors |
| | |
| | ## π Getting Help |
| | |
| | - **GitHub Issues**: https://github.com/llamatelemetry/llamatelemetry/issues |
| | - **Discussions**: https://github.com/llamatelemetry/llamatelemetry/discussions |
| | |
| | ## π License |
| | |
| | This repository: MIT License. |
| | Individual models: See each model card for specific license (Apache 2.0, MIT, Gemma License, etc.) |
| | |
| | --- |
| | |
| | **Maintained by**: [waqasm86](https://huggingface.co/waqasm86) |
| | **SDK Version**: 1.2.0 |
| | **Last Updated**: 2026-02-20 |
| | **Target Platform**: Kaggle dual Tesla T4 (CUDA 12.5, SM 7.5) |
| | **Status**: Active β models being added |
| | |