license: mit
tags:
- llm
- gguf
- llama
- gemma
- mistral
- qwen
- inference
- opentelemetry
- observability
- kaggle
library_name: llamatelemetry
llamatelemetry Models (v1.2.0)
Curated collection of GGUF models optimized for llamatelemetry v1.2.0 on Kaggle dual Tesla T4
GPUs (2Γ 15 GB VRAM), using gen_ai.* OpenTelemetry semantic conventions.
π― About This Repository
This repository contains GGUF models tested and verified to work with:
- llamatelemetry v1.2.0 β CUDA-first OpenTelemetry Python SDK for LLM inference observability
- Platform: Kaggle Notebooks (2Γ Tesla T4, 30 GB total VRAM)
- CUDA: 12.5 | Compute Capability: SM 7.5
π¦ Available Models
Status: Repository initialized. Models will be added as they are verified on Kaggle T4x2.
Planned Models (v1.2.0)
| Model | Size | Quantization | VRAM | Speed (tok/s) | Status |
|---|---|---|---|---|---|
| Gemma 3 1B Instruct | 1B | Q4_K_M | ~1.5 GB | ~80 | π Coming soon |
| Gemma 3 4B Instruct | 4B | Q4_K_M | ~3.5 GB | ~50 | π Coming soon |
| Llama 3.2 3B Instruct | 3B | Q4_K_M | ~3 GB | ~50 | π Coming soon |
| Qwen 2.5 1.5B Instruct | 1.5B | Q4_K_M | ~2 GB | ~70 | π Coming soon |
| Mistral 7B Instruct v0.3 | 7B | Q4_K_M | ~6 GB | ~25 | π Coming soon |
Model Selection Criteria
All models in this repository are:
- β Tested on Kaggle dual T4 GPUs with llamatelemetry v1.2.0
- β Verified to fit in 15 GB VRAM (single GPU) or 30 GB (split)
- β
Compatible with GenAI semconv (
gen_ai.*attributes) - β Instrumented β TTFT, TPOT, token usage captured automatically
- β Documented with performance benchmarks
π Quick Start
Install llamatelemetry v1.2.0
pip install -q --no-cache-dir git+https://github.com/llamatelemetry/llamatelemetry.git@v1.2.0
Verify CUDA (v1.2.0 requirement)
import llamatelemetry
llamatelemetry.require_cuda() # Raises RuntimeError if no GPU
Download and Run a Model
import llamatelemetry
from llamatelemetry import ServerManager, ServerConfig
from llamatelemetry.llama import LlamaCppClient
from huggingface_hub import hf_hub_download
# Initialize SDK with GenAI metrics
llamatelemetry.init(
service_name="kaggle-inference",
otlp_endpoint="http://localhost:4317",
enable_metrics=True,
gpu_enrichment=True,
)
# Download model from this repo (once available)
model_path = hf_hub_download(
repo_id="waqasm86/llamatelemetry-models",
filename="gemma-3-4b-it-Q4_K_M.gguf",
local_dir="/kaggle/working/models"
)
# Start server on dual T4
config = ServerConfig(
model_path=model_path,
tensor_split=[0.5, 0.5],
n_gpu_layers=-1,
flash_attn=True,
)
server = ServerManager(config)
server.start()
# Instrumented inference β emits gen_ai.* spans + metrics
client = LlamaCppClient(base_url=server.url, strict_operation_names=True)
response = client.chat(
messages=[{"role": "user", "content": "Explain CUDA tensor cores."}],
max_tokens=512,
)
print(response.choices[0].message.content)
llamatelemetry.shutdown()
server.stop()
π GenAI Metrics Captured (v1.2.0)
Every inference call automatically records:
| Metric | Unit | Description |
|---|---|---|
gen_ai.client.token.usage |
{token} |
Input + output token count |
gen_ai.client.operation.duration |
s |
Total request duration |
gen_ai.server.time_to_first_token |
s |
TTFT latency |
gen_ai.server.time_per_output_token |
s |
Per-token decode time |
gen_ai.server.request.active |
{request} |
Concurrent in-flight requests |
π― Dual GPU Strategies
Strategy 1: Inference on GPU 0, Analytics on GPU 1
config = ServerConfig(
model_path=model_path,
tensor_split=[1.0, 0.0], # 100% GPU 0
n_gpu_layers=-1,
)
# GPU 1 free for RAPIDS / Graphistry / cuDF
Strategy 2: Model Split Across Both T4s (for larger models)
config = ServerConfig(
model_path=large_model_path,
tensor_split=[0.5, 0.5], # 50% each
n_gpu_layers=-1,
)
π§ Benchmarking Models
from llamatelemetry.bench import BenchmarkRunner, BenchmarkProfile
runner = BenchmarkRunner(client=client, profile=BenchmarkProfile.STANDARD)
results = runner.run(
model_name="gemma-3-4b-it-Q4_K_M",
prompts=[
"Explain attention mechanisms.",
"Write a Python function to sort a list.",
],
)
print(results.summary())
# Output: TTFT p50/p95, tokens/sec, prefill_ms, decode_ms
π Links
- GitHub Repository: https://github.com/llamatelemetry/llamatelemetry
- GitHub Releases: https://github.com/llamatelemetry/llamatelemetry/releases/tag/v1.2.0
- Binaries Repository: https://huggingface.co/waqasm86/llamatelemetry-binaries
- Kaggle Guide: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/KAGGLE_GUIDE.md
- Integration Guide: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/INTEGRATION_GUIDE.md
- API Reference: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/API_REFERENCE.md
π Model Sources
Models are sourced from reputable community providers:
- Unsloth GGUF Models β Optimized GGUF conversions
- Bartowski GGUF Models β High-quality quants
- LM Studio Community β Curated GGUF models
All models are:
- β Publicly available under permissive licenses
- β Verified on llamatelemetry v1.2.0 + Kaggle T4x2
- β Credited to original authors
π Getting Help
- GitHub Issues: https://github.com/llamatelemetry/llamatelemetry/issues
- Discussions: https://github.com/llamatelemetry/llamatelemetry/discussions
π License
This repository: MIT License. Individual models: See each model card for specific license (Apache 2.0, MIT, Gemma License, etc.)
Maintained by: waqasm86 SDK Version: 1.2.0 Last Updated: 2026-02-20 Target Platform: Kaggle dual Tesla T4 (CUDA 12.5, SM 7.5) Status: Active β models being added