llamatelemetry-models / README.md

waqasm86

docs: update README for v1.2.0 — gen_ai semconv, GenAI metrics, new API

8a9dffe verified 2 days ago

preview code

raw

history blame contribute delete

6.22 kB

metadata

license: mit
tags:
  - llm
  - gguf
  - llama
  - gemma
  - mistral
  - qwen
  - inference
  - opentelemetry
  - observability
  - kaggle
library_name: llamatelemetry

llamatelemetry Models (v1.2.0)

Curated collection of GGUF models optimized for llamatelemetry v1.2.0 on Kaggle dual Tesla T4 GPUs (2× 15 GB VRAM), using gen_ai.* OpenTelemetry semantic conventions.

🎯 About This Repository

This repository contains GGUF models tested and verified to work with:

llamatelemetry v1.2.0 — CUDA-first OpenTelemetry Python SDK for LLM inference observability
Platform: Kaggle Notebooks (2× Tesla T4, 30 GB total VRAM)
CUDA: 12.5 | Compute Capability: SM 7.5

📦 Available Models

Status: Repository initialized. Models will be added as they are verified on Kaggle T4x2.

Planned Models (v1.2.0)

Model	Size	Quantization	VRAM	Speed (tok/s)	Status
Gemma 3 1B Instruct	1B	Q4_K_M	~1.5 GB	~80	🔄 Coming soon
Gemma 3 4B Instruct	4B	Q4_K_M	~3.5 GB	~50	🔄 Coming soon
Llama 3.2 3B Instruct	3B	Q4_K_M	~3 GB	~50	🔄 Coming soon
Qwen 2.5 1.5B Instruct	1.5B	Q4_K_M	~2 GB	~70	🔄 Coming soon
Mistral 7B Instruct v0.3	7B	Q4_K_M	~6 GB	~25	🔄 Coming soon

Model Selection Criteria

All models in this repository are:

✅ Tested on Kaggle dual T4 GPUs with llamatelemetry v1.2.0
✅ Verified to fit in 15 GB VRAM (single GPU) or 30 GB (split)
✅ Compatible with GenAI semconv (gen_ai.* attributes)
✅ Instrumented — TTFT, TPOT, token usage captured automatically
✅ Documented with performance benchmarks

🚀 Quick Start

Install llamatelemetry v1.2.0

pip install -q --no-cache-dir git+https://github.com/llamatelemetry/llamatelemetry.git@v1.2.0

Verify CUDA (v1.2.0 requirement)

import llamatelemetry
llamatelemetry.require_cuda()  # Raises RuntimeError if no GPU

Download and Run a Model

import llamatelemetry
from llamatelemetry import ServerManager, ServerConfig
from llamatelemetry.llama import LlamaCppClient
from huggingface_hub import hf_hub_download

# Initialize SDK with GenAI metrics
llamatelemetry.init(
    service_name="kaggle-inference",
    otlp_endpoint="http://localhost:4317",
    enable_metrics=True,
    gpu_enrichment=True,
)

# Download model from this repo (once available)
model_path = hf_hub_download(
    repo_id="waqasm86/llamatelemetry-models",
    filename="gemma-3-4b-it-Q4_K_M.gguf",
    local_dir="/kaggle/working/models"
)

# Start server on dual T4
config = ServerConfig(
    model_path=model_path,
    tensor_split=[0.5, 0.5],
    n_gpu_layers=-1,
    flash_attn=True,
)
server = ServerManager(config)
server.start()

# Instrumented inference — emits gen_ai.* spans + metrics
client = LlamaCppClient(base_url=server.url, strict_operation_names=True)
response = client.chat(
    messages=[{"role": "user", "content": "Explain CUDA tensor cores."}],
    max_tokens=512,
)
print(response.choices[0].message.content)

llamatelemetry.shutdown()
server.stop()

📊 GenAI Metrics Captured (v1.2.0)

Every inference call automatically records:

Metric	Unit	Description
`gen_ai.client.token.usage`	`{token}`	Input + output token count
`gen_ai.client.operation.duration`	`s`	Total request duration
`gen_ai.server.time_to_first_token`	`s`	TTFT latency
`gen_ai.server.time_per_output_token`	`s`	Per-token decode time
`gen_ai.server.request.active`	`{request}`	Concurrent in-flight requests

🎯 Dual GPU Strategies

Strategy 1: Inference on GPU 0, Analytics on GPU 1

config = ServerConfig(
    model_path=model_path,
    tensor_split=[1.0, 0.0],  # 100% GPU 0
    n_gpu_layers=-1,
)
# GPU 1 free for RAPIDS / Graphistry / cuDF

Strategy 2: Model Split Across Both T4s (for larger models)

config = ServerConfig(
    model_path=large_model_path,
    tensor_split=[0.5, 0.5],  # 50% each
    n_gpu_layers=-1,
)

🔧 Benchmarking Models

from llamatelemetry.bench import BenchmarkRunner, BenchmarkProfile

runner = BenchmarkRunner(client=client, profile=BenchmarkProfile.STANDARD)
results = runner.run(
    model_name="gemma-3-4b-it-Q4_K_M",
    prompts=[
        "Explain attention mechanisms.",
        "Write a Python function to sort a list.",
    ],
)
print(results.summary())
# Output: TTFT p50/p95, tokens/sec, prefill_ms, decode_ms

🔗 Links

GitHub Repository: https://github.com/llamatelemetry/llamatelemetry
GitHub Releases: https://github.com/llamatelemetry/llamatelemetry/releases/tag/v1.2.0
Binaries Repository: https://huggingface.co/waqasm86/llamatelemetry-binaries
Kaggle Guide: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/KAGGLE_GUIDE.md
Integration Guide: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/INTEGRATION_GUIDE.md
API Reference: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/API_REFERENCE.md

🔗 Model Sources

Models are sourced from reputable community providers:

Unsloth GGUF Models — Optimized GGUF conversions
Bartowski GGUF Models — High-quality quants
LM Studio Community — Curated GGUF models

All models are:

✅ Publicly available under permissive licenses
✅ Verified on llamatelemetry v1.2.0 + Kaggle T4x2
✅ Credited to original authors

🆘 Getting Help

GitHub Issues: https://github.com/llamatelemetry/llamatelemetry/issues
Discussions: https://github.com/llamatelemetry/llamatelemetry/discussions

📄 License

This repository: MIT License. Individual models: See each model card for specific license (Apache 2.0, MIT, Gemma License, etc.)

Maintained by: waqasm86 SDK Version: 1.2.0 Last Updated: 2026-02-20 Target Platform: Kaggle dual Tesla T4 (CUDA 12.5, SM 7.5) Status: Active — models being added