waqasm86's picture
docs: update README for v1.2.0 β€” gen_ai semconv, GenAI metrics, new API
8a9dffe verified
metadata
license: mit
tags:
  - llm
  - gguf
  - llama
  - gemma
  - mistral
  - qwen
  - inference
  - opentelemetry
  - observability
  - kaggle
library_name: llamatelemetry

llamatelemetry Models (v1.2.0)

Curated collection of GGUF models optimized for llamatelemetry v1.2.0 on Kaggle dual Tesla T4 GPUs (2Γ— 15 GB VRAM), using gen_ai.* OpenTelemetry semantic conventions.

🎯 About This Repository

This repository contains GGUF models tested and verified to work with:

  • llamatelemetry v1.2.0 β€” CUDA-first OpenTelemetry Python SDK for LLM inference observability
  • Platform: Kaggle Notebooks (2Γ— Tesla T4, 30 GB total VRAM)
  • CUDA: 12.5 | Compute Capability: SM 7.5

πŸ“¦ Available Models

Status: Repository initialized. Models will be added as they are verified on Kaggle T4x2.

Planned Models (v1.2.0)

Model Size Quantization VRAM Speed (tok/s) Status
Gemma 3 1B Instruct 1B Q4_K_M ~1.5 GB ~80 πŸ”„ Coming soon
Gemma 3 4B Instruct 4B Q4_K_M ~3.5 GB ~50 πŸ”„ Coming soon
Llama 3.2 3B Instruct 3B Q4_K_M ~3 GB ~50 πŸ”„ Coming soon
Qwen 2.5 1.5B Instruct 1.5B Q4_K_M ~2 GB ~70 πŸ”„ Coming soon
Mistral 7B Instruct v0.3 7B Q4_K_M ~6 GB ~25 πŸ”„ Coming soon

Model Selection Criteria

All models in this repository are:

  1. βœ… Tested on Kaggle dual T4 GPUs with llamatelemetry v1.2.0
  2. βœ… Verified to fit in 15 GB VRAM (single GPU) or 30 GB (split)
  3. βœ… Compatible with GenAI semconv (gen_ai.* attributes)
  4. βœ… Instrumented β€” TTFT, TPOT, token usage captured automatically
  5. βœ… Documented with performance benchmarks

πŸš€ Quick Start

Install llamatelemetry v1.2.0

pip install -q --no-cache-dir git+https://github.com/llamatelemetry/llamatelemetry.git@v1.2.0

Verify CUDA (v1.2.0 requirement)

import llamatelemetry
llamatelemetry.require_cuda()  # Raises RuntimeError if no GPU

Download and Run a Model

import llamatelemetry
from llamatelemetry import ServerManager, ServerConfig
from llamatelemetry.llama import LlamaCppClient
from huggingface_hub import hf_hub_download

# Initialize SDK with GenAI metrics
llamatelemetry.init(
    service_name="kaggle-inference",
    otlp_endpoint="http://localhost:4317",
    enable_metrics=True,
    gpu_enrichment=True,
)

# Download model from this repo (once available)
model_path = hf_hub_download(
    repo_id="waqasm86/llamatelemetry-models",
    filename="gemma-3-4b-it-Q4_K_M.gguf",
    local_dir="/kaggle/working/models"
)

# Start server on dual T4
config = ServerConfig(
    model_path=model_path,
    tensor_split=[0.5, 0.5],
    n_gpu_layers=-1,
    flash_attn=True,
)
server = ServerManager(config)
server.start()

# Instrumented inference β€” emits gen_ai.* spans + metrics
client = LlamaCppClient(base_url=server.url, strict_operation_names=True)
response = client.chat(
    messages=[{"role": "user", "content": "Explain CUDA tensor cores."}],
    max_tokens=512,
)
print(response.choices[0].message.content)

llamatelemetry.shutdown()
server.stop()

πŸ“Š GenAI Metrics Captured (v1.2.0)

Every inference call automatically records:

Metric Unit Description
gen_ai.client.token.usage {token} Input + output token count
gen_ai.client.operation.duration s Total request duration
gen_ai.server.time_to_first_token s TTFT latency
gen_ai.server.time_per_output_token s Per-token decode time
gen_ai.server.request.active {request} Concurrent in-flight requests

🎯 Dual GPU Strategies

Strategy 1: Inference on GPU 0, Analytics on GPU 1

config = ServerConfig(
    model_path=model_path,
    tensor_split=[1.0, 0.0],  # 100% GPU 0
    n_gpu_layers=-1,
)
# GPU 1 free for RAPIDS / Graphistry / cuDF

Strategy 2: Model Split Across Both T4s (for larger models)

config = ServerConfig(
    model_path=large_model_path,
    tensor_split=[0.5, 0.5],  # 50% each
    n_gpu_layers=-1,
)

πŸ”§ Benchmarking Models

from llamatelemetry.bench import BenchmarkRunner, BenchmarkProfile

runner = BenchmarkRunner(client=client, profile=BenchmarkProfile.STANDARD)
results = runner.run(
    model_name="gemma-3-4b-it-Q4_K_M",
    prompts=[
        "Explain attention mechanisms.",
        "Write a Python function to sort a list.",
    ],
)
print(results.summary())
# Output: TTFT p50/p95, tokens/sec, prefill_ms, decode_ms

πŸ”— Links

πŸ”— Model Sources

Models are sourced from reputable community providers:

All models are:

  • βœ… Publicly available under permissive licenses
  • βœ… Verified on llamatelemetry v1.2.0 + Kaggle T4x2
  • βœ… Credited to original authors

πŸ†˜ Getting Help

πŸ“„ License

This repository: MIT License. Individual models: See each model card for specific license (Apache 2.0, MIT, Gemma License, etc.)


Maintained by: waqasm86 SDK Version: 1.2.0 Last Updated: 2026-02-20 Target Platform: Kaggle dual Tesla T4 (CUDA 12.5, SM 7.5) Status: Active β€” models being added