---
license: mit
tags:
- llm
- gguf
- llama
- gemma
- mistral
- qwen
- inference
- opentelemetry
- observability
- kaggle
library_name: llamatelemetry
---

# llamatelemetry Models (v1.2.0)

Curated collection of GGUF models optimized for **llamatelemetry v1.2.0** on Kaggle dual Tesla T4
GPUs (2× 15 GB VRAM), using `gen_ai.*` OpenTelemetry semantic conventions.

## 🎯 About This Repository

This repository contains GGUF models tested and verified to work with:
- **llamatelemetry v1.2.0** — CUDA-first OpenTelemetry Python SDK for LLM inference observability
- **Platform**: Kaggle Notebooks (2× Tesla T4, 30 GB total VRAM)
- **CUDA**: 12.5 | **Compute Capability**: SM 7.5

## 📦 Available Models

> **Status**: Repository initialized. Models will be added as they are verified on Kaggle T4x2.

### Planned Models (v1.2.0)

| Model | Size | Quantization | VRAM | Speed (tok/s) | Status |
|-------|------|--------------|------|---------------|--------|
| Gemma 3 1B Instruct | 1B | Q4_K_M | ~1.5 GB | ~80 | 🔄 Coming soon |
| Gemma 3 4B Instruct | 4B | Q4_K_M | ~3.5 GB | ~50 | 🔄 Coming soon |
| Llama 3.2 3B Instruct | 3B | Q4_K_M | ~3 GB | ~50 | 🔄 Coming soon |
| Qwen 2.5 1.5B Instruct | 1.5B | Q4_K_M | ~2 GB | ~70 | 🔄 Coming soon |
| Mistral 7B Instruct v0.3 | 7B | Q4_K_M | ~6 GB | ~25 | 🔄 Coming soon |

### Model Selection Criteria

All models in this repository are:
1. ✅ **Tested** on Kaggle dual T4 GPUs with llamatelemetry v1.2.0
2. ✅ **Verified** to fit in 15 GB VRAM (single GPU) or 30 GB (split)
3. ✅ **Compatible** with GenAI semconv (`gen_ai.*` attributes)
4. ✅ **Instrumented** — TTFT, TPOT, token usage captured automatically
5. ✅ **Documented** with performance benchmarks

## 🚀 Quick Start

### Install llamatelemetry v1.2.0

```bash
pip install -q --no-cache-dir git+https://github.com/llamatelemetry/llamatelemetry.git@v1.2.0
```

### Verify CUDA (v1.2.0 requirement)

```python
import llamatelemetry
llamatelemetry.require_cuda()  # Raises RuntimeError if no GPU
```

### Download and Run a Model

```python
import llamatelemetry
from llamatelemetry import ServerManager, ServerConfig
from llamatelemetry.llama import LlamaCppClient
from huggingface_hub import hf_hub_download

# Initialize SDK with GenAI metrics
llamatelemetry.init(
    service_name="kaggle-inference",
    otlp_endpoint="http://localhost:4317",
    enable_metrics=True,
    gpu_enrichment=True,
)

# Download model from this repo (once available)
model_path = hf_hub_download(
    repo_id="waqasm86/llamatelemetry-models",
    filename="gemma-3-4b-it-Q4_K_M.gguf",
    local_dir="/kaggle/working/models"
)

# Start server on dual T4
config = ServerConfig(
    model_path=model_path,
    tensor_split=[0.5, 0.5],
    n_gpu_layers=-1,
    flash_attn=True,
)
server = ServerManager(config)
server.start()

# Instrumented inference — emits gen_ai.* spans + metrics
client = LlamaCppClient(base_url=server.url, strict_operation_names=True)
response = client.chat(
    messages=[{"role": "user", "content": "Explain CUDA tensor cores."}],
    max_tokens=512,
)
print(response.choices[0].message.content)

llamatelemetry.shutdown()
server.stop()
```

## 📊 GenAI Metrics Captured (v1.2.0)

Every inference call automatically records:

| Metric | Unit | Description |
|--------|------|-------------|
| `gen_ai.client.token.usage` | `{token}` | Input + output token count |
| `gen_ai.client.operation.duration` | `s` | Total request duration |
| `gen_ai.server.time_to_first_token` | `s` | TTFT latency |
| `gen_ai.server.time_per_output_token` | `s` | Per-token decode time |
| `gen_ai.server.request.active` | `{request}` | Concurrent in-flight requests |

## 🎯 Dual GPU Strategies

### Strategy 1: Inference on GPU 0, Analytics on GPU 1

```python
config = ServerConfig(
    model_path=model_path,
    tensor_split=[1.0, 0.0],  # 100% GPU 0
    n_gpu_layers=-1,
)
# GPU 1 free for RAPIDS / Graphistry / cuDF
```

### Strategy 2: Model Split Across Both T4s (for larger models)

```python
config = ServerConfig(
    model_path=large_model_path,
    tensor_split=[0.5, 0.5],  # 50% each
    n_gpu_layers=-1,
)
```

## 🔧 Benchmarking Models

```python
from llamatelemetry.bench import BenchmarkRunner, BenchmarkProfile

runner = BenchmarkRunner(client=client, profile=BenchmarkProfile.STANDARD)
results = runner.run(
    model_name="gemma-3-4b-it-Q4_K_M",
    prompts=[
        "Explain attention mechanisms.",
        "Write a Python function to sort a list.",
    ],
)
print(results.summary())
# Output: TTFT p50/p95, tokens/sec, prefill_ms, decode_ms
```

## 🔗 Links

- **GitHub Repository**: https://github.com/llamatelemetry/llamatelemetry
- **GitHub Releases**: https://github.com/llamatelemetry/llamatelemetry/releases/tag/v1.2.0
- **Binaries Repository**: https://huggingface.co/waqasm86/llamatelemetry-binaries
- **Kaggle Guide**: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/KAGGLE_GUIDE.md
- **Integration Guide**: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/INTEGRATION_GUIDE.md
- **API Reference**: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/API_REFERENCE.md

## 🔗 Model Sources

Models are sourced from reputable community providers:
- [Unsloth GGUF Models](https://huggingface.co/unsloth) — Optimized GGUF conversions
- [Bartowski GGUF Models](https://huggingface.co/bartowski) — High-quality quants
- [LM Studio Community](https://huggingface.co/lmstudio-community) — Curated GGUF models

All models are:
- ✅ Publicly available under permissive licenses
- ✅ Verified on llamatelemetry v1.2.0 + Kaggle T4x2
- ✅ Credited to original authors

## 🆘 Getting Help

- **GitHub Issues**: https://github.com/llamatelemetry/llamatelemetry/issues
- **Discussions**: https://github.com/llamatelemetry/llamatelemetry/discussions

## 📄 License

This repository: MIT License.
Individual models: See each model card for specific license (Apache 2.0, MIT, Gemma License, etc.)

---

**Maintained by**: [waqasm86](https://huggingface.co/waqasm86)
**SDK Version**: 1.2.0
**Last Updated**: 2026-02-20
**Target Platform**: Kaggle dual Tesla T4 (CUDA 12.5, SM 7.5)
**Status**: Active — models being added