waqasm86's picture
docs: update README for v1.2.0 β€” gen_ai semconv, GenAI metrics, new API
8a9dffe verified
---
license: mit
tags:
- llm
- gguf
- llama
- gemma
- mistral
- qwen
- inference
- opentelemetry
- observability
- kaggle
library_name: llamatelemetry
---
# llamatelemetry Models (v1.2.0)
Curated collection of GGUF models optimized for **llamatelemetry v1.2.0** on Kaggle dual Tesla T4
GPUs (2Γ— 15 GB VRAM), using `gen_ai.*` OpenTelemetry semantic conventions.
## 🎯 About This Repository
This repository contains GGUF models tested and verified to work with:
- **llamatelemetry v1.2.0** β€” CUDA-first OpenTelemetry Python SDK for LLM inference observability
- **Platform**: Kaggle Notebooks (2Γ— Tesla T4, 30 GB total VRAM)
- **CUDA**: 12.5 | **Compute Capability**: SM 7.5
## πŸ“¦ Available Models
> **Status**: Repository initialized. Models will be added as they are verified on Kaggle T4x2.
### Planned Models (v1.2.0)
| Model | Size | Quantization | VRAM | Speed (tok/s) | Status |
|-------|------|--------------|------|---------------|--------|
| Gemma 3 1B Instruct | 1B | Q4_K_M | ~1.5 GB | ~80 | πŸ”„ Coming soon |
| Gemma 3 4B Instruct | 4B | Q4_K_M | ~3.5 GB | ~50 | πŸ”„ Coming soon |
| Llama 3.2 3B Instruct | 3B | Q4_K_M | ~3 GB | ~50 | πŸ”„ Coming soon |
| Qwen 2.5 1.5B Instruct | 1.5B | Q4_K_M | ~2 GB | ~70 | πŸ”„ Coming soon |
| Mistral 7B Instruct v0.3 | 7B | Q4_K_M | ~6 GB | ~25 | πŸ”„ Coming soon |
### Model Selection Criteria
All models in this repository are:
1. βœ… **Tested** on Kaggle dual T4 GPUs with llamatelemetry v1.2.0
2. βœ… **Verified** to fit in 15 GB VRAM (single GPU) or 30 GB (split)
3. βœ… **Compatible** with GenAI semconv (`gen_ai.*` attributes)
4. βœ… **Instrumented** β€” TTFT, TPOT, token usage captured automatically
5. βœ… **Documented** with performance benchmarks
## πŸš€ Quick Start
### Install llamatelemetry v1.2.0
```bash
pip install -q --no-cache-dir git+https://github.com/llamatelemetry/llamatelemetry.git@v1.2.0
```
### Verify CUDA (v1.2.0 requirement)
```python
import llamatelemetry
llamatelemetry.require_cuda() # Raises RuntimeError if no GPU
```
### Download and Run a Model
```python
import llamatelemetry
from llamatelemetry import ServerManager, ServerConfig
from llamatelemetry.llama import LlamaCppClient
from huggingface_hub import hf_hub_download
# Initialize SDK with GenAI metrics
llamatelemetry.init(
service_name="kaggle-inference",
otlp_endpoint="http://localhost:4317",
enable_metrics=True,
gpu_enrichment=True,
)
# Download model from this repo (once available)
model_path = hf_hub_download(
repo_id="waqasm86/llamatelemetry-models",
filename="gemma-3-4b-it-Q4_K_M.gguf",
local_dir="/kaggle/working/models"
)
# Start server on dual T4
config = ServerConfig(
model_path=model_path,
tensor_split=[0.5, 0.5],
n_gpu_layers=-1,
flash_attn=True,
)
server = ServerManager(config)
server.start()
# Instrumented inference β€” emits gen_ai.* spans + metrics
client = LlamaCppClient(base_url=server.url, strict_operation_names=True)
response = client.chat(
messages=[{"role": "user", "content": "Explain CUDA tensor cores."}],
max_tokens=512,
)
print(response.choices[0].message.content)
llamatelemetry.shutdown()
server.stop()
```
## πŸ“Š GenAI Metrics Captured (v1.2.0)
Every inference call automatically records:
| Metric | Unit | Description |
|--------|------|-------------|
| `gen_ai.client.token.usage` | `{token}` | Input + output token count |
| `gen_ai.client.operation.duration` | `s` | Total request duration |
| `gen_ai.server.time_to_first_token` | `s` | TTFT latency |
| `gen_ai.server.time_per_output_token` | `s` | Per-token decode time |
| `gen_ai.server.request.active` | `{request}` | Concurrent in-flight requests |
## 🎯 Dual GPU Strategies
### Strategy 1: Inference on GPU 0, Analytics on GPU 1
```python
config = ServerConfig(
model_path=model_path,
tensor_split=[1.0, 0.0], # 100% GPU 0
n_gpu_layers=-1,
)
# GPU 1 free for RAPIDS / Graphistry / cuDF
```
### Strategy 2: Model Split Across Both T4s (for larger models)
```python
config = ServerConfig(
model_path=large_model_path,
tensor_split=[0.5, 0.5], # 50% each
n_gpu_layers=-1,
)
```
## πŸ”§ Benchmarking Models
```python
from llamatelemetry.bench import BenchmarkRunner, BenchmarkProfile
runner = BenchmarkRunner(client=client, profile=BenchmarkProfile.STANDARD)
results = runner.run(
model_name="gemma-3-4b-it-Q4_K_M",
prompts=[
"Explain attention mechanisms.",
"Write a Python function to sort a list.",
],
)
print(results.summary())
# Output: TTFT p50/p95, tokens/sec, prefill_ms, decode_ms
```
## πŸ”— Links
- **GitHub Repository**: https://github.com/llamatelemetry/llamatelemetry
- **GitHub Releases**: https://github.com/llamatelemetry/llamatelemetry/releases/tag/v1.2.0
- **Binaries Repository**: https://huggingface.co/waqasm86/llamatelemetry-binaries
- **Kaggle Guide**: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/KAGGLE_GUIDE.md
- **Integration Guide**: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/INTEGRATION_GUIDE.md
- **API Reference**: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/API_REFERENCE.md
## πŸ”— Model Sources
Models are sourced from reputable community providers:
- [Unsloth GGUF Models](https://huggingface.co/unsloth) β€” Optimized GGUF conversions
- [Bartowski GGUF Models](https://huggingface.co/bartowski) β€” High-quality quants
- [LM Studio Community](https://huggingface.co/lmstudio-community) β€” Curated GGUF models
All models are:
- βœ… Publicly available under permissive licenses
- βœ… Verified on llamatelemetry v1.2.0 + Kaggle T4x2
- βœ… Credited to original authors
## πŸ†˜ Getting Help
- **GitHub Issues**: https://github.com/llamatelemetry/llamatelemetry/issues
- **Discussions**: https://github.com/llamatelemetry/llamatelemetry/discussions
## πŸ“„ License
This repository: MIT License.
Individual models: See each model card for specific license (Apache 2.0, MIT, Gemma License, etc.)
---
**Maintained by**: [waqasm86](https://huggingface.co/waqasm86)
**SDK Version**: 1.2.0
**Last Updated**: 2026-02-20
**Target Platform**: Kaggle dual Tesla T4 (CUDA 12.5, SM 7.5)
**Status**: Active β€” models being added