docs: update README for v1.2.0 — gen_ai semconv, GenAI metrics, new API

8a9dffe verified 3 days ago

6.22 kB

	---
	license: mit
	tags:
	- llm
	- gguf
	- llama
	- gemma
	- mistral
	- qwen
	- inference
	- opentelemetry
	- observability
	- kaggle
	library_name: llamatelemetry
	---

	# llamatelemetry Models (v1.2.0)

	Curated collection of GGUF models optimized for llamatelemetry v1.2.0 on Kaggle dual Tesla T4
	GPUs (2× 15 GB VRAM), using `gen_ai.*` OpenTelemetry semantic conventions.

	## 🎯 About This Repository

	This repository contains GGUF models tested and verified to work with:
	- llamatelemetry v1.2.0 — CUDA-first OpenTelemetry Python SDK for LLM inference observability
	- Platform: Kaggle Notebooks (2× Tesla T4, 30 GB total VRAM)
	- CUDA: 12.5 \| Compute Capability: SM 7.5

	## 📦 Available Models

	> Status: Repository initialized. Models will be added as they are verified on Kaggle T4x2.

	### Planned Models (v1.2.0)

	\| Model \| Size \| Quantization \| VRAM \| Speed (tok/s) \| Status \|
	\|-------\|------\|--------------\|------\|---------------\|--------\|
	\| Gemma 3 1B Instruct \| 1B \| Q4_K_M \| ~1.5 GB \| ~80 \| 🔄 Coming soon \|
	\| Gemma 3 4B Instruct \| 4B \| Q4_K_M \| ~3.5 GB \| ~50 \| 🔄 Coming soon \|
	\| Llama 3.2 3B Instruct \| 3B \| Q4_K_M \| ~3 GB \| ~50 \| 🔄 Coming soon \|
	\| Qwen 2.5 1.5B Instruct \| 1.5B \| Q4_K_M \| ~2 GB \| ~70 \| 🔄 Coming soon \|
	\| Mistral 7B Instruct v0.3 \| 7B \| Q4_K_M \| ~6 GB \| ~25 \| 🔄 Coming soon \|

	### Model Selection Criteria

	All models in this repository are:
	1. ✅ Tested on Kaggle dual T4 GPUs with llamatelemetry v1.2.0
	2. ✅ Verified to fit in 15 GB VRAM (single GPU) or 30 GB (split)
	3. ✅ Compatible with GenAI semconv (`gen_ai.*` attributes)
	4. ✅ Instrumented — TTFT, TPOT, token usage captured automatically
	5. ✅ Documented with performance benchmarks

	## 🚀 Quick Start

	### Install llamatelemetry v1.2.0

	```bash
	pip install -q --no-cache-dir git+https://github.com/llamatelemetry/llamatelemetry.git@v1.2.0
	```

	### Verify CUDA (v1.2.0 requirement)

	```python
	import llamatelemetry
	llamatelemetry.require_cuda() # Raises RuntimeError if no GPU
	```

	### Download and Run a Model

	```python
	import llamatelemetry
	from llamatelemetry import ServerManager, ServerConfig
	from llamatelemetry.llama import LlamaCppClient
	from huggingface_hub import hf_hub_download

	# Initialize SDK with GenAI metrics
	llamatelemetry.init(
	service_name="kaggle-inference",
	otlp_endpoint="http://localhost:4317",
	enable_metrics=True,
	gpu_enrichment=True,
	)

	# Download model from this repo (once available)
	model_path = hf_hub_download(
	repo_id="waqasm86/llamatelemetry-models",
	filename="gemma-3-4b-it-Q4_K_M.gguf",
	local_dir="/kaggle/working/models"
	)

	# Start server on dual T4
	config = ServerConfig(
	model_path=model_path,
	tensor_split=[0.5, 0.5],
	n_gpu_layers=-1,
	flash_attn=True,
	)
	server = ServerManager(config)
	server.start()

	# Instrumented inference — emits gen_ai.* spans + metrics
	client = LlamaCppClient(base_url=server.url, strict_operation_names=True)
	response = client.chat(
	messages=[{"role": "user", "content": "Explain CUDA tensor cores."}],
	max_tokens=512,
	)
	print(response.choices[0].message.content)

	llamatelemetry.shutdown()
	server.stop()
	```

	## 📊 GenAI Metrics Captured (v1.2.0)

	Every inference call automatically records:

	\| Metric \| Unit \| Description \|
	\|--------\|------\|-------------\|
	\| `gen_ai.client.token.usage` \| `{token}` \| Input + output token count \|
	\| `gen_ai.client.operation.duration` \| `s` \| Total request duration \|
	\| `gen_ai.server.time_to_first_token` \| `s` \| TTFT latency \|
	\| `gen_ai.server.time_per_output_token` \| `s` \| Per-token decode time \|
	\| `gen_ai.server.request.active` \| `{request}` \| Concurrent in-flight requests \|

	## 🎯 Dual GPU Strategies

	### Strategy 1: Inference on GPU 0, Analytics on GPU 1

	```python
	config = ServerConfig(
	model_path=model_path,
	tensor_split=[1.0, 0.0], # 100% GPU 0
	n_gpu_layers=-1,
	)
	# GPU 1 free for RAPIDS / Graphistry / cuDF
	```

	### Strategy 2: Model Split Across Both T4s (for larger models)

	```python
	config = ServerConfig(
	model_path=large_model_path,
	tensor_split=[0.5, 0.5], # 50% each
	n_gpu_layers=-1,
	)
	```

	## 🔧 Benchmarking Models

	```python
	from llamatelemetry.bench import BenchmarkRunner, BenchmarkProfile

	runner = BenchmarkRunner(client=client, profile=BenchmarkProfile.STANDARD)
	results = runner.run(
	model_name="gemma-3-4b-it-Q4_K_M",
	prompts=[
	"Explain attention mechanisms.",
	"Write a Python function to sort a list.",
	],
	)
	print(results.summary())
	# Output: TTFT p50/p95, tokens/sec, prefill_ms, decode_ms
	```

	## 🔗 Links

	- GitHub Repository: https://github.com/llamatelemetry/llamatelemetry
	- GitHub Releases: https://github.com/llamatelemetry/llamatelemetry/releases/tag/v1.2.0
	- Binaries Repository: https://huggingface.co/waqasm86/llamatelemetry-binaries
	- Kaggle Guide: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/KAGGLE_GUIDE.md
	- Integration Guide: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/INTEGRATION_GUIDE.md
	- API Reference: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/API_REFERENCE.md

	## 🔗 Model Sources

	Models are sourced from reputable community providers:
	- [Unsloth GGUF Models](https://huggingface.co/unsloth) — Optimized GGUF conversions
	- [Bartowski GGUF Models](https://huggingface.co/bartowski) — High-quality quants
	- [LM Studio Community](https://huggingface.co/lmstudio-community) — Curated GGUF models

	All models are:
	- ✅ Publicly available under permissive licenses
	- ✅ Verified on llamatelemetry v1.2.0 + Kaggle T4x2
	- ✅ Credited to original authors

	## 🆘 Getting Help

	- GitHub Issues: https://github.com/llamatelemetry/llamatelemetry/issues
	- Discussions: https://github.com/llamatelemetry/llamatelemetry/discussions

	## 📄 License

	This repository: MIT License.
	Individual models: See each model card for specific license (Apache 2.0, MIT, Gemma License, etc.)

	---

	Maintained by: [waqasm86](https://huggingface.co/waqasm86)
	SDK Version: 1.2.0
	Last Updated: 2026-02-20
	Target Platform: Kaggle dual Tesla T4 (CUDA 12.5, SM 7.5)
	Status: Active — models being added