waqasm86
/

llamatelemetry-models

@@ -14,158 +14,188 @@ tags:
 library_name: llamatelemetry
 ---
-# llamatelemetry Models
-Curated collection of GGUF models optimized for **llamatelemetry** on Kaggle dual Tesla T4 GPUs (2× 15GB VRAM).
 ## 🎯 About This Repository
 This repository contains GGUF models tested and verified to work with:
-- **llamatelemetry v1.0.0** - CUDA-first OpenTelemetry Python SDK for LLM inference observability
-- **Platform**: Kaggle Notebooks (2× Tesla T4, 30GB total VRAM)
-- **CUDA**: 12.5
 ## 📦 Available Models
-> **Status**: Repository created, models coming soon!
-### Planned Models (v1.0.0)
 | Model | Size | Quantization | VRAM | Speed (tok/s) | Status |
 |-------|------|--------------|------|---------------|--------|
-| Gemma 3 1B Instruct | 1B | Q4_K_M | ~1.5GB | ~80 | 🔄 Coming soon |
-| Gemma 3 3B Instruct | 3B | Q4_K_M | ~3GB | ~50 | 🔄 Coming soon |
-| Llama 3.2 3B Instruct | 3B | Q4_K_M | ~3GB | ~50 | 🔄 Coming soon |
-| Qwen 2.5 1.5B Instruct | 1.5B | Q4_K_M | ~2GB | ~70 | 🔄 Coming soon |
-| Mistral 7B Instruct | 7B | Q4_K_M | ~6GB | ~25 | 🔄 Coming soon |
 ### Model Selection Criteria
-Models in this repository are:
-1. ✅ **Tested** on Kaggle dual T4 GPUs
-2. ✅ **Verified** to fit in 15GB VRAM (single GPU)
-3. ✅ **Compatible** with llamatelemetry's observability features
-4. ✅ **Optimized** for GGUF + CUDA acceleration
 5. ✅ **Documented** with performance benchmarks
 ## 🚀 Quick Start
-### Install llamatelemetry
 ```bash
-# On Kaggle with GPU T4 × 2
-pip install -q --no-cache-dir git+https://github.com/llamatelemetry/llamatelemetry.git@v1.0.0
 ```
 ### Download and Run a Model
 ```python
 import llamatelemetry
 from huggingface_hub import hf_hub_download
-# Initialize SDK
-llamatelemetry.init(service_name="my-llm-app")
-# Download model (example - not yet available)
 model_path = hf_hub_download(
     repo_id="waqasm86/llamatelemetry-models",
-    filename="gemma-3-1b-it-Q4_K_M.gguf",
     local_dir="/kaggle/working/models"
 )
-# Start server with model on GPU 0
-from llamatelemetry.llama import ServerManager
-server = ServerManager()
-server.start_server(model_path=model_path, gpu_layers=99)
-# Run inference with telemetry
-from llamatelemetry.llama import LlamaCppClient
-client = LlamaCppClient()
-result = client.completion("Explain quantum computing in simple terms", max_tokens=150)
-print(result)
-# Cleanup
 llamatelemetry.shutdown()
 ```
-## 📊 Recommended Models by Use Case
-### For Fast Prototyping
-- **Gemma 3 1B** - Fastest inference, good for testing
-- **Qwen 2.5 1.5B** - Balance of speed and quality
-### For Production Quality
-- **Gemma 3 3B** - High quality, reasonable speed
-- **Llama 3.2 3B** - Strong reasoning capabilities
-### For Complex Tasks
-- **Mistral 7B** - Best quality, slower but fits in single T4
-## 🔗 Model Sources
-Models are sourced from reputable providers:
-- [Unsloth GGUF Models](https://huggingface.co/unsloth) - Optimized GGUF conversions
-- [TheBloke GGUF Models](https://huggingface.co/TheBloke) - Community standard
-- [Bartowski GGUF Models](https://huggingface.co/bartowski) - High-quality quants
-All models are:
-- ✅ Publicly available under permissive licenses
-- ✅ Re-hosted here for convenience and verification
-- ✅ Credited to original authors
 ## 🎯 Dual GPU Strategies
-llamatelemetry supports multi-GPU workloads:
-### Strategy 1: LLM on GPU 0, Observability on GPU 1
 ```python
-from llamatelemetry.llama import ServerManager
-# Start llama-server on GPU 0 only
-server = ServerManager()
-server.start_server(
     model_path=model_path,
-    gpu_layers=99,
-    tensor_split="1.0,0.0",  # 100% GPU 0, 0% GPU 1
-    flash_attn=1,
 )
-# GPU 1 is now free for RAPIDS/Graphistry visualization
 ```
-### Strategy 2: Model Sharding Across Both GPUs
 ```python
-# Split large model across both T4s
-server.start_server(
     model_path=large_model_path,
-    gpu_layers=99,
-    tensor_split="0.5,0.5",  # 50% GPU 0, 50% GPU 1
 )
 ```
-## 📚 Documentation & Links
-- **GitHub**: https://github.com/llamatelemetry/llamatelemetry
-- **Installation Guide**: [KAGGLE_INSTALL_GUIDE.md](https://github.com/llamatelemetry/llamatelemetry/blob/main/KAGGLE_INSTALL_GUIDE.md)
-- **Binaries**: https://huggingface.co/waqasm86/llamatelemetry-binaries
-- **Tutorials**: [notebooks/](https://github.com/llamatelemetry/llamatelemetry/tree/main/notebooks)
 ## 🆘 Getting Help
 - **GitHub Issues**: https://github.com/llamatelemetry/llamatelemetry/issues
-- **Documentation**: https://llamatelemetry.github.io (planned)
 ## 📄 License
-This repository: MIT License
-Individual models: See model cards for specific licenses (Apache 2.0, MIT, Gemma License, etc.)
 ---
-**Maintained by**: [waqasm86](https://huggingface.co/waqasm86)
-**Status**: Repository initialized, models coming soon
-**Target Platform**: Kaggle dual Tesla T4 (CUDA 12.5)
-**Last Updated**: 2026-02-16

 library_name: llamatelemetry
 ---
+# llamatelemetry Models (v1.2.0)
+Curated collection of GGUF models optimized for **llamatelemetry v1.2.0** on Kaggle dual Tesla T4
+GPUs (2× 15 GB VRAM), using `gen_ai.*` OpenTelemetry semantic conventions.
 ## 🎯 About This Repository
 This repository contains GGUF models tested and verified to work with:
+- **llamatelemetry v1.2.0** — CUDA-first OpenTelemetry Python SDK for LLM inference observability
+- **Platform**: Kaggle Notebooks (2× Tesla T4, 30 GB total VRAM)
+- **CUDA**: 12.5 | **Compute Capability**: SM 7.5
 ## 📦 Available Models
+> **Status**: Repository initialized. Models will be added as they are verified on Kaggle T4x2.
+### Planned Models (v1.2.0)
 | Model | Size | Quantization | VRAM | Speed (tok/s) | Status |
 |-------|------|--------------|------|---------------|--------|
+| Gemma 3 1B Instruct | 1B | Q4_K_M | ~1.5 GB | ~80 | 🔄 Coming soon |
+| Gemma 3 4B Instruct | 4B | Q4_K_M | ~3.5 GB | ~50 | 🔄 Coming soon |
+| Llama 3.2 3B Instruct | 3B | Q4_K_M | ~3 GB | ~50 | 🔄 Coming soon |
+| Qwen 2.5 1.5B Instruct | 1.5B | Q4_K_M | ~2 GB | ~70 | 🔄 Coming soon |
+| Mistral 7B Instruct v0.3 | 7B | Q4_K_M | ~6 GB | ~25 | 🔄 Coming soon |
 ### Model Selection Criteria
+All models in this repository are:
+1. ✅ **Tested** on Kaggle dual T4 GPUs with llamatelemetry v1.2.0
+2. ✅ **Verified** to fit in 15 GB VRAM (single GPU) or 30 GB (split)
+3. ✅ **Compatible** with GenAI semconv (`gen_ai.*` attributes)
+4. ✅ **Instrumented** — TTFT, TPOT, token usage captured automatically
 5. ✅ **Documented** with performance benchmarks
 ## 🚀 Quick Start
+### Install llamatelemetry v1.2.0
 ```bash
+pip install -q --no-cache-dir git+https://github.com/llamatelemetry/llamatelemetry.git@v1.2.0
+```
+### Verify CUDA (v1.2.0 requirement)
+```python
+import llamatelemetry
+llamatelemetry.require_cuda()  # Raises RuntimeError if no GPU
 ```
 ### Download and Run a Model
 ```python
 import llamatelemetry
+from llamatelemetry import ServerManager, ServerConfig
+from llamatelemetry.llama import LlamaCppClient
 from huggingface_hub import hf_hub_download
+# Initialize SDK with GenAI metrics
+llamatelemetry.init(
+    service_name="kaggle-inference",
+    otlp_endpoint="http://localhost:4317",
+    enable_metrics=True,
+    gpu_enrichment=True,
+)
+# Download model from this repo (once available)
 model_path = hf_hub_download(
     repo_id="waqasm86/llamatelemetry-models",
+    filename="gemma-3-4b-it-Q4_K_M.gguf",
     local_dir="/kaggle/working/models"
 )
+# Start server on dual T4
+config = ServerConfig(
+    model_path=model_path,
+    tensor_split=[0.5, 0.5],
+    n_gpu_layers=-1,
+    flash_attn=True,
+)
+server = ServerManager(config)
+server.start()
+# Instrumented inference — emits gen_ai.* spans + metrics
+client = LlamaCppClient(base_url=server.url, strict_operation_names=True)
+response = client.chat(
+    messages=[{"role": "user", "content": "Explain CUDA tensor cores."}],
+    max_tokens=512,
+)
+print(response.choices[0].message.content)
 llamatelemetry.shutdown()
+server.stop()
 ```
+## 📊 GenAI Metrics Captured (v1.2.0)
+Every inference call automatically records:
+| Metric | Unit | Description |
+|--------|------|-------------|
+| `gen_ai.client.token.usage` | `{token}` | Input + output token count |
+| `gen_ai.client.operation.duration` | `s` | Total request duration |
+| `gen_ai.server.time_to_first_token` | `s` | TTFT latency |
+| `gen_ai.server.time_per_output_token` | `s` | Per-token decode time |
+| `gen_ai.server.request.active` | `{request}` | Concurrent in-flight requests |
 ## 🎯 Dual GPU Strategies
+### Strategy 1: Inference on GPU 0, Analytics on GPU 1
 ```python
+config = ServerConfig(
     model_path=model_path,
+    tensor_split=[1.0, 0.0],  # 100% GPU 0
+    n_gpu_layers=-1,
 )
+# GPU 1 free for RAPIDS / Graphistry / cuDF
 ```
+### Strategy 2: Model Split Across Both T4s (for larger models)
 ```python
+config = ServerConfig(
     model_path=large_model_path,
+    tensor_split=[0.5, 0.5],  # 50% each
+    n_gpu_layers=-1,
+)
+```
+## 🔧 Benchmarking Models
+```python
+from llamatelemetry.bench import BenchmarkRunner, BenchmarkProfile
+runner = BenchmarkRunner(client=client, profile=BenchmarkProfile.STANDARD)
+results = runner.run(
+    model_name="gemma-3-4b-it-Q4_K_M",
+    prompts=[
+        "Explain attention mechanisms.",
+        "Write a Python function to sort a list.",
+    ],
 )
+print(results.summary())
+# Output: TTFT p50/p95, tokens/sec, prefill_ms, decode_ms
 ```
+## 🔗 Links
+- **GitHub Repository**: https://github.com/llamatelemetry/llamatelemetry
+- **GitHub Releases**: https://github.com/llamatelemetry/llamatelemetry/releases/tag/v1.2.0
+- **Binaries Repository**: https://huggingface.co/waqasm86/llamatelemetry-binaries
+- **Kaggle Guide**: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/KAGGLE_GUIDE.md
+- **Integration Guide**: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/INTEGRATION_GUIDE.md
+- **API Reference**: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/API_REFERENCE.md
+## 🔗 Model Sources
+Models are sourced from reputable community providers:
+- [Unsloth GGUF Models](https://huggingface.co/unsloth) — Optimized GGUF conversions
+- [Bartowski GGUF Models](https://huggingface.co/bartowski) — High-quality quants
+- [LM Studio Community](https://huggingface.co/lmstudio-community) — Curated GGUF models
+All models are:
+- ✅ Publicly available under permissive licenses
+- ✅ Verified on llamatelemetry v1.2.0 + Kaggle T4x2
+- ✅ Credited to original authors
 ## 🆘 Getting Help
 - **GitHub Issues**: https://github.com/llamatelemetry/llamatelemetry/issues
+- **Discussions**: https://github.com/llamatelemetry/llamatelemetry/discussions
 ## 📄 License
+This repository: MIT License.
+Individual models: See each model card for specific license (Apache 2.0, MIT, Gemma License, etc.)
 ---
+**Maintained by**: [waqasm86](https://huggingface.co/waqasm86)
+**SDK Version**: 1.2.0
+**Last Updated**: 2026-02-20
+**Target Platform**: Kaggle dual Tesla T4 (CUDA 12.5, SM 7.5)
+**Status**: Active — models being added