waqasm86 commited on
Commit
8a9dffe
·
verified ·
1 Parent(s): 95cae10

docs: update README for v1.2.0 — gen_ai semconv, GenAI metrics, new API

Browse files
Files changed (1) hide show
  1. README.md +118 -88
README.md CHANGED
@@ -14,158 +14,188 @@ tags:
14
  library_name: llamatelemetry
15
  ---
16
 
17
- # llamatelemetry Models
18
 
19
- Curated collection of GGUF models optimized for **llamatelemetry** on Kaggle dual Tesla T4 GPUs (2× 15GB VRAM).
 
20
 
21
  ## 🎯 About This Repository
22
 
23
  This repository contains GGUF models tested and verified to work with:
24
- - **llamatelemetry v1.0.0** - CUDA-first OpenTelemetry Python SDK for LLM inference observability
25
- - **Platform**: Kaggle Notebooks (2× Tesla T4, 30GB total VRAM)
26
- - **CUDA**: 12.5
27
 
28
  ## 📦 Available Models
29
 
30
- > **Status**: Repository created, models coming soon!
31
 
32
- ### Planned Models (v1.0.0)
33
 
34
  | Model | Size | Quantization | VRAM | Speed (tok/s) | Status |
35
  |-------|------|--------------|------|---------------|--------|
36
- | Gemma 3 1B Instruct | 1B | Q4_K_M | ~1.5GB | ~80 | 🔄 Coming soon |
37
- | Gemma 3 3B Instruct | 3B | Q4_K_M | ~3GB | ~50 | 🔄 Coming soon |
38
- | Llama 3.2 3B Instruct | 3B | Q4_K_M | ~3GB | ~50 | 🔄 Coming soon |
39
- | Qwen 2.5 1.5B Instruct | 1.5B | Q4_K_M | ~2GB | ~70 | 🔄 Coming soon |
40
- | Mistral 7B Instruct | 7B | Q4_K_M | ~6GB | ~25 | 🔄 Coming soon |
41
 
42
  ### Model Selection Criteria
43
 
44
- Models in this repository are:
45
- 1. ✅ **Tested** on Kaggle dual T4 GPUs
46
- 2. ✅ **Verified** to fit in 15GB VRAM (single GPU)
47
- 3. ✅ **Compatible** with llamatelemetry's observability features
48
- 4. ✅ **Optimized** for GGUF + CUDA acceleration
49
  5. ✅ **Documented** with performance benchmarks
50
 
51
  ## 🚀 Quick Start
52
 
53
- ### Install llamatelemetry
54
 
55
  ```bash
56
- # On Kaggle with GPU T4 × 2
57
- pip install -q --no-cache-dir git+https://github.com/llamatelemetry/llamatelemetry.git@v1.0.0
 
 
 
 
 
 
58
  ```
59
 
60
  ### Download and Run a Model
61
 
62
  ```python
63
  import llamatelemetry
 
 
64
  from huggingface_hub import hf_hub_download
65
 
66
- # Initialize SDK
67
- llamatelemetry.init(service_name="my-llm-app")
 
 
 
 
 
68
 
69
- # Download model (example - not yet available)
70
  model_path = hf_hub_download(
71
  repo_id="waqasm86/llamatelemetry-models",
72
- filename="gemma-3-1b-it-Q4_K_M.gguf",
73
  local_dir="/kaggle/working/models"
74
  )
75
 
76
- # Start server with model on GPU 0
77
- from llamatelemetry.llama import ServerManager
78
- server = ServerManager()
79
- server.start_server(model_path=model_path, gpu_layers=99)
80
-
81
- # Run inference with telemetry
82
- from llamatelemetry.llama import LlamaCppClient
83
- client = LlamaCppClient()
84
- result = client.completion("Explain quantum computing in simple terms", max_tokens=150)
85
- print(result)
 
 
 
 
 
 
 
86
 
87
- # Cleanup
88
  llamatelemetry.shutdown()
 
89
  ```
90
 
91
- ## 📊 Recommended Models by Use Case
92
-
93
- ### For Fast Prototyping
94
- - **Gemma 3 1B** - Fastest inference, good for testing
95
- - **Qwen 2.5 1.5B** - Balance of speed and quality
96
-
97
- ### For Production Quality
98
- - **Gemma 3 3B** - High quality, reasonable speed
99
- - **Llama 3.2 3B** - Strong reasoning capabilities
100
 
101
- ### For Complex Tasks
102
- - **Mistral 7B** - Best quality, slower but fits in single T4
103
 
104
- ## 🔗 Model Sources
105
-
106
- Models are sourced from reputable providers:
107
- - [Unsloth GGUF Models](https://huggingface.co/unsloth) - Optimized GGUF conversions
108
- - [TheBloke GGUF Models](https://huggingface.co/TheBloke) - Community standard
109
- - [Bartowski GGUF Models](https://huggingface.co/bartowski) - High-quality quants
110
-
111
- All models are:
112
- - ✅ Publicly available under permissive licenses
113
- - ✅ Re-hosted here for convenience and verification
114
- - ✅ Credited to original authors
115
 
116
  ## 🎯 Dual GPU Strategies
117
 
118
- llamatelemetry supports multi-GPU workloads:
119
-
120
- ### Strategy 1: LLM on GPU 0, Observability on GPU 1
121
 
122
  ```python
123
- from llamatelemetry.llama import ServerManager
124
-
125
- # Start llama-server on GPU 0 only
126
- server = ServerManager()
127
- server.start_server(
128
  model_path=model_path,
129
- gpu_layers=99,
130
- tensor_split="1.0,0.0", # 100% GPU 0, 0% GPU 1
131
- flash_attn=1,
132
  )
133
-
134
- # GPU 1 is now free for RAPIDS/Graphistry visualization
135
  ```
136
 
137
- ### Strategy 2: Model Sharding Across Both GPUs
138
 
139
  ```python
140
- # Split large model across both T4s
141
- server.start_server(
142
  model_path=large_model_path,
143
- gpu_layers=99,
144
- tensor_split="0.5,0.5", # 50% GPU 0, 50% GPU 1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
  )
 
 
146
  ```
147
 
148
- ## 📚 Documentation & Links
 
 
 
 
 
 
 
149
 
150
- - **GitHub**: https://github.com/llamatelemetry/llamatelemetry
151
- - **Installation Guide**: [KAGGLE_INSTALL_GUIDE.md](https://github.com/llamatelemetry/llamatelemetry/blob/main/KAGGLE_INSTALL_GUIDE.md)
152
- - **Binaries**: https://huggingface.co/waqasm86/llamatelemetry-binaries
153
- - **Tutorials**: [notebooks/](https://github.com/llamatelemetry/llamatelemetry/tree/main/notebooks)
 
 
 
 
 
 
 
154
 
155
  ## 🆘 Getting Help
156
 
157
  - **GitHub Issues**: https://github.com/llamatelemetry/llamatelemetry/issues
158
- - **Documentation**: https://llamatelemetry.github.io (planned)
159
 
160
  ## 📄 License
161
 
162
- This repository: MIT License
163
-
164
- Individual models: See model cards for specific licenses (Apache 2.0, MIT, Gemma License, etc.)
165
 
166
  ---
167
 
168
- **Maintained by**: [waqasm86](https://huggingface.co/waqasm86)
169
- **Status**: Repository initialized, models coming soon
170
- **Target Platform**: Kaggle dual Tesla T4 (CUDA 12.5)
171
- **Last Updated**: 2026-02-16
 
 
14
  library_name: llamatelemetry
15
  ---
16
 
17
+ # llamatelemetry Models (v1.2.0)
18
 
19
+ Curated collection of GGUF models optimized for **llamatelemetry v1.2.0** on Kaggle dual Tesla T4
20
+ GPUs (2× 15 GB VRAM), using `gen_ai.*` OpenTelemetry semantic conventions.
21
 
22
  ## 🎯 About This Repository
23
 
24
  This repository contains GGUF models tested and verified to work with:
25
+ - **llamatelemetry v1.2.0** CUDA-first OpenTelemetry Python SDK for LLM inference observability
26
+ - **Platform**: Kaggle Notebooks (2× Tesla T4, 30 GB total VRAM)
27
+ - **CUDA**: 12.5 | **Compute Capability**: SM 7.5
28
 
29
  ## 📦 Available Models
30
 
31
+ > **Status**: Repository initialized. Models will be added as they are verified on Kaggle T4x2.
32
 
33
+ ### Planned Models (v1.2.0)
34
 
35
  | Model | Size | Quantization | VRAM | Speed (tok/s) | Status |
36
  |-------|------|--------------|------|---------------|--------|
37
+ | Gemma 3 1B Instruct | 1B | Q4_K_M | ~1.5 GB | ~80 | 🔄 Coming soon |
38
+ | Gemma 3 4B Instruct | 4B | Q4_K_M | ~3.5 GB | ~50 | 🔄 Coming soon |
39
+ | Llama 3.2 3B Instruct | 3B | Q4_K_M | ~3 GB | ~50 | 🔄 Coming soon |
40
+ | Qwen 2.5 1.5B Instruct | 1.5B | Q4_K_M | ~2 GB | ~70 | 🔄 Coming soon |
41
+ | Mistral 7B Instruct v0.3 | 7B | Q4_K_M | ~6 GB | ~25 | 🔄 Coming soon |
42
 
43
  ### Model Selection Criteria
44
 
45
+ All models in this repository are:
46
+ 1. ✅ **Tested** on Kaggle dual T4 GPUs with llamatelemetry v1.2.0
47
+ 2. ✅ **Verified** to fit in 15 GB VRAM (single GPU) or 30 GB (split)
48
+ 3. ✅ **Compatible** with GenAI semconv (`gen_ai.*` attributes)
49
+ 4. ✅ **Instrumented** TTFT, TPOT, token usage captured automatically
50
  5. ✅ **Documented** with performance benchmarks
51
 
52
  ## 🚀 Quick Start
53
 
54
+ ### Install llamatelemetry v1.2.0
55
 
56
  ```bash
57
+ pip install -q --no-cache-dir git+https://github.com/llamatelemetry/llamatelemetry.git@v1.2.0
58
+ ```
59
+
60
+ ### Verify CUDA (v1.2.0 requirement)
61
+
62
+ ```python
63
+ import llamatelemetry
64
+ llamatelemetry.require_cuda() # Raises RuntimeError if no GPU
65
  ```
66
 
67
  ### Download and Run a Model
68
 
69
  ```python
70
  import llamatelemetry
71
+ from llamatelemetry import ServerManager, ServerConfig
72
+ from llamatelemetry.llama import LlamaCppClient
73
  from huggingface_hub import hf_hub_download
74
 
75
+ # Initialize SDK with GenAI metrics
76
+ llamatelemetry.init(
77
+ service_name="kaggle-inference",
78
+ otlp_endpoint="http://localhost:4317",
79
+ enable_metrics=True,
80
+ gpu_enrichment=True,
81
+ )
82
 
83
+ # Download model from this repo (once available)
84
  model_path = hf_hub_download(
85
  repo_id="waqasm86/llamatelemetry-models",
86
+ filename="gemma-3-4b-it-Q4_K_M.gguf",
87
  local_dir="/kaggle/working/models"
88
  )
89
 
90
+ # Start server on dual T4
91
+ config = ServerConfig(
92
+ model_path=model_path,
93
+ tensor_split=[0.5, 0.5],
94
+ n_gpu_layers=-1,
95
+ flash_attn=True,
96
+ )
97
+ server = ServerManager(config)
98
+ server.start()
99
+
100
+ # Instrumented inference — emits gen_ai.* spans + metrics
101
+ client = LlamaCppClient(base_url=server.url, strict_operation_names=True)
102
+ response = client.chat(
103
+ messages=[{"role": "user", "content": "Explain CUDA tensor cores."}],
104
+ max_tokens=512,
105
+ )
106
+ print(response.choices[0].message.content)
107
 
 
108
  llamatelemetry.shutdown()
109
+ server.stop()
110
  ```
111
 
112
+ ## 📊 GenAI Metrics Captured (v1.2.0)
 
 
 
 
 
 
 
 
113
 
114
+ Every inference call automatically records:
 
115
 
116
+ | Metric | Unit | Description |
117
+ |--------|------|-------------|
118
+ | `gen_ai.client.token.usage` | `{token}` | Input + output token count |
119
+ | `gen_ai.client.operation.duration` | `s` | Total request duration |
120
+ | `gen_ai.server.time_to_first_token` | `s` | TTFT latency |
121
+ | `gen_ai.server.time_per_output_token` | `s` | Per-token decode time |
122
+ | `gen_ai.server.request.active` | `{request}` | Concurrent in-flight requests |
 
 
 
 
123
 
124
  ## 🎯 Dual GPU Strategies
125
 
126
+ ### Strategy 1: Inference on GPU 0, Analytics on GPU 1
 
 
127
 
128
  ```python
129
+ config = ServerConfig(
 
 
 
 
130
  model_path=model_path,
131
+ tensor_split=[1.0, 0.0], # 100% GPU 0
132
+ n_gpu_layers=-1,
 
133
  )
134
+ # GPU 1 free for RAPIDS / Graphistry / cuDF
 
135
  ```
136
 
137
+ ### Strategy 2: Model Split Across Both T4s (for larger models)
138
 
139
  ```python
140
+ config = ServerConfig(
 
141
  model_path=large_model_path,
142
+ tensor_split=[0.5, 0.5], # 50% each
143
+ n_gpu_layers=-1,
144
+ )
145
+ ```
146
+
147
+ ## 🔧 Benchmarking Models
148
+
149
+ ```python
150
+ from llamatelemetry.bench import BenchmarkRunner, BenchmarkProfile
151
+
152
+ runner = BenchmarkRunner(client=client, profile=BenchmarkProfile.STANDARD)
153
+ results = runner.run(
154
+ model_name="gemma-3-4b-it-Q4_K_M",
155
+ prompts=[
156
+ "Explain attention mechanisms.",
157
+ "Write a Python function to sort a list.",
158
+ ],
159
  )
160
+ print(results.summary())
161
+ # Output: TTFT p50/p95, tokens/sec, prefill_ms, decode_ms
162
  ```
163
 
164
+ ## 🔗 Links
165
+
166
+ - **GitHub Repository**: https://github.com/llamatelemetry/llamatelemetry
167
+ - **GitHub Releases**: https://github.com/llamatelemetry/llamatelemetry/releases/tag/v1.2.0
168
+ - **Binaries Repository**: https://huggingface.co/waqasm86/llamatelemetry-binaries
169
+ - **Kaggle Guide**: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/KAGGLE_GUIDE.md
170
+ - **Integration Guide**: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/INTEGRATION_GUIDE.md
171
+ - **API Reference**: https://github.com/llamatelemetry/llamatelemetry/blob/main/docs/API_REFERENCE.md
172
 
173
+ ## 🔗 Model Sources
174
+
175
+ Models are sourced from reputable community providers:
176
+ - [Unsloth GGUF Models](https://huggingface.co/unsloth) — Optimized GGUF conversions
177
+ - [Bartowski GGUF Models](https://huggingface.co/bartowski) — High-quality quants
178
+ - [LM Studio Community](https://huggingface.co/lmstudio-community) — Curated GGUF models
179
+
180
+ All models are:
181
+ - ✅ Publicly available under permissive licenses
182
+ - ✅ Verified on llamatelemetry v1.2.0 + Kaggle T4x2
183
+ - ✅ Credited to original authors
184
 
185
  ## 🆘 Getting Help
186
 
187
  - **GitHub Issues**: https://github.com/llamatelemetry/llamatelemetry/issues
188
+ - **Discussions**: https://github.com/llamatelemetry/llamatelemetry/discussions
189
 
190
  ## 📄 License
191
 
192
+ This repository: MIT License.
193
+ Individual models: See each model card for specific license (Apache 2.0, MIT, Gemma License, etc.)
 
194
 
195
  ---
196
 
197
+ **Maintained by**: [waqasm86](https://huggingface.co/waqasm86)
198
+ **SDK Version**: 1.2.0
199
+ **Last Updated**: 2026-02-20
200
+ **Target Platform**: Kaggle dual Tesla T4 (CUDA 12.5, SM 7.5)
201
+ **Status**: Active — models being added