waqasm86 commited on
Commit
c0806ea
Β·
verified Β·
1 Parent(s): db26482

Initialize repository with README

Browse files
Files changed (1) hide show
  1. README.md +164 -0
README.md ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - llm
5
+ - gguf
6
+ - llama
7
+ - gemma
8
+ - mistral
9
+ - qwen
10
+ - inference
11
+ - opentelemetry
12
+ - observability
13
+ - kaggle
14
+ library_name: llamatelemetry
15
+ ---
16
+
17
+ # llamatelemetry Models
18
+
19
+ Curated collection of GGUF models optimized for **llamatelemetry** on Kaggle dual Tesla T4 GPUs (2Γ— 15GB VRAM).
20
+
21
+ ## 🎯 About This Repository
22
+
23
+ This repository contains GGUF models tested and verified to work with:
24
+ - **llamatelemetry v0.1.0** - CUDA-first OpenTelemetry Python SDK for LLM inference observability
25
+ - **Platform**: Kaggle Notebooks (2Γ— Tesla T4, 30GB total VRAM)
26
+ - **CUDA**: 12.5
27
+
28
+ ## πŸ“¦ Available Models
29
+
30
+ > **Status**: Repository created, models coming soon!
31
+
32
+ ### Planned Models (v0.1.0)
33
+
34
+ | Model | Size | Quantization | VRAM | Speed (tok/s) | Status |
35
+ |-------|------|--------------|------|---------------|--------|
36
+ | Gemma 3 1B Instruct | 1B | Q4_K_M | ~1.5GB | ~80 | πŸ”„ Coming soon |
37
+ | Gemma 3 3B Instruct | 3B | Q4_K_M | ~3GB | ~50 | πŸ”„ Coming soon |
38
+ | Llama 3.2 3B Instruct | 3B | Q4_K_M | ~3GB | ~50 | πŸ”„ Coming soon |
39
+ | Qwen 2.5 1.5B Instruct | 1.5B | Q4_K_M | ~2GB | ~70 | πŸ”„ Coming soon |
40
+ | Mistral 7B Instruct | 7B | Q4_K_M | ~6GB | ~25 | πŸ”„ Coming soon |
41
+
42
+ ### Model Selection Criteria
43
+
44
+ Models in this repository are:
45
+ 1. βœ… **Tested** on Kaggle dual T4 GPUs
46
+ 2. βœ… **Verified** to fit in 15GB VRAM (single GPU)
47
+ 3. βœ… **Compatible** with llamatelemetry's observability features
48
+ 4. βœ… **Optimized** for GGUF + CUDA acceleration
49
+ 5. βœ… **Documented** with performance benchmarks
50
+
51
+ ## πŸš€ Quick Start
52
+
53
+ ### Install llamatelemetry
54
+
55
+ ```bash
56
+ # On Kaggle with GPU T4 Γ— 2
57
+ pip install --no-cache-dir --force-reinstall \
58
+ git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0
59
+ ```
60
+
61
+ ### Download and Run a Model
62
+
63
+ ```python
64
+ import llamatelemetry
65
+ from llamatelemetry import InferenceEngine
66
+ from huggingface_hub import hf_hub_download
67
+
68
+ # Download model (example - not yet available)
69
+ model_path = hf_hub_download(
70
+ repo_id="waqasm86/llamatelemetry-models",
71
+ filename="gemma-3-1b-it-Q4_K_M.gguf",
72
+ local_dir="/kaggle/working/models"
73
+ )
74
+
75
+ # Load model on GPU 0
76
+ engine = InferenceEngine()
77
+ engine.load_model(model_path, silent=True)
78
+
79
+ # Run inference with telemetry
80
+ result = engine.infer("Explain quantum computing in simple terms", max_tokens=150)
81
+ print(result.text)
82
+ ```
83
+
84
+ ## πŸ“Š Recommended Models by Use Case
85
+
86
+ ### For Fast Prototyping
87
+ - **Gemma 3 1B** - Fastest inference, good for testing
88
+ - **Qwen 2.5 1.5B** - Balance of speed and quality
89
+
90
+ ### For Production Quality
91
+ - **Gemma 3 3B** - High quality, reasonable speed
92
+ - **Llama 3.2 3B** - Strong reasoning capabilities
93
+
94
+ ### For Complex Tasks
95
+ - **Mistral 7B** - Best quality, slower but fits in single T4
96
+
97
+ ## πŸ”— Model Sources
98
+
99
+ Models are sourced from reputable providers:
100
+ - [Unsloth GGUF Models](https://huggingface.co/unsloth) - Optimized GGUF conversions
101
+ - [TheBloke GGUF Models](https://huggingface.co/TheBloke) - Community standard
102
+ - [Bartowski GGUF Models](https://huggingface.co/bartowski) - High-quality quants
103
+
104
+ All models are:
105
+ - βœ… Publicly available under permissive licenses
106
+ - βœ… Re-hosted here for convenience and verification
107
+ - βœ… Credited to original authors
108
+
109
+ ## 🎯 Dual GPU Strategies
110
+
111
+ llamatelemetry supports multi-GPU workloads:
112
+
113
+ ### Strategy 1: LLM on GPU 0, Observability on GPU 1
114
+
115
+ ```python
116
+ from llamatelemetry.server import ServerManager
117
+
118
+ # Start llama-server on GPU 0 only
119
+ server = ServerManager()
120
+ server.start_server(
121
+ model_path=model_path,
122
+ gpu_layers=99,
123
+ tensor_split="1.0,0.0", # 100% GPU 0, 0% GPU 1
124
+ flash_attn=1,
125
+ )
126
+
127
+ # GPU 1 is now free for RAPIDS/Graphistry visualization
128
+ ```
129
+
130
+ ### Strategy 2: Model Sharding Across Both GPUs
131
+
132
+ ```python
133
+ # Split large model across both T4s
134
+ server.start_server(
135
+ model_path=large_model_path,
136
+ gpu_layers=99,
137
+ tensor_split="0.5,0.5", # 50% GPU 0, 50% GPU 1
138
+ )
139
+ ```
140
+
141
+ ## πŸ“š Documentation & Links
142
+
143
+ - **GitHub**: https://github.com/llamatelemetry/llamatelemetry
144
+ - **Installation Guide**: [KAGGLE_INSTALL_GUIDE.md](https://github.com/llamatelemetry/llamatelemetry/blob/main/KAGGLE_INSTALL_GUIDE.md)
145
+ - **Binaries**: https://huggingface.co/waqasm86/llamatelemetry-binaries
146
+ - **Tutorials**: [notebooks/](https://github.com/llamatelemetry/llamatelemetry/tree/main/notebooks)
147
+
148
+ ## πŸ†˜ Getting Help
149
+
150
+ - **GitHub Issues**: https://github.com/llamatelemetry/llamatelemetry/issues
151
+ - **Documentation**: https://llamatelemetry.github.io (planned)
152
+
153
+ ## πŸ“„ License
154
+
155
+ This repository: MIT License
156
+
157
+ Individual models: See model cards for specific licenses (Apache 2.0, MIT, Gemma License, etc.)
158
+
159
+ ---
160
+
161
+ **Maintained by**: [waqasm86](https://huggingface.co/waqasm86)
162
+ **Status**: Repository initialized, models coming soon
163
+ **Target Platform**: Kaggle dual Tesla T4 (CUDA 12.5)
164
+ **Last Updated**: 2026-02-03