R-Kentaren commited on
Commit
ada8101
·
verified ·
1 Parent(s): cb2ef9f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -186
README.md CHANGED
@@ -1,8 +1,8 @@
1
  ---
2
- title: ZeroGPU-LLM-Inference
3
  emoji: 🧠
4
- colorFrom: indigo
5
- colorTo: purple
6
  sdk: gradio
7
  sdk_version: 5.49.1
8
  app_file: app.py
@@ -10,186 +10,3 @@ pinned: false
10
  license: apache-2.0
11
  short_description: Streaming LLM chat with web search and controls
12
  ---
13
-
14
- # 🧠 ZeroGPU LLM Inference
15
-
16
- A modern, user-friendly Gradio interface for **token-streaming, chat-style inference** across a wide variety of Transformer models—powered by ZeroGPU for free GPU acceleration on Hugging Face Spaces.
17
-
18
- ## ✨ Key Features
19
-
20
- ### 🎨 Modern UI/UX
21
- - **Clean, intuitive interface** with organized layout and visual hierarchy
22
- - **Collapsible advanced settings** for both simple and power users
23
- - **Smooth animations and transitions** for better user experience
24
- - **Responsive design** that works on all screen sizes
25
- - **Copy-to-clipboard** functionality for easy sharing of responses
26
-
27
- ### 🔍 Web Search Integration
28
- - **Real-time DuckDuckGo search** with background threading
29
- - **Configurable timeout** and result limits
30
- - **Automatic context injection** into system prompts
31
- - **Smart toggle** - search settings auto-hide when disabled
32
-
33
- ### 💡 Smart Features
34
- - **Thought vs. Answer streaming**: `<think>…</think>` blocks shown separately as "💭 Thought"
35
- - **Working cancel button** - immediately stops generation without errors
36
- - **Debug panel** for prompt engineering insights
37
- - **Duration estimates** based on model size and settings
38
- - **Example prompts** to help users get started
39
- - **Dynamic system prompts** with automatic date insertion
40
-
41
- ### 🎯 Model Variety
42
- - **30+ LLM options** from leading providers (Qwen, Microsoft, Meta, Mistral, etc.)
43
- - Models ranging from **135M to 32B+** parameters
44
- - Specialized models for **reasoning, coding, and general chat**
45
- - **Efficient model loading** - one at a time with automatic cache clearing
46
-
47
- ### ⚙️ Advanced Controls
48
- - **Generation parameters**: max tokens, temperature, top-k, top-p, repetition penalty
49
- - **Web search settings**: max results, chars per result, timeout
50
- - **Custom system prompts** with dynamic date insertion
51
- - **Organized in collapsible sections** to keep interface clean
52
-
53
- ## 🔄 Supported Models
54
-
55
- ### Compact Models (< 2B)
56
- - **SmolLM2-135M-Instruct** - Tiny but capable
57
- - **SmolLM2-360M-Instruct** - Lightweight conversation
58
- - **Taiwan-ELM-270M/1.1B** - Multilingual support
59
- - **Qwen3-0.6B/1.7B** - Fast inference
60
-
61
- ### Mid-Size Models (2B-8B)
62
- - **Qwen3-4B/8B** - Balanced performance
63
- - **Phi-4-mini** (4.3B) - Reasoning & Instruct variants
64
- - **MiniCPM3-4B** - Efficient mid-size
65
- - **Gemma-3-4B-IT** - Instruction-tuned
66
- - **Llama-3.2-Taiwan-3B** - Regional optimization
67
- - **Mistral-7B-Instruct** - Classic performer
68
- - **DeepSeek-R1-Distill-Llama-8B** - Reasoning specialist
69
-
70
- ### Large Models (14B+)
71
- - **Qwen3-14B** - Strong general purpose
72
- - **Apriel-1.5-15b-Thinker** - Multimodal reasoning
73
- - **gpt-oss-20b** - Open GPT-style
74
- - **Qwen3-32B** - Top-tier performance
75
-
76
- ## 🚀 How It Works
77
-
78
- 1. **Select Model** - Choose from 30+ pre-configured models
79
- 2. **Configure Settings** - Adjust generation parameters or use defaults
80
- 3. **Enable Web Search** (optional) - Get real-time information
81
- 4. **Start Chatting** - Type your message or use example prompts
82
- 5. **Stream Response** - Watch as tokens are generated in real-time
83
- 6. **Cancel Anytime** - Stop generation mid-stream if needed
84
-
85
- ### Technical Flow
86
-
87
- 1. User message enters chat history
88
- 2. If search enabled, background thread fetches DuckDuckGo results
89
- 3. Search snippets merge into system prompt (within timeout limit)
90
- 4. Selected model pipeline loads on ZeroGPU (bf16→f16→f32 fallback)
91
- 5. Prompt formatted with thinking mode detection
92
- 6. Tokens stream to UI with thought/answer separation
93
- 7. Cancel button available for immediate interruption
94
- 8. Memory cleared after generation for next request
95
-
96
- ## ⚙️ Generation Parameters
97
-
98
- | Parameter | Range | Default | Description |
99
- |-----------|-------|---------|-------------|
100
- | Max Tokens | 64-16384 | 1024 | Maximum response length |
101
- | Temperature | 0.1-2.0 | 0.7 | Creativity vs focus |
102
- | Top-K | 1-100 | 40 | Token sampling pool size |
103
- | Top-P | 0.1-1.0 | 0.9 | Nucleus sampling threshold |
104
- | Repetition Penalty | 1.0-2.0 | 1.2 | Reduce repetition |
105
-
106
- ## 🌐 Web Search Settings
107
-
108
- | Setting | Range | Default | Description |
109
- |---------|-------|---------|-------------|
110
- | Max Results | Integer | 4 | Number of search results |
111
- | Max Chars/Result | Integer | 50 | Character limit per result |
112
- | Search Timeout | 0-30s | 5s | Maximum wait time |
113
-
114
- ## 💻 Local Development
115
-
116
- ```bash
117
- # Clone the repository
118
- git clone https://huggingface.co/spaces/Luigi/ZeroGPU-LLM-Inference
119
- cd ZeroGPU-LLM-Inference
120
-
121
- # Install dependencies
122
- pip install -r requirements.txt
123
-
124
- # Run the app
125
- python app.py
126
- ```
127
-
128
- ## 🎨 UI Design Philosophy
129
-
130
- The interface follows these principles:
131
-
132
- 1. **Simplicity First** - Core features immediately visible
133
- 2. **Progressive Disclosure** - Advanced options hidden but accessible
134
- 3. **Visual Hierarchy** - Clear organization with groups and sections
135
- 4. **Feedback** - Status indicators and helpful messages
136
- 5. **Accessibility** - Responsive, keyboard-friendly, with tooltips
137
-
138
- ## 🔧 Customization
139
-
140
- ### Adding New Models
141
-
142
- Edit `MODELS` dictionary in `app.py`:
143
-
144
- ```python
145
- "Your-Model-Name": {
146
- "repo_id": "org/model-name",
147
- "description": "Model description",
148
- "params_b": 7.0 # Size in billions
149
- }
150
- ```
151
-
152
- ### Modifying UI Theme
153
-
154
- Adjust theme parameters in `gr.Blocks()`:
155
-
156
- ```python
157
- theme=gr.themes.Soft(
158
- primary_hue="indigo",
159
- secondary_hue="purple",
160
- # ... more options
161
- )
162
- ```
163
-
164
- ## 📊 Performance
165
-
166
- - **Token streaming** for responsive feel
167
- - **Background search** doesn't block UI
168
- - **Efficient memory** management with cache clearing
169
- - **ZeroGPU acceleration** for fast inference
170
- - **Optimized loading** with dtype fallbacks
171
-
172
- ## 🤝 Contributing
173
-
174
- Contributions welcome! Areas for improvement:
175
-
176
- - Additional model integrations
177
- - UI/UX enhancements
178
- - Performance optimizations
179
- - Bug fixes and testing
180
- - Documentation improvements
181
-
182
- ## 📝 License
183
-
184
- Apache 2.0 - See LICENSE file for details
185
-
186
- ## 🙏 Acknowledgments
187
-
188
- - Built with [Gradio](https://gradio.app)
189
- - Powered by [Hugging Face Transformers](https://huggingface.co/transformers)
190
- - Uses [ZeroGPU](https://huggingface.co/zero-gpu-explorers) for acceleration
191
- - Search via [DuckDuckGo](https://duckduckgo.com)
192
-
193
- ---
194
-
195
- **Made with ❤️ for the open source community**
 
1
  ---
2
+ title: CPU-LLM-Inference
3
  emoji: 🧠
4
+ colorFrom: blue
5
+ colorTo: blue
6
  sdk: gradio
7
  sdk_version: 5.49.1
8
  app_file: app.py
 
10
  license: apache-2.0
11
  short_description: Streaming LLM chat with web search and controls
12
  ---