Trouter-Library commited on
Commit
c11b4a4
·
verified ·
1 Parent(s): 80fdbda

Create USAGE_GUIDE.md

Browse files
Files changed (1) hide show
  1. USAGE_GUIDE.md +326 -0
USAGE_GUIDE.md ADDED
@@ -0,0 +1,326 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Trouter-20B Usage Guide
2
+
3
+ ## Installation
4
+
5
+ ```bash
6
+ pip install transformers torch accelerate bitsandbytes
7
+ ```
8
+
9
+ ## Quick Start
10
+
11
+ ### Basic Text Generation
12
+
13
+ ```python
14
+ from transformers import AutoTokenizer, AutoModelForCausalLM
15
+ import torch
16
+
17
+ # Load model and tokenizer
18
+ model_name = "your-username/Trouter-20B"
19
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
20
+ model = AutoModelForCausalLM.from_pretrained(
21
+ model_name,
22
+ torch_dtype=torch.bfloat16,
23
+ device_map="auto"
24
+ )
25
+
26
+ # Generate text
27
+ prompt = "Explain quantum computing in simple terms:"
28
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
29
+
30
+ outputs = model.generate(
31
+ **inputs,
32
+ max_new_tokens=256,
33
+ temperature=0.7,
34
+ top_p=0.95,
35
+ do_sample=True
36
+ )
37
+
38
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
39
+ print(response)
40
+ ```
41
+
42
+ ### Chat Interface
43
+
44
+ ```python
45
+ def chat(messages, max_new_tokens=512):
46
+ """
47
+ Chat with the model using a conversation history.
48
+
49
+ Args:
50
+ messages: List of dicts with 'role' and 'content' keys
51
+ max_new_tokens: Maximum tokens to generate
52
+
53
+ Example:
54
+ messages = [
55
+ {"role": "user", "content": "What is machine learning?"}
56
+ ]
57
+ """
58
+ prompt = tokenizer.apply_chat_template(
59
+ messages,
60
+ tokenize=False,
61
+ add_generation_prompt=True
62
+ )
63
+
64
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
65
+
66
+ outputs = model.generate(
67
+ **inputs,
68
+ max_new_tokens=max_new_tokens,
69
+ temperature=0.7,
70
+ top_p=0.95,
71
+ do_sample=True,
72
+ pad_token_id=tokenizer.eos_token_id
73
+ )
74
+
75
+ response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
76
+ return response
77
+
78
+ # Example usage
79
+ conversation = [
80
+ {"role": "user", "content": "Hello! Can you help me with Python?"}
81
+ ]
82
+
83
+ response = chat(conversation)
84
+ print(response)
85
+
86
+ # Continue conversation
87
+ conversation.append({"role": "assistant", "content": response})
88
+ conversation.append({"role": "user", "content": "Show me how to read a CSV file."})
89
+
90
+ response = chat(conversation)
91
+ print(response)
92
+ ```
93
+
94
+ ### Memory-Efficient Loading (8-bit Quantization)
95
+
96
+ ```python
97
+ from transformers import AutoTokenizer, AutoModelForCausalLM
98
+ import torch
99
+
100
+ model_name = "your-username/Trouter-20B"
101
+
102
+ # Load in 8-bit for reduced memory usage
103
+ model = AutoModelForCausalLM.from_pretrained(
104
+ model_name,
105
+ load_in_8bit=True,
106
+ device_map="auto",
107
+ torch_dtype=torch.float16
108
+ )
109
+
110
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
111
+ ```
112
+
113
+ ### 4-bit Quantization (Even Lower Memory)
114
+
115
+ ```python
116
+ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
117
+
118
+ model_name = "your-username/Trouter-20B"
119
+
120
+ # Configure 4-bit quantization
121
+ bnb_config = BitsAndBytesConfig(
122
+ load_in_4bit=True,
123
+ bnb_4bit_quant_type="nf4",
124
+ bnb_4bit_compute_dtype=torch.bfloat16,
125
+ bnb_4bit_use_double_quant=True
126
+ )
127
+
128
+ model = AutoModelForCausalLM.from_pretrained(
129
+ model_name,
130
+ quantization_config=bnb_config,
131
+ device_map="auto"
132
+ )
133
+
134
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
135
+ ```
136
+
137
+ ## Advanced Usage
138
+
139
+ ### Batch Generation
140
+
141
+ ```python
142
+ prompts = [
143
+ "Write a poem about AI:",
144
+ "Explain neural networks:",
145
+ "What is reinforcement learning?"
146
+ ]
147
+
148
+ inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
149
+
150
+ outputs = model.generate(
151
+ **inputs,
152
+ max_new_tokens=128,
153
+ temperature=0.8,
154
+ top_p=0.95,
155
+ num_return_sequences=1,
156
+ do_sample=True,
157
+ pad_token_id=tokenizer.eos_token_id
158
+ )
159
+
160
+ responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)
161
+ for prompt, response in zip(prompts, responses):
162
+ print(f"Prompt: {prompt}")
163
+ print(f"Response: {response}\n")
164
+ ```
165
+
166
+ ### Streaming Generation
167
+
168
+ ```python
169
+ from transformers import TextIteratorStreamer
170
+ from threading import Thread
171
+
172
+ streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
173
+
174
+ prompt = "Write a story about a robot:"
175
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
176
+
177
+ generation_kwargs = {
178
+ **inputs,
179
+ "max_new_tokens": 256,
180
+ "temperature": 0.7,
181
+ "do_sample": True,
182
+ "streamer": streamer
183
+ }
184
+
185
+ thread = Thread(target=model.generate, kwargs=generation_kwargs)
186
+ thread.start()
187
+
188
+ print("Generated text: ", end="")
189
+ for new_text in streamer:
190
+ print(new_text, end="", flush=True)
191
+ print()
192
+ ```
193
+
194
+ ### Custom Generation Parameters
195
+
196
+ ```python
197
+ # Creative generation
198
+ creative_output = model.generate(
199
+ **inputs,
200
+ max_new_tokens=256,
201
+ temperature=1.0, # Higher = more creative
202
+ top_p=0.95,
203
+ top_k=50,
204
+ repetition_penalty=1.2,
205
+ do_sample=True
206
+ )
207
+
208
+ # Deterministic generation
209
+ deterministic_output = model.generate(
210
+ **inputs,
211
+ max_new_tokens=256,
212
+ temperature=0.1, # Lower = more focused
213
+ do_sample=False,
214
+ num_beams=4 # Beam search for quality
215
+ )
216
+ ```
217
+
218
+ ## Fine-tuning
219
+
220
+ ### Using PEFT (Parameter-Efficient Fine-Tuning)
221
+
222
+ ```python
223
+ from peft import LoraConfig, get_peft_model
224
+ from transformers import TrainingArguments, Trainer
225
+
226
+ # Configure LoRA
227
+ lora_config = LoraConfig(
228
+ r=16,
229
+ lora_alpha=32,
230
+ target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
231
+ lora_dropout=0.05,
232
+ bias="none",
233
+ task_type="CAUSAL_LM"
234
+ )
235
+
236
+ # Apply LoRA to model
237
+ model = get_peft_model(model, lora_config)
238
+ model.print_trainable_parameters()
239
+
240
+ # Training arguments
241
+ training_args = TrainingArguments(
242
+ output_dir="./trouter-finetuned",
243
+ per_device_train_batch_size=4,
244
+ gradient_accumulation_steps=4,
245
+ learning_rate=2e-4,
246
+ num_train_epochs=3,
247
+ logging_steps=10,
248
+ save_steps=100,
249
+ fp16=True
250
+ )
251
+
252
+ # Train
253
+ trainer = Trainer(
254
+ model=model,
255
+ args=training_args,
256
+ train_dataset=train_dataset
257
+ )
258
+
259
+ trainer.train()
260
+ ```
261
+
262
+ ## Performance Optimization
263
+
264
+ ### GPU Memory Requirements
265
+
266
+ - **Full precision (bfloat16)**: ~40GB VRAM
267
+ - **8-bit quantization**: ~20GB VRAM
268
+ - **4-bit quantization**: ~10GB VRAM
269
+
270
+ ### Recommendations
271
+
272
+ - Use `device_map="auto"` for automatic multi-GPU distribution
273
+ - Enable `torch.compile()` for PyTorch 2.0+ for faster inference
274
+ - Use Flash Attention 2 if available for better performance
275
+
276
+ ```python
277
+ # Enable Flash Attention 2
278
+ model = AutoModelForCausalLM.from_pretrained(
279
+ model_name,
280
+ torch_dtype=torch.bfloat16,
281
+ device_map="auto",
282
+ attn_implementation="flash_attention_2"
283
+ )
284
+ ```
285
+
286
+ ## Troubleshooting
287
+
288
+ ### Out of Memory Errors
289
+
290
+ 1. Use quantization (8-bit or 4-bit)
291
+ 2. Reduce `max_new_tokens`
292
+ 3. Decrease batch size
293
+ 4. Enable gradient checkpointing for fine-tuning
294
+
295
+ ### Slow Generation
296
+
297
+ 1. Use smaller `max_new_tokens`
298
+ 2. Disable `do_sample` for greedy decoding
299
+ 3. Use Flash Attention 2
300
+ 4. Consider model quantization
301
+
302
+ ### Poor Quality Outputs
303
+
304
+ 1. Adjust temperature (0.7-0.9 recommended)
305
+ 2. Tune top_p and top_k values
306
+ 3. Add repetition_penalty (1.1-1.3)
307
+ 4. Ensure proper prompt formatting
308
+
309
+ ## Community and Support
310
+
311
+ - **Issues**: [GitHub Issues](https://github.com/your-username/Trouter-20B/issues)
312
+ - **Discussions**: [Hugging Face Discussions](https://huggingface.co/your-username/Trouter-20B/discussions)
313
+ - **Discord**: [Community Discord](#)
314
+
315
+ ## Citation
316
+
317
+ If you use Trouter-20B in your research, please cite:
318
+
319
+ ```bibtex
320
+ @software{trouter20b2025,
321
+ title={Trouter-20B: A 20 Billion Parameter Language Model},
322
+ author={Your Name},
323
+ year={2025},
324
+ url={https://huggingface.co/your-username/Trouter-20B}
325
+ }
326
+ ```