| # Trouter-20B Quick Start Guide | |
| Get up and running with Trouter-20B in minutes. | |
| ## Installation | |
| ```bash | |
| pip install transformers torch accelerate bitsandbytes | |
| ``` | |
| ## Basic Usage | |
| ### Option 1: Full Precision (Requires ~40GB VRAM) | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| import torch | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "Trouter-Library/Trouter-20B", | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto" | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained("Trouter-Library/Trouter-20B") | |
| prompt = "Explain machine learning:" | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| outputs = model.generate(**inputs, max_new_tokens=200) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| ### Option 2: 4-bit Quantization (Requires ~10GB VRAM) ⭐ Recommended | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig | |
| import torch | |
| bnb_config = BitsAndBytesConfig( | |
| load_in_4bit=True, | |
| bnb_4bit_quant_type="nf4", | |
| bnb_4bit_compute_dtype=torch.bfloat16 | |
| ) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "Trouter-Library/Trouter-20B", | |
| quantization_config=bnb_config, | |
| device_map="auto" | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained("Trouter-Library/Trouter-20B") | |
| prompt = "Explain machine learning:" | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| outputs = model.generate(**inputs, max_new_tokens=200) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| ## Chat Interface | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig | |
| import torch | |
| # Load model | |
| bnb_config = BitsAndBytesConfig( | |
| load_in_4bit=True, | |
| bnb_4bit_quant_type="nf4", | |
| bnb_4bit_compute_dtype=torch.bfloat16 | |
| ) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "Trouter-Library/Trouter-20B", | |
| quantization_config=bnb_config, | |
| device_map="auto" | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained("Trouter-Library/Trouter-20B") | |
| # Create conversation | |
| messages = [ | |
| {"role": "user", "content": "What is quantum computing?"} | |
| ] | |
| # Apply chat template | |
| prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| # Generate response | |
| outputs = model.generate( | |
| **inputs, | |
| max_new_tokens=300, | |
| temperature=0.7, | |
| top_p=0.95, | |
| do_sample=True | |
| ) | |
| response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True) | |
| print(response) | |
| # Continue conversation | |
| messages.append({"role": "assistant", "content": response}) | |
| messages.append({"role": "user", "content": "Can you explain it more simply?"}) | |
| ``` | |
| ## Generation Parameters | |
| Adjust these for different use cases: | |
| ### Creative Writing (More Random) | |
| ```python | |
| outputs = model.generate( | |
| **inputs, | |
| max_new_tokens=500, | |
| temperature=0.9, # Higher = more creative | |
| top_p=0.95, | |
| top_k=50, | |
| do_sample=True | |
| ) | |
| ``` | |
| ### Factual/Technical (More Deterministic) | |
| ```python | |
| outputs = model.generate( | |
| **inputs, | |
| max_new_tokens=300, | |
| temperature=0.3, # Lower = more focused | |
| top_p=0.9, | |
| do_sample=True | |
| ) | |
| ``` | |
| ### Code Generation (Precise) | |
| ```python | |
| outputs = model.generate( | |
| **inputs, | |
| max_new_tokens=400, | |
| temperature=0.2, | |
| top_p=0.95, | |
| repetition_penalty=1.1, | |
| do_sample=True | |
| ) | |
| ``` | |
| ## Memory Requirements | |
| | Configuration | VRAM Required | Setup | | |
| |--------------|---------------|-------| | |
| | **Full (BF16)** | ~40GB | `torch_dtype=torch.bfloat16` | | |
| | **8-bit** | ~20GB | `load_in_8bit=True` | | |
| | **4-bit** | ~10GB | 4-bit quantization config | | |
| ## Common Issues | |
| ### Out of Memory | |
| - Use 4-bit quantization | |
| - Reduce `max_new_tokens` | |
| - Clear GPU cache: `torch.cuda.empty_cache()` | |
| ### Slow Generation | |
| - Use smaller `max_new_tokens` | |
| - Set `do_sample=False` for greedy decoding | |
| - Reduce batch size | |
| ### Poor Quality | |
| - Adjust temperature (0.7-0.9 for most tasks) | |
| - Increase max_new_tokens | |
| - Try different prompts | |
| ## Next Steps | |
| - See [USAGE_GUIDE.md](./USAGE_GUIDE.md) for advanced examples | |
| - Check [examples.py](./examples.py) for code samples | |
| - Read [EVALUATION.md](./EVALUATION.md) for benchmark results | |
| ## Simple Copy-Paste Example | |
| ```python | |
| # Install first: pip install transformers torch accelerate bitsandbytes | |
| from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig | |
| import torch | |
| # Load model (4-bit for efficiency) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "Trouter-Library/Trouter-20B", | |
| quantization_config=BitsAndBytesConfig( | |
| load_in_4bit=True, | |
| bnb_4bit_quant_type="nf4", | |
| bnb_4bit_compute_dtype=torch.bfloat16 | |
| ), | |
| device_map="auto" | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained("Trouter-Library/Trouter-20B") | |
| # Generate text | |
| prompt = "Write a Python function to calculate factorial:" | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| That's it! You're ready to use Trouter-20B. |