Trouter-20b / QUICKSTART.md
Trouter-Library's picture
Create QUICKSTART.md
85c988d verified
# Trouter-20B Quick Start Guide
Get up and running with Trouter-20B in minutes.
## Installation
```bash
pip install transformers torch accelerate bitsandbytes
```
## Basic Usage
### Option 1: Full Precision (Requires ~40GB VRAM)
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained(
"Trouter-Library/Trouter-20B",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Trouter-Library/Trouter-20B")
prompt = "Explain machine learning:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
### Option 2: 4-bit Quantization (Requires ~10GB VRAM) ⭐ Recommended
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
"Trouter-Library/Trouter-20B",
quantization_config=bnb_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Trouter-Library/Trouter-20B")
prompt = "Explain machine learning:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Chat Interface
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
# Load model
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
"Trouter-Library/Trouter-20B",
quantization_config=bnb_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Trouter-Library/Trouter-20B")
# Create conversation
messages = [
{"role": "user", "content": "What is quantum computing?"}
]
# Apply chat template
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generate response
outputs = model.generate(
**inputs,
max_new_tokens=300,
temperature=0.7,
top_p=0.95,
do_sample=True
)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)
# Continue conversation
messages.append({"role": "assistant", "content": response})
messages.append({"role": "user", "content": "Can you explain it more simply?"})
```
## Generation Parameters
Adjust these for different use cases:
### Creative Writing (More Random)
```python
outputs = model.generate(
**inputs,
max_new_tokens=500,
temperature=0.9, # Higher = more creative
top_p=0.95,
top_k=50,
do_sample=True
)
```
### Factual/Technical (More Deterministic)
```python
outputs = model.generate(
**inputs,
max_new_tokens=300,
temperature=0.3, # Lower = more focused
top_p=0.9,
do_sample=True
)
```
### Code Generation (Precise)
```python
outputs = model.generate(
**inputs,
max_new_tokens=400,
temperature=0.2,
top_p=0.95,
repetition_penalty=1.1,
do_sample=True
)
```
## Memory Requirements
| Configuration | VRAM Required | Setup |
|--------------|---------------|-------|
| **Full (BF16)** | ~40GB | `torch_dtype=torch.bfloat16` |
| **8-bit** | ~20GB | `load_in_8bit=True` |
| **4-bit** | ~10GB | 4-bit quantization config |
## Common Issues
### Out of Memory
- Use 4-bit quantization
- Reduce `max_new_tokens`
- Clear GPU cache: `torch.cuda.empty_cache()`
### Slow Generation
- Use smaller `max_new_tokens`
- Set `do_sample=False` for greedy decoding
- Reduce batch size
### Poor Quality
- Adjust temperature (0.7-0.9 for most tasks)
- Increase max_new_tokens
- Try different prompts
## Next Steps
- See [USAGE_GUIDE.md](./USAGE_GUIDE.md) for advanced examples
- Check [examples.py](./examples.py) for code samples
- Read [EVALUATION.md](./EVALUATION.md) for benchmark results
## Simple Copy-Paste Example
```python
# Install first: pip install transformers torch accelerate bitsandbytes
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
# Load model (4-bit for efficiency)
model = AutoModelForCausalLM.from_pretrained(
"Trouter-Library/Trouter-20B",
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
),
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Trouter-Library/Trouter-20B")
# Generate text
prompt = "Write a Python function to calculate factorial:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
That's it! You're ready to use Trouter-20B.