File size: 7,357 Bytes

c11b4a4

# Trouter-20B Usage Guide

## Installation

```bash
pip install transformers torch accelerate bitsandbytes
```

## Quick Start

### Basic Text Generation

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "your-username/Trouter-20B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Generate text
prompt = "Explain quantum computing in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.95,
    do_sample=True
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

### Chat Interface

```python
def chat(messages, max_new_tokens=512):
    """
    Chat with the model using a conversation history.
    
    Args:
        messages: List of dicts with 'role' and 'content' keys
        max_new_tokens: Maximum tokens to generate
    
    Example:
        messages = [
            {"role": "user", "content": "What is machine learning?"}
        ]
    """
    prompt = tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        temperature=0.7,
        top_p=0.95,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return response

# Example usage
conversation = [
    {"role": "user", "content": "Hello! Can you help me with Python?"}
]

response = chat(conversation)
print(response)

# Continue conversation
conversation.append({"role": "assistant", "content": response})
conversation.append({"role": "user", "content": "Show me how to read a CSV file."})

response = chat(conversation)
print(response)
```

### Memory-Efficient Loading (8-bit Quantization)

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "your-username/Trouter-20B"

# Load in 8-bit for reduced memory usage
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto",
    torch_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
```

### 4-bit Quantization (Even Lower Memory)

```python
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_name = "your-username/Trouter-20B"

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
```

## Advanced Usage

### Batch Generation

```python
prompts = [
    "Write a poem about AI:",
    "Explain neural networks:",
    "What is reinforcement learning?"
]

inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    temperature=0.8,
    top_p=0.95,
    num_return_sequences=1,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for prompt, response in zip(prompts, responses):
    print(f"Prompt: {prompt}")
    print(f"Response: {response}\n")
```

### Streaming Generation

```python
from transformers import TextIteratorStreamer
from threading import Thread

streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)

prompt = "Write a story about a robot:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

generation_kwargs = {
    **inputs,
    "max_new_tokens": 256,
    "temperature": 0.7,
    "do_sample": True,
    "streamer": streamer
}

thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

print("Generated text: ", end="")
for new_text in streamer:
    print(new_text, end="", flush=True)
print()
```

### Custom Generation Parameters

```python
# Creative generation
creative_output = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=1.0,      # Higher = more creative
    top_p=0.95,
    top_k=50,
    repetition_penalty=1.2,
    do_sample=True
)

# Deterministic generation
deterministic_output = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.1,      # Lower = more focused
    do_sample=False,
    num_beams=4           # Beam search for quality
)
```

## Fine-tuning

### Using PEFT (Parameter-Efficient Fine-Tuning)

```python
from peft import LoraConfig, get_peft_model
from transformers import TrainingArguments, Trainer

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA to model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Training arguments
training_args = TrainingArguments(
    output_dir="./trouter-finetuned",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    num_train_epochs=3,
    logging_steps=10,
    save_steps=100,
    fp16=True
)

# Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset
)

trainer.train()
```

## Performance Optimization

### GPU Memory Requirements

- **Full precision (bfloat16)**: ~40GB VRAM
- **8-bit quantization**: ~20GB VRAM
- **4-bit quantization**: ~10GB VRAM

### Recommendations

- Use `device_map="auto"` for automatic multi-GPU distribution
- Enable `torch.compile()` for PyTorch 2.0+ for faster inference
- Use Flash Attention 2 if available for better performance

```python
# Enable Flash Attention 2
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"
)
```

## Troubleshooting

### Out of Memory Errors

1. Use quantization (8-bit or 4-bit)
2. Reduce `max_new_tokens`
3. Decrease batch size
4. Enable gradient checkpointing for fine-tuning

### Slow Generation

1. Use smaller `max_new_tokens`
2. Disable `do_sample` for greedy decoding
3. Use Flash Attention 2
4. Consider model quantization

### Poor Quality Outputs

1. Adjust temperature (0.7-0.9 recommended)
2. Tune top_p and top_k values
3. Add repetition_penalty (1.1-1.3)
4. Ensure proper prompt formatting

## Community and Support

- **Issues**: [GitHub Issues](https://github.com/your-username/Trouter-20B/issues)
- **Discussions**: [Hugging Face Discussions](https://huggingface.co/your-username/Trouter-20B/discussions)
- **Discord**: [Community Discord](#)

## Citation

If you use Trouter-20B in your research, please cite:

```bibtex
@software{trouter20b2025,
  title={Trouter-20B: A 20 Billion Parameter Language Model},
  author={Your Name},
  year={2025},
  url={https://huggingface.co/your-username/Trouter-20B}
}
```