Trouter-20b / QUICKSTART.md
Trouter-Library's picture
Create QUICKSTART.md
85c988d verified

Trouter-20B Quick Start Guide

Get up and running with Trouter-20B in minutes.

Installation

pip install transformers torch accelerate bitsandbytes

Basic Usage

Option 1: Full Precision (Requires ~40GB VRAM)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "Trouter-Library/Trouter-20B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Trouter-Library/Trouter-20B")

prompt = "Explain machine learning:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Option 2: 4-bit Quantization (Requires ~10GB VRAM) ⭐ Recommended

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "Trouter-Library/Trouter-20B",
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Trouter-Library/Trouter-20B")

prompt = "Explain machine learning:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Chat Interface

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Load model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "Trouter-Library/Trouter-20B",
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Trouter-Library/Trouter-20B")

# Create conversation
messages = [
    {"role": "user", "content": "What is quantum computing?"}
]

# Apply chat template
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate response
outputs = model.generate(
    **inputs,
    max_new_tokens=300,
    temperature=0.7,
    top_p=0.95,
    do_sample=True
)

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

# Continue conversation
messages.append({"role": "assistant", "content": response})
messages.append({"role": "user", "content": "Can you explain it more simply?"})

Generation Parameters

Adjust these for different use cases:

Creative Writing (More Random)

outputs = model.generate(
    **inputs,
    max_new_tokens=500,
    temperature=0.9,      # Higher = more creative
    top_p=0.95,
    top_k=50,
    do_sample=True
)

Factual/Technical (More Deterministic)

outputs = model.generate(
    **inputs,
    max_new_tokens=300,
    temperature=0.3,      # Lower = more focused
    top_p=0.9,
    do_sample=True
)

Code Generation (Precise)

outputs = model.generate(
    **inputs,
    max_new_tokens=400,
    temperature=0.2,
    top_p=0.95,
    repetition_penalty=1.1,
    do_sample=True
)

Memory Requirements

Configuration VRAM Required Setup
Full (BF16) ~40GB torch_dtype=torch.bfloat16
8-bit ~20GB load_in_8bit=True
4-bit ~10GB 4-bit quantization config

Common Issues

Out of Memory

  • Use 4-bit quantization
  • Reduce max_new_tokens
  • Clear GPU cache: torch.cuda.empty_cache()

Slow Generation

  • Use smaller max_new_tokens
  • Set do_sample=False for greedy decoding
  • Reduce batch size

Poor Quality

  • Adjust temperature (0.7-0.9 for most tasks)
  • Increase max_new_tokens
  • Try different prompts

Next Steps

Simple Copy-Paste Example

# Install first: pip install transformers torch accelerate bitsandbytes

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Load model (4-bit for efficiency)
model = AutoModelForCausalLM.from_pretrained(
    "Trouter-Library/Trouter-20B",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    ),
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Trouter-Library/Trouter-20B")

# Generate text
prompt = "Write a Python function to calculate factorial:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

That's it! You're ready to use Trouter-20B.