Instructions to use amazon/MistralLite with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use amazon/MistralLite with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="amazon/MistralLite")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("amazon/MistralLite") model = AutoModelForCausalLM.from_pretrained("amazon/MistralLite") - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use amazon/MistralLite with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "amazon/MistralLite" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amazon/MistralLite", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/amazon/MistralLite
- SGLang
How to use amazon/MistralLite with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "amazon/MistralLite" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amazon/MistralLite", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "amazon/MistralLite" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amazon/MistralLite", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use amazon/MistralLite with Docker Model Runner:
docker model run hf.co/amazon/MistralLite
Error during model loading: CUDA error: out of memory
Using GPU: NVIDIA GeForce RTX 4090, I am having difficulty training the amazon/MistralLite model through a simple fine-tuning test. The error reported is ‘out of memory’. Despite the clear error, I have tried various ways to execute the fine-tuning. Has anyone experienced this and can help?
My script:
Definir o dispositivo (GPU ou CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
Função para imprimir uso de memória da GPU
def print_memory_usage(description="Memory status"):
print(description)
print(f"Total memory: {torch.cuda.get_device_properties(0).total_memory / 1e9} GB")
print(f"Used memory: {torch.cuda.memory_allocated(0) / 1e9} GB")
print(f"Cached memory: {torch.cuda.memory_reserved(0) / 1e9} GB")
Função para imprimir informações da GPU
def print_gpu_info():
if torch.cuda.is_available():
gpu_name = torch.cuda.get_device_name(0)
total_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f"Using GPU: {gpu_name}")
print(f"Total GPU memory: {total_memory:.2f} GB")
else:
print("No GPU available, using CPU.")
Função para estimar a memória necessária
def estimate_memory_requirements(model, batch_size, seq_length):
# Memória para o modelo
param_size = sum(p.numel() for p in model.parameters()) * 4 # 4 bytes por float32
# Memória para os dados de entrada e saída (batch size * seq length * 4 bytes por float32)
data_size = batch_size * seq_length * 4
# Memória para gradientes (assumindo que gradientes ocupam o mesmo espaço que os parâmetros)
grad_size = param_size
total_memory = param_size + data_size + grad_size
print(f"Estimated memory requirements: {total_memory / 1e9} GB")
Imprimir informações da GPU
print_gpu_info()
Configurações do modelo e dados
model_name = "amazon/MistralLite"
tokenizer = AutoTokenizer.from_pretrained(model_name)
batch_size = 8 # Você pode ajustar conforme necessário
seq_length = 512 # Comprimento máximo da sequência