# Using SGLang as Backend

Lighteval allows you to use SGLang as a backend, providing significant speedups for model evaluation.
To use SGLang, simply change the `model_args` to reflect the arguments you want to pass to SGLang.

## Basic Usage

```bash
lighteval sglang \
    "model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16" \
    truthfulqa:mc
```

## Parallelism Options

SGLang can distribute the model across multiple GPUs using data parallelism and tensor parallelism.
You can choose the parallelism method by setting the appropriate parameters in the `model_args`.

### Tensor Parallelism

For example, if you have 4 GPUs, you can split the model across them using tensor parallelism with `tp_size`:

```bash
lighteval sglang \
    "model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16,tp_size=4" \
    truthfulqa:mc
```

### Data Parallelism

If your model fits on a single GPU, you can use data parallelism with `dp_size` to speed up the evaluation:

```bash
lighteval sglang \
    "model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16,dp_size=4" \
    truthfulqa:mc
```

## Using a Configuration File

For more advanced configurations, you can use a YAML configuration file for the model.
An example configuration file is shown below and can be found at `examples/model_configs/sglang_model_config.yaml`.

```bash
lighteval sglang \
    "examples/model_configs/sglang_model_config.yaml" \
    truthfulqa:mc
```

> [!TIP]
> Documentation for SGLang server arguments can be found [here](https://docs.sglang.ai/backend/server_arguments.html)

```yaml
model_parameters:
    model_name: "HuggingFaceTB/SmolLM-1.7B-Instruct"
    dtype: "auto"
    tp_size: 1
    dp_size: 1
    context_length: null
    random_seed: 1
    trust_remote_code: False
    device: "cuda"
    skip_tokenizer_init: False
    kv_cache_dtype: "auto"
    add_special_tokens: True
    pairwise_tokenization: False
    sampling_backend: null
    attention_backend: null
    mem_fraction_static: 0.8
    chunked_prefill_size: 4096
    generation_parameters:
      max_new_tokens: 1024
      min_new_tokens: 0
      temperature: 1.0
      top_k: 50
      min_p: 0.0
      top_p: 1.0
      presence_penalty: 0.0
      repetition_penalty: 1.0
      frequency_penalty: 0.0
```

> [!WARNING]
> In case of out-of-memory (OOM) issues, you might need to reduce the context size of the
> model as well as reduce the `mem_fraction_static` and `chunked_prefill_size` parameters.

## Key SGLang Parameters

### Memory Management
- `mem_fraction_static`: Fraction of GPU memory to allocate for static tensors (default: 0.8)
- `chunked_prefill_size`: Size of chunks for prefill operations (default: 4096)
- `context_length`: Maximum context length for the model
- `kv_cache_dtype`: Data type for key-value cache

### Parallelism Settings
- `tp_size`: Number of GPUs for tensor parallelism
- `dp_size`: Number of GPUs for data parallelism

### Model Configuration
- `dtype`: Data type for model weights ("auto", "float16", "bfloat16", etc.)
- `device`: Device to run the model on ("cuda", "cpu")
- `trust_remote_code`: Whether to trust remote code from the model
- `skip_tokenizer_init`: Skip tokenizer initialization for faster startup

### Generation Parameters
- `temperature`: Controls randomness in generation (0.0 = deterministic, 1.0 = random)
- `top_p`: Nucleus sampling parameter
- `top_k`: Top-k sampling parameter
- `max_new_tokens`: Maximum number of tokens to generate
- `repetition_penalty`: Penalty for repeating tokens
- `presence_penalty`: Penalty for token presence
- `frequency_penalty`: Penalty for token frequency

