File size: 17,727 Bytes

0f07ba7

+++
disableToc = false
title = "Model Configuration"
weight = 23
url = '/advanced/model-configuration'
+++

LocalAI uses YAML configuration files to define model parameters, templates, and behavior. This page provides a complete reference for all available configuration options.

## Overview

Model configuration files allow you to:
- Define default parameters (temperature, top_p, etc.)
- Configure prompt templates
- Specify backend settings
- Set up function calling
- Configure GPU and memory options
- And much more

## Configuration File Locations

You can create model configuration files in several ways:

1. **Individual YAML files** in the models directory (e.g., `models/gpt-3.5-turbo.yaml`)
2. **Single config file** with multiple models using `--models-config-file` or `LOCALAI_MODELS_CONFIG_FILE`
3. **Remote URLs** - specify a URL to a YAML configuration file at startup

### Example: Basic Configuration

```yaml
name: gpt-3.5-turbo
parameters:
  model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
  temperature: 0.3

context_size: 512
threads: 10
backend: llama-stable

template:
  completion: completion
  chat: chat
```

### Example: Multiple Models in One File

When using `--models-config-file`, you can define multiple models as a list:

```yaml
- name: model1
  parameters:
    model: model1.bin
  context_size: 512
  backend: llama-stable

- name: model2
  parameters:
    model: model2.bin
  context_size: 1024
  backend: llama-stable
```

## Core Configuration Fields

### Basic Model Settings

| Field | Type | Description | Example |
|-------|------|-------------|---------|
| `name` | string | Model name, used to identify the model in API calls | `gpt-3.5-turbo` |
| `backend` | string | Backend to use (e.g. `llama-cpp`, `vllm`, `diffusers`, `whisper`) | `llama-cpp` |
| `description` | string | Human-readable description of the model | `A conversational AI model` |
| `usage` | string | Usage instructions or notes | `Best for general conversation` |

### Model File and Downloads

| Field | Type | Description |
|-------|------|-------------|
| `parameters.model` | string | Path to the model file (relative to models directory) or URL |
| `download_files` | array | List of files to download. Each entry has `filename`, `uri`, and optional `sha256` |

**Example:**
```yaml
parameters:
  model: my-model.gguf

download_files:
  - filename: my-model.gguf
    uri: https://example.com/model.gguf
    sha256: abc123...
```

## Parameters Section

The `parameters` section contains all OpenAI-compatible request parameters and model-specific options.

### OpenAI-Compatible Parameters

These settings will be used as defaults for all the API calls to the model.

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `temperature` | float | `0.9` | Sampling temperature (0.0-2.0). Higher values make output more random |
| `top_p` | float | `0.95` | Nucleus sampling: consider tokens with top_p probability mass |
| `top_k` | int | `40` | Consider only the top K most likely tokens |
| `max_tokens` | int | `0` | Maximum number of tokens to generate (0 = unlimited) |
| `frequency_penalty` | float | `0.0` | Penalty for token frequency (-2.0 to 2.0) |
| `presence_penalty` | float | `0.0` | Penalty for token presence (-2.0 to 2.0) |
| `repeat_penalty` | float | `1.1` | Penalty for repeating tokens |
| `repeat_last_n` | int | `64` | Number of previous tokens to consider for repeat penalty |
| `seed` | int | `-1` | Random seed (omit for random) |
| `echo` | bool | `false` | Echo back the prompt in the response |
| `n` | int | `1` | Number of completions to generate |
| `logprobs` | bool/int | `false` | Return log probabilities of tokens |
| `top_logprobs` | int | `0` | Number of top logprobs to return per token (0-20) |
| `logit_bias` | map | `{}` | Map of token IDs to bias values (-100 to 100) |
| `typical_p` | float | `1.0` | Typical sampling parameter |
| `tfz` | float | `1.0` | Tail free z parameter |
| `keep` | int | `0` | Number of tokens to keep from the prompt |

### Language and Translation

| Field | Type | Description |
|-------|------|-------------|
| `language` | string | Language code for transcription/translation |
| `translate` | bool | Whether to translate audio transcription |

### Custom Parameters

| Field | Type | Description |
|-------|------|-------------|
| `batch` | int | Batch size for processing |
| `ignore_eos` | bool | Ignore end-of-sequence tokens |
| `negative_prompt` | string | Negative prompt for image generation |
| `rope_freq_base` | float32 | RoPE frequency base |
| `rope_freq_scale` | float32 | RoPE frequency scale |
| `negative_prompt_scale` | float32 | Scale for negative prompt |
| `tokenizer` | string | Tokenizer to use (RWKV) |

## LLM Configuration

These settings apply to most LLM backends (llama.cpp, vLLM, etc.):

### Performance Settings

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `threads` | int | `processor count` | Number of threads for parallel computation |
| `context_size` | int | `512` | Maximum context size (number of tokens) |
| `f16` | bool | `false` | Enable 16-bit floating point precision (GPU acceleration) |
| `gpu_layers` | int | `0` | Number of layers to offload to GPU (0 = CPU only) |

### Memory Management

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `mmap` | bool | `true` | Use memory mapping for model loading (faster, less RAM) |
| `mmlock` | bool | `false` | Lock model in memory (prevents swapping) |
| `low_vram` | bool | `false` | Use minimal VRAM mode |
| `no_kv_offloading` | bool | `false` | Disable KV cache offloading |

### GPU Configuration

| Field | Type | Description |
|-------|------|-------------|
| `tensor_split` | string | Comma-separated GPU memory allocation (e.g., `"0.8,0.2"` for 80%/20%) |
| `main_gpu` | string | Main GPU identifier for multi-GPU setups |
| `cuda` | bool | Explicitly enable/disable CUDA |

### Sampling and Generation

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `mirostat` | int | `0` | Mirostat sampling mode (0=disabled, 1=Mirostat, 2=Mirostat 2.0) |
| `mirostat_tau` | float | `5.0` | Mirostat target entropy |
| `mirostat_eta` | float | `0.1` | Mirostat learning rate |

### LoRA Configuration

| Field | Type | Description |
|-------|------|-------------|
| `lora_adapter` | string | Path to LoRA adapter file |
| `lora_base` | string | Base model for LoRA |
| `lora_scale` | float32 | LoRA scale factor |
| `lora_adapters` | array | Multiple LoRA adapters |
| `lora_scales` | array | Scales for multiple LoRA adapters |

### Advanced Options

| Field | Type | Description |
|-------|------|-------------|
| `no_mulmatq` | bool | Disable matrix multiplication queuing |
| `draft_model` | string | Draft model for speculative decoding |
| `n_draft` | int32 | Number of draft tokens |
| `quantization` | string | Quantization format |
| `load_format` | string | Model load format |
| `numa` | bool | Enable NUMA (Non-Uniform Memory Access) |
| `rms_norm_eps` | float32 | RMS normalization epsilon |
| `ngqa` | int32 | Natural question generation parameter |
| `rope_scaling` | string | RoPE scaling configuration |
| `type` | string | Model type/architecture |
| `grammar` | string | Grammar file path for constrained generation |

### YARN Configuration

YARN (Yet Another RoPE extensioN) settings for context extension:

| Field | Type | Description |
|-------|------|-------------|
| `yarn_ext_factor` | float32 | YARN extension factor |
| `yarn_attn_factor` | float32 | YARN attention factor |
| `yarn_beta_fast` | float32 | YARN beta fast parameter |
| `yarn_beta_slow` | float32 | YARN beta slow parameter |

### Prompt Caching

| Field | Type | Description |
|-------|------|-------------|
| `prompt_cache_path` | string | Path to store prompt cache (relative to models directory) |
| `prompt_cache_all` | bool | Cache all prompts automatically |
| `prompt_cache_ro` | bool | Read-only prompt cache |

### Text Processing

| Field | Type | Description |
|-------|------|-------------|
| `stopwords` | array | Words or phrases that stop generation |
| `cutstrings` | array | Strings to cut from responses |
| `trimspace` | array | Strings to trim whitespace from |
| `trimsuffix` | array | Suffixes to trim from responses |
| `extract_regex` | array | Regular expressions to extract content |

### System Prompt

| Field | Type | Description |
|-------|------|-------------|
| `system_prompt` | string | Default system prompt for the model |

## vLLM-Specific Configuration

These options apply when using the `vllm` backend:

| Field | Type | Description |
|-------|------|-------------|
| `gpu_memory_utilization` | float32 | GPU memory utilization (0.0-1.0, default 0.9) |
| `trust_remote_code` | bool | Trust and execute remote code |
| `enforce_eager` | bool | Force eager execution mode |
| `swap_space` | int | Swap space in GB |
| `max_model_len` | int | Maximum model length |
| `tensor_parallel_size` | int | Tensor parallelism size |
| `disable_log_stats` | bool | Disable logging statistics |
| `dtype` | string | Data type (e.g., `float16`, `bfloat16`) |
| `flash_attention` | string | Flash attention configuration |
| `cache_type_k` | string | Key cache type |
| `cache_type_v` | string | Value cache type |
| `limit_mm_per_prompt` | object | Limit multimodal content per prompt: `{image: int, video: int, audio: int}` |

## Template Configuration

Templates use Go templates with [Sprig functions](http://masterminds.github.io/sprig/).

| Field | Type | Description |
|-------|------|-------------|
| `template.chat` | string | Template for chat completion endpoint |
| `template.chat_message` | string | Template for individual chat messages |
| `template.completion` | string | Template for text completion |
| `template.edit` | string | Template for edit operations |
| `template.function` | string | Template for function/tool calls |
| `template.multimodal` | string | Template for multimodal interactions |
| `template.reply_prefix` | string | Prefix to add to model replies |
| `template.use_tokenizer_template` | bool | Use tokenizer's built-in template (vLLM/transformers) |
| `template.join_chat_messages_by_character` | string | Character to join chat messages (default: `\n`) |

### Template Variables

Templating supports [sprig](https://masterminds.github.io/sprig/) functions.

Following are common variables available in templates:
- `{{.Input}}` - User input
- `{{.Instruction}}` - Instruction for edit operations
- `{{.System}}` - System message
- `{{.Prompt}}` - Full prompt
- `{{.Functions}}` - Function definitions (for function calling)
- `{{.FunctionCall}}` - Function call result

### Example Template

```yaml
template:
  chat: |
    {{.System}}
    {{range .Messages}}
    {{if eq .Role "user"}}User: {{.Content}}{{end}}
    {{if eq .Role "assistant"}}Assistant: {{.Content}}{{end}}
    {{end}}
    Assistant:
```

## Function Calling Configuration

Configure how the model handles function/tool calls:

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `function.disable_no_action` | bool | `false` | Disable the no-action behavior |
| `function.no_action_function_name` | string | `answer` | Name of the no-action function |
| `function.no_action_description_name` | string | | Description for no-action function |
| `function.function_name_key` | string | `name` | JSON key for function name |
| `function.function_arguments_key` | string | `arguments` | JSON key for function arguments |
| `function.response_regex` | array | | Named regex patterns to extract function calls |
| `function.argument_regex` | array | | Named regex to extract function arguments |
| `function.argument_regex_key_name` | string | `key` | Named regex capture for argument key |
| `function.argument_regex_value_name` | string | `value` | Named regex capture for argument value |
| `function.json_regex_match` | array | | Regex patterns to match JSON in tool mode |
| `function.replace_function_results` | array | | Replace function call results with patterns |
| `function.replace_llm_results` | array | | Replace LLM results with patterns |
| `function.capture_llm_results` | array | | Capture LLM results as text (e.g., for "thinking" blocks) |

### Grammar Configuration

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `function.grammar.disable` | bool | `false` | Completely disable grammar enforcement |
| `function.grammar.parallel_calls` | bool | `false` | Allow parallel function calls |
| `function.grammar.mixed_mode` | bool | `false` | Allow mixed-mode grammar enforcing |
| `function.grammar.no_mixed_free_string` | bool | `false` | Disallow free strings in mixed mode |
| `function.grammar.disable_parallel_new_lines` | bool | `false` | Disable parallel processing for new lines |
| `function.grammar.prefix` | string | | Prefix to add before grammar rules |
| `function.grammar.expect_strings_after_json` | bool | `false` | Expect strings after JSON data |

## Diffusers Configuration

For image generation models using the `diffusers` backend:

| Field | Type | Description |
|-------|------|-------------|
| `diffusers.cuda` | bool | Enable CUDA for diffusers |
| `diffusers.pipeline_type` | string | Pipeline type (e.g., `stable-diffusion`, `stable-diffusion-xl`) |
| `diffusers.scheduler_type` | string | Scheduler type (e.g., `euler`, `ddpm`) |
| `diffusers.enable_parameters` | string | Comma-separated parameters to enable |
| `diffusers.cfg_scale` | float32 | Classifier-free guidance scale |
| `diffusers.img2img` | bool | Enable image-to-image transformation |
| `diffusers.clip_skip` | int | Number of CLIP layers to skip |
| `diffusers.clip_model` | string | CLIP model to use |
| `diffusers.clip_subfolder` | string | CLIP model subfolder |
| `diffusers.control_net` | string | ControlNet model to use |
| `step` | int | Number of diffusion steps |

## TTS Configuration

For text-to-speech models:

| Field | Type | Description |
|-------|------|-------------|
| `tts.voice` | string | Voice file path or voice ID |
| `tts.audio_path` | string | Path to audio files (for Vall-E) |

## Roles Configuration

Map conversation roles to specific strings:

```yaml
roles:
  user: "### Instruction:"
  assistant: "### Response:"
  system: "### System Instruction:"
```

## Feature Flags

Enable or disable experimental features:

```yaml
feature_flags:
  feature_name: true
  another_feature: false
```

## MCP Configuration

Model Context Protocol (MCP) configuration:

| Field | Type | Description |
|-------|------|-------------|
| `mcp.remote` | string | YAML string defining remote MCP servers |
| `mcp.stdio` | string | YAML string defining STDIO MCP servers |

## Agent Configuration

Agent/autonomous agent configuration:

| Field | Type | Description |
|-------|------|-------------|
| `agent.max_attempts` | int | Maximum number of attempts |
| `agent.max_iterations` | int | Maximum number of iterations |
| `agent.enable_reasoning` | bool | Enable reasoning capabilities |
| `agent.enable_planning` | bool | Enable planning capabilities |
| `agent.enable_mcp_prompts` | bool | Enable MCP prompts |
| `agent.enable_plan_re_evaluator` | bool | Enable plan re-evaluation |

## Pipeline Configuration

Define pipelines for audio-to-audio processing:

| Field | Type | Description |
|-------|------|-------------|
| `pipeline.tts` | string | TTS model name |
| `pipeline.llm` | string | LLM model name |
| `pipeline.transcription` | string | Transcription model name |
| `pipeline.vad` | string | Voice activity detection model name |

## gRPC Configuration

Backend gRPC communication settings:

| Field | Type | Description |
|-------|------|-------------|
| `grpc.attempts` | int | Number of retry attempts |
| `grpc.attempts_sleep_time` | int | Sleep time between retries (seconds) |

## Overrides

Override model configuration values at runtime (llama.cpp):

```yaml
overrides:
  - "qwen3moe.expert_used_count=int:10"
  - "some_key=string:value"
```

Format: `KEY=TYPE:VALUE` where TYPE is `int`, `float`, `string`, or `bool`.

## Known Use Cases

Specify which endpoints this model supports:

```yaml
known_usecases:
  - chat
  - completion
  - embeddings
```

Available flags: `chat`, `completion`, `edit`, `embeddings`, `rerank`, `image`, `transcript`, `tts`, `sound_generation`, `tokenize`, `vad`, `video`, `detection`, `llm` (combination of CHAT, COMPLETION, EDIT).

## Complete Example

Here's a comprehensive example combining many options:

```yaml
name: my-llm-model
description: A high-performance LLM model
backend: llama-stable

parameters:
  model: my-model.gguf
  temperature: 0.7
  top_p: 0.9
  top_k: 40
  max_tokens: 2048

context_size: 4096
threads: 8
f16: true
gpu_layers: 35

system_prompt: "You are a helpful AI assistant."

template:
  chat: |
    {{.System}}
    {{range .Messages}}
    {{if eq .Role "user"}}User: {{.Content}}
    {{else if eq .Role "assistant"}}Assistant: {{.Content}}
    {{end}}
    {{end}}
    Assistant:

roles:
  user: "User:"
  assistant: "Assistant:"
  system: "System:"

stopwords:
  - "\n\nUser:"
  - "\n\nHuman:"

prompt_cache_path: "cache/my-model"
prompt_cache_all: true

function:
  grammar:
    parallel_calls: true
    mixed_mode: false

feature_flags:
  experimental_feature: true
```

## Related Documentation

- See [Advanced Usage]({{%relref "advanced/advanced-usage" %}}) for other configuration options
- See [Prompt Templates]({{%relref "advanced/advanced-usage#prompt-templates" %}}) for template examples
- See [CLI Reference]({{%relref "reference/cli-reference" %}}) for command-line options