LocalAI / docs /content /advanced /model-configuration.md

Upload folder using huggingface_hub

0f07ba7 verified 20 days ago

17.7 kB

	+++
	disableToc = false
	title = "Model Configuration"
	weight = 23
	url = '/advanced/model-configuration'
	+++

	LocalAI uses YAML configuration files to define model parameters, templates, and behavior. This page provides a complete reference for all available configuration options.

	## Overview

	Model configuration files allow you to:
	- Define default parameters (temperature, top_p, etc.)
	- Configure prompt templates
	- Specify backend settings
	- Set up function calling
	- Configure GPU and memory options
	- And much more

	## Configuration File Locations

	You can create model configuration files in several ways:

	1. Individual YAML files in the models directory (e.g., `models/gpt-3.5-turbo.yaml`)
	2. Single config file with multiple models using `--models-config-file` or `LOCALAI_MODELS_CONFIG_FILE`
	3. Remote URLs - specify a URL to a YAML configuration file at startup

	### Example: Basic Configuration

	```yaml
	name: gpt-3.5-turbo
	parameters:
	model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
	temperature: 0.3

	context_size: 512
	threads: 10
	backend: llama-stable

	template:
	completion: completion
	chat: chat
	```

	### Example: Multiple Models in One File

	When using `--models-config-file`, you can define multiple models as a list:

	```yaml
	- name: model1
	parameters:
	model: model1.bin
	context_size: 512
	backend: llama-stable

	- name: model2
	parameters:
	model: model2.bin
	context_size: 1024
	backend: llama-stable
	```

	## Core Configuration Fields

	### Basic Model Settings

	\| Field \| Type \| Description \| Example \|
	\|-------\|------\|-------------\|---------\|
	\| `name` \| string \| Model name, used to identify the model in API calls \| `gpt-3.5-turbo` \|
	\| `backend` \| string \| Backend to use (e.g. `llama-cpp`, `vllm`, `diffusers`, `whisper`) \| `llama-cpp` \|
	\| `description` \| string \| Human-readable description of the model \| `A conversational AI model` \|
	\| `usage` \| string \| Usage instructions or notes \| `Best for general conversation` \|

	### Model File and Downloads

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `parameters.model` \| string \| Path to the model file (relative to models directory) or URL \|
	\| `download_files` \| array \| List of files to download. Each entry has `filename`, `uri`, and optional `sha256` \|

	Example:
	```yaml
	parameters:
	model: my-model.gguf

	download_files:
	- filename: my-model.gguf
	uri: https://example.com/model.gguf
	sha256: abc123...
	```

	## Parameters Section

	The `parameters` section contains all OpenAI-compatible request parameters and model-specific options.

	### OpenAI-Compatible Parameters

	These settings will be used as defaults for all the API calls to the model.

	\| Field \| Type \| Default \| Description \|
	\|-------\|------\|---------\|-------------\|
	\| `temperature` \| float \| `0.9` \| Sampling temperature (0.0-2.0). Higher values make output more random \|
	\| `top_p` \| float \| `0.95` \| Nucleus sampling: consider tokens with top_p probability mass \|
	\| `top_k` \| int \| `40` \| Consider only the top K most likely tokens \|
	\| `max_tokens` \| int \| `0` \| Maximum number of tokens to generate (0 = unlimited) \|
	\| `frequency_penalty` \| float \| `0.0` \| Penalty for token frequency (-2.0 to 2.0) \|
	\| `presence_penalty` \| float \| `0.0` \| Penalty for token presence (-2.0 to 2.0) \|
	\| `repeat_penalty` \| float \| `1.1` \| Penalty for repeating tokens \|
	\| `repeat_last_n` \| int \| `64` \| Number of previous tokens to consider for repeat penalty \|
	\| `seed` \| int \| `-1` \| Random seed (omit for random) \|
	\| `echo` \| bool \| `false` \| Echo back the prompt in the response \|
	\| `n` \| int \| `1` \| Number of completions to generate \|
	\| `logprobs` \| bool/int \| `false` \| Return log probabilities of tokens \|
	\| `top_logprobs` \| int \| `0` \| Number of top logprobs to return per token (0-20) \|
	\| `logit_bias` \| map \| `{}` \| Map of token IDs to bias values (-100 to 100) \|
	\| `typical_p` \| float \| `1.0` \| Typical sampling parameter \|
	\| `tfz` \| float \| `1.0` \| Tail free z parameter \|
	\| `keep` \| int \| `0` \| Number of tokens to keep from the prompt \|

	### Language and Translation

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `language` \| string \| Language code for transcription/translation \|
	\| `translate` \| bool \| Whether to translate audio transcription \|

	### Custom Parameters

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `batch` \| int \| Batch size for processing \|
	\| `ignore_eos` \| bool \| Ignore end-of-sequence tokens \|
	\| `negative_prompt` \| string \| Negative prompt for image generation \|
	\| `rope_freq_base` \| float32 \| RoPE frequency base \|
	\| `rope_freq_scale` \| float32 \| RoPE frequency scale \|
	\| `negative_prompt_scale` \| float32 \| Scale for negative prompt \|
	\| `tokenizer` \| string \| Tokenizer to use (RWKV) \|

	## LLM Configuration

	These settings apply to most LLM backends (llama.cpp, vLLM, etc.):

	### Performance Settings

	\| Field \| Type \| Default \| Description \|
	\|-------\|------\|---------\|-------------\|
	\| `threads` \| int \| `processor count` \| Number of threads for parallel computation \|
	\| `context_size` \| int \| `512` \| Maximum context size (number of tokens) \|
	\| `f16` \| bool \| `false` \| Enable 16-bit floating point precision (GPU acceleration) \|
	\| `gpu_layers` \| int \| `0` \| Number of layers to offload to GPU (0 = CPU only) \|

	### Memory Management

	\| Field \| Type \| Default \| Description \|
	\|-------\|------\|---------\|-------------\|
	\| `mmap` \| bool \| `true` \| Use memory mapping for model loading (faster, less RAM) \|
	\| `mmlock` \| bool \| `false` \| Lock model in memory (prevents swapping) \|
	\| `low_vram` \| bool \| `false` \| Use minimal VRAM mode \|
	\| `no_kv_offloading` \| bool \| `false` \| Disable KV cache offloading \|

	### GPU Configuration

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `tensor_split` \| string \| Comma-separated GPU memory allocation (e.g., `"0.8,0.2"` for 80%/20%) \|
	\| `main_gpu` \| string \| Main GPU identifier for multi-GPU setups \|
	\| `cuda` \| bool \| Explicitly enable/disable CUDA \|

	### Sampling and Generation

	\| Field \| Type \| Default \| Description \|
	\|-------\|------\|---------\|-------------\|
	\| `mirostat` \| int \| `0` \| Mirostat sampling mode (0=disabled, 1=Mirostat, 2=Mirostat 2.0) \|
	\| `mirostat_tau` \| float \| `5.0` \| Mirostat target entropy \|
	\| `mirostat_eta` \| float \| `0.1` \| Mirostat learning rate \|

	### LoRA Configuration

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `lora_adapter` \| string \| Path to LoRA adapter file \|
	\| `lora_base` \| string \| Base model for LoRA \|
	\| `lora_scale` \| float32 \| LoRA scale factor \|
	\| `lora_adapters` \| array \| Multiple LoRA adapters \|
	\| `lora_scales` \| array \| Scales for multiple LoRA adapters \|

	### Advanced Options

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `no_mulmatq` \| bool \| Disable matrix multiplication queuing \|
	\| `draft_model` \| string \| Draft model for speculative decoding \|
	\| `n_draft` \| int32 \| Number of draft tokens \|
	\| `quantization` \| string \| Quantization format \|
	\| `load_format` \| string \| Model load format \|
	\| `numa` \| bool \| Enable NUMA (Non-Uniform Memory Access) \|
	\| `rms_norm_eps` \| float32 \| RMS normalization epsilon \|
	\| `ngqa` \| int32 \| Natural question generation parameter \|
	\| `rope_scaling` \| string \| RoPE scaling configuration \|
	\| `type` \| string \| Model type/architecture \|
	\| `grammar` \| string \| Grammar file path for constrained generation \|

	### YARN Configuration

	YARN (Yet Another RoPE extensioN) settings for context extension:

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `yarn_ext_factor` \| float32 \| YARN extension factor \|
	\| `yarn_attn_factor` \| float32 \| YARN attention factor \|
	\| `yarn_beta_fast` \| float32 \| YARN beta fast parameter \|
	\| `yarn_beta_slow` \| float32 \| YARN beta slow parameter \|

	### Prompt Caching

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `prompt_cache_path` \| string \| Path to store prompt cache (relative to models directory) \|
	\| `prompt_cache_all` \| bool \| Cache all prompts automatically \|
	\| `prompt_cache_ro` \| bool \| Read-only prompt cache \|

	### Text Processing

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `stopwords` \| array \| Words or phrases that stop generation \|
	\| `cutstrings` \| array \| Strings to cut from responses \|
	\| `trimspace` \| array \| Strings to trim whitespace from \|
	\| `trimsuffix` \| array \| Suffixes to trim from responses \|
	\| `extract_regex` \| array \| Regular expressions to extract content \|

	### System Prompt

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `system_prompt` \| string \| Default system prompt for the model \|

	## vLLM-Specific Configuration

	These options apply when using the `vllm` backend:

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `gpu_memory_utilization` \| float32 \| GPU memory utilization (0.0-1.0, default 0.9) \|
	\| `trust_remote_code` \| bool \| Trust and execute remote code \|
	\| `enforce_eager` \| bool \| Force eager execution mode \|
	\| `swap_space` \| int \| Swap space in GB \|
	\| `max_model_len` \| int \| Maximum model length \|
	\| `tensor_parallel_size` \| int \| Tensor parallelism size \|
	\| `disable_log_stats` \| bool \| Disable logging statistics \|
	\| `dtype` \| string \| Data type (e.g., `float16`, `bfloat16`) \|
	\| `flash_attention` \| string \| Flash attention configuration \|
	\| `cache_type_k` \| string \| Key cache type \|
	\| `cache_type_v` \| string \| Value cache type \|
	\| `limit_mm_per_prompt` \| object \| Limit multimodal content per prompt: `{image: int, video: int, audio: int}` \|

	## Template Configuration

	Templates use Go templates with [Sprig functions](http://masterminds.github.io/sprig/).

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `template.chat` \| string \| Template for chat completion endpoint \|
	\| `template.chat_message` \| string \| Template for individual chat messages \|
	\| `template.completion` \| string \| Template for text completion \|
	\| `template.edit` \| string \| Template for edit operations \|
	\| `template.function` \| string \| Template for function/tool calls \|
	\| `template.multimodal` \| string \| Template for multimodal interactions \|
	\| `template.reply_prefix` \| string \| Prefix to add to model replies \|
	\| `template.use_tokenizer_template` \| bool \| Use tokenizer's built-in template (vLLM/transformers) \|
	\| `template.join_chat_messages_by_character` \| string \| Character to join chat messages (default: `\n`) \|

	### Template Variables

	Templating supports [sprig](https://masterminds.github.io/sprig/) functions.

	Following are common variables available in templates:
	- `{{.Input}}` - User input
	- `{{.Instruction}}` - Instruction for edit operations
	- `{{.System}}` - System message
	- `{{.Prompt}}` - Full prompt
	- `{{.Functions}}` - Function definitions (for function calling)
	- `{{.FunctionCall}}` - Function call result

	### Example Template

	```yaml
	template:
	chat: \|
	{{.System}}
	{{range .Messages}}
	{{if eq .Role "user"}}User: {{.Content}}{{end}}
	{{if eq .Role "assistant"}}Assistant: {{.Content}}{{end}}
	{{end}}
	Assistant:
	```

	## Function Calling Configuration

	Configure how the model handles function/tool calls:

	\| Field \| Type \| Default \| Description \|
	\|-------\|------\|---------\|-------------\|
	\| `function.disable_no_action` \| bool \| `false` \| Disable the no-action behavior \|
	\| `function.no_action_function_name` \| string \| `answer` \| Name of the no-action function \|
	\| `function.no_action_description_name` \| string \| \| Description for no-action function \|
	\| `function.function_name_key` \| string \| `name` \| JSON key for function name \|
	\| `function.function_arguments_key` \| string \| `arguments` \| JSON key for function arguments \|
	\| `function.response_regex` \| array \| \| Named regex patterns to extract function calls \|
	\| `function.argument_regex` \| array \| \| Named regex to extract function arguments \|
	\| `function.argument_regex_key_name` \| string \| `key` \| Named regex capture for argument key \|
	\| `function.argument_regex_value_name` \| string \| `value` \| Named regex capture for argument value \|
	\| `function.json_regex_match` \| array \| \| Regex patterns to match JSON in tool mode \|
	\| `function.replace_function_results` \| array \| \| Replace function call results with patterns \|
	\| `function.replace_llm_results` \| array \| \| Replace LLM results with patterns \|
	\| `function.capture_llm_results` \| array \| \| Capture LLM results as text (e.g., for "thinking" blocks) \|

	### Grammar Configuration

	\| Field \| Type \| Default \| Description \|
	\|-------\|------\|---------\|-------------\|
	\| `function.grammar.disable` \| bool \| `false` \| Completely disable grammar enforcement \|
	\| `function.grammar.parallel_calls` \| bool \| `false` \| Allow parallel function calls \|
	\| `function.grammar.mixed_mode` \| bool \| `false` \| Allow mixed-mode grammar enforcing \|
	\| `function.grammar.no_mixed_free_string` \| bool \| `false` \| Disallow free strings in mixed mode \|
	\| `function.grammar.disable_parallel_new_lines` \| bool \| `false` \| Disable parallel processing for new lines \|
	\| `function.grammar.prefix` \| string \| \| Prefix to add before grammar rules \|
	\| `function.grammar.expect_strings_after_json` \| bool \| `false` \| Expect strings after JSON data \|

	## Diffusers Configuration

	For image generation models using the `diffusers` backend:

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `diffusers.cuda` \| bool \| Enable CUDA for diffusers \|
	\| `diffusers.pipeline_type` \| string \| Pipeline type (e.g., `stable-diffusion`, `stable-diffusion-xl`) \|
	\| `diffusers.scheduler_type` \| string \| Scheduler type (e.g., `euler`, `ddpm`) \|
	\| `diffusers.enable_parameters` \| string \| Comma-separated parameters to enable \|
	\| `diffusers.cfg_scale` \| float32 \| Classifier-free guidance scale \|
	\| `diffusers.img2img` \| bool \| Enable image-to-image transformation \|
	\| `diffusers.clip_skip` \| int \| Number of CLIP layers to skip \|
	\| `diffusers.clip_model` \| string \| CLIP model to use \|
	\| `diffusers.clip_subfolder` \| string \| CLIP model subfolder \|
	\| `diffusers.control_net` \| string \| ControlNet model to use \|
	\| `step` \| int \| Number of diffusion steps \|

	## TTS Configuration

	For text-to-speech models:

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `tts.voice` \| string \| Voice file path or voice ID \|
	\| `tts.audio_path` \| string \| Path to audio files (for Vall-E) \|

	## Roles Configuration

	Map conversation roles to specific strings:

	```yaml
	roles:
	user: "### Instruction:"
	assistant: "### Response:"
	system: "### System Instruction:"
	```

	## Feature Flags

	Enable or disable experimental features:

	```yaml
	feature_flags:
	feature_name: true
	another_feature: false
	```

	## MCP Configuration

	Model Context Protocol (MCP) configuration:

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `mcp.remote` \| string \| YAML string defining remote MCP servers \|
	\| `mcp.stdio` \| string \| YAML string defining STDIO MCP servers \|

	## Agent Configuration

	Agent/autonomous agent configuration:

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `agent.max_attempts` \| int \| Maximum number of attempts \|
	\| `agent.max_iterations` \| int \| Maximum number of iterations \|
	\| `agent.enable_reasoning` \| bool \| Enable reasoning capabilities \|
	\| `agent.enable_planning` \| bool \| Enable planning capabilities \|
	\| `agent.enable_mcp_prompts` \| bool \| Enable MCP prompts \|
	\| `agent.enable_plan_re_evaluator` \| bool \| Enable plan re-evaluation \|

	## Pipeline Configuration

	Define pipelines for audio-to-audio processing:

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `pipeline.tts` \| string \| TTS model name \|
	\| `pipeline.llm` \| string \| LLM model name \|
	\| `pipeline.transcription` \| string \| Transcription model name \|
	\| `pipeline.vad` \| string \| Voice activity detection model name \|

	## gRPC Configuration

	Backend gRPC communication settings:

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `grpc.attempts` \| int \| Number of retry attempts \|
	\| `grpc.attempts_sleep_time` \| int \| Sleep time between retries (seconds) \|

	## Overrides

	Override model configuration values at runtime (llama.cpp):

	```yaml
	overrides:
	- "qwen3moe.expert_used_count=int:10"
	- "some_key=string:value"
	```

	Format: `KEY=TYPE:VALUE` where TYPE is `int`, `float`, `string`, or `bool`.

	## Known Use Cases

	Specify which endpoints this model supports:

	```yaml
	known_usecases:
	- chat
	- completion
	- embeddings
	```

	Available flags: `chat`, `completion`, `edit`, `embeddings`, `rerank`, `image`, `transcript`, `tts`, `sound_generation`, `tokenize`, `vad`, `video`, `detection`, `llm` (combination of CHAT, COMPLETION, EDIT).

	## Complete Example

	Here's a comprehensive example combining many options:

	```yaml
	name: my-llm-model
	description: A high-performance LLM model
	backend: llama-stable

	parameters:
	model: my-model.gguf
	temperature: 0.7
	top_p: 0.9
	top_k: 40
	max_tokens: 2048

	context_size: 4096
	threads: 8
	f16: true
	gpu_layers: 35

	system_prompt: "You are a helpful AI assistant."

	template:
	chat: \|
	{{.System}}
	{{range .Messages}}
	{{if eq .Role "user"}}User: {{.Content}}
	{{else if eq .Role "assistant"}}Assistant: {{.Content}}
	{{end}}
	{{end}}
	Assistant:

	roles:
	user: "User:"
	assistant: "Assistant:"
	system: "System:"

	stopwords:
	- "\n\nUser:"
	- "\n\nHuman:"

	prompt_cache_path: "cache/my-model"
	prompt_cache_all: true

	function:
	grammar:
	parallel_calls: true
	mixed_mode: false

	feature_flags:
	experimental_feature: true
	```

	## Related Documentation

	- See [Advanced Usage]({{%relref "advanced/advanced-usage" %}}) for other configuration options
	- See [Prompt Templates]({{%relref "advanced/advanced-usage#prompt-templates" %}}) for template examples
	- See [CLI Reference]({{%relref "reference/cli-reference" %}}) for command-line options