Spaces:

Zyan11
/

LocalAI-Amlan-Edition

Running

App Files Files Community

LocalAI-Amlan-Edition / docs /content /features /text-generation.md

Amlan-109

feat: Initial commit of LocalAI Amlan Edition with premium branding and personalization

750bbe6 5 days ago

preview code

raw

history blame contribute delete

23.9 kB


	+++
	disableToc = false
	title = "📖 Text generation (GPT)"
	weight = 10
	url = "/features/text-generation/"
	+++

	LocalAI supports generating text with GPT with `llama.cpp` and other backends (such as `rwkv.cpp` as ) see also the [Model compatibility]({{%relref "reference/compatibility-table" %}}) for an up-to-date list of the supported model families.

	Note:

	- You can also specify the model name as part of the OpenAI token.
	- If only one model is available, the API will use it for all the requests.

	## API Reference

	### Chat completions

	https://platform.openai.com/docs/api-reference/chat

	For example, to generate a chat completion, you can send a POST request to the `/v1/chat/completions` endpoint with the instruction as the request body:

	```bash
	curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
	"model": "ggml-koala-7b-model-q4_0-r2.bin",
	"messages": [{"role": "user", "content": "Say this is a test!"}],
	"temperature": 0.7
	}'
	```

	Available additional parameters: `top_p`, `top_k`, `max_tokens`

	### Edit completions

	https://platform.openai.com/docs/api-reference/edits

	To generate an edit completion you can send a POST request to the `/v1/edits` endpoint with the instruction as the request body:

	```bash
	curl http://localhost:8080/v1/edits -H "Content-Type: application/json" -d '{
	"model": "ggml-koala-7b-model-q4_0-r2.bin",
	"instruction": "rephrase",
	"input": "Black cat jumped out of the window",
	"temperature": 0.7
	}'
	```

	Available additional parameters: `top_p`, `top_k`, `max_tokens`.

	### Completions

	https://platform.openai.com/docs/api-reference/completions

	To generate a completion, you can send a POST request to the `/v1/completions` endpoint with the instruction as per the request body:

	```bash
	curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
	"model": "ggml-koala-7b-model-q4_0-r2.bin",
	"prompt": "A long time ago in a galaxy far, far away",
	"temperature": 0.7
	}'
	```

	Available additional parameters: `top_p`, `top_k`, `max_tokens`

	### List models

	You can list all the models available with:

	```bash
	curl http://localhost:8080/v1/models
	```

	### Anthropic Messages API

	LocalAI supports the Anthropic Messages API, which is compatible with Claude clients. This endpoint provides a structured way to send messages and receive responses, with support for tools, streaming, and multimodal content.

	Endpoint: `POST /v1/messages` or `POST /messages`

	Reference: https://docs.anthropic.com/claude/reference/messages_post

	#### Basic Usage

	```bash
	curl http://localhost:8080/v1/messages \
	-H "Content-Type: application/json" \
	-H "anthropic-version: 2023-06-01" \
	-d '{
	"model": "ggml-koala-7b-model-q4_0-r2.bin",
	"max_tokens": 1024,
	"messages": [
	{"role": "user", "content": "Say this is a test!"}
	]
	}'
	```

	#### Request Parameters

	\| Parameter \| Type \| Required \| Description \|
	\|-----------\|------\|----------\|-------------\|
	\| `model` \| string \| Yes \| The model identifier \|
	\| `messages` \| array \| Yes \| Array of message objects with `role` and `content` \|
	\| `max_tokens` \| integer \| Yes \| Maximum number of tokens to generate (must be > 0) \|
	\| `system` \| string \| No \| System message to set the assistant's behavior \|
	\| `temperature` \| float \| No \| Sampling temperature (0.0 to 1.0) \|
	\| `top_p` \| float \| No \| Nucleus sampling parameter \|
	\| `top_k` \| integer \| No \| Top-k sampling parameter \|
	\| `stop_sequences` \| array \| No \| Array of strings that will stop generation \|
	\| `stream` \| boolean \| No \| Enable streaming responses \|
	\| `tools` \| array \| No \| Array of tool definitions for function calling \|
	\| `tool_choice` \| string/object \| No \| Tool choice strategy: "auto", "any", "none", or specific tool \|
	\| `metadata` \| object \| No \| Custom metadata to attach to the request \|

	#### Message Format

	Messages can contain text or structured content blocks:

	```bash
	curl http://localhost:8080/v1/messages \
	-H "Content-Type: application/json" \
	-d '{
	"model": "ggml-koala-7b-model-q4_0-r2.bin",
	"max_tokens": 1024,
	"messages": [
	{
	"role": "user",
	"content": [
	{
	"type": "text",
	"text": "What is in this image?"
	},
	{
	"type": "image",
	"source": {
	"type": "base64",
	"media_type": "image/jpeg",
	"data": "base64_encoded_image_data"
	}
	}
	]
	}
	]
	}'
	```

	#### Tool Calling

	The Anthropic API supports function calling through tools:

	```bash
	curl http://localhost:8080/v1/messages \
	-H "Content-Type: application/json" \
	-d '{
	"model": "ggml-koala-7b-model-q4_0-r2.bin",
	"max_tokens": 1024,
	"tools": [
	{
	"name": "get_weather",
	"description": "Get the current weather",
	"input_schema": {
	"type": "object",
	"properties": {
	"location": {
	"type": "string",
	"description": "The city and state"
	}
	},
	"required": ["location"]
	}
	}
	],
	"tool_choice": "auto",
	"messages": [
	{"role": "user", "content": "What is the weather in San Francisco?"}
	]
	}'
	```

	#### Streaming

	Enable streaming responses by setting `stream: true`:

	```bash
	curl http://localhost:8080/v1/messages \
	-H "Content-Type: application/json" \
	-d '{
	"model": "ggml-koala-7b-model-q4_0-r2.bin",
	"max_tokens": 1024,
	"stream": true,
	"messages": [
	{"role": "user", "content": "Tell me a story"}
	]
	}'
	```

	Streaming responses use Server-Sent Events (SSE) format with event types: `message_start`, `content_block_start`, `content_block_delta`, `content_block_stop`, `message_delta`, and `message_stop`.

	#### Response Format

	```json
	{
	"id": "msg_abc123",
	"type": "message",
	"role": "assistant",
	"content": [
	{
	"type": "text",
	"text": "This is a test!"
	}
	],
	"model": "ggml-koala-7b-model-q4_0-r2.bin",
	"stop_reason": "end_turn",
	"usage": {
	"input_tokens": 10,
	"output_tokens": 5
	}
	}
	```

	### Open Responses API

	LocalAI supports the Open Responses API specification, which provides a standardized interface for AI model interactions with support for background processing, streaming, tool calling, and advanced features like reasoning.

	Endpoint: `POST /v1/responses` or `POST /responses`

	Reference: https://www.openresponses.org/specification

	#### Basic Usage

	```bash
	curl http://localhost:8080/v1/responses \
	-H "Content-Type: application/json" \
	-d '{
	"model": "ggml-koala-7b-model-q4_0-r2.bin",
	"input": "Say this is a test!",
	"max_output_tokens": 1024
	}'
	```

	#### Request Parameters

	\| Parameter \| Type \| Required \| Description \|
	\|-----------\|------\|----------\|-------------\|
	\| `model` \| string \| Yes \| The model identifier \|
	\| `input` \| string/array \| Yes \| Input text or array of input items \|
	\| `max_output_tokens` \| integer \| No \| Maximum number of tokens to generate \|
	\| `temperature` \| float \| No \| Sampling temperature \|
	\| `top_p` \| float \| No \| Nucleus sampling parameter \|
	\| `instructions` \| string \| No \| System instructions \|
	\| `tools` \| array \| No \| Array of tool definitions \|
	\| `tool_choice` \| string/object \| No \| Tool choice: "auto", "required", "none", or specific tool \|
	\| `stream` \| boolean \| No \| Enable streaming responses \|
	\| `background` \| boolean \| No \| Run request in background (returns immediately) \|
	\| `store` \| boolean \| No \| Whether to store the response \|
	\| `reasoning` \| object \| No \| Reasoning configuration with `effort` and `summary` \|
	\| `parallel_tool_calls` \| boolean \| No \| Allow parallel tool calls \|
	\| `max_tool_calls` \| integer \| No \| Maximum number of tool calls \|
	\| `presence_penalty` \| float \| No \| Presence penalty (-2.0 to 2.0) \|
	\| `frequency_penalty` \| float \| No \| Frequency penalty (-2.0 to 2.0) \|
	\| `top_logprobs` \| integer \| No \| Number of top logprobs to return \|
	\| `truncation` \| string \| No \| Truncation mode: "auto" or "disabled" \|
	\| `text_format` \| object \| No \| Text format configuration \|
	\| `metadata` \| object \| No \| Custom metadata \|

	#### Input Format

	Input can be a simple string or an array of structured items:

	```bash
	curl http://localhost:8080/v1/responses \
	-H "Content-Type: application/json" \
	-d '{
	"model": "ggml-koala-7b-model-q4_0-r2.bin",
	"input": [
	{
	"type": "message",
	"role": "user",
	"content": "What is the weather?"
	}
	],
	"max_output_tokens": 1024
	}'
	```

	#### Background Processing

	Run requests in the background for long-running tasks:

	```bash
	curl http://localhost:8080/v1/responses \
	-H "Content-Type: application/json" \
	-d '{
	"model": "ggml-koala-7b-model-q4_0-r2.bin",
	"input": "Generate a long story",
	"max_output_tokens": 4096,
	"background": true
	}'
	```

	The response will include a response ID that can be used to poll for completion:

	```json
	{
	"id": "resp_abc123",
	"object": "response",
	"status": "in_progress",
	"created_at": 1234567890
	}
	```

	#### Retrieving Background Responses

	Use the GET endpoint to retrieve background responses:

	```bash
	# Get response by ID
	curl http://localhost:8080/v1/responses/resp_abc123

	# Resume streaming with query parameters
	curl "http://localhost:8080/v1/responses/resp_abc123?stream=true&starting_after=10"
	```

	#### Canceling Background Responses

	Cancel a background response that's still in progress:

	```bash
	curl -X POST http://localhost:8080/v1/responses/resp_abc123/cancel
	```

	#### Tool Calling

	Open Responses API supports function calling with tools:

	```bash
	curl http://localhost:8080/v1/responses \
	-H "Content-Type: application/json" \
	-d '{
	"model": "ggml-koala-7b-model-q4_0-r2.bin",
	"input": "What is the weather in San Francisco?",
	"tools": [
	{
	"type": "function",
	"name": "get_weather",
	"description": "Get the current weather",
	"parameters": {
	"type": "object",
	"properties": {
	"location": {
	"type": "string",
	"description": "The city and state"
	}
	},
	"required": ["location"]
	}
	}
	],
	"tool_choice": "auto",
	"max_output_tokens": 1024
	}'
	```

	#### Reasoning Configuration

	Configure reasoning effort and summary style:

	```bash
	curl http://localhost:8080/v1/responses \
	-H "Content-Type: application/json" \
	-d '{
	"model": "ggml-koala-7b-model-q4_0-r2.bin",
	"input": "Solve this complex problem step by step",
	"reasoning": {
	"effort": "high",
	"summary": "detailed"
	},
	"max_output_tokens": 2048
	}'
	```

	#### Response Format

	```json
	{
	"id": "resp_abc123",
	"object": "response",
	"created_at": 1234567890,
	"completed_at": 1234567895,
	"status": "completed",
	"model": "ggml-koala-7b-model-q4_0-r2.bin",
	"output": [
	{
	"type": "message",
	"id": "msg_001",
	"role": "assistant",
	"content": [
	{
	"type": "output_text",
	"text": "This is a test!",
	"annotations": [],
	"logprobs": []
	}
	],
	"status": "completed"
	}
	],
	"error": null,
	"incomplete_details": null,
	"temperature": 0.7,
	"top_p": 1.0,
	"presence_penalty": 0.0,
	"frequency_penalty": 0.0,
	"usage": {
	"input_tokens": 10,
	"output_tokens": 5,
	"total_tokens": 15,
	"input_tokens_details": {
	"cached_tokens": 0
	},
	"output_tokens_details": {
	"reasoning_tokens": 0
	}
	}
	}
	```

	## Backends

	### RWKV

	RWKV support is available through llama.cpp (see below)

	### llama.cpp

	[llama.cpp](https://github.com/ggerganov/llama.cpp) is a popular port of Facebook's LLaMA model in C/C++.

	{{% notice note %}}

	The `ggml` file format has been deprecated. If you are using `ggml` models and you are configuring your model with a YAML file, specify, use a LocalAI version older than v2.25.0. For `gguf` models, use the `llama` backend. The go backend is deprecated as well but still available as `go-llama`.

	{{% /notice %}}

	#### Features

	The `llama.cpp` model supports the following features:
	- [📖 Text generation (GPT)]({{%relref "features/text-generation" %}})
	- [🧠 Embeddings]({{%relref "features/embeddings" %}})
	- [🔥 OpenAI functions]({{%relref "features/openai-functions" %}})
	- [✍️ Constrained grammars]({{%relref "features/constrained_grammars" %}})

	#### Setup

	LocalAI supports `llama.cpp` models out of the box. You can use the `llama.cpp` model in the same way as any other model.

	##### Manual setup

	It is sufficient to copy the `ggml` or `gguf` model files in the `models` folder. You can refer to the model in the `model` parameter in the API calls.

	[You can optionally create an associated YAML]({{%relref "advanced" %}}) model config file to tune the model's parameters or apply a template to the prompt.

	Prompt templates are useful for models that are fine-tuned towards a specific prompt.

	##### Automatic setup

	LocalAI supports model galleries which are indexes of models. For instance, the huggingface gallery contains a large curated index of models from the huggingface model hub for `ggml` or `gguf` models.

	For instance, if you have the galleries enabled and LocalAI already running, you can just start chatting with models in huggingface by running:

	```bash
	curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
	"model": "TheBloke/WizardLM-13B-V1.2-GGML/wizardlm-13b-v1.2.ggmlv3.q2_K.bin",
	"messages": [{"role": "user", "content": "Say this is a test!"}],
	"temperature": 0.1
	}'
	```

	LocalAI will automatically download and configure the model in the `model` directory.

	Models can be also preloaded or downloaded on demand. To learn about model galleries, check out the [model gallery documentation]({{%relref "features/model-gallery" %}}).

	#### YAML configuration

	To use the `llama.cpp` backend, specify `llama-cpp` as the backend in the YAML file:

	```yaml
	name: llama
	backend: llama-cpp
	parameters:
	# Relative to the models path
	model: file.gguf
	```

	#### Backend Options

	The `llama.cpp` backend supports additional configuration options that can be specified in the `options` field of your model YAML configuration. These options allow fine-tuning of the backend behavior:

	\| Option \| Type \| Description \| Example \|
	\|--------\|------\|-------------\|---------\|
	\| `use_jinja` or `jinja` \| boolean \| Enable Jinja2 template processing for chat templates. When enabled, the backend uses Jinja2-based chat templates from the model for formatting messages. \| `use_jinja:true` \|
	\| `context_shift` \| boolean \| Enable context shifting, which allows the model to dynamically adjust context window usage. \| `context_shift:true` \|
	\| `cache_ram` \| integer \| Set the maximum RAM cache size in MiB for KV cache. Use `-1` for unlimited (default). \| `cache_ram:2048` \|
	\| `parallel` or `n_parallel` \| integer \| Enable parallel request processing. When set to a value greater than 1, enables continuous batching for handling multiple requests concurrently. \| `parallel:4` \|
	\| `grpc_servers` or `rpc_servers` \| string \| Comma-separated list of gRPC server addresses for distributed inference. Allows distributing workload across multiple llama.cpp workers. \| `grpc_servers:localhost:50051,localhost:50052` \|
	\| `fit_params` or `fit` \| boolean \| Enable auto-adjustment of model/context parameters to fit available device memory. Default: `true`. \| `fit_params:true` \|
	\| `fit_params_target` or `fit_target` \| integer \| Target margin per device in MiB when using fit_params. Default: `1024` (1GB). \| `fit_target:2048` \|
	\| `fit_params_min_ctx` or `fit_ctx` \| integer \| Minimum context size that can be set by fit_params. Default: `4096`. \| `fit_ctx:2048` \|
	\| `n_cache_reuse` or `cache_reuse` \| integer \| Minimum chunk size to attempt reusing from the cache via KV shifting. Default: `0` (disabled). \| `cache_reuse:256` \|
	\| `slot_prompt_similarity` or `sps` \| float \| How much the prompt of a request must match the prompt of a slot to use that slot. Default: `0.1`. Set to `0` to disable. \| `sps:0.5` \|
	\| `swa_full` \| boolean \| Use full-size SWA (Sliding Window Attention) cache. Default: `false`. \| `swa_full:true` \|
	\| `cont_batching` or `continuous_batching` \| boolean \| Enable continuous batching for handling multiple sequences. Default: `true`. \| `cont_batching:true` \|
	\| `check_tensors` \| boolean \| Validate tensor data for invalid values during model loading. Default: `false`. \| `check_tensors:true` \|
	\| `warmup` \| boolean \| Enable warmup run after model loading. Default: `true`. \| `warmup:false` \|
	\| `no_op_offload` \| boolean \| Disable offloading host tensor operations to device. Default: `false`. \| `no_op_offload:true` \|
	\| `kv_unified` or `unified_kv` \| boolean \| Enable unified KV cache. Default: `false`. \| `kv_unified:true` \|
	\| `n_ctx_checkpoints` or `ctx_checkpoints` \| integer \| Maximum number of context checkpoints per slot. Default: `8`. \| `ctx_checkpoints:4` \|

	Example configuration with options:

	```yaml
	name: llama-model
	backend: llama
	parameters:
	model: model.gguf
	options:
	- use_jinja:true
	- context_shift:true
	- cache_ram:4096
	- parallel:2
	- fit_params:true
	- fit_target:1024
	- slot_prompt_similarity:0.5
	```

	Note: The `parallel` option can also be set via the `LLAMACPP_PARALLEL` environment variable, and `grpc_servers` can be set via the `LLAMACPP_GRPC_SERVERS` environment variable. Options specified in the YAML file take precedence over environment variables.

	#### Reference

	- [llama](https://github.com/ggerganov/llama.cpp)


	### vLLM

	[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference.

	LocalAI has a built-in integration with vLLM, and it can be used to run models. You can check out `vllm` performance [here](https://github.com/vllm-project/vllm#performance).

	#### Setup

	Create a YAML file for the model you want to use with `vllm`.

	To setup a model, you need to just specify the model name in the YAML config file:
	```yaml
	name: vllm
	backend: vllm
	parameters:
	model: "facebook/opt-125m"

	```

	The backend will automatically download the required files in order to run the model.


	#### Usage

	Use the `completions` endpoint by specifying the `vllm` backend:
	```
	curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
	"model": "vllm",
	"prompt": "Hello, my name is",
	"temperature": 0.1, "top_p": 0.1
	}'
	```

	### Transformers

	[Transformers](https://huggingface.co/docs/transformers/index) is a State-of-the-art Machine Learning library for PyTorch, TensorFlow, and JAX.

	LocalAI has a built-in integration with Transformers, and it can be used to run models.

	This is an extra backend - in the container images (the `extra` images already contains python dependencies for Transformers) is already available and there is nothing to do for the setup.

	#### Setup

	Create a YAML file for the model you want to use with `transformers`.

	To setup a model, you need to just specify the model name in the YAML config file:
	```yaml
	name: transformers
	backend: transformers
	parameters:
	model: "facebook/opt-125m"
	type: AutoModelForCausalLM
	quantization: bnb_4bit # One of: bnb_8bit, bnb_4bit, xpu_4bit, xpu_8bit (optional)
	```

	The backend will automatically download the required files in order to run the model.

	#### Parameters

	##### Type

	\| Type \| Description \|
	\| --- \| --- \|
	\| `AutoModelForCausalLM` \| `AutoModelForCausalLM` is a model that can be used to generate sequences. Use it for NVIDIA CUDA and Intel GPU with Intel Extensions for Pytorch acceleration \|
	\| `OVModelForCausalLM` \| for Intel CPU/GPU/NPU OpenVINO Text Generation models \|
	\| `OVModelForFeatureExtraction` \| for Intel CPU/GPU/NPU OpenVINO Embedding acceleration \|
	\| N/A \| Defaults to `AutoModel` \|

	- `OVModelForCausalLM` requires OpenVINO IR [Text Generation](https://huggingface.co/models?library=openvino&pipeline_tag=text-generation) models from Hugging face
	- `OVModelForFeatureExtraction` works with any Safetensors Transformer [Feature Extraction](https://huggingface.co/models?pipeline_tag=feature-extraction&library=transformers,safetensors) model from Huggingface (Embedding Model)

	Please note that streaming is currently not implemente in `AutoModelForCausalLM` for Intel GPU.
	AMD GPU support is not implemented.
	Although AMD CPU is not officially supported by OpenVINO there are reports that it works: YMMV.

	##### Embeddings
	Use `embeddings: true` if the model is an embedding model

	##### Inference device selection
	Transformer backend tries to automatically select the best device for inference, anyway you can override the decision manually overriding with the `main_gpu` parameter.

	\| Inference Engine \| Applicable Values \|
	\| --- \| --- \|
	\| CUDA \| `cuda`, `cuda.X` where X is the GPU device like in `nvidia-smi -L` output \|
	\| OpenVINO \| Any applicable value from [Inference Modes](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html) like `AUTO`,`CPU`,`GPU`,`NPU`,`MULTI`,`HETERO` \|

	Example for CUDA:
	`main_gpu: cuda.0`

	Example for OpenVINO:
	`main_gpu: AUTO:-CPU`

	This parameter applies to both Text Generation and Feature Extraction (i.e. Embeddings) models.

	##### Inference Precision
	Transformer backend automatically select the fastest applicable inference precision according to the device support.
	CUDA backend can manually enable bfloat16 if your hardware support it with the following parameter:

	`f16: true`

	##### Quantization

	\| Quantization \| Description \|
	\| --- \| --- \|
	\| `bnb_8bit` \| 8-bit quantization \|
	\| `bnb_4bit` \| 4-bit quantization \|
	\| `xpu_8bit` \| 8-bit quantization for Intel XPUs \|
	\| `xpu_4bit` \| 4-bit quantization for Intel XPUs \|

	##### Trust Remote Code
	Some models like Microsoft Phi-3 requires external code than what is provided by the transformer library.
	By default it is disabled for security.
	It can be manually enabled with:
	`trust_remote_code: true`

	##### Maximum Context Size
	Maximum context size in bytes can be specified with the parameter: `context_size`. Do not use values higher than what your model support.

	Usage example:
	`context_size: 8192`

	##### Auto Prompt Template
	Usually chat template is defined by the model author in the `tokenizer_config.json` file.
	To enable it use the `use_tokenizer_template: true` parameter in the `template` section.

	Usage example:
	```
	template:
	use_tokenizer_template: true
	```

	##### Custom Stop Words
	Stopwords are usually defined in `tokenizer_config.json` file.
	They can be overridden with the `stopwords` parameter in case of need like in llama3-Instruct model.

	Usage example:
	```
	stopwords:
	- "<\|eot_id\|>"
	- "<\|end_of_text\|>"
	```

	#### Usage

	Use the `completions` endpoint by specifying the `transformers` model:
	```
	curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
	"model": "transformers",
	"prompt": "Hello, my name is",
	"temperature": 0.1, "top_p": 0.1
	}'
	```

	#### Examples

	##### OpenVINO

	A model configuration file for openvion and starling model:

	```yaml
	name: starling-openvino
	backend: transformers
	parameters:
	model: fakezeta/Starling-LM-7B-beta-openvino-int8
	context_size: 8192
	threads: 6
	f16: true
	type: OVModelForCausalLM
	stopwords:
	- <\|end_of_turn\|>
	- <\|endoftext\|>
	prompt_cache_path: "cache"
	prompt_cache_all: true
	template:
	chat_message: \|
	{{if eq .RoleName "system"}}{{.Content}}<\|end_of_turn\|>{{end}}{{if eq .RoleName "assistant"}}<\|end_of_turn\|>GPT4 Correct Assistant: {{.Content}}<\|end_of_turn\|>{{end}}{{if eq .RoleName "user"}}GPT4 Correct User: {{.Content}}{{end}}

	chat: \|
	{{.Input}}<\|end_of_turn\|>GPT4 Correct Assistant:

	completion: \|
	{{.Input}}
	```