Chapter 20

Advanced

Advanced usage

Model Configuration with YAML Files

LocalAI uses YAML configuration files to define model parameters, templates, and behavior. You can create individual YAML files in the models directory or use a single configuration file with multiple models.

Quick Example:

name: gpt-3.5-turbo
parameters:
  model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
  temperature: 0.3

context_size: 512
threads: 10
backend: llama-stable

template:
  completion: completion
  chat: chat

For a complete reference of all available configuration options, see the Model Configuration page.

Configuration File Locations:

Individual files: Create .yaml files in your models directory (e.g., models/gpt-3.5-turbo.yaml)
Single config file: Use --models-config-file or LOCALAI_MODELS_CONFIG_FILE to specify a file containing multiple models

Remote URLs: Specify a URL to a YAML configuration file at startup:

local-ai run github://mudler/LocalAI/examples/configurations/phi-2.yaml@master

See also chatbot-ui as an example on how to use config files.

Prompt templates

The API doesn’t inject a default prompt for talking to the model. You have to use a prompt similar to what’s described in the standford-alpaca docs: https://github.com/tatsu-lab/stanford_alpaca#data-release.

You can use a default template for every model present in your model path, by creating a corresponding file with the `.tmpl` suffix next to your model. For instance, if the model is called `foo.bin`, you can create a sibling file, `foo.bin.tmpl` which will be used as a default prompt and can be used with alpaca:

The below instruction describes a task. Write a response that appropriately completes the request.

### Instruction:
{{.Input}}

### Response:

See the prompt-templates directory in this repository for templates for some of the most popular models.

For the edit endpoint, an example template for alpaca-based models can be:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{{.Instruction}}

### Input:
{{.Input}}

### Response:

Install models using the API

Instead of installing models manually, you can use the LocalAI API endpoints and a model definition to install programmatically via API models in runtime.

A curated collection of model files is in the model-gallery. The files of the model gallery are different from the model files used to configure LocalAI models. The model gallery files contains information about the model setup, and the files necessary to run the model locally.

To install for example lunademo, you can send a POST call to the /models/apply endpoint with the model definition url (url) and the name of the model should have in LocalAI (name, optional):

curl --location 'http://localhost:8080/models/apply' \
--header 'Content-Type: application/json' \
--data-raw '{
    "id": "TheBloke/Luna-AI-Llama2-Uncensored-GGML/luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin",
    "name": "lunademo"
}'

Preloading models during startup

In order to allow the API to start-up with all the needed model on the first-start, the model gallery files can be used during startup.

PRELOAD_MODELS='[{"url": "https://raw.githubusercontent.com/go-skynet/model-gallery/main/gpt4all-j.yaml","name": "gpt4all-j"}]' local-ai

PRELOAD_MODELS (or --preload-models) takes a list in JSON with the same parameter of the API calls of the /models/apply endpoint.

Similarly it can be specified a path to a YAML configuration file containing a list of models with PRELOAD_MODELS_CONFIG ( or --preload-models-config ):

- url: https://raw.githubusercontent.com/go-skynet/model-gallery/main/gpt4all-j.yaml
  name: gpt4all-j

Automatic prompt caching

LocalAI can automatically cache prompts for faster loading of the prompt. This can be useful if your model need a prompt template with prefixed text in the prompt before the input.

To enable prompt caching, you can control the settings in the model config YAML file:

prompt_cache_path: "cache"
prompt_cache_all: true

prompt_cache_path is relative to the models folder. you can enter here a name for the file that will be automatically create during the first load if prompt_cache_all is set to true.

Configuring a specific backend for the model

By default LocalAI will try to autoload the model by trying all the backends. This might work for most of models, but some of the backends are NOT configured to autoload.

The available backends are listed in the model compatibility table.

In order to specify a backend for your models, create a model config file in your models directory specifying the backend:

name: gpt-3.5-turbo

parameters:
  # Relative to the models path
  model: ...

backend: llama-stable

Connect external backends

LocalAI backends are internally implemented using gRPC services. This also allows LocalAI to connect to external gRPC services on start and extend LocalAI functionalities via third-party binaries.

The --external-grpc-backends parameter in the CLI can be used either to specify a local backend (a file) or a remote URL. The syntax is <BACKEND_NAME>:<BACKEND_URI>. Once LocalAI is started with it, the new backend name will be available for all the API endpoints.

So for instance, to register a new backend which is a local file:

./local-ai --debug --external-grpc-backends "my-awesome-backend:/path/to/my/backend.py"

Or a remote URI:

./local-ai --debug --external-grpc-backends "my-awesome-backend:host:port"

For example, to start vllm manually after compiling LocalAI (also assuming running the command from the root of the repository):

./local-ai --external-grpc-backends "vllm:$PWD/backend/python/vllm/run.sh"

Note that first is is necessary to create the environment with:

make -C backend/python/vllm

Environment variables

When LocalAI runs in a container, there are additional environment variables available that modify the behavior of LocalAI on startup:

Environment variable	Default	Description
`REBUILD`	`false`	Rebuild LocalAI on startup
`BUILD_TYPE`		Build type. Available: `cublas`, `openblas`, `clblas`, `intel` (intel core), `sycl_f16`, `sycl_f32` (intel backends)
`GO_TAGS`		Go tags. Available: `stablediffusion`
`HUGGINGFACEHUB_API_TOKEN`		Special token for interacting with HuggingFace Inference API, required only when using the `langchain-huggingface` backend
`EXTRA_BACKENDS`		A space separated list of backends to prepare. For example `EXTRA_BACKENDS="backend/python/diffusers backend/python/transformers"` prepares the python environment on start
`DISABLE_AUTODETECT`	`false`	Disable autodetect of CPU flagset on start
`LLAMACPP_GRPC_SERVERS`		A list of llama.cpp workers to distribute the workload. For example `LLAMACPP_GRPC_SERVERS="address1:port,address2:port"`

Here is how to configure these variables:

docker run --env REBUILD=true localai
docker run --env-file .env localai

CLI Parameters

For a complete reference of all CLI parameters, environment variables, and command-line options, see the CLI Reference page.

You can control LocalAI with command line arguments to specify a binding address, number of threads, model paths, and many other options. Any command line parameter can be specified via an environment variable.

.env files

Any settings being provided by an Environment Variable can also be provided from within .env files. There are several locations that will be checked for relevant .env files. In order of precedence they are:

.env within the current directory
localai.env within the current directory
localai.env within the home directory
.config/localai.env within the home directory
/etc/localai.env

Environment variables within files earlier in the list will take precedence over environment variables defined in files later in the list.

An example .env file is:

LOCALAI_THREADS=10
LOCALAI_MODELS_PATH=/mnt/storage/localai/models
LOCALAI_F16=true

Request headers

You can use ‘Extra-Usage’ request header key presence (‘Extra-Usage: true’) to receive inference timings in milliseconds extending default OpenAI response model in the usage field:

...
{
  "id": "...",
  "created": ...,
  "model": "...",
  "choices": [
    {
      ...
    },
    ...
  ],
  "object": "...",
  "usage": {
    "prompt_tokens": ...,
    "completion_tokens": ...,
    "total_tokens": ...,
    // Extra-Usage header key will include these two float fields:
    "timing_prompt_processing: ...,
    "timing_token_generation": ...,
  },
}
...

Extra backends

LocalAI can be extended with extra backends. The backends are implemented as gRPC services and can be written in any language. See the backend section for more details on how to install and build new backends for LocalAI.

In runtime

When using the -core container image it is possible to prepare the python backends you are interested into by using the EXTRA_BACKENDS variable, for instance:

docker run --env EXTRA_BACKENDS="backend/python/diffusers" quay.io/go-skynet/local-ai:master

Concurrent requests

LocalAI supports parallel requests for the backends that supports it. For instance, vLLM and llama.cpp supports parallel requests, and thus LocalAI allows to run multiple requests in parallel.

In order to enable parallel requests, you have to pass --parallel-requests or set the PARALLEL_REQUEST to true as environment variable.

A list of the environment variable that tweaks parallelism is the following:

### Python backends GRPC max workers
### Default number of workers for GRPC Python backends.
### This actually controls wether a backend can process multiple requests or not.

### Define the number of parallel LLAMA.cpp workers (Defaults to 1)

### Enable to run parallel requests

Note that, for llama.cpp you need to set accordingly LLAMACPP_PARALLEL to the number of parallel processes your GPU/CPU can handle. For python-based backends (like vLLM) you can set PYTHON_GRPC_MAX_WORKERS to the number of parallel requests.

VRAM and Memory Management

For detailed information on managing VRAM when running multiple models, see the dedicated VRAM and Memory Management page.

Disable CPU flagset auto detection in llama.cpp

LocalAI will automatically discover the CPU flagset available in your host and will use the most optimized version of the backends.

If you want to disable this behavior, you can set DISABLE_AUTODETECT to true in the environment variables.

Fine-tuning LLMs for text generation

Note

Section under construction

This section covers how to fine-tune a language model for text generation and consume it in LocalAI.

Requirements

For this example you will need at least a 12GB VRAM of GPU and a Linux box.

Fine-tuning

Fine-tuning a language model is a process that requires a lot of computational power and time.

Currently LocalAI doesn’t support the fine-tuning endpoint as LocalAI but there are are plans to support that. For the time being a guide is proposed here to give a simple starting point on how to fine-tune a model and use it with LocalAI (but also with llama.cpp).

There is an e2e example of fine-tuning a LLM model to use with LocalAI written by @mudler available here.

The steps involved are:

Preparing a dataset
Prepare the environment and install dependencies
Fine-tune the model
Merge the Lora base with the model
Convert the model to gguf
Use the model with LocalAI

Dataset preparation

We are going to need a dataset or a set of datasets.

Axolotl supports a variety of formats, in the notebook and in this example we are aiming for a very simple dataset and build that manually, so we are going to use the completion format which requires the full text to be used for fine-tuning.

A dataset for an instructor model (like Alpaca) can look like the following:

[
 {
    "text": "As an AI language model you are trained to reply to an instruction. Try to be as much polite as possible\n\n## Instruction\n\nWrite a poem about a tree.\n\n## Response\n\nTrees are beautiful, ...",
 },
 {
    "text": "As an AI language model you are trained to reply to an instruction. Try to be as much polite as possible\n\n## Instruction\n\nWrite a poem about a tree.\n\n## Response\n\nTrees are beautiful, ...",
 }
]

Every block in the text is the whole text that is used to fine-tune. For example, for an instructor model it follows the following format (more or less):

<System prompt>

## Instruction

<Question, instruction>

## Response

<Expected response from the LLM>

The instruction format works such as when we are going to inference with the model, we are going to feed it only the first part up to the ## Instruction block, and the model is going to complete the text with the ## Response block.

Prepare a dataset, and upload it to your Google Drive in case you are using the Google colab. Otherwise place it next the axolotl.yaml file as dataset.json.

Install dependencies

git clone https://github.com/OpenAccess-AI-Collective/axolotl && pushd axolotl && git checkout 797f3dd1de8fd8c0eafbd1c9fdb172abd9ff840a && popd #0.3.0
pip install packaging
pushd axolotl && pip install -e '.[flash-attn,deepspeed]' && popd

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.0/flash_attn-2.3.0+cu117torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Configure accelerate:

accelerate config default

Fine-tuning

We will need to configure axolotl. In this example is provided a file to use axolotl.yaml that uses openllama-3b for fine-tuning. Copy the axolotl.yaml file and edit it to your needs. The dataset needs to be next to it as dataset.json. You can find the axolotl.yaml file here.

If you have a big dataset, you can pre-tokenize it to speedup the fine-tuning process:

python -m axolotl.cli.preprocess axolotl.yaml

Now we are ready to start the fine-tuning process:

accelerate launch -m axolotl.cli.train axolotl.yaml

After we have finished the fine-tuning, we merge the Lora base with the model:

python3 -m axolotl.cli.merge_lora axolotl.yaml --lora_model_dir="./qlora-out" --load_in_8bit=False --load_in_4bit=False

And we convert it to the gguf format that LocalAI can consume:

git clone https://github.com/ggerganov/llama.cpp.git
pushd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release && popd

pushd llama.cpp && python3 convert_hf_to_gguf.py ../qlora-out/merged && popd

pushd llama.cpp/build/bin &&  ./llama-quantize ../../../qlora-out/merged/Merged-33B-F16.gguf \
    ../../../custom-model-q4_0.gguf q4_0

Now you should have ended up with a custom-model-q4_0.gguf file that you can copy in the LocalAI models directory and use it with LocalAI.

VRAM and Memory Management

When running multiple models in LocalAI, especially on systems with limited GPU memory (VRAM), you may encounter situations where loading a new model fails because there isn’t enough available VRAM. LocalAI provides two mechanisms to automatically manage model memory allocation and prevent VRAM exhaustion.

The Problem

By default, LocalAI keeps models loaded in memory once they’re first used. This means:

If you load a large model that uses most of your VRAM, subsequent requests for other models may fail
Models remain in memory even when not actively being used
There’s no automatic mechanism to unload models to make room for new ones, unless done manually via the web interface

This is a common issue when working with GPU-accelerated models, as VRAM is typically more limited than system RAM. For more context, see issues #6068, #7269, and #5352.

Solution 1: Single Active Backend

The simplest approach is to ensure only one model is loaded at a time. When a new model is requested, LocalAI will automatically unload the currently active model before loading the new one.

Configuration

./local-ai --single-active-backend

LOCALAI_SINGLE_ACTIVE_BACKEND=true ./local-ai

Use cases

Single GPU systems with limited VRAM
When you only need one model active at a time
Simple deployments where model switching is acceptable

Example

LOCALAI_SINGLE_ACTIVE_BACKEND=true ./local-ai

curl http://localhost:8080/v1/chat/completions -d '{"model": "model-a", ...}'

curl http://localhost:8080/v1/chat/completions -d '{"model": "model-b", ...}'

Solution 2: Watchdog Mechanisms

For more flexible memory management, LocalAI provides watchdog mechanisms that automatically unload models based on their activity state. This allows multiple models to be loaded simultaneously, but automatically frees memory when models become inactive or stuck.

Idle Watchdog

The idle watchdog monitors models that haven’t been used for a specified period and automatically unloads them to free VRAM.

Configuration

LOCALAI_WATCHDOG_IDLE=true ./local-ai

LOCALAI_WATCHDOG_IDLE=true LOCALAI_WATCHDOG_IDLE_TIMEOUT=10m ./local-ai

./local-ai --enable-watchdog-idle --watchdog-idle-timeout=10m

Busy Watchdog

The busy watchdog monitors models that have been processing requests for an unusually long time and terminates them if they exceed a threshold. This is useful for detecting and recovering from stuck or hung backends.

Configuration

LOCALAI_WATCHDOG_BUSY=true ./local-ai

LOCALAI_WATCHDOG_BUSY=true LOCALAI_WATCHDOG_BUSY_TIMEOUT=10m ./local-ai

./local-ai --enable-watchdog-busy --watchdog-busy-timeout=10m

Combined Configuration

You can enable both watchdogs simultaneously for comprehensive memory management:

LOCALAI_WATCHDOG_IDLE=true \
LOCALAI_WATCHDOG_IDLE_TIMEOUT=15m \
LOCALAI_WATCHDOG_BUSY=true \
LOCALAI_WATCHDOG_BUSY_TIMEOUT=5m \
./local-ai

Or using command line flags:

./local-ai \
  --enable-watchdog-idle --watchdog-idle-timeout=15m \
  --enable-watchdog-busy --watchdog-busy-timeout=5m

Use cases

Multi-model deployments where different models may be used intermittently
Systems where you want to keep frequently-used models loaded but free memory from unused ones
Recovery from stuck or hung backend processes
Production environments requiring automatic resource management

Example

LOCALAI_WATCHDOG_IDLE=true \
LOCALAI_WATCHDOG_IDLE_TIMEOUT=10m \
LOCALAI_WATCHDOG_BUSY=true \
LOCALAI_WATCHDOG_BUSY_TIMEOUT=5m \
./local-ai

curl http://localhost:8080/v1/chat/completions -d '{"model": "model-a", ...}'
curl http://localhost:8080/v1/chat/completions -d '{"model": "model-b", ...}'

Timeout Format

Timeouts can be specified using Go’s duration format:

15m - 15 minutes
1h - 1 hour
30s - 30 seconds
2h30m - 2 hours and 30 minutes

Limitations and Considerations

VRAM Usage Estimation

LocalAI cannot reliably estimate VRAM usage of new models to load across different backends (llama.cpp, vLLM, diffusers, etc.) because:

Different backends report memory usage differently
VRAM requirements vary based on model architecture, quantization, and configuration
Some backends may not expose memory usage information before loading the model

Manual Management

If automatic management doesn’t meet your needs, you can manually stop models using the LocalAI management API:

curl -X POST http://localhost:8080/backend/shutdown \
  -H "Content-Type: application/json" \
  -d '{"model": "model-name"}'

To stop all models, you’ll need to call the endpoint for each loaded model individually, or use the web UI to stop all models at once.

Best Practices

Monitor VRAM usage: Use nvidia-smi (for NVIDIA GPUs) or similar tools to monitor actual VRAM usage
Start with single active backend: For single-GPU systems, --single-active-backend is often the simplest solution
Tune watchdog timeouts: Adjust timeouts based on your usage patterns - shorter timeouts free memory faster but may cause more frequent reloads
Consider model size: Ensure your VRAM can accommodate at least one of your largest models
Use quantization: Smaller quantized models use less VRAM and allow more flexibility

See Advanced Usage for other configuration options
See GPU Acceleration for GPU setup and configuration
See Backend Flags for all available backend configuration options

Model Configuration

LocalAI uses YAML configuration files to define model parameters, templates, and behavior. This page provides a complete reference for all available configuration options.

Overview

Model configuration files allow you to:

Define default parameters (temperature, top_p, etc.)
Configure prompt templates
Specify backend settings
Set up function calling
Configure GPU and memory options
And much more

Configuration File Locations

You can create model configuration files in several ways:

Individual YAML files in the models directory (e.g., models/gpt-3.5-turbo.yaml)
Single config file with multiple models using --models-config-file or LOCALAI_MODELS_CONFIG_FILE
Remote URLs - specify a URL to a YAML configuration file at startup

Example: Basic Configuration

name: gpt-3.5-turbo
parameters:
  model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
  temperature: 0.3

context_size: 512
threads: 10
backend: llama-stable

template:
  completion: completion
  chat: chat

Example: Multiple Models in One File

When using --models-config-file, you can define multiple models as a list:

- name: model1
  parameters:
    model: model1.bin
  context_size: 512
  backend: llama-stable

- name: model2
  parameters:
    model: model2.bin
  context_size: 1024
  backend: llama-stable

Core Configuration Fields

Basic Model Settings

Field	Type	Description	Example
`name`	string	Model name, used to identify the model in API calls	`gpt-3.5-turbo`
`backend`	string	Backend to use (e.g. `llama-cpp`, `vllm`, `diffusers`, `whisper`)	`llama-cpp`
`description`	string	Human-readable description of the model	`A conversational AI model`
`usage`	string	Usage instructions or notes	`Best for general conversation`

Model File and Downloads

Field	Type	Description
`parameters.model`	string	Path to the model file (relative to models directory) or URL
`download_files`	array	List of files to download. Each entry has `filename`, `uri`, and optional `sha256`

Example:

parameters:
  model: my-model.gguf

download_files:
  - filename: my-model.gguf
    uri: https://example.com/model.gguf
    sha256: abc123...

Parameters Section

The parameters section contains all OpenAI-compatible request parameters and model-specific options.

OpenAI-Compatible Parameters

These settings will be used as defaults for all the API calls to the model.

Field	Type	Default	Description
`temperature`	float	`0.9`	Sampling temperature (0.0-2.0). Higher values make output more random
`top_p`	float	`0.95`	Nucleus sampling: consider tokens with top_p probability mass
`top_k`	int	`40`	Consider only the top K most likely tokens
`max_tokens`	int	`0`	Maximum number of tokens to generate (0 = unlimited)
`frequency_penalty`	float	`0.0`	Penalty for token frequency (-2.0 to 2.0)
`presence_penalty`	float	`0.0`	Penalty for token presence (-2.0 to 2.0)
`repeat_penalty`	float	`1.1`	Penalty for repeating tokens
`repeat_last_n`	int	`64`	Number of previous tokens to consider for repeat penalty
`seed`	int	`-1`	Random seed (omit for random)
`echo`	bool	`false`	Echo back the prompt in the response
`n`	int	`1`	Number of completions to generate
`logprobs`	bool/int	`false`	Return log probabilities of tokens
`top_logprobs`	int	`0`	Number of top logprobs to return per token (0-20)
`logit_bias`	map	`{}`	Map of token IDs to bias values (-100 to 100)
`typical_p`	float	`1.0`	Typical sampling parameter
`tfz`	float	`1.0`	Tail free z parameter
`keep`	int	`0`	Number of tokens to keep from the prompt

Language and Translation

Field	Type	Description
`language`	string	Language code for transcription/translation
`translate`	bool	Whether to translate audio transcription

Custom Parameters

Field	Type	Description
`batch`	int	Batch size for processing
`ignore_eos`	bool	Ignore end-of-sequence tokens
`negative_prompt`	string	Negative prompt for image generation
`rope_freq_base`	float32	RoPE frequency base
`rope_freq_scale`	float32	RoPE frequency scale
`negative_prompt_scale`	float32	Scale for negative prompt
`tokenizer`	string	Tokenizer to use (RWKV)

LLM Configuration

These settings apply to most LLM backends (llama.cpp, vLLM, etc.):

Performance Settings

Field	Type	Default	Description
`threads`	int	`processor count`	Number of threads for parallel computation
`context_size`	int	`512`	Maximum context size (number of tokens)
`f16`	bool	`false`	Enable 16-bit floating point precision (GPU acceleration)
`gpu_layers`	int	`0`	Number of layers to offload to GPU (0 = CPU only)

Memory Management

Field	Type	Default	Description
`mmap`	bool	`true`	Use memory mapping for model loading (faster, less RAM)
`mmlock`	bool	`false`	Lock model in memory (prevents swapping)
`low_vram`	bool	`false`	Use minimal VRAM mode
`no_kv_offloading`	bool	`false`	Disable KV cache offloading

GPU Configuration

Field	Type	Description
`tensor_split`	string	Comma-separated GPU memory allocation (e.g., `"0.8,0.2"` for 80%/20%)
`main_gpu`	string	Main GPU identifier for multi-GPU setups
`cuda`	bool	Explicitly enable/disable CUDA

Sampling and Generation

Field	Type	Default	Description
`mirostat`	int	`0`	Mirostat sampling mode (0=disabled, 1=Mirostat, 2=Mirostat 2.0)
`mirostat_tau`	float	`5.0`	Mirostat target entropy
`mirostat_eta`	float	`0.1`	Mirostat learning rate

LoRA Configuration

Field	Type	Description
`lora_adapter`	string	Path to LoRA adapter file
`lora_base`	string	Base model for LoRA
`lora_scale`	float32	LoRA scale factor
`lora_adapters`	array	Multiple LoRA adapters
`lora_scales`	array	Scales for multiple LoRA adapters

Advanced Options

Field	Type	Description
`no_mulmatq`	bool	Disable matrix multiplication queuing
`draft_model`	string	Draft model for speculative decoding
`n_draft`	int32	Number of draft tokens
`quantization`	string	Quantization format
`load_format`	string	Model load format
`numa`	bool	Enable NUMA (Non-Uniform Memory Access)
`rms_norm_eps`	float32	RMS normalization epsilon
`ngqa`	int32	Natural question generation parameter
`rope_scaling`	string	RoPE scaling configuration
`type`	string	Model type/architecture
`grammar`	string	Grammar file path for constrained generation

YARN Configuration

YARN (Yet Another RoPE extensioN) settings for context extension:

Field	Type	Description
`yarn_ext_factor`	float32	YARN extension factor
`yarn_attn_factor`	float32	YARN attention factor
`yarn_beta_fast`	float32	YARN beta fast parameter
`yarn_beta_slow`	float32	YARN beta slow parameter

Prompt Caching

Field	Type	Description
`prompt_cache_path`	string	Path to store prompt cache (relative to models directory)
`prompt_cache_all`	bool	Cache all prompts automatically
`prompt_cache_ro`	bool	Read-only prompt cache

Text Processing

Field	Type	Description
`stopwords`	array	Words or phrases that stop generation
`cutstrings`	array	Strings to cut from responses
`trimspace`	array	Strings to trim whitespace from
`trimsuffix`	array	Suffixes to trim from responses
`extract_regex`	array	Regular expressions to extract content

System Prompt

Field	Type	Description
`system_prompt`	string	Default system prompt for the model

vLLM-Specific Configuration

These options apply when using the vllm backend:

Field	Type	Description
`gpu_memory_utilization`	float32	GPU memory utilization (0.0-1.0, default 0.9)
`trust_remote_code`	bool	Trust and execute remote code
`enforce_eager`	bool	Force eager execution mode
`swap_space`	int	Swap space in GB
`max_model_len`	int	Maximum model length
`tensor_parallel_size`	int	Tensor parallelism size
`disable_log_stats`	bool	Disable logging statistics
`dtype`	string	Data type (e.g., `float16`, `bfloat16`)
`flash_attention`	string	Flash attention configuration
`cache_type_k`	string	Key cache type
`cache_type_v`	string	Value cache type
`limit_mm_per_prompt`	object	Limit multimodal content per prompt: `{image: int, video: int, audio: int}`

Template Configuration

Templates use Go templates with Sprig functions.

Field	Type	Description
`template.chat`	string	Template for chat completion endpoint
`template.chat_message`	string	Template for individual chat messages
`template.completion`	string	Template for text completion
`template.edit`	string	Template for edit operations
`template.function`	string	Template for function/tool calls
`template.multimodal`	string	Template for multimodal interactions
`template.reply_prefix`	string	Prefix to add to model replies
`template.use_tokenizer_template`	bool	Use tokenizer’s built-in template (vLLM/transformers)
`template.join_chat_messages_by_character`	string	Character to join chat messages (default: `\n`)

Template Variables

Templating supports sprig functions.

Following are common variables available in templates:

{{.Input}} - User input
{{.Instruction}} - Instruction for edit operations
{{.System}} - System message
{{.Prompt}} - Full prompt
{{.Functions}} - Function definitions (for function calling)
{{.FunctionCall}} - Function call result

Example Template

template:
  chat: |
    {{.System}}
    {{range .Messages}}
    {{if eq .Role "user"}}User: {{.Content}}{{end}}
    {{if eq .Role "assistant"}}Assistant: {{.Content}}{{end}}
    {{end}}
    Assistant:

Function Calling Configuration

Configure how the model handles function/tool calls:

Field	Type	Default	Description
`function.disable_no_action`	bool	`false`	Disable the no-action behavior
`function.no_action_function_name`	string	`answer`	Name of the no-action function
`function.no_action_description_name`	string		Description for no-action function
`function.function_name_key`	string	`name`	JSON key for function name
`function.function_arguments_key`	string	`arguments`	JSON key for function arguments
`function.response_regex`	array		Named regex patterns to extract function calls
`function.argument_regex`	array		Named regex to extract function arguments
`function.argument_regex_key_name`	string	`key`	Named regex capture for argument key
`function.argument_regex_value_name`	string	`value`	Named regex capture for argument value
`function.json_regex_match`	array		Regex patterns to match JSON in tool mode
`function.replace_function_results`	array		Replace function call results with patterns
`function.replace_llm_results`	array		Replace LLM results with patterns
`function.capture_llm_results`	array		Capture LLM results as text (e.g., for “thinking” blocks)

Grammar Configuration

Field	Type	Default	Description
`function.grammar.disable`	bool	`false`	Completely disable grammar enforcement
`function.grammar.parallel_calls`	bool	`false`	Allow parallel function calls
`function.grammar.mixed_mode`	bool	`false`	Allow mixed-mode grammar enforcing
`function.grammar.no_mixed_free_string`	bool	`false`	Disallow free strings in mixed mode
`function.grammar.disable_parallel_new_lines`	bool	`false`	Disable parallel processing for new lines
`function.grammar.prefix`	string		Prefix to add before grammar rules
`function.grammar.expect_strings_after_json`	bool	`false`	Expect strings after JSON data

Diffusers Configuration

For image generation models using the diffusers backend:

Field	Type	Description
`diffusers.cuda`	bool	Enable CUDA for diffusers
`diffusers.pipeline_type`	string	Pipeline type (e.g., `stable-diffusion`, `stable-diffusion-xl`)
`diffusers.scheduler_type`	string	Scheduler type (e.g., `euler`, `ddpm`)
`diffusers.enable_parameters`	string	Comma-separated parameters to enable
`diffusers.cfg_scale`	float32	Classifier-free guidance scale
`diffusers.img2img`	bool	Enable image-to-image transformation
`diffusers.clip_skip`	int	Number of CLIP layers to skip
`diffusers.clip_model`	string	CLIP model to use
`diffusers.clip_subfolder`	string	CLIP model subfolder
`diffusers.control_net`	string	ControlNet model to use
`step`	int	Number of diffusion steps

TTS Configuration

For text-to-speech models:

Field	Type	Description
`tts.voice`	string	Voice file path or voice ID
`tts.audio_path`	string	Path to audio files (for Vall-E)

Roles Configuration

Map conversation roles to specific strings:

roles:
  user: "### Instruction:"
  assistant: "### Response:"
  system: "### System Instruction:"

Feature Flags

Enable or disable experimental features:

feature_flags:
  feature_name: true
  another_feature: false

MCP Configuration

Model Context Protocol (MCP) configuration:

Field	Type	Description
`mcp.remote`	string	YAML string defining remote MCP servers
`mcp.stdio`	string	YAML string defining STDIO MCP servers

Agent Configuration

Agent/autonomous agent configuration:

Field	Type	Description
`agent.max_attempts`	int	Maximum number of attempts
`agent.max_iterations`	int	Maximum number of iterations
`agent.enable_reasoning`	bool	Enable reasoning capabilities
`agent.enable_planning`	bool	Enable planning capabilities
`agent.enable_mcp_prompts`	bool	Enable MCP prompts
`agent.enable_plan_re_evaluator`	bool	Enable plan re-evaluation

Pipeline Configuration

Define pipelines for audio-to-audio processing:

Field	Type	Description
`pipeline.tts`	string	TTS model name
`pipeline.llm`	string	LLM model name
`pipeline.transcription`	string	Transcription model name
`pipeline.vad`	string	Voice activity detection model name

gRPC Configuration

Backend gRPC communication settings:

Field	Type	Description
`grpc.attempts`	int	Number of retry attempts
`grpc.attempts_sleep_time`	int	Sleep time between retries (seconds)

Overrides

Override model configuration values at runtime (llama.cpp):

overrides:
  - "qwen3moe.expert_used_count=int:10"
  - "some_key=string:value"

Format: KEY=TYPE:VALUE where TYPE is int, float, string, or bool.

Known Use Cases

Specify which endpoints this model supports:

known_usecases:
  - chat
  - completion
  - embeddings

Available flags: chat, completion, edit, embeddings, rerank, image, transcript, tts, sound_generation, tokenize, vad, video, detection, llm (combination of CHAT, COMPLETION, EDIT).

Complete Example

Here’s a comprehensive example combining many options:

name: my-llm-model
description: A high-performance LLM model
backend: llama-stable

parameters:
  model: my-model.gguf
  temperature: 0.7
  top_p: 0.9
  top_k: 40
  max_tokens: 2048

context_size: 4096
threads: 8
f16: true
gpu_layers: 35

system_prompt: "You are a helpful AI assistant."

template:
  chat: |
    {{.System}}
    {{range .Messages}}
    {{if eq .Role "user"}}User: {{.Content}}
    {{else if eq .Role "assistant"}}Assistant: {{.Content}}
    {{end}}
    {{end}}
    Assistant:

roles:
  user: "User:"
  assistant: "Assistant:"
  system: "System:"

stopwords:
  - "\n\nUser:"
  - "\n\nHuman:"

prompt_cache_path: "cache/my-model"
prompt_cache_all: true

function:
  grammar:
    parallel_calls: true
    mixed_mode: false

feature_flags:
  experimental_feature: true

See Advanced Usage for other configuration options
See Prompt Templates for template examples
See CLI Reference for command-line options

Advanced

Subsections of Advanced

Advanced usage

Model Configuration with YAML Files

Prompt templates

Install models using the API

Preloading models during startup

Automatic prompt caching

Configuring a specific backend for the model

Connect external backends

Environment variables

CLI Parameters

.env files

Request headers

Extra backends

In runtime

Concurrent requests

VRAM and Memory Management

Disable CPU flagset auto detection in llama.cpp

Fine-tuning LLMs for text generation

Requirements

Fine-tuning

Dataset preparation

Install dependencies

Fine-tuning

VRAM and Memory Management

The Problem

Solution 1: Single Active Backend

Configuration

Use cases

Example

Solution 2: Watchdog Mechanisms

Idle Watchdog

Configuration

Busy Watchdog

Configuration

Combined Configuration

Use cases

Example

Timeout Format

Limitations and Considerations

VRAM Usage Estimation

Manual Management

Best Practices

Related Documentation

Model Configuration

Overview

Configuration File Locations

Example: Basic Configuration

Example: Multiple Models in One File

Core Configuration Fields

Basic Model Settings

Model File and Downloads

Parameters Section

OpenAI-Compatible Parameters

Language and Translation

Custom Parameters

LLM Configuration

Performance Settings

Memory Management

GPU Configuration

Sampling and Generation

LoRA Configuration

Advanced Options

YARN Configuration

Prompt Caching

Text Processing

System Prompt

vLLM-Specific Configuration

Template Configuration

Template Variables

Example Template

Function Calling Configuration

Grammar Configuration

Diffusers Configuration

TTS Configuration

Roles Configuration

Feature Flags

MCP Configuration

Agent Configuration