Subsections of Advanced

Advanced usage

Model Configuration with YAML Files

LocalAI uses YAML configuration files to define model parameters, templates, and behavior. You can create individual YAML files in the models directory or use a single configuration file with multiple models.

Quick Example:

name: gpt-3.5-turbo
parameters:
  model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
  temperature: 0.3

context_size: 512
threads: 10
backend: llama-stable

template:
  completion: completion
  chat: chat

For a complete reference of all available configuration options, see the Model Configuration page.

Configuration File Locations:

  1. Individual files: Create .yaml files in your models directory (e.g., models/gpt-3.5-turbo.yaml)
  2. Single config file: Use --models-config-file or LOCALAI_MODELS_CONFIG_FILE to specify a file containing multiple models
  3. Remote URLs: Specify a URL to a YAML configuration file at startup:
    local-ai run github://mudler/LocalAI/examples/configurations/phi-2.yaml@master

See also chatbot-ui as an example on how to use config files.

Prompt templates

The API doesn’t inject a default prompt for talking to the model. You have to use a prompt similar to what’s described in the standford-alpaca docs: https://github.com/tatsu-lab/stanford_alpaca#data-release.

You can use a default template for every model present in your model path, by creating a corresponding file with the `.tmpl` suffix next to your model. For instance, if the model is called `foo.bin`, you can create a sibling file, `foo.bin.tmpl` which will be used as a default prompt and can be used with alpaca:
The below instruction describes a task. Write a response that appropriately completes the request.

### Instruction:
{{.Input}}

### Response:

See the prompt-templates directory in this repository for templates for some of the most popular models.

For the edit endpoint, an example template for alpaca-based models can be:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{{.Instruction}}

### Input:
{{.Input}}

### Response:

Install models using the API

Instead of installing models manually, you can use the LocalAI API endpoints and a model definition to install programmatically via API models in runtime.

A curated collection of model files is in the model-gallery. The files of the model gallery are different from the model files used to configure LocalAI models. The model gallery files contains information about the model setup, and the files necessary to run the model locally.

To install for example lunademo, you can send a POST call to the /models/apply endpoint with the model definition url (url) and the name of the model should have in LocalAI (name, optional):

curl --location 'http://localhost:8080/models/apply' \
--header 'Content-Type: application/json' \
--data-raw '{
    "id": "TheBloke/Luna-AI-Llama2-Uncensored-GGML/luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin",
    "name": "lunademo"
}'

Preloading models during startup

In order to allow the API to start-up with all the needed model on the first-start, the model gallery files can be used during startup.

PRELOAD_MODELS='[{"url": "https://raw.githubusercontent.com/go-skynet/model-gallery/main/gpt4all-j.yaml","name": "gpt4all-j"}]' local-ai

PRELOAD_MODELS (or --preload-models) takes a list in JSON with the same parameter of the API calls of the /models/apply endpoint.

Similarly it can be specified a path to a YAML configuration file containing a list of models with PRELOAD_MODELS_CONFIG ( or --preload-models-config ):

- url: https://raw.githubusercontent.com/go-skynet/model-gallery/main/gpt4all-j.yaml
  name: gpt4all-j

Automatic prompt caching

LocalAI can automatically cache prompts for faster loading of the prompt. This can be useful if your model need a prompt template with prefixed text in the prompt before the input.

To enable prompt caching, you can control the settings in the model config YAML file:

prompt_cache_path: "cache"
prompt_cache_all: true

prompt_cache_path is relative to the models folder. you can enter here a name for the file that will be automatically create during the first load if prompt_cache_all is set to true.

Configuring a specific backend for the model

By default LocalAI will try to autoload the model by trying all the backends. This might work for most of models, but some of the backends are NOT configured to autoload.

The available backends are listed in the model compatibility table.

In order to specify a backend for your models, create a model config file in your models directory specifying the backend:

name: gpt-3.5-turbo

parameters:
  # Relative to the models path
  model: ...

backend: llama-stable

Connect external backends

LocalAI backends are internally implemented using gRPC services. This also allows LocalAI to connect to external gRPC services on start and extend LocalAI functionalities via third-party binaries.

The --external-grpc-backends parameter in the CLI can be used either to specify a local backend (a file) or a remote URL. The syntax is <BACKEND_NAME>:<BACKEND_URI>. Once LocalAI is started with it, the new backend name will be available for all the API endpoints.

So for instance, to register a new backend which is a local file:

./local-ai --debug --external-grpc-backends "my-awesome-backend:/path/to/my/backend.py"

Or a remote URI:

./local-ai --debug --external-grpc-backends "my-awesome-backend:host:port"

For example, to start vllm manually after compiling LocalAI (also assuming running the command from the root of the repository):

./local-ai --external-grpc-backends "vllm:$PWD/backend/python/vllm/run.sh"

Note that first is is necessary to create the environment with:

make -C backend/python/vllm

Environment variables

When LocalAI runs in a container, there are additional environment variables available that modify the behavior of LocalAI on startup:

Environment variableDefaultDescription
REBUILDfalseRebuild LocalAI on startup
BUILD_TYPEBuild type. Available: cublas, openblas, clblas, intel (intel core), sycl_f16, sycl_f32 (intel backends)
GO_TAGSGo tags. Available: stablediffusion
HUGGINGFACEHUB_API_TOKENSpecial token for interacting with HuggingFace Inference API, required only when using the langchain-huggingface backend
EXTRA_BACKENDSA space separated list of backends to prepare. For example EXTRA_BACKENDS="backend/python/diffusers backend/python/transformers" prepares the python environment on start
DISABLE_AUTODETECTfalseDisable autodetect of CPU flagset on start
LLAMACPP_GRPC_SERVERSA list of llama.cpp workers to distribute the workload. For example LLAMACPP_GRPC_SERVERS="address1:port,address2:port"

Here is how to configure these variables:

docker run --env REBUILD=true localai
docker run --env-file .env localai

CLI Parameters

For a complete reference of all CLI parameters, environment variables, and command-line options, see the CLI Reference page.

You can control LocalAI with command line arguments to specify a binding address, number of threads, model paths, and many other options. Any command line parameter can be specified via an environment variable.

.env files

Any settings being provided by an Environment Variable can also be provided from within .env files. There are several locations that will be checked for relevant .env files. In order of precedence they are:

  • .env within the current directory
  • localai.env within the current directory
  • localai.env within the home directory
  • .config/localai.env within the home directory
  • /etc/localai.env

Environment variables within files earlier in the list will take precedence over environment variables defined in files later in the list.

An example .env file is:

LOCALAI_THREADS=10
LOCALAI_MODELS_PATH=/mnt/storage/localai/models
LOCALAI_F16=true

Request headers

You can use ‘Extra-Usage’ request header key presence (‘Extra-Usage: true’) to receive inference timings in milliseconds extending default OpenAI response model in the usage field:

...
{
  "id": "...",
  "created": ...,
  "model": "...",
  "choices": [
    {
      ...
    },
    ...
  ],
  "object": "...",
  "usage": {
    "prompt_tokens": ...,
    "completion_tokens": ...,
    "total_tokens": ...,
    // Extra-Usage header key will include these two float fields:
    "timing_prompt_processing: ...,
    "timing_token_generation": ...,
  },
}
...

Extra backends

LocalAI can be extended with extra backends. The backends are implemented as gRPC services and can be written in any language. See the backend section for more details on how to install and build new backends for LocalAI.

In runtime

When using the -core container image it is possible to prepare the python backends you are interested into by using the EXTRA_BACKENDS variable, for instance:

docker run --env EXTRA_BACKENDS="backend/python/diffusers" quay.io/go-skynet/local-ai:master

Concurrent requests

LocalAI supports parallel requests for the backends that supports it. For instance, vLLM and llama.cpp supports parallel requests, and thus LocalAI allows to run multiple requests in parallel.

In order to enable parallel requests, you have to pass --parallel-requests or set the PARALLEL_REQUEST to true as environment variable.

A list of the environment variable that tweaks parallelism is the following:

### Python backends GRPC max workers
### Default number of workers for GRPC Python backends.
### This actually controls wether a backend can process multiple requests or not.

### Define the number of parallel LLAMA.cpp workers (Defaults to 1)

### Enable to run parallel requests

Note that, for llama.cpp you need to set accordingly LLAMACPP_PARALLEL to the number of parallel processes your GPU/CPU can handle. For python-based backends (like vLLM) you can set PYTHON_GRPC_MAX_WORKERS to the number of parallel requests.

VRAM and Memory Management

For detailed information on managing VRAM when running multiple models, see the dedicated VRAM and Memory Management page.

Disable CPU flagset auto detection in llama.cpp

LocalAI will automatically discover the CPU flagset available in your host and will use the most optimized version of the backends.

If you want to disable this behavior, you can set DISABLE_AUTODETECT to true in the environment variables.

Fine-tuning LLMs for text generation

Note

Section under construction

This section covers how to fine-tune a language model for text generation and consume it in LocalAI.

Open In Colab Open In Colab

Requirements

For this example you will need at least a 12GB VRAM of GPU and a Linux box.

Fine-tuning

Fine-tuning a language model is a process that requires a lot of computational power and time.

Currently LocalAI doesn’t support the fine-tuning endpoint as LocalAI but there are are plans to support that. For the time being a guide is proposed here to give a simple starting point on how to fine-tune a model and use it with LocalAI (but also with llama.cpp).

There is an e2e example of fine-tuning a LLM model to use with LocalAI written by @mudler available here.

The steps involved are:

  • Preparing a dataset
  • Prepare the environment and install dependencies
  • Fine-tune the model
  • Merge the Lora base with the model
  • Convert the model to gguf
  • Use the model with LocalAI

Dataset preparation

We are going to need a dataset or a set of datasets.

Axolotl supports a variety of formats, in the notebook and in this example we are aiming for a very simple dataset and build that manually, so we are going to use the completion format which requires the full text to be used for fine-tuning.

A dataset for an instructor model (like Alpaca) can look like the following:

[
 {
    "text": "As an AI language model you are trained to reply to an instruction. Try to be as much polite as possible\n\n## Instruction\n\nWrite a poem about a tree.\n\n## Response\n\nTrees are beautiful, ...",
 },
 {
    "text": "As an AI language model you are trained to reply to an instruction. Try to be as much polite as possible\n\n## Instruction\n\nWrite a poem about a tree.\n\n## Response\n\nTrees are beautiful, ...",
 }
]

Every block in the text is the whole text that is used to fine-tune. For example, for an instructor model it follows the following format (more or less):

<System prompt>

## Instruction

<Question, instruction>

## Response

<Expected response from the LLM>

The instruction format works such as when we are going to inference with the model, we are going to feed it only the first part up to the ## Instruction block, and the model is going to complete the text with the ## Response block.

Prepare a dataset, and upload it to your Google Drive in case you are using the Google colab. Otherwise place it next the axolotl.yaml file as dataset.json.

Install dependencies

git clone https://github.com/OpenAccess-AI-Collective/axolotl && pushd axolotl && git checkout 797f3dd1de8fd8c0eafbd1c9fdb172abd9ff840a && popd #0.3.0
pip install packaging
pushd axolotl && pip install -e '.[flash-attn,deepspeed]' && popd

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.0/flash_attn-2.3.0+cu117torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Configure accelerate:

accelerate config default

Fine-tuning

We will need to configure axolotl. In this example is provided a file to use axolotl.yaml that uses openllama-3b for fine-tuning. Copy the axolotl.yaml file and edit it to your needs. The dataset needs to be next to it as dataset.json. You can find the axolotl.yaml file here.

If you have a big dataset, you can pre-tokenize it to speedup the fine-tuning process:

python -m axolotl.cli.preprocess axolotl.yaml

Now we are ready to start the fine-tuning process:

accelerate launch -m axolotl.cli.train axolotl.yaml

After we have finished the fine-tuning, we merge the Lora base with the model:

python3 -m axolotl.cli.merge_lora axolotl.yaml --lora_model_dir="./qlora-out" --load_in_8bit=False --load_in_4bit=False

And we convert it to the gguf format that LocalAI can consume:

git clone https://github.com/ggerganov/llama.cpp.git
pushd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release && popd

pushd llama.cpp && python3 convert_hf_to_gguf.py ../qlora-out/merged && popd

pushd llama.cpp/build/bin &&  ./llama-quantize ../../../qlora-out/merged/Merged-33B-F16.gguf \
    ../../../custom-model-q4_0.gguf q4_0

Now you should have ended up with a custom-model-q4_0.gguf file that you can copy in the LocalAI models directory and use it with LocalAI.

VRAM and Memory Management

When running multiple models in LocalAI, especially on systems with limited GPU memory (VRAM), you may encounter situations where loading a new model fails because there isn’t enough available VRAM. LocalAI provides two mechanisms to automatically manage model memory allocation and prevent VRAM exhaustion.

The Problem

By default, LocalAI keeps models loaded in memory once they’re first used. This means:

  • If you load a large model that uses most of your VRAM, subsequent requests for other models may fail
  • Models remain in memory even when not actively being used
  • There’s no automatic mechanism to unload models to make room for new ones, unless done manually via the web interface

This is a common issue when working with GPU-accelerated models, as VRAM is typically more limited than system RAM. For more context, see issues #6068, #7269, and #5352.

Solution 1: Single Active Backend

The simplest approach is to ensure only one model is loaded at a time. When a new model is requested, LocalAI will automatically unload the currently active model before loading the new one.

Configuration

./local-ai --single-active-backend

LOCALAI_SINGLE_ACTIVE_BACKEND=true ./local-ai

Use cases

  • Single GPU systems with limited VRAM
  • When you only need one model active at a time
  • Simple deployments where model switching is acceptable

Example

LOCALAI_SINGLE_ACTIVE_BACKEND=true ./local-ai

curl http://localhost:8080/v1/chat/completions -d '{"model": "model-a", ...}'

curl http://localhost:8080/v1/chat/completions -d '{"model": "model-b", ...}'

Solution 2: Watchdog Mechanisms

For more flexible memory management, LocalAI provides watchdog mechanisms that automatically unload models based on their activity state. This allows multiple models to be loaded simultaneously, but automatically frees memory when models become inactive or stuck.

Idle Watchdog

The idle watchdog monitors models that haven’t been used for a specified period and automatically unloads them to free VRAM.

Configuration

LOCALAI_WATCHDOG_IDLE=true ./local-ai

LOCALAI_WATCHDOG_IDLE=true LOCALAI_WATCHDOG_IDLE_TIMEOUT=10m ./local-ai

./local-ai --enable-watchdog-idle --watchdog-idle-timeout=10m

Busy Watchdog

The busy watchdog monitors models that have been processing requests for an unusually long time and terminates them if they exceed a threshold. This is useful for detecting and recovering from stuck or hung backends.

Configuration

LOCALAI_WATCHDOG_BUSY=true ./local-ai

LOCALAI_WATCHDOG_BUSY=true LOCALAI_WATCHDOG_BUSY_TIMEOUT=10m ./local-ai

./local-ai --enable-watchdog-busy --watchdog-busy-timeout=10m

Combined Configuration

You can enable both watchdogs simultaneously for comprehensive memory management:

LOCALAI_WATCHDOG_IDLE=true \
LOCALAI_WATCHDOG_IDLE_TIMEOUT=15m \
LOCALAI_WATCHDOG_BUSY=true \
LOCALAI_WATCHDOG_BUSY_TIMEOUT=5m \
./local-ai

Or using command line flags:

./local-ai \
  --enable-watchdog-idle --watchdog-idle-timeout=15m \
  --enable-watchdog-busy --watchdog-busy-timeout=5m

Use cases

  • Multi-model deployments where different models may be used intermittently
  • Systems where you want to keep frequently-used models loaded but free memory from unused ones
  • Recovery from stuck or hung backend processes
  • Production environments requiring automatic resource management

Example

LOCALAI_WATCHDOG_IDLE=true \
LOCALAI_WATCHDOG_IDLE_TIMEOUT=10m \
LOCALAI_WATCHDOG_BUSY=true \
LOCALAI_WATCHDOG_BUSY_TIMEOUT=5m \
./local-ai

curl http://localhost:8080/v1/chat/completions -d '{"model": "model-a", ...}'
curl http://localhost:8080/v1/chat/completions -d '{"model": "model-b", ...}'

Timeout Format

Timeouts can be specified using Go’s duration format:

  • 15m - 15 minutes
  • 1h - 1 hour
  • 30s - 30 seconds
  • 2h30m - 2 hours and 30 minutes

Limitations and Considerations

VRAM Usage Estimation

LocalAI cannot reliably estimate VRAM usage of new models to load across different backends (llama.cpp, vLLM, diffusers, etc.) because:

  • Different backends report memory usage differently
  • VRAM requirements vary based on model architecture, quantization, and configuration
  • Some backends may not expose memory usage information before loading the model

Manual Management

If automatic management doesn’t meet your needs, you can manually stop models using the LocalAI management API:

curl -X POST http://localhost:8080/backend/shutdown \
  -H "Content-Type: application/json" \
  -d '{"model": "model-name"}'

To stop all models, you’ll need to call the endpoint for each loaded model individually, or use the web UI to stop all models at once.

Best Practices

  1. Monitor VRAM usage: Use nvidia-smi (for NVIDIA GPUs) or similar tools to monitor actual VRAM usage
  2. Start with single active backend: For single-GPU systems, --single-active-backend is often the simplest solution
  3. Tune watchdog timeouts: Adjust timeouts based on your usage patterns - shorter timeouts free memory faster but may cause more frequent reloads
  4. Consider model size: Ensure your VRAM can accommodate at least one of your largest models
  5. Use quantization: Smaller quantized models use less VRAM and allow more flexibility

Model Configuration

LocalAI uses YAML configuration files to define model parameters, templates, and behavior. This page provides a complete reference for all available configuration options.

Overview

Model configuration files allow you to:

  • Define default parameters (temperature, top_p, etc.)
  • Configure prompt templates
  • Specify backend settings
  • Set up function calling
  • Configure GPU and memory options
  • And much more

Configuration File Locations

You can create model configuration files in several ways:

  1. Individual YAML files in the models directory (e.g., models/gpt-3.5-turbo.yaml)
  2. Single config file with multiple models using --models-config-file or LOCALAI_MODELS_CONFIG_FILE
  3. Remote URLs - specify a URL to a YAML configuration file at startup

Example: Basic Configuration

name: gpt-3.5-turbo
parameters:
  model: luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin
  temperature: 0.3

context_size: 512
threads: 10
backend: llama-stable

template:
  completion: completion
  chat: chat

Example: Multiple Models in One File

When using --models-config-file, you can define multiple models as a list:

- name: model1
  parameters:
    model: model1.bin
  context_size: 512
  backend: llama-stable

- name: model2
  parameters:
    model: model2.bin
  context_size: 1024
  backend: llama-stable

Core Configuration Fields

Basic Model Settings

FieldTypeDescriptionExample
namestringModel name, used to identify the model in API callsgpt-3.5-turbo
backendstringBackend to use (e.g. llama-cpp, vllm, diffusers, whisper)llama-cpp
descriptionstringHuman-readable description of the modelA conversational AI model
usagestringUsage instructions or notesBest for general conversation

Model File and Downloads

FieldTypeDescription
parameters.modelstringPath to the model file (relative to models directory) or URL
download_filesarrayList of files to download. Each entry has filename, uri, and optional sha256

Example:

parameters:
  model: my-model.gguf

download_files:
  - filename: my-model.gguf
    uri: https://example.com/model.gguf
    sha256: abc123...

Parameters Section

The parameters section contains all OpenAI-compatible request parameters and model-specific options.

OpenAI-Compatible Parameters

These settings will be used as defaults for all the API calls to the model.

FieldTypeDefaultDescription
temperaturefloat0.9Sampling temperature (0.0-2.0). Higher values make output more random
top_pfloat0.95Nucleus sampling: consider tokens with top_p probability mass
top_kint40Consider only the top K most likely tokens
max_tokensint0Maximum number of tokens to generate (0 = unlimited)
frequency_penaltyfloat0.0Penalty for token frequency (-2.0 to 2.0)
presence_penaltyfloat0.0Penalty for token presence (-2.0 to 2.0)
repeat_penaltyfloat1.1Penalty for repeating tokens
repeat_last_nint64Number of previous tokens to consider for repeat penalty
seedint-1Random seed (omit for random)
echoboolfalseEcho back the prompt in the response
nint1Number of completions to generate
logprobsbool/intfalseReturn log probabilities of tokens
top_logprobsint0Number of top logprobs to return per token (0-20)
logit_biasmap{}Map of token IDs to bias values (-100 to 100)
typical_pfloat1.0Typical sampling parameter
tfzfloat1.0Tail free z parameter
keepint0Number of tokens to keep from the prompt

Language and Translation

FieldTypeDescription
languagestringLanguage code for transcription/translation
translateboolWhether to translate audio transcription

Custom Parameters

FieldTypeDescription
batchintBatch size for processing
ignore_eosboolIgnore end-of-sequence tokens
negative_promptstringNegative prompt for image generation
rope_freq_basefloat32RoPE frequency base
rope_freq_scalefloat32RoPE frequency scale
negative_prompt_scalefloat32Scale for negative prompt
tokenizerstringTokenizer to use (RWKV)

LLM Configuration

These settings apply to most LLM backends (llama.cpp, vLLM, etc.):

Performance Settings

FieldTypeDefaultDescription
threadsintprocessor countNumber of threads for parallel computation
context_sizeint512Maximum context size (number of tokens)
f16boolfalseEnable 16-bit floating point precision (GPU acceleration)
gpu_layersint0Number of layers to offload to GPU (0 = CPU only)

Memory Management

FieldTypeDefaultDescription
mmapbooltrueUse memory mapping for model loading (faster, less RAM)
mmlockboolfalseLock model in memory (prevents swapping)
low_vramboolfalseUse minimal VRAM mode
no_kv_offloadingboolfalseDisable KV cache offloading

GPU Configuration

FieldTypeDescription
tensor_splitstringComma-separated GPU memory allocation (e.g., "0.8,0.2" for 80%/20%)
main_gpustringMain GPU identifier for multi-GPU setups
cudaboolExplicitly enable/disable CUDA

Sampling and Generation

FieldTypeDefaultDescription
mirostatint0Mirostat sampling mode (0=disabled, 1=Mirostat, 2=Mirostat 2.0)
mirostat_taufloat5.0Mirostat target entropy
mirostat_etafloat0.1Mirostat learning rate

LoRA Configuration

FieldTypeDescription
lora_adapterstringPath to LoRA adapter file
lora_basestringBase model for LoRA
lora_scalefloat32LoRA scale factor
lora_adaptersarrayMultiple LoRA adapters
lora_scalesarrayScales for multiple LoRA adapters

Advanced Options

FieldTypeDescription
no_mulmatqboolDisable matrix multiplication queuing
draft_modelstringDraft model for speculative decoding
n_draftint32Number of draft tokens
quantizationstringQuantization format
load_formatstringModel load format
numaboolEnable NUMA (Non-Uniform Memory Access)
rms_norm_epsfloat32RMS normalization epsilon
ngqaint32Natural question generation parameter
rope_scalingstringRoPE scaling configuration
typestringModel type/architecture
grammarstringGrammar file path for constrained generation

YARN Configuration

YARN (Yet Another RoPE extensioN) settings for context extension:

FieldTypeDescription
yarn_ext_factorfloat32YARN extension factor
yarn_attn_factorfloat32YARN attention factor
yarn_beta_fastfloat32YARN beta fast parameter
yarn_beta_slowfloat32YARN beta slow parameter

Prompt Caching

FieldTypeDescription
prompt_cache_pathstringPath to store prompt cache (relative to models directory)
prompt_cache_allboolCache all prompts automatically
prompt_cache_roboolRead-only prompt cache

Text Processing

FieldTypeDescription
stopwordsarrayWords or phrases that stop generation
cutstringsarrayStrings to cut from responses
trimspacearrayStrings to trim whitespace from
trimsuffixarraySuffixes to trim from responses
extract_regexarrayRegular expressions to extract content

System Prompt

FieldTypeDescription
system_promptstringDefault system prompt for the model

vLLM-Specific Configuration

These options apply when using the vllm backend:

FieldTypeDescription
gpu_memory_utilizationfloat32GPU memory utilization (0.0-1.0, default 0.9)
trust_remote_codeboolTrust and execute remote code
enforce_eagerboolForce eager execution mode
swap_spaceintSwap space in GB
max_model_lenintMaximum model length
tensor_parallel_sizeintTensor parallelism size
disable_log_statsboolDisable logging statistics
dtypestringData type (e.g., float16, bfloat16)
flash_attentionstringFlash attention configuration
cache_type_kstringKey cache type
cache_type_vstringValue cache type
limit_mm_per_promptobjectLimit multimodal content per prompt: {image: int, video: int, audio: int}

Template Configuration

Templates use Go templates with Sprig functions.

FieldTypeDescription
template.chatstringTemplate for chat completion endpoint
template.chat_messagestringTemplate for individual chat messages
template.completionstringTemplate for text completion
template.editstringTemplate for edit operations
template.functionstringTemplate for function/tool calls
template.multimodalstringTemplate for multimodal interactions
template.reply_prefixstringPrefix to add to model replies
template.use_tokenizer_templateboolUse tokenizer’s built-in template (vLLM/transformers)
template.join_chat_messages_by_characterstringCharacter to join chat messages (default: \n)

Template Variables

Templating supports sprig functions.

Following are common variables available in templates:

  • {{.Input}} - User input
  • {{.Instruction}} - Instruction for edit operations
  • {{.System}} - System message
  • {{.Prompt}} - Full prompt
  • {{.Functions}} - Function definitions (for function calling)
  • {{.FunctionCall}} - Function call result

Example Template

template:
  chat: |
    {{.System}}
    {{range .Messages}}
    {{if eq .Role "user"}}User: {{.Content}}{{end}}
    {{if eq .Role "assistant"}}Assistant: {{.Content}}{{end}}
    {{end}}
    Assistant:

Function Calling Configuration

Configure how the model handles function/tool calls:

FieldTypeDefaultDescription
function.disable_no_actionboolfalseDisable the no-action behavior
function.no_action_function_namestringanswerName of the no-action function
function.no_action_description_namestringDescription for no-action function
function.function_name_keystringnameJSON key for function name
function.function_arguments_keystringargumentsJSON key for function arguments
function.response_regexarrayNamed regex patterns to extract function calls
function.argument_regexarrayNamed regex to extract function arguments
function.argument_regex_key_namestringkeyNamed regex capture for argument key
function.argument_regex_value_namestringvalueNamed regex capture for argument value
function.json_regex_matcharrayRegex patterns to match JSON in tool mode
function.replace_function_resultsarrayReplace function call results with patterns
function.replace_llm_resultsarrayReplace LLM results with patterns
function.capture_llm_resultsarrayCapture LLM results as text (e.g., for “thinking” blocks)

Grammar Configuration

FieldTypeDefaultDescription
function.grammar.disableboolfalseCompletely disable grammar enforcement
function.grammar.parallel_callsboolfalseAllow parallel function calls
function.grammar.mixed_modeboolfalseAllow mixed-mode grammar enforcing
function.grammar.no_mixed_free_stringboolfalseDisallow free strings in mixed mode
function.grammar.disable_parallel_new_linesboolfalseDisable parallel processing for new lines
function.grammar.prefixstringPrefix to add before grammar rules
function.grammar.expect_strings_after_jsonboolfalseExpect strings after JSON data

Diffusers Configuration

For image generation models using the diffusers backend:

FieldTypeDescription
diffusers.cudaboolEnable CUDA for diffusers
diffusers.pipeline_typestringPipeline type (e.g., stable-diffusion, stable-diffusion-xl)
diffusers.scheduler_typestringScheduler type (e.g., euler, ddpm)
diffusers.enable_parametersstringComma-separated parameters to enable
diffusers.cfg_scalefloat32Classifier-free guidance scale
diffusers.img2imgboolEnable image-to-image transformation
diffusers.clip_skipintNumber of CLIP layers to skip
diffusers.clip_modelstringCLIP model to use
diffusers.clip_subfolderstringCLIP model subfolder
diffusers.control_netstringControlNet model to use
stepintNumber of diffusion steps

TTS Configuration

For text-to-speech models:

FieldTypeDescription
tts.voicestringVoice file path or voice ID
tts.audio_pathstringPath to audio files (for Vall-E)

Roles Configuration

Map conversation roles to specific strings:

roles:
  user: "### Instruction:"
  assistant: "### Response:"
  system: "### System Instruction:"

Feature Flags

Enable or disable experimental features:

feature_flags:
  feature_name: true
  another_feature: false

MCP Configuration

Model Context Protocol (MCP) configuration:

FieldTypeDescription
mcp.remotestringYAML string defining remote MCP servers
mcp.stdiostringYAML string defining STDIO MCP servers

Agent Configuration

Agent/autonomous agent configuration:

FieldTypeDescription
agent.max_attemptsintMaximum number of attempts
agent.max_iterationsintMaximum number of iterations
agent.enable_reasoningboolEnable reasoning capabilities
agent.enable_planningboolEnable planning capabilities
agent.enable_mcp_promptsboolEnable MCP prompts
agent.enable_plan_re_evaluatorboolEnable plan re-evaluation

Pipeline Configuration

Define pipelines for audio-to-audio processing:

FieldTypeDescription
pipeline.ttsstringTTS model name
pipeline.llmstringLLM model name
pipeline.transcriptionstringTranscription model name
pipeline.vadstringVoice activity detection model name

gRPC Configuration

Backend gRPC communication settings:

FieldTypeDescription
grpc.attemptsintNumber of retry attempts
grpc.attempts_sleep_timeintSleep time between retries (seconds)

Overrides

Override model configuration values at runtime (llama.cpp):

overrides:
  - "qwen3moe.expert_used_count=int:10"
  - "some_key=string:value"

Format: KEY=TYPE:VALUE where TYPE is int, float, string, or bool.

Known Use Cases

Specify which endpoints this model supports:

known_usecases:
  - chat
  - completion
  - embeddings

Available flags: chat, completion, edit, embeddings, rerank, image, transcript, tts, sound_generation, tokenize, vad, video, detection, llm (combination of CHAT, COMPLETION, EDIT).

Complete Example

Here’s a comprehensive example combining many options:

name: my-llm-model
description: A high-performance LLM model
backend: llama-stable

parameters:
  model: my-model.gguf
  temperature: 0.7
  top_p: 0.9
  top_k: 40
  max_tokens: 2048

context_size: 4096
threads: 8
f16: true
gpu_layers: 35

system_prompt: "You are a helpful AI assistant."

template:
  chat: |
    {{.System}}
    {{range .Messages}}
    {{if eq .Role "user"}}User: {{.Content}}
    {{else if eq .Role "assistant"}}Assistant: {{.Content}}
    {{end}}
    {{end}}
    Assistant:

roles:
  user: "User:"
  assistant: "Assistant:"
  system: "System:"

stopwords:
  - "\n\nUser:"
  - "\n\nHuman:"

prompt_cache_path: "cache/my-model"
prompt_cache_all: true

function:
  grammar:
    parallel_calls: true
    mixed_mode: false

feature_flags:
  experimental_feature: true