Jump to content

Llama.cpp

From ArchWiki


LLM inference in C/C++

Installation

llama.cpp is available in the AUR:

Note Ensure you have the appropriate Vulkan driver installed.

Usage

Primary executors are llama-cli and llama-server.

llama-cli

llama-cli is the CLI executor:

$ llama-cli --help
$ llama-cli -m model.gguf

llama-server

llama-server launches an HTTP server:

$ llama-server --help
$ llama-server -m model.gguf

Obtaining Models

llama.cpp uses models in the GGUF format.

Download from Hugging Face

Download models from Hugging Face using the -hf flag:

$ llama-cli -hf org/model
Warning This may overwrite an existing model file without prompting.

Manual Download

Manually download models using wget or curl:

$ wget -c model.gguf

Model Quantization

Quantization lowers model precision to reduce memory usage.

GGUF models use suffixes to indicate quantization level. Generally, lower numbers (Q4) use less memory but may reduce quality compared to higher numbers (Q8).

Unsloth provides a wide selection of quantized models on Hugging Face.

KV Cache Quantization

For further memory efficiency, you can quantize the KV (key-value) cache.

$ llama-cli -ctk q4_0 -ctv q4_0 -m model.gguf

This can significantly reduce memory usage.

See also