Llama.cpp

LLM inference in C/C++

Installation

llama.cpp is available in the AUR:

Install llama.cpp^AUR for CPU inference.
Install llama.cpp-vulkan^AUR for GPU inference.

Note Ensure you have the appropriate Vulkan driver installed.

Usage

Primary executors are llama-cli and llama-server.

llama-cli

llama-cli is the CLI executor:

$ llama-cli --help
$ llama-cli -m model.gguf

llama-server

llama-server launches an HTTP server:

$ llama-server --help
$ llama-server -m model.gguf

Obtaining Models

llama.cpp uses models in the GGUF format.

Download from Hugging Face

Download models from Hugging Face using the -hf flag:

$ llama-cli -hf org/model

Warning This may overwrite an existing model file without prompting.

Manual Download

Manually download models using wget or curl:

$ wget -c model.gguf

Model Quantization

Quantization lowers model precision to reduce memory usage.

GGUF models use suffixes to indicate quantization level. Generally, lower numbers (Q4) use less memory but may reduce quality compared to higher numbers (Q8).

Unsloth provides a wide selection of quantized models on Hugging Face.

KV Cache Quantization

For further memory efficiency, you can quantize the KV (key-value) cache.

$ llama-cli -ctk q4_0 -ctv q4_0 -m model.gguf

This can significantly reduce memory usage.