Llama.cpp
Appearance
LLM inference in C/C++
Installation
llama.cpp is available in the AUR:
- Install llama.cppAUR for CPU inference.
- Install llama.cpp-vulkanAUR for GPU inference.
Note Ensure you have the appropriate Vulkan driver installed.
Usage
Primary executors are llama-cli and llama-server.
llama-cli
llama-cli is the CLI executor:
$ llama-cli --help $ llama-cli -m model.gguf
llama-server
llama-server launches an HTTP server:
$ llama-server --help $ llama-server -m model.gguf
Obtaining Models
llama.cpp uses models in the GGUF format.
Download from Hugging Face
Download models from Hugging Face using the -hf flag:
$ llama-cli -hf org/model
Warning This may overwrite an existing model file without prompting.
Manual Download
Manually download models using wget or curl:
$ wget -c model.gguf
Model Quantization
Quantization lowers model precision to reduce memory usage.
GGUF models use suffixes to indicate quantization level. Generally, lower numbers (Q4) use less memory but may reduce quality compared to higher numbers (Q8).
Unsloth provides a wide selection of quantized models on Hugging Face.
KV Cache Quantization
For further memory efficiency, you can quantize the KV (key-value) cache.
$ llama-cli -ctk q4_0 -ctv q4_0 -m model.gguf
This can significantly reduce memory usage.