Skip to content

Run LLMs locally with llama-cpp

Notes for running LLM in local machine with CPU and GPUs.

All these commands are run on Ubuntu 22.04.2 LTS.

Installation

Install NVIDIA CUDA Toolkit

See: Ubuntu 安装 NVIDIA 驱动和 CUDA (NVCC)

Install llama-ccp

sh
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
sh
mkdir build
cmake -B build -DLLAMA_CUDA=ON
cmake --build build --config Release

See: CUBLAS compilation issue with make : "Unsupported gpu architecture 'compute_89'"

如果后面运行时出现下面的错误:

sh
CUDA error: the provided PTX was compiled with an unsupported toolchain.

可能是 build 时没有识别正确的 nvcc 路径,请在当前环境下检查 nvcc --version 的输出。

  • 例如通过 conda 安装的 nvcc 可能不在当前的 env 下。

可以用下面的命令指定 nvcc 路径:

sh
cmake -B build -DLLAMA_CUDA=ON -DCMAKE_CUDA_COMPILER=/home/asimov/miniconda3/envs/ai/bin/nvcc

See: TheBloke/Llama-2-13B-chat-GGML · CUDA error - the provided PTX was compiled with an unsupported toolchain

Install llama-cpp-python (Deprecated)

Install llama-cpp-python - [optional]

This package is Python Bindings for llama.cpp, which provides OpenAI format compatibility.

sh
LLAMA_CUBLAS=1 CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python[server]

If you have installed llama-cpp-python before setup nvcc correctly, you need setup nvcc first, then reinstall llama-cpp-python:

sh
LLAMA_CUBLAS=1 CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python[server] --upgrade --force-reinstall --no-cache-dir

See: OpenAI Compatible Server of llama-cpp-python

Download models

Install huggingface_hub

This is used to use huggingface-cli to download models.

sh
pip install 'huggingface_hub[cli,torch]'
pip install hf_transfer

Download from huggingface

sh
# For PRC users
export HF_ENDPOINT=https://hf-mirror.com
export HF_HUB_ENABLE_HF_TRANSFER=1
sh
# [Recommend] qwen1.5-14b-chat (q5_k_m)
huggingface-cli download Qwen/Qwen1.5-14B-Chat-GGUF qwen1_5-14b-chat-q5_k_m.gguf --local-dir ./models/ --local-dir-use-symlinks False

# qwen1.5-1.8b-chat (q5_k_m)
huggingface-cli download Qwen/Qwen1.5-1.8B-Chat-GGUF qwen1_5-1_8b-chat-q8_0.gguf --local-dir ./models/ --local-dir-use-symlinks False

# qwen1.5-7b-chat (q2_k)
huggingface-cli download Qwen/Qwen1.5-7B-Chat-GGUF qwen1_5-7b-chat-q2_k.gguf --local-dir ./models/ --local-dir-use-symlinks False

# qwen1.5-72b-chat (q2_k)
huggingface-cli download Qwen/Qwen1.5-72B-Chat-GGUF qwen1_5-72b-chat-q2_k.gguf --local-dir ./models/ --local-dir-use-symlinks False

# mixtral-8x7b-instruct (Q5_K_M)
huggingface-cli download TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf --local-dir ./models/ --local-dir-use-symlinks False

Run Chat server

llama-cpp

sh
# main:    ~/repos/llama.cpp/build/bin/main
# models:  ~/repos/local-llms/models/
cd build/bin

Interactive mode

sh
# qwen1.5-14b-chat (q2_k)
./main --model "$HOME/repos/local-llms/models/qwen1_5-14b-chat-q5_k_m.gguf" --n-gpu-layers 41 --ctx-size 8192 --interactive-first

# qwen1.5-72b-chat (q2_k)
./main --model "$HOME/repos/local-llms/models/qwen1_5-72b-chat-q2_k.gguf" --n-gpu-layers 41 --ctx-size 8192 --interactive-first

Server mode

sh
# qwen1.5-14b-chat (q2_k)
./server --model "$HOME/repos/local-llms/models/qwen1_5-14b-chat-q5_k_m.gguf" --host 0.0.0.0 --port 13332 --n-gpu-layers 41 --ctx-size 8192

# qwen1.5-72b-chat (q2_k)
./server --model "$HOME/repos/local-llms/models/qwen1_5-72b-chat-q2_k.gguf" --host 0.0.0.0 --port 13332 --n-gpu-layers 81 --ctx-size 8192
api_like_OAI.py (Deprecated)

api_like_OAI.py

You can also use api_like_OAI.py for OpenAI format compatibility:

sh
# [./build/bin/]
wget https://raw.githubusercontent.com/ggerganov/llama.cpp/ea73dace986f05b6b35c799880c7eaea7ee578f4/examples/server/api_like_OAI.py
python api_like_OAI.py --host 0.0.0.0 --port 13333 --llama-api http://127.0.0.1:13332

./server now supports OpenAI format requests, so this method is no longer suggested.

See: Short guide to hosting your own llama.cpp openAI compatible web-server

llama-cpp-python

This will launch a LLM server which supports requests in OpenAI API format.

sh
# If the machine is hosted behind proxy, 
#   you might need to unset `http(s)_proxy` before running the serive
# or set `no_proxy` as below:
export no_proxy=localhost,127.0.0.1,127.0.0.0,127.0.1.1,local.home

# If you have multiple GPUs, you can specify which one to use:
#   by default, llama-cpp will use all GPUs and allocate the memory equally
export CUDA_VISIBLE_DEVICES=0,1,2
sh
# [Recommend] qwen1.5-14b-chat (q5_k_m)
python -m llama_cpp.server --model "./models/qwen1_5-14b-chat-q5_k_m.gguf" --model_alias "qwen1.5-14b-chat" --host 0.0.0.0 --port 13333 --n_ctx 8192 --n_gpu_layers 41 --interrupt_requests True

# qwen1.5-7b-chat (q2_k)
python -m llama_cpp.server --model "./models/qwen1_5-7b-chat-q5_k_m.gguf" --model_alias "qwen-1.5-7b-chat" --host 0.0.0.0 --port 13333 --n_ctx 16384 --n_gpu_layers 33 --interrupt_requests True

# qwen1.5-72b-chat (q2_k)
python -m llama_cpp.server --model "./models/qwen1_5-72b-chat-q2_k.gguf" --model_alias "qwen-1.5-72b-chat" --host 0.0.0.0 --port 13333 --n_ctx 16384 --n_gpu_layers 81 --interrupt_requests True

# mixtral-8x7b (Q5_K_M)
python -m llama_cpp.server --model "./models/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf" --model_alias "mixtral-8x7b" --host 0.0.0.0 --port 13333 --n_ctx 16384 --n_gpu_layers 33 --interrupt_requests True
sh
# Inference on 3 * GTX 1080ti:
#   - (q5_k_m, n_ctx=8192):  [16GB VRAM, ~ 23 t/s]
#   - (q2_k,   n_ctx=1024):  [ 8GB VRAM, ~ 28 t/s]

# Inference on RTX Ada 6000:
#   - (q5_k_m, n_ctx=32768): [40GB VRAM, ~ 60 t/s]

You can also go to API docs to test requests interactively: http://127.0.0.1:13333/docs.

Requests to server

After the server is running, you can chat with api with following codes:

py
from openai import OpenAI

base_url = "http://127.0.0.1:13333/v1"
api_key = "sk-xxxxx"

client = OpenAI(base_url=base_url, api_key=api_key)
response = client.chat.completions.create(
    model="qwen1.5-14b-chat",
    messages=[
        {
            "role": "user",
            "content": "what is your model",
        }
    ],
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)
    elif chunk.choices[0].finish_reason == "stop":
        print()
    else:
        pass

Command line options

llama-cpp-python (Deprecated)

llama-cpp-python

sh
python -m llama_cpp.server --help
txt
usage: __main__.py [-h] [--model MODEL] [--model_alias MODEL_ALIAS] [--n_gpu_layers N_GPU_LAYERS] [--main_gpu MAIN_GPU]
                   [--tensor_split [TENSOR_SPLIT ...]] [--vocab_only VOCAB_ONLY] [--use_mmap USE_MMAP] [--use_mlock USE_MLOCK] [--seed SEED]
                   [--n_ctx N_CTX] [--n_batch N_BATCH] [--n_threads N_THREADS] [--n_threads_batch N_THREADS_BATCH] [--rope_scaling_type ROPE_SCALING_TYPE]
                   [--rope_freq_base ROPE_FREQ_BASE] [--rope_freq_scale ROPE_FREQ_SCALE] [--yarn_ext_factor YARN_EXT_FACTOR]
                   [--yarn_attn_factor YARN_ATTN_FACTOR] [--yarn_beta_fast YARN_BETA_FAST] [--yarn_beta_slow YARN_BETA_SLOW]
                   [--yarn_orig_ctx YARN_ORIG_CTX] [--mul_mat_q MUL_MAT_Q] [--logits_all LOGITS_ALL] [--embedding EMBEDDING] [--offload_kqv OFFLOAD_KQV]
                   [--last_n_tokens_size LAST_N_TOKENS_SIZE] [--lora_base LORA_BASE] [--lora_path LORA_PATH] [--numa NUMA] [--chat_format CHAT_FORMAT]
                   [--clip_model_path CLIP_MODEL_PATH] [--cache CACHE] [--cache_type CACHE_TYPE] [--cache_size CACHE_SIZE] [--verbose VERBOSE]
                   [--host HOST] [--port PORT] [--ssl_keyfile SSL_KEYFILE] [--ssl_certfile SSL_CERTFILE] [--interrupt_requests INTERRUPT_REQUESTS]


options:
  -h, --help            show this help message and exit
  --model MODEL         The path to the model to use for generating completions. (default: PydanticUndefined)
  --model_alias MODEL_ALIAS
                        The alias of the model to use for generating completions.
  --n_gpu_layers N_GPU_LAYERS
                        The number of layers to put on the GPU. The rest will be on the CPU. Set -1 to move all to GPU. (default: 0)
  --main_gpu MAIN_GPU   Main GPU to use. (default: 0)
  --tensor_split [TENSOR_SPLIT ...]
                        Split layers across multiple GPUs in proportion.
  --vocab_only VOCAB_ONLY
                        Whether to only return the vocabulary. (default: False)
  --use_mmap USE_MMAP   Use mmap. (default: True)
  --use_mlock USE_MLOCK
                        Use mlock. (default: True)
  --seed SEED           Random seed. -1 for random. (default: 4294967295)
  --n_ctx N_CTX         The context size. (default: 2048)
  --n_batch N_BATCH     The batch size to use per eval. (default: 512)
  --n_threads N_THREADS
                        The number of threads to use. (default: 8)
  --n_threads_batch N_THREADS_BATCH
                        The number of threads to use when batch processing. (default: 8)
  --rope_scaling_type ROPE_SCALING_TYPE
  --rope_freq_base ROPE_FREQ_BASE
                        RoPE base frequency (default: 0.0)
  --rope_freq_scale ROPE_FREQ_SCALE
                        RoPE frequency scaling factor (default: 0.0)
  --yarn_ext_factor YARN_EXT_FACTOR
  --yarn_attn_factor YARN_ATTN_FACTOR
  --yarn_beta_fast YARN_BETA_FAST
  --yarn_beta_slow YARN_BETA_SLOW
  --yarn_orig_ctx YARN_ORIG_CTX
  --mul_mat_q MUL_MAT_Q
                        if true, use experimental mul_mat_q kernels (default: True)
  --logits_all LOGITS_ALL
                        Whether to return logits. (default: True)
  --embedding EMBEDDING
                        Whether to use embeddings. (default: True)
  --offload_kqv OFFLOAD_KQV
                        Whether to offload kqv to the GPU. (default: False)
  --last_n_tokens_size LAST_N_TOKENS_SIZE
                        Last n tokens to keep for repeat penalty calculation. (default: 64)
  --lora_base LORA_BASE
                        Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model.
  --lora_path LORA_PATH
                        Path to a LoRA file to apply to the model.
  --numa NUMA           Enable NUMA support. (default: False)
  --chat_format CHAT_FORMAT
                        Chat format to use. (default: llama-2)
  --clip_model_path CLIP_MODEL_PATH
                        Path to a CLIP model to use for multi-modal chat completion.
  --cache CACHE         Use a cache to reduce processing times for evaluated prompts. (default: False)
  --cache_type CACHE_TYPE
                        The type of cache to use. Only used if cache is True. (default: ram)
  --cache_size CACHE_SIZE
                        The size of the cache in bytes. Only used if cache is True. (default: 2147483648)
  --verbose VERBOSE     Whether to print debug information. (default: True)
  --host HOST           Listen address (default: localhost)
  --port PORT           Listen port (default: 8000)
  --ssl_keyfile SSL_KEYFILE
                        SSL key file for HTTPS
  --ssl_certfile SSL_CERTFILE
                        SSL certificate file for HTTPS
  --interrupt_requests INTERRUPT_REQUESTS
                        Whether to interrupt requests when a new request is received. (default: True)

llama-cpp (server)

sh
# [build/bin]
./server --help
txt
usage: ./server [options]

options:
  -h, --help                show this help message and exit
  -v, --verbose             verbose output (default: disabled)
  -t N, --threads N         number of threads to use during computation (default: 48)
  -tb N, --threads-batch N  number of threads to use during batch and prompt processing (default: same as --threads)
  -c N, --ctx-size N        size of the prompt context (default: 512)
  --rope-scaling {none,linear,yarn}
                            RoPE frequency scaling method, defaults to linear unless specified by the model
  --rope-freq-base N        RoPE base frequency (default: loaded from model)
  --rope-freq-scale N       RoPE frequency scaling factor, expands context by a factor of 1/N
  --yarn-ext-factor N       YaRN: extrapolation mix factor (default: 1.0, 0.0 = full interpolation)
  --yarn-attn-factor N      YaRN: scale sqrt(t) or attention magnitude (default: 1.0)
  --yarn-beta-slow N        YaRN: high correction dim or alpha (default: 1.0)
  --yarn-beta-fast N        YaRN: low correction dim or beta (default: 32.0)
  -b N, --batch-size N      batch size for prompt processing (default: 512)
  --memory-f32              use f32 instead of f16 for memory key+value (default: disabled)
                            not recommended: doubles context memory required and no measurable increase in quality
  --mlock                   force system to keep model in RAM rather than swapping or compressing
  --no-mmap                 do not memory-map model (slower load but may reduce pageouts if not using mlock)
  --numa TYPE               attempt optimizations that help on some NUMA systems
                              - distribute: spread execution evenly over all nodes
                              - isolate: only spawn threads on CPUs on the node that execution started on
                              - numactl: use the CPU map provided my numactl
  -ngl N, --n-gpu-layers N
                            number of layers to store in VRAM
  -sm SPLIT_MODE, --split-mode SPLIT_MODE
                            how to split the model across multiple GPUs, one of:
                              - none: use one GPU only
                              - layer (default): split layers and KV across GPUs
                              - row: split rows across GPUs
  -ts SPLIT --tensor-split SPLIT
                            fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1
  -mg i, --main-gpu i       the GPU to use for the model (with split-mode = none),
                            or for intermediate results and KV (with split-mode = row)
  -m FNAME, --model FNAME
                            model path (default: models/7B/ggml-model-f16.gguf)
  -a ALIAS, --alias ALIAS
                            set an alias for the model, will be added as `model` field in completion response
  --lora FNAME              apply LoRA adapter (implies --no-mmap)
  --lora-base FNAME         optional model to use as a base for the layers modified by the LoRA adapter
  --host                    ip address to listen (default  (default: 127.0.0.1)
  --port PORT               port to listen (default  (default: 8080)
  --path PUBLIC_PATH        path from which to serve static files (default examples/server/public)
  --api-key API_KEY         optional api key to enhance server security. If set, requests must include this key for access.
  --api-key-file FNAME      path to file containing api keys delimited by new lines. If set, requests must include one of the keys for access.
  -to N, --timeout N        server read/write timeout in seconds (default: 600)
  --embedding               enable embedding vector output (default: disabled)
  -np N, --parallel N       number of slots for process requests (default: 1)
  -cb, --cont-batching      enable continuous batching (a.k.a dynamic batching) (default: disabled)
  -spf FNAME, --system-prompt-file FNAME
                            set a file to load a system prompt (initial prompt of all slots), this is useful for chat applications.
  -ctk TYPE, --cache-type-k TYPE
                            KV cache data type for K (default: f16)
  -ctv TYPE, --cache-type-v TYPE
                            KV cache data type for V (default: f16)
  --mmproj MMPROJ_FILE      path to a multimodal projector file for LLaVA.
  --log-format              log output format: json or text (default: json)
  --log-disable             disables logging to a file.
  --slots-endpoint-disable  disables slots monitoring endpoint.
  --metrics                 enable prometheus compatible metrics endpoint (default: disabled).

  -n, --n-predict           maximum tokens to predict (default: -1)
  --override-kv KEY=TYPE:VALUE
                            advanced option to override model metadata by key. may be specified multiple times.
                            types: int, float, bool. example: --override-kv tokenizer.ggml.add_bos_token=bool:false
  -gan N, --grp-attn-n N    set the group attention factor to extend context size through self-extend(default: 1=disabled), used together with group attention width `--grp-attn-w`  -gaw N, --grp-attn-w N    set the group attention width to extend context size through self-extend(default: 512), used together with group attention factor `--grp-attn-n`  --chat-template JINJA_TEMPLATE
                            set custom jinja chat template (default: template taken from model's metadata)
                            Note: only commonly used templates are accepted, since we don't have jinja parser
llama-cpp api_like_OAI (Deprecated)

llama-cpp api_like_OAI

sh
# [build/bin]
python api_like_OAI.py --help
txt
usage: api_like_OAI.py [-h] [--chat-prompt-model CHAT_PROMPT_MODEL] [--chat-prompt CHAT_PROMPT] [--user-name USER_NAME] [--ai-name AI_NAME]
                       [--system-name SYSTEM_NAME] [--stop STOP] [--llama-api LLAMA_API] [--api-key API_KEY] [--host HOST] [--port PORT]

An example of using server.cpp with a similar API to OAI. It must be used together with server.cpp.
options:
  -h, --help            show this help message and exit
  --chat-prompt-model CHAT_PROMPT_MODEL
                        Set the model name of conversation template
  --chat-prompt CHAT_PROMPT
                        the top prompt in chat completions(default: 'A chat between a curious user and an artificial intelligence assistant. The assistant
                        follows the given rules no matter what.\n')
  --user-name USER_NAME
                        USER name in chat completions(default: '\nUSER: ')
  --ai-name AI_NAME     ASSISTANT name in chat completions(default: '\nASSISTANT: ')
  --system-name SYSTEM_NAME
                        SYSTEM name in chat completions(default: '\nASSISTANT's RULE: ')
  --stop STOP           the end of response in chat completions(default: '</s>')
  --llama-api LLAMA_API
                        Set the address of server.cpp in llama.cpp(default: http://127.0.0.1:8080)
  --api-key API_KEY     Set the api key to allow only few user(default: NULL)
  --host HOST           Set the ip address to listen.(default: 127.0.0.1)
  --port PORT           Set the port to listen.(default: 8081)

Common issues

Extreme low performance of llama-cpp-python

Extreme low performance of llama-cpp-python

sh
llama_print_timings:        load time =    3436.49 ms
llama_print_timings:      sample time =      30.06 ms /    12 runs   (    2.51 ms per token,   399.16 tokens per second)
llama_print_timings: prompt eval time =    3432.49 ms /  4472 tokens (    0.77 ms per token,  1302.84 tokens per second)
llama_print_timings:        eval time =     240.56 ms /    11 runs   (   21.87 ms per token,    45.73 tokens per second)
llama_print_timings:       total time =   57699.92 ms /  4483 tokens

See: Incredibly slow response time · Issue #49 · abetlen/llama-cpp-python

See: Performance issues with high level API · Issue #232 · abetlen/llama-cpp-python

See: llama-cpp-python not using GPU on m1 · Issue #756 · abetlen/llama-cpp-python