Run LLMs locally with llama-cpp
Notes for running LLM in local machine with CPU and GPUs.
All these commands are run on Ubuntu 22.04.2 LTS
.
Installation
Install NVIDIA CUDA Toolkit
See: Ubuntu 安装 NVIDIA 驱动和 CUDA (NVCC)
Install llama-ccp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cmake -B build -DLLAMA_CUDA=ON
cmake --build build --config Release
See: Readme of llama.cpp
See: CUBLAS compilation issue with make : "Unsupported gpu architecture 'compute_89'"
- Works with cmake or without -arch=native
- https://github.com/ggerganov/llama.cpp/issues/1420
如果后面运行时出现下面的错误:
CUDA error: the provided PTX was compiled with an unsupported toolchain.
可能是 build 时没有识别正确的 nvcc
路径,请在当前环境下检查 nvcc --version
的输出。
- 例如通过 conda 安装的 nvcc 可能不在当前的 env 下。
可以用下面的命令指定 nvcc
路径:
cmake -B build -DLLAMA_CUDA=ON -DCMAKE_CUDA_COMPILER=/home/asimov/miniconda3/envs/ai/bin/nvcc
See: TheBloke/Llama-2-13B-chat-GGML · CUDA error - the provided PTX was compiled with an unsupported toolchain
Install llama-cpp-python (Deprecated)
Install llama-cpp-python - [optional]
This package is Python Bindings for llama.cpp, which provides OpenAI format compatibility.
LLAMA_CUBLAS=1 CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python[server]
If you have installed llama-cpp-python
before setup nvcc
correctly, you need setup nvcc
first, then reinstall llama-cpp-python
:
LLAMA_CUBLAS=1 CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python[server] --upgrade --force-reinstall --no-cache-dir
See: README of llama-cpp-python
See: OpenAI Compatible Server of llama-cpp-python
Download models
Install huggingface_hub
This is used to use huggingface-cli
to download models.
pip install 'huggingface_hub[cli,torch]'
pip install hf_transfer
See: Installation of huggingface_hub
Download from huggingface
# For PRC users
export HF_ENDPOINT=https://hf-mirror.com
export HF_HUB_ENABLE_HF_TRANSFER=1
# [Recommend] qwen1.5-14b-chat (q5_k_m)
huggingface-cli download Qwen/Qwen1.5-14B-Chat-GGUF qwen1_5-14b-chat-q5_k_m.gguf --local-dir ./models/ --local-dir-use-symlinks False
# qwen1.5-1.8b-chat (q5_k_m)
huggingface-cli download Qwen/Qwen1.5-1.8B-Chat-GGUF qwen1_5-1_8b-chat-q8_0.gguf --local-dir ./models/ --local-dir-use-symlinks False
# qwen1.5-7b-chat (q2_k)
huggingface-cli download Qwen/Qwen1.5-7B-Chat-GGUF qwen1_5-7b-chat-q2_k.gguf --local-dir ./models/ --local-dir-use-symlinks False
# qwen1.5-72b-chat (q2_k)
huggingface-cli download Qwen/Qwen1.5-72B-Chat-GGUF qwen1_5-72b-chat-q2_k.gguf --local-dir ./models/ --local-dir-use-symlinks False
# mixtral-8x7b-instruct (Q5_K_M)
huggingface-cli download TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf --local-dir ./models/ --local-dir-use-symlinks False
See more GGUF formats of Qwen models:
Run Chat server
llama-cpp
# main: ~/repos/llama.cpp/build/bin/main
# models: ~/repos/local-llms/models/
cd build/bin
Interactive mode
# qwen1.5-14b-chat (q2_k)
./main --model "$HOME/repos/local-llms/models/qwen1_5-14b-chat-q5_k_m.gguf" --n-gpu-layers 41 --ctx-size 8192 --interactive-first
# qwen1.5-72b-chat (q2_k)
./main --model "$HOME/repos/local-llms/models/qwen1_5-72b-chat-q2_k.gguf" --n-gpu-layers 41 --ctx-size 8192 --interactive-first
See: Interactive mode:
Server mode
# qwen1.5-14b-chat (q2_k)
./server --model "$HOME/repos/local-llms/models/qwen1_5-14b-chat-q5_k_m.gguf" --host 0.0.0.0 --port 13332 --n-gpu-layers 41 --ctx-size 8192
# qwen1.5-72b-chat (q2_k)
./server --model "$HOME/repos/local-llms/models/qwen1_5-72b-chat-q2_k.gguf" --host 0.0.0.0 --port 13332 --n-gpu-layers 81 --ctx-size 8192
See: llama-cpp server:
api_like_OAI.py (Deprecated)
api_like_OAI.py
You can also use api_like_OAI.py
for OpenAI format compatibility:
# [./build/bin/]
wget https://raw.githubusercontent.com/ggerganov/llama.cpp/ea73dace986f05b6b35c799880c7eaea7ee578f4/examples/server/api_like_OAI.py
python api_like_OAI.py --host 0.0.0.0 --port 13333 --llama-api http://127.0.0.1:13332
./server
now supports OpenAI format requests, so this method is no longer suggested.
See: Short guide to hosting your own llama.cpp openAI compatible web-server
See:
llama-cpp-python
This will launch a LLM server which supports requests in OpenAI API format.
# If the machine is hosted behind proxy,
# you might need to unset `http(s)_proxy` before running the serive
# or set `no_proxy` as below:
export no_proxy=localhost,127.0.0.1,127.0.0.0,127.0.1.1,local.home
# If you have multiple GPUs, you can specify which one to use:
# by default, llama-cpp will use all GPUs and allocate the memory equally
export CUDA_VISIBLE_DEVICES=0,1,2
# [Recommend] qwen1.5-14b-chat (q5_k_m)
python -m llama_cpp.server --model "./models/qwen1_5-14b-chat-q5_k_m.gguf" --model_alias "qwen1.5-14b-chat" --host 0.0.0.0 --port 13333 --n_ctx 8192 --n_gpu_layers 41 --interrupt_requests True
# qwen1.5-7b-chat (q2_k)
python -m llama_cpp.server --model "./models/qwen1_5-7b-chat-q5_k_m.gguf" --model_alias "qwen-1.5-7b-chat" --host 0.0.0.0 --port 13333 --n_ctx 16384 --n_gpu_layers 33 --interrupt_requests True
# qwen1.5-72b-chat (q2_k)
python -m llama_cpp.server --model "./models/qwen1_5-72b-chat-q2_k.gguf" --model_alias "qwen-1.5-72b-chat" --host 0.0.0.0 --port 13333 --n_ctx 16384 --n_gpu_layers 81 --interrupt_requests True
# mixtral-8x7b (Q5_K_M)
python -m llama_cpp.server --model "./models/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf" --model_alias "mixtral-8x7b" --host 0.0.0.0 --port 13333 --n_ctx 16384 --n_gpu_layers 33 --interrupt_requests True
# Inference on 3 * GTX 1080ti:
# - (q5_k_m, n_ctx=8192): [16GB VRAM, ~ 23 t/s]
# - (q2_k, n_ctx=1024): [ 8GB VRAM, ~ 28 t/s]
# Inference on RTX Ada 6000:
# - (q5_k_m, n_ctx=32768): [40GB VRAM, ~ 60 t/s]
You can also go to API docs to test requests interactively: http://127.0.0.1:13333/docs
.
See: OpenAI Compatible Server in llama-cpp-python
Requests to server
After the server is running, you can chat with api with following codes:
from openai import OpenAI
base_url = "http://127.0.0.1:13333/v1"
api_key = "sk-xxxxx"
client = OpenAI(base_url=base_url, api_key=api_key)
response = client.chat.completions.create(
model="qwen1.5-14b-chat",
messages=[
{
"role": "user",
"content": "what is your model",
}
],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="", flush=True)
elif chunk.choices[0].finish_reason == "stop":
print()
else:
pass
Command line options
llama-cpp-python (Deprecated)
llama-cpp-python
python -m llama_cpp.server --help
usage: __main__.py [-h] [--model MODEL] [--model_alias MODEL_ALIAS] [--n_gpu_layers N_GPU_LAYERS] [--main_gpu MAIN_GPU]
[--tensor_split [TENSOR_SPLIT ...]] [--vocab_only VOCAB_ONLY] [--use_mmap USE_MMAP] [--use_mlock USE_MLOCK] [--seed SEED]
[--n_ctx N_CTX] [--n_batch N_BATCH] [--n_threads N_THREADS] [--n_threads_batch N_THREADS_BATCH] [--rope_scaling_type ROPE_SCALING_TYPE]
[--rope_freq_base ROPE_FREQ_BASE] [--rope_freq_scale ROPE_FREQ_SCALE] [--yarn_ext_factor YARN_EXT_FACTOR]
[--yarn_attn_factor YARN_ATTN_FACTOR] [--yarn_beta_fast YARN_BETA_FAST] [--yarn_beta_slow YARN_BETA_SLOW]
[--yarn_orig_ctx YARN_ORIG_CTX] [--mul_mat_q MUL_MAT_Q] [--logits_all LOGITS_ALL] [--embedding EMBEDDING] [--offload_kqv OFFLOAD_KQV]
[--last_n_tokens_size LAST_N_TOKENS_SIZE] [--lora_base LORA_BASE] [--lora_path LORA_PATH] [--numa NUMA] [--chat_format CHAT_FORMAT]
[--clip_model_path CLIP_MODEL_PATH] [--cache CACHE] [--cache_type CACHE_TYPE] [--cache_size CACHE_SIZE] [--verbose VERBOSE]
[--host HOST] [--port PORT] [--ssl_keyfile SSL_KEYFILE] [--ssl_certfile SSL_CERTFILE] [--interrupt_requests INTERRUPT_REQUESTS]
options:
-h, --help show this help message and exit
--model MODEL The path to the model to use for generating completions. (default: PydanticUndefined)
--model_alias MODEL_ALIAS
The alias of the model to use for generating completions.
--n_gpu_layers N_GPU_LAYERS
The number of layers to put on the GPU. The rest will be on the CPU. Set -1 to move all to GPU. (default: 0)
--main_gpu MAIN_GPU Main GPU to use. (default: 0)
--tensor_split [TENSOR_SPLIT ...]
Split layers across multiple GPUs in proportion.
--vocab_only VOCAB_ONLY
Whether to only return the vocabulary. (default: False)
--use_mmap USE_MMAP Use mmap. (default: True)
--use_mlock USE_MLOCK
Use mlock. (default: True)
--seed SEED Random seed. -1 for random. (default: 4294967295)
--n_ctx N_CTX The context size. (default: 2048)
--n_batch N_BATCH The batch size to use per eval. (default: 512)
--n_threads N_THREADS
The number of threads to use. (default: 8)
--n_threads_batch N_THREADS_BATCH
The number of threads to use when batch processing. (default: 8)
--rope_scaling_type ROPE_SCALING_TYPE
--rope_freq_base ROPE_FREQ_BASE
RoPE base frequency (default: 0.0)
--rope_freq_scale ROPE_FREQ_SCALE
RoPE frequency scaling factor (default: 0.0)
--yarn_ext_factor YARN_EXT_FACTOR
--yarn_attn_factor YARN_ATTN_FACTOR
--yarn_beta_fast YARN_BETA_FAST
--yarn_beta_slow YARN_BETA_SLOW
--yarn_orig_ctx YARN_ORIG_CTX
--mul_mat_q MUL_MAT_Q
if true, use experimental mul_mat_q kernels (default: True)
--logits_all LOGITS_ALL
Whether to return logits. (default: True)
--embedding EMBEDDING
Whether to use embeddings. (default: True)
--offload_kqv OFFLOAD_KQV
Whether to offload kqv to the GPU. (default: False)
--last_n_tokens_size LAST_N_TOKENS_SIZE
Last n tokens to keep for repeat penalty calculation. (default: 64)
--lora_base LORA_BASE
Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model.
--lora_path LORA_PATH
Path to a LoRA file to apply to the model.
--numa NUMA Enable NUMA support. (default: False)
--chat_format CHAT_FORMAT
Chat format to use. (default: llama-2)
--clip_model_path CLIP_MODEL_PATH
Path to a CLIP model to use for multi-modal chat completion.
--cache CACHE Use a cache to reduce processing times for evaluated prompts. (default: False)
--cache_type CACHE_TYPE
The type of cache to use. Only used if cache is True. (default: ram)
--cache_size CACHE_SIZE
The size of the cache in bytes. Only used if cache is True. (default: 2147483648)
--verbose VERBOSE Whether to print debug information. (default: True)
--host HOST Listen address (default: localhost)
--port PORT Listen port (default: 8000)
--ssl_keyfile SSL_KEYFILE
SSL key file for HTTPS
--ssl_certfile SSL_CERTFILE
SSL certificate file for HTTPS
--interrupt_requests INTERRUPT_REQUESTS
Whether to interrupt requests when a new request is received. (default: True)
llama-cpp (server)
# [build/bin]
./server --help
usage: ./server [options]
options:
-h, --help show this help message and exit
-v, --verbose verbose output (default: disabled)
-t N, --threads N number of threads to use during computation (default: 48)
-tb N, --threads-batch N number of threads to use during batch and prompt processing (default: same as --threads)
-c N, --ctx-size N size of the prompt context (default: 512)
--rope-scaling {none,linear,yarn}
RoPE frequency scaling method, defaults to linear unless specified by the model
--rope-freq-base N RoPE base frequency (default: loaded from model)
--rope-freq-scale N RoPE frequency scaling factor, expands context by a factor of 1/N
--yarn-ext-factor N YaRN: extrapolation mix factor (default: 1.0, 0.0 = full interpolation)
--yarn-attn-factor N YaRN: scale sqrt(t) or attention magnitude (default: 1.0)
--yarn-beta-slow N YaRN: high correction dim or alpha (default: 1.0)
--yarn-beta-fast N YaRN: low correction dim or beta (default: 32.0)
-b N, --batch-size N batch size for prompt processing (default: 512)
--memory-f32 use f32 instead of f16 for memory key+value (default: disabled)
not recommended: doubles context memory required and no measurable increase in quality
--mlock force system to keep model in RAM rather than swapping or compressing
--no-mmap do not memory-map model (slower load but may reduce pageouts if not using mlock)
--numa TYPE attempt optimizations that help on some NUMA systems
- distribute: spread execution evenly over all nodes
- isolate: only spawn threads on CPUs on the node that execution started on
- numactl: use the CPU map provided my numactl
-ngl N, --n-gpu-layers N
number of layers to store in VRAM
-sm SPLIT_MODE, --split-mode SPLIT_MODE
how to split the model across multiple GPUs, one of:
- none: use one GPU only
- layer (default): split layers and KV across GPUs
- row: split rows across GPUs
-ts SPLIT --tensor-split SPLIT
fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1
-mg i, --main-gpu i the GPU to use for the model (with split-mode = none),
or for intermediate results and KV (with split-mode = row)
-m FNAME, --model FNAME
model path (default: models/7B/ggml-model-f16.gguf)
-a ALIAS, --alias ALIAS
set an alias for the model, will be added as `model` field in completion response
--lora FNAME apply LoRA adapter (implies --no-mmap)
--lora-base FNAME optional model to use as a base for the layers modified by the LoRA adapter
--host ip address to listen (default (default: 127.0.0.1)
--port PORT port to listen (default (default: 8080)
--path PUBLIC_PATH path from which to serve static files (default examples/server/public)
--api-key API_KEY optional api key to enhance server security. If set, requests must include this key for access.
--api-key-file FNAME path to file containing api keys delimited by new lines. If set, requests must include one of the keys for access.
-to N, --timeout N server read/write timeout in seconds (default: 600)
--embedding enable embedding vector output (default: disabled)
-np N, --parallel N number of slots for process requests (default: 1)
-cb, --cont-batching enable continuous batching (a.k.a dynamic batching) (default: disabled)
-spf FNAME, --system-prompt-file FNAME
set a file to load a system prompt (initial prompt of all slots), this is useful for chat applications.
-ctk TYPE, --cache-type-k TYPE
KV cache data type for K (default: f16)
-ctv TYPE, --cache-type-v TYPE
KV cache data type for V (default: f16)
--mmproj MMPROJ_FILE path to a multimodal projector file for LLaVA.
--log-format log output format: json or text (default: json)
--log-disable disables logging to a file.
--slots-endpoint-disable disables slots monitoring endpoint.
--metrics enable prometheus compatible metrics endpoint (default: disabled).
-n, --n-predict maximum tokens to predict (default: -1)
--override-kv KEY=TYPE:VALUE
advanced option to override model metadata by key. may be specified multiple times.
types: int, float, bool. example: --override-kv tokenizer.ggml.add_bos_token=bool:false
-gan N, --grp-attn-n N set the group attention factor to extend context size through self-extend(default: 1=disabled), used together with group attention width `--grp-attn-w` -gaw N, --grp-attn-w N set the group attention width to extend context size through self-extend(default: 512), used together with group attention factor `--grp-attn-n` --chat-template JINJA_TEMPLATE
set custom jinja chat template (default: template taken from model's metadata)
Note: only commonly used templates are accepted, since we don't have jinja parser
llama-cpp api_like_OAI (Deprecated)
llama-cpp api_like_OAI
# [build/bin]
python api_like_OAI.py --help
usage: api_like_OAI.py [-h] [--chat-prompt-model CHAT_PROMPT_MODEL] [--chat-prompt CHAT_PROMPT] [--user-name USER_NAME] [--ai-name AI_NAME]
[--system-name SYSTEM_NAME] [--stop STOP] [--llama-api LLAMA_API] [--api-key API_KEY] [--host HOST] [--port PORT]
An example of using server.cpp with a similar API to OAI. It must be used together with server.cpp.
options:
-h, --help show this help message and exit
--chat-prompt-model CHAT_PROMPT_MODEL
Set the model name of conversation template
--chat-prompt CHAT_PROMPT
the top prompt in chat completions(default: 'A chat between a curious user and an artificial intelligence assistant. The assistant
follows the given rules no matter what.\n')
--user-name USER_NAME
USER name in chat completions(default: '\nUSER: ')
--ai-name AI_NAME ASSISTANT name in chat completions(default: '\nASSISTANT: ')
--system-name SYSTEM_NAME
SYSTEM name in chat completions(default: '\nASSISTANT's RULE: ')
--stop STOP the end of response in chat completions(default: '</s>')
--llama-api LLAMA_API
Set the address of server.cpp in llama.cpp(default: http://127.0.0.1:8080)
--api-key API_KEY Set the api key to allow only few user(default: NULL)
--host HOST Set the ip address to listen.(default: 127.0.0.1)
--port PORT Set the port to listen.(default: 8081)
Common issues
Extreme low performance of llama-cpp-python
Extreme low performance of llama-cpp-python
llama_print_timings: load time = 3436.49 ms
llama_print_timings: sample time = 30.06 ms / 12 runs ( 2.51 ms per token, 399.16 tokens per second)
llama_print_timings: prompt eval time = 3432.49 ms / 4472 tokens ( 0.77 ms per token, 1302.84 tokens per second)
llama_print_timings: eval time = 240.56 ms / 11 runs ( 21.87 ms per token, 45.73 tokens per second)
llama_print_timings: total time = 57699.92 ms / 4483 tokens
See: Incredibly slow response time · Issue #49 · abetlen/llama-cpp-python
See: Performance issues with high level API · Issue #232 · abetlen/llama-cpp-python
See: llama-cpp-python not using GPU on m1 · Issue #756 · abetlen/llama-cpp-python