# LLM library * [LLM library](#llm-library) * [Prerequisites](#prerequisites) * [Configuration options](#configuration-options) * [Conditional options](#conditional-options) * [llama cpp options](#llama-cpp-options) * [onnxruntime genai options](#onnxruntime-genai-options) * [Quick start](#quick-start) * [Neural network](#neural-network) * [llama cpp model](#llama-cpp-model) * [onnxruntime genai model](#onnxruntime-genai-model) * [To build for Android](#to-build-for-android) * [To build for Linux](#to-build-for-linux) * [Generic aarch64 target](#generic-aarch64-target) * [Aarch64 target with SME](#aarch64-target-with-sme) * [Native host build](#native-host-build) * [Building and running tests](#building-and-running-tests) * [To build an executable](#to-build-an-executable) * [llama cpp](#llama-cpp) * [onnxruntime genai](#onnxruntime-genai) * [Trademarks](#trademarks) * [License](#license) This repo is designed for building an [Arm® KleidiAI™](https://www.arm.com/markets/artificial-intelligence/software/kleidi) enabled LLM library using CMake build system. It intends to provide an abstraction for different Machine Learning frameworks/backends that Arm® KleidiAI™ kernels have been integrated into. Currently, it supports [llama.cpp](https://github.com/ggml-org/llama.cpp) and [onnxruntime-genai](https://github.com/microsoft/onnxruntime-genai) backends but we intend to add support for other backends, such as [mediapipe](https://github.com/google-ai-edge/mediapipe), soon. The backend library (selected at CMake configuration stage) is wrapped by this project's thin C++ layer that could be used directly for testing and evaluations. However, JNI bindings are also provided for developers targeting Android™ based applications. ## Prerequisites * A Linux®-based operating system is recommended (this repo is tested on Ubuntu® 22.04.4 LTS) * An Android™ or Linux® device with an Arm® CPU is recommended as a deployment target, but this library can be built for any native machine. * CMake 3.27 or above installed * Python 3.9 or above installed, python is used to download test resources and models * Android™ NDK (if building for Android™). Minimum version: r27 is recommended and can be downloaded from [here](https://developer.android.com/ndk/downloads) * Aarch64 GNU toolchain (version 14.1 or later) if cross-compiling from a Linux® based system which can be downloaded from [here](https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads) * Java Development Kit required for building JNI wrapper library necessary to utilise this module in an Android/Java application. ## Configuration options The project is designed to download the required software sources based on user provided configuration options. CMake presets are available to use and set the following variables: - `LLM_FRAMEWORK`: Currently supports `llama.cpp` (default framework) and `onnxruntime-genai`. - `BUILD_JNI_LIB`: Build the JNI shared library that other projects can consume, enabled by default. - `BUILD_UNIT_TESTS`: Build C++ unit tests and add them to CTest, JNI tests will also be built, enabled by default. - `BUILD_EXECUTABLE`: Build standalone applications, disabled by default. > **NOTE**: If you need specific version of Java set the path in `JAVA_HOME` environment variable. > ```shell > export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64 > ``` > Failure to locate "jni.h" occurs if compatible JDK is not on the system path. > If you want to experiment with the repository without JNI libs, turn the `BUILD_JNI_LIB` option off by > configuring with `-DBUILD_JNI_LIB=OFF`. - `DOWNLOADS_LOCK_TIMEOUT`: A timeout value in seconds indicating how much time a lock should be tried for when downloading resources. This is a one-time download that CMake configuration will initiate unless it has been run by the user directly or another prior CMake configuration. The lock prevents multiple CMake configuration processes running in parallel downloading files to the same location. ### Conditional options There are different conditional options for different frameworks. #### llama cpp options For `llama.cpp` as framework, these configuration parameters can be set: - `LLAMA_SRC_DIR`: Source directory path that will be populated by CMake configuration. - `LLAMA_GIT_URL`: Git URL to clone the sources from. - `LLAMA_GIT_SHA`: Git SHA for checkout. - `LLAMA_BUILD_COMMON`: Build llama's dependency Common, enabled by default. - `BUILD_SHARED_LIBS`: Build shared instead of static dependency libraries, specifically - ggml and common, disabled by default. - `LLAMA_CURL`: Enable HTTP transport via libcurl for remote models or features requiring network communication, disabled by default. #### onnxruntime genai options When using `onnxruntime-genai`, the `onnxruntime` dependency will be built from source. To customize the versions of both `onnxruntime` and `onnxruntime-genai`, the following configuration parameters can be used: onnxruntime: - `ONNXRUNTIME_SRC_DIR`: Source directory path that will be populated by CMake configuration. - `ONNXRUNTIME_GIT_URL`: Git URL to clone the sources from. - `ONNXRUNTIME_GIT_TAG`: Git SHA for checkout. onnxruntime-genai: - `ONNXRT_GENAI_SRC_DIR`: Source directory path that will be populated by CMake configuration. - `ONNXRT_GENAI_GIT_URL`: Git URL to clone the sources from. - `ONNXRT_GENAI_GIT_TAG`: Git SHA for checkout. > **NOTE**: This repository has been tested with `onnxruntime` version `v1.22.1` and `onnxruntime-genai` version `v0.8.3`. ## Quick start By default, the JNI builds are enabled, and Arm® KleidiAI™ kernels are enabled on arm64/aarch64. To disable these, configure with: `-DUSE_KLEIDIAI=OFF`. ### Neural network There are different default model for different frameworks. #### llama cpp model This project uses the **phi-2 model** as its default network for `llama.cpp` framework. The model is distributed using the **Q4_0 quantization format**, which is highly recommended as it delivers effective inference times by striking a balance between computational efficiency and model performance. - You can access the model from [Hugging Face](https://huggingface.co/ggml-org/models/blob/main/phi-2/ggml-model-q4_0.gguf). - The default model configuration is declared in the [`requirements.json`](scripts/py/requirements.json) file. However, any model supported by the backend library could be used. > **NOTE**: Currently only Q4_0 models are accelerated by Arm® KleidiAI™ kernels in `llama.cpp`. #### onnxruntime genai model This project uses the **Phi-4-mini-instruct-onnx** as its default network for `onnxruntime-genai` framework. The model is distributed using **int4 quantization format** with the **block size: 32**, which is highly recommended as it delivers effective inference times by striking a balance between computational efficiency and model performance. - You can access the model from [Hugging Face](https://huggingface.co/microsoft/Phi-4-mini-instruct-onnx/tree/main/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4). - The default model configuration is declared in the [`requirements.json`](scripts/py/requirements.json) file. However, any model supported by the backend library could be used. To use an ONNX model with this framework, the following files are required: - `genai_config.json`: Configuration file - `model_name.onnx`: ONNX model - `model_name.onnx.data`: ONNX model data - `tokenizer.json`: Tokenizer file - `tokenizer_config.json`: Tokenizer config file These files are essential for loading and running ONNX models effectively. > **NOTE**: Currently only int4 and block size 32 models are accelerated by Arm® KleidiAI™ kernels in `onnxruntime-genai`. ### To build for Android For Android™ build, ensure the `NDK_PATH` is set to installed Android™ NDK, specify Android™ ABI and platform if required or use a default preset e.g. android-arm64-release-kleidi-on-v82a-dotprod-i8mm ```shell cmake -B build \ -DCMAKE_TOOLCHAIN_FILE=${NDK_PATH}/build/cmake/android.toolchain.cmake \ -DANDROID_ABI=arm64-v8a \ -DANDROID_PLATFORM=android-33 \ -DCMAKE_C_FLAGS=-march=armv8.2-a+i8mm+dotprod \ -DCMAKE_CXX_FLAGS=-march=armv8.2-a+i8mm+dotprod cmake --build ./build ``` ### To build for Linux Building for Linux targets, with `llama.cpp` backend, `GGML_CPU_ARM_ARCH` can be set to provide the architecture flags. #### Generic aarch64 target As an example, for a target with `FEAT_DOTPROD` and `FEAT_I8MM` available, the configuration command might be: ```shell cmake -B build \ --preset=elinux-aarch64-release-with-tests \ -DGGML_CPU_ARM_ARCH=armv8.2-a+dotprod+i8mm \ -DBUILD_EXECUTABLE=ON cmake --build ./build ``` #### Aarch64 target with SME To build for aarch64 Linux system with [Scalable Matrix Extensions](https://developer.arm.com/documentation/109246/0100/SME-Overview/SME-and-SME2), for `llama.cpp` ensure `GGML_CPU_ARM_ARCH` is set with needed feature flags as below: ```shell cmake -B build \ --preset=elinux-aarch64-release-with-tests \ -DGGML_CPU_ARM_ARCH=armv8.2-a+dotprod+i8mm+sve+sme \ -DBUILD_EXECUTABLE=ON cmake --build ./build ``` Once built, a standalone application can be executed to get performance. If `FEAT_SME` is available on deployment target, environment variable `GGML_KLEIDIAI_SME` can be used to toggle the use of SME kernels during execution for `llama.cpp`. For example: ```shell GGML_KLEIDIAI_SME=1 ./build/bin/llama-cli -m resources_downloaded/models/llama.cpp/model.gguf -t 1 -p "What is a car?" ``` To run without invoking SME kernels, set `GGML_KLEIDIAI_SME=0` during execution: ```shell GGML_KLEIDIAI_SME=0 ./build/bin/llama-cli -m resources_downloaded/models/llama.cpp/model.gguf -t 1 -p "What is a car?" ``` > **NOTE**: In some cases, it may be desirable to build a statically linked executable. For llama.cpp backend > this can be done by adding these configuration parameters to the CMake command for Clang or GNU toolchains: > ```shell > -DCMAKE_EXE_LINKER_FLAGS="-static" \ > -DGGML_OPENMP=OFF > ``` #### Native host build ```shell cmake -B build --preset=native-release-with-tests cmake --build ./build ``` ## Building and running tests To build and test for native host machine: ```shell cmake -B build --preset=native-release-with-tests cmake --build ./build ctest --test-dir ./build ``` > **NOTE**: For consistent and reliable test results, avoid using the `--parallel` option when running tests. This should produce something like: ```shell Internal ctest changing into directory: /home/user/llm/build Test project /home/user/llm/build Start 1: llm-cpp-ctest 1/2 Test #1: llm-cpp-ctest .................... Passed 4.16 sec Start 2: llama-jni-ctest 2/2 Test #2: llama-jni-ctest .................. Passed 3.25 sec 100% tests passed, 0 tests failed out of 2 Total Test time (real) = 7.41 sec ``` Even when cross-compiling, the test binaries can be copied to the target system and executed. ## To build an executable To build a standalone application add the configuration option `-DBUILD_EXECUTABLE=ON` to any of the build commands above. For example: On Aarch-64 ```shell cmake -B build \ --preset=elinux-aarch64-release-with-tests \ -DCMAKE_C_FLAGS=-march=armv8.2-a+dotprod+i8mm \ -DCMAKE_CXX_FLAGS=-march=armv8.2-a+dotprod+i8mm \ -DBUILD_EXECUTABLE=ON cmake --build ./build Or on x86 (No Kleidi Acceleration) ```shell cmake -B build \ --preset=native-release-with-tests \ -DBUILD_EXECUTABLE=ON cmake --build ./build ``` ### llama cpp You can run either executable from command line and add your prompt for example the following: ``` ./build/bin/llama-cli -m resources_downloaded/models/llama.cpp/model.gguf --prompt "What is the capital of France" ``` More information can be found at `llama.cpp/examples/main/README.md` on how this executable can be run. ### onnxruntime genai You can run model_benchmark executable from command line: ``` ./build/bin/model_benchmark -i resources_downloaded/models/onnxruntime-genai ``` More information can be found at `onnxruntime-genai/benchmark/c/readme.md` on how this executable can be run. ## Trademarks * Arm® and KleidiAI™ are registered trademarks or trademarks of Arm® Limited (or its subsidiaries) in the US and/or elsewhere. * Android™ is a trademark of Google LLC. ## License This project is distributed under the software licenses in [LICENSES](LICENSES) directory.