Skip to content
README.md 12.5 KiB
Newer Older
Nina Drozd's avatar
Nina Drozd committed
<!--
    SPDX-FileCopyrightText: Copyright 2024-2025 Arm Limited and/or its affiliates <open-source-office@arm.com>

    SPDX-License-Identifier: Apache-2.0
-->


# LLM library

<!-- TOC -->
* [LLM library](#llm-library)
  * [Prerequisites](#prerequisites)
  * [Configuration options](#configuration-options)
    * [Conditional options](#conditional-options)
      * [llama cpp options](#llama-cpp-options)
      * [onnxruntime genai options](#onnxruntime-genai-options)
Nina Drozd's avatar
Nina Drozd committed
  * [Quick start](#quick-start)
    * [Neural network](#neural-network)
      * [llama cpp model](#llama-cpp-model)
      * [onnxruntime genai model](#onnxruntime-genai-model)
Nina Drozd's avatar
Nina Drozd committed
    * [To build for Android](#to-build-for-android)
    * [To build for Linux](#to-build-for-linux)
      * [Generic aarch64 target](#generic-aarch64-target)
      * [Aarch64 target with SME](#aarch64-target-with-sme)
      * [Native host build](#native-host-build)
  * [Building and running tests](#building-and-running-tests)
  * [To build an executable](#to-build-an-executable)
    * [llama cpp](#llama-cpp)
    * [onnxruntime genai](#onnxruntime-genai)
Nina Drozd's avatar
Nina Drozd committed
  * [Trademarks](#trademarks)
  * [License](#license)
<!-- TOC -->

This repo is designed for building an
[Arm® KleidiAI™](https://www.arm.com/markets/artificial-intelligence/software/kleidi)
enabled LLM library using CMake build system. It intends to provide an abstraction for different Machine Learning
frameworks/backends that Arm® KleidiAI™ kernels have been integrated into.
Currently, it supports [llama.cpp](https://github.com/ggml-org/llama.cpp) and
[onnxruntime-genai](https://github.com/microsoft/onnxruntime-genai) backends but we intend to add
support for other backends, such as [mediapipe](https://github.com/google-ai-edge/mediapipe), soon.
Nina Drozd's avatar
Nina Drozd committed

The backend library (selected at CMake configuration stage) is wrapped by this project's thin C++ layer that could be used
directly for testing and evaluations. However, JNI bindings are also provided for developers targeting Android™ based
applications.

## Prerequisites

* A Linux®-based operating system is recommended (this repo is tested on Ubuntu® 22.04.4 LTS)
* An Android™ or Linux® device with an Arm® CPU is recommended as a deployment target, but this
  library can be built for any native machine.
* CMake 3.27 or above installed
* Python 3.9 or above installed, python is used to download test resources and models
* Android™ NDK (if building for Android™). Minimum version: r27 is recommended and can be downloaded
  from [here](https://developer.android.com/ndk/downloads)
* Aarch64 GNU toolchain (version 14.1 or later) if cross-compiling from a Linux® based system which can be downloaded from [here](https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads)
* Java Development Kit required for building JNI wrapper library necessary to utilise this module in an Android/Java application.

## Configuration options

The project is designed to download the required software sources based on user
provided configuration options. CMake presets are available to use and set the following variables:

- `LLM_FRAMEWORK`: Currently supports `llama.cpp` (default framework) and `onnxruntime-genai`.
Nina Drozd's avatar
Nina Drozd committed
- `BUILD_JNI_LIB`: Build the JNI shared library that other projects can consume, <b>enabled by default.</b>
- `BUILD_UNIT_TESTS`: Build C++ unit tests and add them to CTest, JNI tests will also be built, <b>enabled by default.</b>
- `BUILD_EXECUTABLE`: Build standalone applications, <b>disabled by default.</b>

> **NOTE**: If you need specific version of Java set the path in `JAVA_HOME` environment variable.
> ```shell
> export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
> ```
> Failure to locate "jni.h" occurs if compatible JDK is not on the system path.
> If you want to experiment with the repository without JNI libs, turn the `BUILD_JNI_LIB` option off by
> configuring with `-DBUILD_JNI_LIB=OFF`.

- `DOWNLOADS_LOCK_TIMEOUT`: A timeout value in seconds indicating how much time a lock should be tried for
  when downloading resources. This is a one-time download that CMake configuration will initiate unless it
  has been run by the user directly or another prior CMake configuration. The lock prevents multiple CMake
  configuration processes running in parallel downloading files to the same location.

### Conditional options

There are different conditional options for different frameworks.

#### llama cpp options

For `llama.cpp` as framework, these configuration parameters can be set:
Nina Drozd's avatar
Nina Drozd committed
- `LLAMA_SRC_DIR`: Source directory path that will be populated by CMake
  configuration.
- `LLAMA_GIT_URL`: Git URL to clone the sources from.
- `LLAMA_GIT_SHA`: Git SHA for checkout.
- `LLAMA_BUILD_COMMON`: Build llama's dependency Common, <b>enabled by default.</b>
- `BUILD_SHARED_LIBS`: Build shared instead of static dependency libraries, specifically - ggml and common, <b>disabled by default.</b>
- `LLAMA_CURL`: Enable HTTP transport via libcurl for remote models or features requiring network communication, <b>disabled by default.</b>

#### onnxruntime genai options

When using `onnxruntime-genai`, the `onnxruntime` dependency will be built from source. To customize
the versions of both `onnxruntime` and `onnxruntime-genai`, the following configuration parameters
can be used:

onnxruntime:
- `ONNXRUNTIME_SRC_DIR`: Source directory path that will be populated by CMake
  configuration.
- `ONNXRUNTIME_GIT_URL`: Git URL to clone the sources from.
- `ONNXRUNTIME_GIT_TAG`: Git SHA for checkout.

onnxruntime-genai:
- `ONNXRT_GENAI_SRC_DIR`: Source directory path that will be populated by CMake
  configuration.
- `ONNXRT_GENAI_GIT_URL`: Git URL to clone the sources from.
- `ONNXRT_GENAI_GIT_TAG`: Git SHA for checkout.

> **NOTE**: This repository has been tested with `onnxruntime` version `v1.22.1` and
`onnxruntime-genai` version `v0.8.3`.
Nina Drozd's avatar
Nina Drozd committed

## Quick start

By default, the JNI builds are enabled, and Arm® KleidiAI™ kernels are enabled on arm64/aarch64.
To disable these, configure with: `-DUSE_KLEIDIAI=OFF`.
Nina Drozd's avatar
Nina Drozd committed

### Neural network

There are different default model for different frameworks.

#### llama cpp model

This project uses the **phi-2 model** as its default network for `llama.cpp` framework.
The model is distributed using the **Q4_0 quantization format**, which is highly recommended as it
delivers effective inference times by striking a balance between computational efficiency and model performance.
Nina Drozd's avatar
Nina Drozd committed

- You can access the model from [Hugging Face](https://huggingface.co/ggml-org/models/blob/main/phi-2/ggml-model-q4_0.gguf).
- The default model configuration is declared in the [`requirements.json`](scripts/py/requirements.json) file.

However, any model supported by the backend library could be used.

> **NOTE**: Currently only Q4_0 models are accelerated by Arm® KleidiAI™ kernels in `llama.cpp`.

#### onnxruntime genai model

This project uses the **Phi-4-mini-instruct-onnx** as its default network for `onnxruntime-genai` framework.
The model is distributed using **int4 quantization format** with the **block size: 32**, which is highly recommended as it
delivers effective inference times by striking a balance between computational efficiency and model performance.

- You can access the model from [Hugging Face](https://huggingface.co/microsoft/Phi-4-mini-instruct-onnx/tree/main/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4).
- The default model configuration is declared in the [`requirements.json`](scripts/py/requirements.json) file.

However, any model supported by the backend library could be used.

To use an ONNX model with this framework, the following files are required:
- `genai_config.json`: Configuration file
- `model_name.onnx`: ONNX model
- `model_name.onnx.data`: ONNX model data
- `tokenizer.json`: Tokenizer file
- `tokenizer_config.json`: Tokenizer config file

These files are essential for loading and running ONNX models effectively.

> **NOTE**: Currently only int4 and block size 32 models are accelerated by Arm® KleidiAI™ kernels in `onnxruntime-genai`.
Nina Drozd's avatar
Nina Drozd committed

### To build for Android
For Android™ build, ensure the `NDK_PATH` is set to installed Android™ NDK, specify Android™ ABI and platform if required or use a default preset e.g. android-arm64-release-kleidi-on-v82a-dotprod-i8mm
Nina Drozd's avatar
Nina Drozd committed
```shell
cmake -B build \
    -DCMAKE_TOOLCHAIN_FILE=${NDK_PATH}/build/cmake/android.toolchain.cmake \
    -DANDROID_ABI=arm64-v8a \
    -DANDROID_PLATFORM=android-33 \
    -DCMAKE_C_FLAGS=-march=armv8.2-a+i8mm+dotprod \
    -DCMAKE_CXX_FLAGS=-march=armv8.2-a+i8mm+dotprod

cmake --build ./build
```

### To build for Linux

Building for Linux targets, with `llama.cpp` backend, `GGML_CPU_ARM_ARCH` can be set to provide the architecture flags.
Nina Drozd's avatar
Nina Drozd committed

#### Generic aarch64 target

As an example, for a target with `FEAT_DOTPROD` and `FEAT_I8MM` available, the configuration command might be:

```shell
cmake -B build \
    --preset=elinux-aarch64-release-with-tests \
    -DGGML_CPU_ARM_ARCH=armv8.2-a+dotprod+i8mm \
    -DBUILD_EXECUTABLE=ON

cmake --build ./build
```

#### Aarch64 target with SME

To build for aarch64 Linux system with [Scalable Matrix Extensions](https://developer.arm.com/documentation/109246/0100/SME-Overview/SME-and-SME2), for `llama.cpp` ensure `GGML_CPU_ARM_ARCH` is set with needed feature flags as below:
Nina Drozd's avatar
Nina Drozd committed

```shell
cmake -B build \
    --preset=elinux-aarch64-release-with-tests \
    -DGGML_CPU_ARM_ARCH=armv8.2-a+dotprod+i8mm+sve+sme \
    -DBUILD_EXECUTABLE=ON

cmake --build ./build
```

Once built, a standalone application can be executed to get performance.

If `FEAT_SME` is available on deployment target, environment variable `GGML_KLEIDIAI_SME` can be used to
toggle the use of SME kernels during execution for `llama.cpp`. For example:
Nina Drozd's avatar
Nina Drozd committed

```shell
GGML_KLEIDIAI_SME=1 ./build/bin/llama-cli -m resources_downloaded/models/llama.cpp/model.gguf -t 1 -p "What is a car?"
Nina Drozd's avatar
Nina Drozd committed
```

To run without invoking SME kernels, set `GGML_KLEIDIAI_SME=0` during execution:

```shell
GGML_KLEIDIAI_SME=0 ./build/bin/llama-cli -m resources_downloaded/models/llama.cpp/model.gguf -t 1 -p "What is a car?"
Nina Drozd's avatar
Nina Drozd committed
```

> **NOTE**: In some cases, it may be desirable to build a statically linked executable. For llama.cpp backend
> this can be done by adding these configuration parameters to the CMake command for Clang or GNU toolchains:
> ```shell
>    -DCMAKE_EXE_LINKER_FLAGS="-static"   \
>    -DGGML_OPENMP=OFF
> ```

#### Native host build

```shell
cmake -B build --preset=native-release-with-tests
cmake --build ./build
```

## Building and running tests

To build and test for native host machine:

```shell
cmake -B build --preset=native-release-with-tests
cmake --build ./build
ctest --test-dir ./build
```

> **NOTE**: For consistent and reliable test results, avoid using the `--parallel` option when running tests.

Nina Drozd's avatar
Nina Drozd committed
This should produce something like:
```shell
Internal ctest changing into directory: /home/user/llm/build
Test project /home/user/llm/build
    Start 1: llm-cpp-ctest
1/2 Test #1: llm-cpp-ctest ....................   Passed    4.16 sec
    Start 2: llama-jni-ctest
2/2 Test #2: llama-jni-ctest ..................   Passed    3.25 sec

100% tests passed, 0 tests failed out of 2

Total Test time (real) =   7.41 sec
```

Even when cross-compiling, the test binaries can be copied to the target system and executed.

## To build an executable

To build a standalone application add the configuration option `-DBUILD_EXECUTABLE=ON` to any of the build
commands above. For example:

On Aarch-64
```shell
cmake -B build \
    --preset=elinux-aarch64-release-with-tests \
    -DCMAKE_C_FLAGS=-march=armv8.2-a+dotprod+i8mm \
    -DCMAKE_CXX_FLAGS=-march=armv8.2-a+dotprod+i8mm \
    -DBUILD_EXECUTABLE=ON
cmake --build ./build


Or on x86 (No Kleidi Acceleration)

```shell
cmake -B build \
    --preset=native-release-with-tests \
    -DBUILD_EXECUTABLE=ON
cmake --build ./build
```

### llama cpp

Nina Drozd's avatar
Nina Drozd committed
You can run either executable from command line and add your prompt for example the following:
```
./build/bin/llama-cli -m resources_downloaded/models/llama.cpp/model.gguf --prompt "What is the capital of France"
Nina Drozd's avatar
Nina Drozd committed
```
More information can be found at `llama.cpp/examples/main/README.md` on how this executable can be run.

### onnxruntime genai

You can run model_benchmark executable from command line:
```
./build/bin/model_benchmark -i resources_downloaded/models/onnxruntime-genai
```
More information can be found at `onnxruntime-genai/benchmark/c/readme.md` on how this executable can be run.

Nina Drozd's avatar
Nina Drozd committed
## Trademarks

* Arm® and KleidiAI™ are registered trademarks or trademarks of Arm® Limited (or its subsidiaries) in the US and/or
  elsewhere.
* Android™ is a trademark of Google LLC.

## License

This project is distributed under the software licenses in [LICENSES](LICENSES) directory.