Commits · v1.11.0-rc2 · Kleidi / KleidiAI

Jul 01, 2025

Anton Bondarenko authored Jul 01, 2025



Signed-off-by: Anton Bondarenko <anton.bondarenko@arm.com>

Approved-by: Dan Johansson <dan.johansson@arm.com>

f362d32f

Jun 26, 2025

Improve packing performance for quantized Int4 per-block · 23266090

Evie Wright authored Jun 26, 2025 and

Viet-Hoa Do committed Jun 26, 2025

Improves performance of ‘kai_rhs_pack_nxk_qsi4c32pnrx8_qsu4c32s1s0_neon’ by vectorizing row summation

Signed-off-by: Evie Wright <evie.wright@arm.com>

Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

23266090

Jun 25, 2025
- Optimize kai_rhs_pack_nxk_qsi4c32p_qsu4c32s1s0 using advanced SIMD, for kr/sr = 4 · d18f620a
  Evie Wright authored Jun 25, 2025 and Viet-Hoa Do committed Jun 25, 2025
```
Signed-off-by: Evie Wright <evie.wright@arm.com>

Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Approved-by: Viet-Hoa Do <viet-hoa.do@arm.com>
```
  d18f620a
Jun 24, 2025

Update CHANGELOG with recent updates · 184e45c6

Anton Bondarenko authored Jun 24, 2025



Signed-off-by: Anton Bondarenko <anton.bondarenko@arm.com>

Approved-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>

184e45c6

Matmul Micro-kernel(1xN) F32/F16 <- (QSI8D32) LHS x (QAI4C32) RHS · a0afd5e1

Anitha Raj authored Jun 24, 2025 and

Anton Bondarenko committed Jun 24, 2025



* Matrix multiplication (1xN) micro-kernels to compute the matrix multiplication of dynamically quantized symmetric signed 8-bit integer with per-block quantization (QSI8D32) LHS matrix and quantized asymmetric 4-bit signed integer with per-block quantization (QAI4C32) RHS matrix and the accumulation of the result into a single-precision (F32) and half-precision (F16) output, optimized for FEAT_DotProd and packing parameter kr = 8.

Signed-off-by: Anitha Raj <anitha.raj@arm.com>

Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

a0afd5e1

Jun 23, 2025

fix data type for values in kai_rhs_pack_nxk_qsi4cxps1s0_qsu4cxs1s0_neon · 5e8be458

Evie Wright authored Jun 23, 2025 and

Suhail M committed Jun 23, 2025



update unit test to test signed integer inputs
Resolves: KLEIDIAI-664

Signed-off-by: Evie Wright <evie.wright@arm.com>

Approved-by: Suhail M <mohammedsuhail.munshi@arm.com>

5e8be458

Jun 20, 2025

Back up floating point registers in SME1 kernels · c80d1883

Jakub Sujak authored Jun 20, 2025



Signed-off-by: Jakub Sujak <jakub.sujak@arm.com>

Approved-by: Viet-Hoa Do <viet-hoa.do@arm.com>

c80d1883

Add SME1 F16 GEMM micro-kernel · f3da48be

Jakub Sujak authored Jun 20, 2025

Adds F16 GEMM micro-kernel using SME1 MOPA instruction and 2VL x 2VL block size. This SME1 kernel is compatible with existing SME F16 LHS and RHS packing functions.

Signed-off-by: Jakub Sujak <jakub.sujak@arm.com>

Approved-by: Viet-Hoa Do <viet-hoa.do@arm.com>

f3da48be

Update SME1 F32 GEMM assembly · 395b3695

Jakub Sujak authored Jun 20, 2025



Signed-off-by: Jakub Sujak <jakub.sujak@arm.com>

Approved-by: Viet-Hoa Do <viet-hoa.do@arm.com>

395b3695

Jun 19, 2025

Zero-initialize test buffers · 88ef5b98

Jakub Sujak authored Jun 19, 2025



Test buffers must be initialized to a default value of 0.

Report mismatches outside the portion-tested ROI to catch out-of-bound kernel writes.

Signed-off-by: Jakub Sujak <jakub.sujak@arm.com>

Approved-by: Viet-Hoa Do <viet-hoa.do@arm.com>

88ef5b98

Move GEMM F32 SME1 assembly kernel to separate file · bb852338

Viet-Hoa Do authored Jun 19, 2025



Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com>

Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Approved-by: Jakub Sujak <jakub.sujak@arm.com>

bb852338

Add GEMM F32 using SME1 MOPA with block size 2VLx2VL · b3b24af1

Viet-Hoa Do authored Jun 19, 2025



* Add GEMM F32 kernel using SME1 MOPA with block size 2VLx2VL.
* Add tests for the newly added kernel.
* Add CI job to run the kernel on FVP with SME1 and without SME2.

Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com>

Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

b3b24af1

Add SME/SME2 support for FVP image in CI · 8426f5df

Anton Bondarenko authored Jun 19, 2025

Update Linux Kernel and bootloader-wrapper to versions supporting SME/SME2.
This required to properly handle context switch with SME state preservation
and restoration. Also address CPU boot issue by keep in sync number of CPUs
used in bootloader/DTS and FVP configuration. Two CPUs used when first used
for running system services and second one as isolated to run test programs.

Linux Kernel: 6.16-rc2 (update to released version once available)
Bootloader wrapper: latest available

Signed-off-by: Anton Bondarenko <anton.bondarenko@arm.com>

Approved-by: Jakub Sujak <jakub.sujak@arm.com>

8426f5df

Jun 16, 2025

Bump version to v1.10.0 · dc69e899

Emil Ohlsson authored Jun 16, 2025



Signed-off-by: Emil Ohlsson <emil.ohlsson@arm.com>

Approved-by: Jakub Sujak <jakub.sujak@arm.com>

dc69e899

Jun 13, 2025

Matmul Micro-kernel(MxN) F32 <- (QSI8D32) LHS x (QAI4C32) RHS · 4f5154e0

Anitha Raj authored Jun 13, 2025 and

Emil Ohlsson committed Jun 13, 2025



* Matrix multiplication (MxN) micro-kernels to compute the matrix multiplication of dynamically quantized symmetric signed 8-bit integer with per-block quantization (QSI8D32) LHS matrix and quantized asymmetric 4-bit signed integer with per-block quantization (QAI4C32) RHS matrix and the accumulation of the result into a single-precision (F32) output, optimized for FEAT_DotProd

Signed-off-by: Anitha Raj <anitha.raj@arm.com>

Reviewed-by: Anitha Raj <anitha.raj@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

4f5154e0

Jun 12, 2025

Split FP32 SME kernels into seperate assembly file. · cdcff672

Jens Elofsson authored Jun 12, 2025



Move the assembly blocks of the following kernels into their own files:
- rhs_pack_kxn_f32p2vlx1biasf32_f32_f32_sme
- lhs_pack_f32p2vlx1_f32_sme
- matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla
- matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla
- matmul_clamp_f32_f32p2vlx1_f32p2vlx1biasf32_sme2_mopa

Signed-off-by: Jens Elofsson <jens.elofsson@arm.com>

Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

cdcff672

Split FP16 SME kernels into seperate assembly file. · f0454add

Jens Elofsson authored Jun 12, 2025



Move the assembly blocks of the following kernels into their own files:
- lhs_pack_x16p2vlx2_x16_sme
- rhs_pack_kxn_x16p2vlx2b_x16_x16_sme
- matmul_clamp_f16_f16p2vlx2_f16p2vlx2_2vlx2vl_sme2_mopa
- matmul_clamp_f16_f16_f16p2vlx2b_1x16vl_sme2_dot

Signed-off-by: Jens Elofsson <jens.elofsson@arm.com>

Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

f0454add

Jun 11, 2025

Split INT8 SME kernels into seperate assembly file. · f11792a5

Jens Elofsson authored Jun 11, 2025



Move the assembly blocks of the following kernels into their own files:
- lhs_pack_x8p2vlx4_x8_sme
- rhs_pack_kxn_qsi8cxp2vlx4sb_qs8cx_f32_i32_sme
- matmul_clamp_qai8_qai8_qsi8cxp2vlx4sb_1x16vl_sme2_dot
- matmul_clamp_qai8_qai8p2vlx4_qsi8cxpsb2vlx4_2vlx2vl_sme2_mopa

Signed-off-by: Jens Elofsson <jens.elofsson@arm.com>

Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

f11792a5

Jun 10, 2025

Fix bug where kai_get_m_step returns the incorrect value · e4e3c549

Jens Elofsson authored Jun 10, 2025



Fix issue in kernels
- matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla
- matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla

where kai_get_m_step returns the incorrect value.

Signed-off-by: Jens Elofsson <jens.elofsson@arm.com>

Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

e4e3c549

Jun 05, 2025

Enable MSVC support for matmul_clamp_f32_qsi8d32p_qai4c32p micro-kernels · d08325fb

Anitha Raj authored Jun 05, 2025 and

Anton Bondarenko committed Jun 05, 2025

Update architectural feature guards and CMakeLists to enable MSVC build for matmul_clamp_f32_qsi8d32p_qai4c32p micro-kernels.

Signed-off-by: Anitha Raj <anitha.raj@arm.com>

Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

d08325fb

Optimize kai_rhs_pack_nxk_qsi4c32p_qsu4c32s1s0 using advanced SIMD · 05ef512e

Evie Wright authored Jun 05, 2025 and

Anton Bondarenko committed Jun 05, 2025



Optimize the transposed RHS packing function for matmul_clamp_f32_qai8dxp_qsi4c32p using advanced SIMD, for kr / sr = 8

Signed-off-by: Evie Wright <evie.wright@arm.com>

Signed-off-by: Anitha Raj <anitha.raj@arm.com>

Reviewed-by: Anitha Raj <anitha.raj@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Reviewed-by: Evie Wright <evie.wright@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

05ef512e

Fix minor bug - Incorrect vector size in Conv2D example · 09771a98

Suhail M authored Jun 05, 2025



Signed-off-by: Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com>

Approved-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>

09771a98

Enable MSVC support for imatmul kernels · b3699d63

Emil Ohlsson authored Jun 05, 2025



This change adds MSVC support for the imatmul kernels. As a result, they
are also aligned, which results in some additional diffs.

Signed-off-by: Emil Ohlsson <emil.ohlsson@arm.com>

Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

b3699d63

fix memory allocation for f16_f16_f16p example · febaaa75
Charlotte Chen authored Jun 05, 2025 and Jakub Sujak committed Jun 05, 2025
```
Signed-off-by: Charlotte Chen <charlotte.chen@arm.com>

Approved-by: Jakub Sujak <jakub.sujak@arm.com>
```
febaaa75

Jun 04, 2025

Matmul Micro-kernel(MxN) F16 <- (QSI8D32) LHS x (QAI4C32) RHS · 8c63677f

Anitha Raj authored Jun 04, 2025 and

Jakub Sujak committed Jun 04, 2025



* Matrix multiplication (MxN) micro-kernels to compute the matrix multiplication of dynamically quantized symmetric signed 8-bit integer with per-block quantization (QSI8D32) LHS matrix and quantized asymmetric 4-bit signed integer with per-block quantization (QAI4C32) RHS matrix and the accumulation of the result into a half-precision (F16) output, optimized for FEAT_DotProd.

Signed-off-by: Anitha Raj <anitha.raj@arm.com>

Approved-by: Jakub Sujak <jakub.sujak@arm.com>

8c63677f

Update architecture check for wider compatibility · 3666557f

Emil Ohlsson authored Jun 04, 2025



Signed-off-by: Emil Ohlsson <emil.ohlsson@arm.com>

Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

3666557f

Jun 03, 2025

Add Conv2D example using FP16 IGEMM · d7f833a5

Suhail M authored Jun 03, 2025



- Example demonstrates creating an indirect buffer using a Conv2D input tensor
- Example demonstrates indirect buffer usage with imatmul kernels.

Signed-off-by: Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com>

Reviewed-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Reviewed-by: Suhail M <mohammedsuhail.munshi@arm.com>
Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

d7f833a5

Update the register back up for kai_kernel_matmul_clamp_f32_qai8dxp1x4_qsi8cxp4vlx4_1x4vl_sme2_dot · 24708820

Anitha Raj authored Jun 03, 2025 and

Emil Ohlsson committed Jun 03, 2025



- Following the procedure call standard, backup registers d8-d15 in kai_kernel_matmul_clamp_f32_qai8dxp1x4_qsi8cxp4vlx4_1x4vl_sme2_dot asm micro-kernel.

Signed-off-by: Anitha Raj <anitha.raj@arm.com>

Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

24708820

Jun 02, 2025

Bump version and update changelog · a15b6725

Emil Ohlsson authored Jun 02, 2025



Signed-off-by: Emil Ohlsson <emil.ohlsson@arm.com>

Approved-by: Viet-Hoa Do <viet-hoa.do@arm.com>

a15b6725

May 30, 2025

Matmul Micro-kernels F32 <- QAI8DXP(LHS) x QSI8CXP(RHS) optimized for SME · 3d8217c2

Anitha Raj authored May 30, 2025 and

Felix Johnny Thomasmathibalan committed May 30, 2025



* Micro-kernels (1xN) to compute the matrix multiplication of dynamically quantized asymmetric 8-bit integer with per-channel quantization (QAI8DX) LHS matrix and quantized symmetric 8-bit integer with per-channel quantization (QSI8CX) RHS matrix and the accumulation of the result into a single-precision (F32) output, optimized for SME2 technology.
*  Micro-kernels (MxN) to compute the matrix multiplication of dynamically quantized asymmetric 8-bit integer with per-channel quantization (QAI8DX) LHS matrix and quantized symmetric 8-bit integer with per-channel quantization (QSI8CX) RHS matrix and the accumulation of the result into a single-precision (F32) output, optimized for SME2 technology.

Signed-off-by: Anitha Raj <anitha.raj@arm.com>

Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Reviewed-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Approved-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>

3d8217c2

May 28, 2025

Extend support for signed 4-bit integer inputs in kai_rhs_pack_nxk_qsi4cxps1s0_qsu4cxs1s0_neon · bf64dadd

Evie Wright authored May 28, 2025 and

Felix Johnny Thomasmathibalan committed May 28, 2025

Signed-off-by: Evie Wright <evie.wright@arm.com>

Reviewed-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Approved-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>

bf64dadd

May 27, 2025

Update build image components · 1bb52851

Anton Bondarenko authored May 27, 2025



Update missing or outdated components in build environment

Signed-off-by: Anton Bondarenko <anton.bondarenko@arm.com>

Approved-by: Jakub Sujak <jakub.sujak@arm.com>

1bb52851

Address linter issues · c24de178

Emil Ohlsson authored May 27, 2025



Newer versions of the linter flags issues with parentheis in
expressionss, as well as use of `size_t` without inclusion of `stddef.h`

Signed-off-by: Emil Ohlsson <emil.ohlsson@arm.com>

Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Jakub Sujak <jakub.sujak@arm.com>

c24de178

Replace templated clamp finding with dynamic · 6ef4683c

Emil Ohlsson authored May 27, 2025



Given that there is a non-templated version for doing clamp testing,
there is no need to toggle on type in matmul_test.cpp.

Signed-off-by: Emil Ohlsson <emil.ohlsson@arm.com>

Approved-by: Jens Elofsson <jens.elofsson@arm.com>

6ef4683c

May 26, 2025

Use new Buffer class for the entire test framework · 5cb98201

Viet-Hoa Do authored May 26, 2025 and

Anton Bondarenko committed May 26, 2025



* Replace `std::vector<uint8_t>` by `Buffer` class.
* Update `Buffer` class:
  - Add support for initial value of the buffer.
  - Always initialize the buffer with 0 by default.
* Add `pad_matrix` reference function to support extending
  the data buffer.

Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com>

Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>
Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

5cb98201

May 22, 2025

Fix clamping issue · 32546c02

Viet-Hoa Do authored May 22, 2025



* Numeric limits report the lowest and highest finite values
  of F16 and BF16 to be 0 which disables testing of all F16
  and BF16 kernels with clamping.
* Update numeric limits to have the correct limits.
* Update numeric limits to make sure compilation error when
  a type is not supported.

Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com>

Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

32546c02

Avoid FEAT_FP16 requirement unnecessarily · 695a717f

Viet-Hoa Do authored May 22, 2025



* The conversion between FP32 and FP16 is part of the base
  instruction set and does not require FEAT_FP16.
  The equivalent functions in the test framework need to change
  to avoid the need for FEAT_FP16.

Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com>

Approved-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

695a717f

May 13, 2025

Add imatmul documentation · 382c2b25

Emil Ohlsson authored May 13, 2025



This commit introduces documentation for the imatmul kernels, with extra
focus on the left hand side packing.

Signed-off-by: Emil Ohlsson <emil.ohlsson@arm.com>

Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

382c2b25

May 08, 2025

Steamline test functionality · bfc48f79

Anton Bondarenko authored May 08, 2025



* Remove duplicate data types and structures
* Cleanup unused headers
* Move common CPU feature checkers to cpu_info

Signed-off-by: Anton Bondarenko <anton.bondarenko@arm.com>

Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

bfc48f79

May 02, 2025

Smarter memory protection with Buffer class · 73d1e85d

Jakub Sujak authored May 02, 2025



Introduces a dedicated `Buffer` abstraction for managing blocks of memory. Buffer comes with protection mechanisms that can be enabled by setting the `KAI_TEST_BUFFER_POLICY` environment variable.

Example usage:

```KAI_TEST_BUFFER_POLICY=PROTECT_OVERFLOW ./kleidiai_test```

Available memory protection mechanisms:

- `KAI_TEST_BUFFER_POLICY=PROTECT_UNDERFLOW`
- `KAI_TEST_BUFFER_POLICY=PROTECT_OVERFLOW`

If `KAI_TEST_BUFFER_POLICY` is not set or is not one of the above values, then no memory protection mechanisms are enabled and Buffer performs naive malloc() allocation of memory.

When `KAI_TEST_BUFFER_POLICY` is set to one of the above values, the following protections are enabled:

- `PROTECT_UNDERFLOW`: Memory equal to the size of the user buffer rounded to the nearest whole page plus adjacent guard pages is allocated, and the user buffer is aligned to the end of the head guard page thus detecting whenever a buffer underflow occurs.
- `PROTECT_OVERFLOW`: Same as above, but now the edge of the user buffer is aligned to the start of the tail guard page thus detecting whenever a buffer overflow occurs.

Buffer is only intended to opaquely allocate and manage memory. The underlying memory resource can be requested using the familiar `Buffer::data()` method and interacted with using `kai::test::read_array<T>()` and `kai::test::write_array<T>()` utilities.

Signed-off-by: Jakub Sujak <jakub.sujak@arm.com>

Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>

73d1e85d