Commits · v1.12.0-rc1 · Kleidi / KleidiAI

Jul 22, 2025

Correct CHANGELOG information · 8ca22671

Anton Bondarenko authored Jul 22, 2025 and

Felix Johnny Thomasmathibalan committed Jul 22, 2025



Signed-off-by: Anton Bondarenko <anton.bondarenko@arm.com>

Approved-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>

8ca22671

Jul 21, 2025

Bump version to v1.12.0 · c6b364b0

Anton Bondarenko authored Jul 21, 2025



Also update CHANGELOG to mention more

Signed-off-by: Anton Bondarenko <anton.bondarenko@arm.com>

Approved-by: John McLoughlin <john.mcloughlin@arm.com>

c6b364b0

Matmul Micro-kernels BF16 <- (QAI8DXP) LHS x (QSI4C32P) RHS · ce70c631

Evie Wright authored Jul 21, 2025 and

Anton Bondarenko committed Jul 21, 2025



Signed-off-by: Evie Wright <evie.wright@arm.com>

Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

ce70c631

Refactor matmul test to reduce code duplication · 74c7d8fc

Evie Wright authored Jul 21, 2025 and

Anton Bondarenko committed Jul 21, 2025



Signed-off-by: Evie Wright <evie.wright@arm.com>

Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

74c7d8fc

Jul 17, 2025

Create separate .S-file for common support functions · 45bf0603

Jens Elofsson authored Jul 17, 2025 and

Viet-Hoa Do committed Jul 17, 2025



Create kai_common_sme_asm.S to hold support functions that uses pure assembly
instead of inlined assembly.

The function moved in this patch is kai_get_sme_vector_length_u8.

Signed-off-by: Jens Elofsson <jens.elofsson@arm.com>

Approved-by: Jakub Sujak <jakub.sujak@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

45bf0603

Jul 16, 2025

Sort the list of files in KLEIDIAI_FILES_NEON_ASM · ed03892b

Viet-Hoa Do authored Jul 16, 2025 and

Anton Bondarenko committed Jul 16, 2025



Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com>

Approved-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>

ed03892b

Add more micro-kernels to MSVC build · 72f56edc

Viet-Hoa Do authored Jul 16, 2025



* List of micro-kernels added to the MSVC build:
  - kai_matmul_clamp_f32_f32_f32p8x1biasf32_6x8x4_neon_mla
  - kai_lhs_quant_pack_qsi8d32p_f32_neon
  - kai_rhs_pack_kxn_qsi8cxp_qsi8cx_neon
  - kai_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon
  - kai_rhs_pack_nxk_qsi4cxps1s0_qsu4cxs1s0_neon
  - kai_rhs_pack_nxk_qsi8cxp_qsi8cx_neon

Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com>

Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>
Approved-by: Jens Elofsson <jens.elofsson@arm.com>

72f56edc

Move transposed RHS packing kernels to separate assembly file · 9ca4b257

Viet-Hoa Do authored Jul 16, 2025



* Kernels:
  - kai_rhs_pack_nxk_f32p2vlx1biasf32_f32_f32_sme
  - kai_rhs_pack_nxk_x16p2vlx2b_x16_x16_sme

Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com>

Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

9ca4b257

Jul 15, 2025

Update kai_kernel_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme2_mopa · f147dadd

Anitha Raj authored Jul 15, 2025 and

Anton Bondarenko committed Jul 15, 2025



* Update the asm kernel to multiply the zero-points and sum as integers instead of float.

Signed-off-by: Anitha Raj <anitha.raj@arm.com>

Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Reviewed-by: Anitha Raj <anitha.raj@arm.com>
Approved-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>

f147dadd

Jul 14, 2025

Add SME F16 GEMV kernel targeting FEAT_SME · a1dacc36

Jakub Sujak authored Jul 14, 2025



* Add SME F16 GEMV micro-kernel.

* The GEMV micro-kernel uses instructions compatible with FEAT_SME.

* The GEMV micro-kernel is designed to reuse the same RHS packing functions as the SME F16 GEMM.

This new GEMV micro-kernel is compatible with FEAT_SME but not FEAT_SME2 requirement. By using pairs of `FMLALB` and `FMLALT` instructions, we can reuse the existing RHS data format of the GEMM operation where `kr=2` thus eliminating the need for a specialized packing function for the GEMV operation.

Signed-off-by: Jakub Sujak <jakub.sujak@arm.com>

Approved-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>

a1dacc36

Jul 09, 2025

Add SME1 F32 GEMV kernel · d071a3a4

Jakub Sujak authored Jul 09, 2025

This SME1 GEMV kernel computes a 1x8VL block and is designed to work with the same RHS packing function as the SME1 GEMM.

Signed-off-by: Jakub Sujak <jakub.sujak@arm.com>

Reviewed-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

d071a3a4

Reduce test name redundancy · bb2cdc16

Dan Johansson authored Jul 09, 2025

Long test names cause the JUnit report files to exceed the size limit. By
removing redundancies in the test names, the report size is reduced by 15%.

Signed-off-by: Dan Johansson <dan.johansson@arm.com>

Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

bb2cdc16

Jul 08, 2025

Matmul Micro-kernels BF16 <- (QAI8DXP) LHS x (QSI4CXP) RHS · 1a7a7700

Nikhil Gupta authored Jul 08, 2025 and

Anton Bondarenko committed Jul 08, 2025



- Matrix multiplication (MxN) Micro-kernels of QAI8DXP LHS and QSI4CXP
  RHS with BF16 output, optimized for FEAT_I8MM.
- Matrix multiplication (1xN) Micro-kernels of QAI8DXP LHS and QSI4CXP
  RHS with BF16 output, optimized for FEAT_DotProd.

Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com>
Signed-off-by: Evie Wright <evie.wright@arm.com>

Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Reviewed-by: Evie Wright <evie.wright@arm.com>
Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

1a7a7700

Jul 04, 2025
- Added framework integration example for onnx rt that can be used for other frameworks · 7f93c5c5
  Viet-Hoa Do authored Jul 04, 2025
```
Signed-off-by: Damien Dooley <damien.dooley@arm.com>

Approved-by: Viet-Hoa Do <viet-hoa.do@arm.com>
```
  7f93c5c5
Jul 03, 2025

Run SME/SME2 examples in FVP · 5d3031fb

Anton Bondarenko authored Jul 03, 2025



Signed-off-by: Anton Bondarenko <anton.bondarenko@arm.com>

Approved-by: Dan Johansson <dan.johansson@arm.com>

5d3031fb

Jul 01, 2025

Bump version to v1.11.0 · f362d32f

Anton Bondarenko authored Jul 01, 2025



Signed-off-by: Anton Bondarenko <anton.bondarenko@arm.com>

Approved-by: Dan Johansson <dan.johansson@arm.com>

f362d32f

Jun 26, 2025

Improve packing performance for quantized Int4 per-block · 23266090

Evie Wright authored Jun 26, 2025 and

Viet-Hoa Do committed Jun 26, 2025

Improves performance of ‘kai_rhs_pack_nxk_qsi4c32pnrx8_qsu4c32s1s0_neon’ by vectorizing row summation

Signed-off-by: Evie Wright <evie.wright@arm.com>

Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

23266090

Jun 25, 2025
- Optimize kai_rhs_pack_nxk_qsi4c32p_qsu4c32s1s0 using advanced SIMD, for kr/sr = 4 · d18f620a
  Evie Wright authored Jun 25, 2025 and Viet-Hoa Do committed Jun 25, 2025
```
Signed-off-by: Evie Wright <evie.wright@arm.com>

Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Approved-by: Viet-Hoa Do <viet-hoa.do@arm.com>
```
  d18f620a
Jun 24, 2025

Update CHANGELOG with recent updates · 184e45c6

Anton Bondarenko authored Jun 24, 2025



Signed-off-by: Anton Bondarenko <anton.bondarenko@arm.com>

Approved-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>

184e45c6

Matmul Micro-kernel(1xN) F32/F16 <- (QSI8D32) LHS x (QAI4C32) RHS · a0afd5e1

Anitha Raj authored Jun 24, 2025 and

Anton Bondarenko committed Jun 24, 2025



* Matrix multiplication (1xN) micro-kernels to compute the matrix multiplication of dynamically quantized symmetric signed 8-bit integer with per-block quantization (QSI8D32) LHS matrix and quantized asymmetric 4-bit signed integer with per-block quantization (QAI4C32) RHS matrix and the accumulation of the result into a single-precision (F32) and half-precision (F16) output, optimized for FEAT_DotProd and packing parameter kr = 8.

Signed-off-by: Anitha Raj <anitha.raj@arm.com>

Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

a0afd5e1

Jun 23, 2025

fix data type for values in kai_rhs_pack_nxk_qsi4cxps1s0_qsu4cxs1s0_neon · 5e8be458

Evie Wright authored Jun 23, 2025 and

Suhail M committed Jun 23, 2025



update unit test to test signed integer inputs
Resolves: KLEIDIAI-664

Signed-off-by: Evie Wright <evie.wright@arm.com>

Approved-by: Suhail M <mohammedsuhail.munshi@arm.com>

5e8be458

Jun 20, 2025

Back up floating point registers in SME1 kernels · c80d1883

Jakub Sujak authored Jun 20, 2025



Signed-off-by: Jakub Sujak <jakub.sujak@arm.com>

Approved-by: Viet-Hoa Do <viet-hoa.do@arm.com>

c80d1883

Add SME1 F16 GEMM micro-kernel · f3da48be

Jakub Sujak authored Jun 20, 2025

Adds F16 GEMM micro-kernel using SME1 MOPA instruction and 2VL x 2VL block size. This SME1 kernel is compatible with existing SME F16 LHS and RHS packing functions.

Signed-off-by: Jakub Sujak <jakub.sujak@arm.com>

Approved-by: Viet-Hoa Do <viet-hoa.do@arm.com>

f3da48be

Update SME1 F32 GEMM assembly · 395b3695

Jakub Sujak authored Jun 20, 2025



Signed-off-by: Jakub Sujak <jakub.sujak@arm.com>

Approved-by: Viet-Hoa Do <viet-hoa.do@arm.com>

395b3695

Jun 19, 2025

Zero-initialize test buffers · 88ef5b98

Jakub Sujak authored Jun 19, 2025



Test buffers must be initialized to a default value of 0.

Report mismatches outside the portion-tested ROI to catch out-of-bound kernel writes.

Signed-off-by: Jakub Sujak <jakub.sujak@arm.com>

Approved-by: Viet-Hoa Do <viet-hoa.do@arm.com>

88ef5b98

Move GEMM F32 SME1 assembly kernel to separate file · bb852338

Viet-Hoa Do authored Jun 19, 2025



Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com>

Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Approved-by: Jakub Sujak <jakub.sujak@arm.com>

bb852338

Add GEMM F32 using SME1 MOPA with block size 2VLx2VL · b3b24af1

Viet-Hoa Do authored Jun 19, 2025



* Add GEMM F32 kernel using SME1 MOPA with block size 2VLx2VL.
* Add tests for the newly added kernel.
* Add CI job to run the kernel on FVP with SME1 and without SME2.

Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com>

Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

b3b24af1

Add SME/SME2 support for FVP image in CI · 8426f5df

Anton Bondarenko authored Jun 19, 2025

Update Linux Kernel and bootloader-wrapper to versions supporting SME/SME2.
This required to properly handle context switch with SME state preservation
and restoration. Also address CPU boot issue by keep in sync number of CPUs
used in bootloader/DTS and FVP configuration. Two CPUs used when first used
for running system services and second one as isolated to run test programs.

Linux Kernel: 6.16-rc2 (update to released version once available)
Bootloader wrapper: latest available

Signed-off-by: Anton Bondarenko <anton.bondarenko@arm.com>

Approved-by: Jakub Sujak <jakub.sujak@arm.com>

8426f5df

Jun 16, 2025

Bump version to v1.10.0 · dc69e899

Emil Ohlsson authored Jun 16, 2025



Signed-off-by: Emil Ohlsson <emil.ohlsson@arm.com>

Approved-by: Jakub Sujak <jakub.sujak@arm.com>

dc69e899

Jun 13, 2025

Matmul Micro-kernel(MxN) F32 <- (QSI8D32) LHS x (QAI4C32) RHS · 4f5154e0

Anitha Raj authored Jun 13, 2025 and

Emil Ohlsson committed Jun 13, 2025



* Matrix multiplication (MxN) micro-kernels to compute the matrix multiplication of dynamically quantized symmetric signed 8-bit integer with per-block quantization (QSI8D32) LHS matrix and quantized asymmetric 4-bit signed integer with per-block quantization (QAI4C32) RHS matrix and the accumulation of the result into a single-precision (F32) output, optimized for FEAT_DotProd

Signed-off-by: Anitha Raj <anitha.raj@arm.com>

Reviewed-by: Anitha Raj <anitha.raj@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

4f5154e0

Jun 12, 2025

Split FP32 SME kernels into seperate assembly file. · cdcff672

Jens Elofsson authored Jun 12, 2025



Move the assembly blocks of the following kernels into their own files:
- rhs_pack_kxn_f32p2vlx1biasf32_f32_f32_sme
- lhs_pack_f32p2vlx1_f32_sme
- matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla
- matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla
- matmul_clamp_f32_f32p2vlx1_f32p2vlx1biasf32_sme2_mopa

Signed-off-by: Jens Elofsson <jens.elofsson@arm.com>

Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

cdcff672

Split FP16 SME kernels into seperate assembly file. · f0454add

Jens Elofsson authored Jun 12, 2025



Move the assembly blocks of the following kernels into their own files:
- lhs_pack_x16p2vlx2_x16_sme
- rhs_pack_kxn_x16p2vlx2b_x16_x16_sme
- matmul_clamp_f16_f16p2vlx2_f16p2vlx2_2vlx2vl_sme2_mopa
- matmul_clamp_f16_f16_f16p2vlx2b_1x16vl_sme2_dot

Signed-off-by: Jens Elofsson <jens.elofsson@arm.com>

Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

f0454add

Jun 11, 2025

Split INT8 SME kernels into seperate assembly file. · f11792a5

Jens Elofsson authored Jun 11, 2025



Move the assembly blocks of the following kernels into their own files:
- lhs_pack_x8p2vlx4_x8_sme
- rhs_pack_kxn_qsi8cxp2vlx4sb_qs8cx_f32_i32_sme
- matmul_clamp_qai8_qai8_qsi8cxp2vlx4sb_1x16vl_sme2_dot
- matmul_clamp_qai8_qai8p2vlx4_qsi8cxpsb2vlx4_2vlx2vl_sme2_mopa

Signed-off-by: Jens Elofsson <jens.elofsson@arm.com>

Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

f11792a5

Jun 10, 2025

Fix bug where kai_get_m_step returns the incorrect value · e4e3c549

Jens Elofsson authored Jun 10, 2025



Fix issue in kernels
- matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla
- matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla

where kai_get_m_step returns the incorrect value.

Signed-off-by: Jens Elofsson <jens.elofsson@arm.com>

Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

e4e3c549

Jun 05, 2025

Enable MSVC support for matmul_clamp_f32_qsi8d32p_qai4c32p micro-kernels · d08325fb

Anitha Raj authored Jun 05, 2025 and

Anton Bondarenko committed Jun 05, 2025

Update architectural feature guards and CMakeLists to enable MSVC build for matmul_clamp_f32_qsi8d32p_qai4c32p micro-kernels.

Signed-off-by: Anitha Raj <anitha.raj@arm.com>

Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

d08325fb

Optimize kai_rhs_pack_nxk_qsi4c32p_qsu4c32s1s0 using advanced SIMD · 05ef512e

Evie Wright authored Jun 05, 2025 and

Anton Bondarenko committed Jun 05, 2025



Optimize the transposed RHS packing function for matmul_clamp_f32_qai8dxp_qsi4c32p using advanced SIMD, for kr / sr = 8

Signed-off-by: Evie Wright <evie.wright@arm.com>

Signed-off-by: Anitha Raj <anitha.raj@arm.com>

Reviewed-by: Anitha Raj <anitha.raj@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Reviewed-by: Evie Wright <evie.wright@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

05ef512e

Fix minor bug - Incorrect vector size in Conv2D example · 09771a98

Suhail M authored Jun 05, 2025



Signed-off-by: Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com>

Approved-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>

09771a98

Enable MSVC support for imatmul kernels · b3699d63

Emil Ohlsson authored Jun 05, 2025



This change adds MSVC support for the imatmul kernels. As a result, they
are also aligned, which results in some additional diffs.

Signed-off-by: Emil Ohlsson <emil.ohlsson@arm.com>

Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

b3699d63

fix memory allocation for f16_f16_f16p example · febaaa75
Charlotte Chen authored Jun 05, 2025 and Jakub Sujak committed Jun 05, 2025
```
Signed-off-by: Charlotte Chen <charlotte.chen@arm.com>

Approved-by: Jakub Sujak <jakub.sujak@arm.com>
```
febaaa75

Jun 04, 2025

Matmul Micro-kernel(MxN) F16 <- (QSI8D32) LHS x (QAI4C32) RHS · 8c63677f

Anitha Raj authored Jun 04, 2025 and

Jakub Sujak committed Jun 04, 2025



* Matrix multiplication (MxN) micro-kernels to compute the matrix multiplication of dynamically quantized symmetric signed 8-bit integer with per-block quantization (QSI8D32) LHS matrix and quantized asymmetric 4-bit signed integer with per-block quantization (QAI4C32) RHS matrix and the accumulation of the result into a half-precision (F16) output, optimized for FEAT_DotProd.

Signed-off-by: Anitha Raj <anitha.raj@arm.com>

Approved-by: Jakub Sujak <jakub.sujak@arm.com>

8c63677f