Commits · main · Kleidi / KleidiAI

Jul 22, 2025

Correct CHANGELOG information · 8ca22671

Anton Bondarenko authored Jul 22, 2025 and

Felix Johnny Thomasmathibalan committed Jul 22, 2025



Signed-off-by: Anton Bondarenko <anton.bondarenko@arm.com>

Approved-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>

8ca22671

Jul 21, 2025

Bump version to v1.12.0 · c6b364b0

Anton Bondarenko authored Jul 21, 2025



Also update CHANGELOG to mention more

Signed-off-by: Anton Bondarenko <anton.bondarenko@arm.com>

Approved-by: John McLoughlin <john.mcloughlin@arm.com>

c6b364b0

Matmul Micro-kernels BF16 <- (QAI8DXP) LHS x (QSI4C32P) RHS · ce70c631

Evie Wright authored Jul 21, 2025 and

Anton Bondarenko committed Jul 21, 2025



Signed-off-by: Evie Wright <evie.wright@arm.com>

Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

ce70c631

Jul 16, 2025

Add more micro-kernels to MSVC build · 72f56edc

Viet-Hoa Do authored Jul 16, 2025



* List of micro-kernels added to the MSVC build:
  - kai_matmul_clamp_f32_f32_f32p8x1biasf32_6x8x4_neon_mla
  - kai_lhs_quant_pack_qsi8d32p_f32_neon
  - kai_rhs_pack_kxn_qsi8cxp_qsi8cx_neon
  - kai_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon
  - kai_rhs_pack_nxk_qsi4cxps1s0_qsu4cxs1s0_neon
  - kai_rhs_pack_nxk_qsi8cxp_qsi8cx_neon

Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com>

Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>
Approved-by: Jens Elofsson <jens.elofsson@arm.com>

72f56edc

Move transposed RHS packing kernels to separate assembly file · 9ca4b257

Viet-Hoa Do authored Jul 16, 2025



* Kernels:
  - kai_rhs_pack_nxk_f32p2vlx1biasf32_f32_f32_sme
  - kai_rhs_pack_nxk_x16p2vlx2b_x16_x16_sme

Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com>

Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

9ca4b257

Jul 15, 2025

Update kai_kernel_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme2_mopa · f147dadd

Anitha Raj authored Jul 15, 2025 and

Anton Bondarenko committed Jul 15, 2025



* Update the asm kernel to multiply the zero-points and sum as integers instead of float.

Signed-off-by: Anitha Raj <anitha.raj@arm.com>

Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Reviewed-by: Anitha Raj <anitha.raj@arm.com>
Approved-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>

f147dadd

Jul 14, 2025

Add SME F16 GEMV kernel targeting FEAT_SME · a1dacc36

Jakub Sujak authored Jul 14, 2025



* Add SME F16 GEMV micro-kernel.

* The GEMV micro-kernel uses instructions compatible with FEAT_SME.

* The GEMV micro-kernel is designed to reuse the same RHS packing functions as the SME F16 GEMM.

This new GEMV micro-kernel is compatible with FEAT_SME but not FEAT_SME2 requirement. By using pairs of `FMLALB` and `FMLALT` instructions, we can reuse the existing RHS data format of the GEMM operation where `kr=2` thus eliminating the need for a specialized packing function for the GEMV operation.

Signed-off-by: Jakub Sujak <jakub.sujak@arm.com>

Approved-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>

a1dacc36

Jul 09, 2025

Add SME1 F32 GEMV kernel · d071a3a4

Jakub Sujak authored Jul 09, 2025

This SME1 GEMV kernel computes a 1x8VL block and is designed to work with the same RHS packing function as the SME1 GEMM.

Signed-off-by: Jakub Sujak <jakub.sujak@arm.com>

Reviewed-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

d071a3a4

Jul 08, 2025

Matmul Micro-kernels BF16 <- (QAI8DXP) LHS x (QSI4CXP) RHS · 1a7a7700

Nikhil Gupta authored Jul 08, 2025 and

Anton Bondarenko committed Jul 08, 2025



- Matrix multiplication (MxN) Micro-kernels of QAI8DXP LHS and QSI4CXP
  RHS with BF16 output, optimized for FEAT_I8MM.
- Matrix multiplication (1xN) Micro-kernels of QAI8DXP LHS and QSI4CXP
  RHS with BF16 output, optimized for FEAT_DotProd.

Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com>
Signed-off-by: Evie Wright <evie.wright@arm.com>

Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Reviewed-by: Evie Wright <evie.wright@arm.com>
Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

1a7a7700

Jul 01, 2025

Bump version to v1.11.0 · f362d32f

Anton Bondarenko authored Jul 01, 2025



Signed-off-by: Anton Bondarenko <anton.bondarenko@arm.com>

Approved-by: Dan Johansson <dan.johansson@arm.com>

f362d32f

Jun 26, 2025

Improve packing performance for quantized Int4 per-block · 23266090

Evie Wright authored Jun 26, 2025 and

Viet-Hoa Do committed Jun 26, 2025

Improves performance of ‘kai_rhs_pack_nxk_qsi4c32pnrx8_qsu4c32s1s0_neon’ by vectorizing row summation

Signed-off-by: Evie Wright <evie.wright@arm.com>

Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

23266090

Jun 25, 2025
- Optimize kai_rhs_pack_nxk_qsi4c32p_qsu4c32s1s0 using advanced SIMD, for kr/sr = 4 · d18f620a
  Evie Wright authored Jun 25, 2025 and Viet-Hoa Do committed Jun 25, 2025
```
Signed-off-by: Evie Wright <evie.wright@arm.com>

Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Approved-by: Viet-Hoa Do <viet-hoa.do@arm.com>
```
  d18f620a
Jun 24, 2025

Update CHANGELOG with recent updates · 184e45c6

Anton Bondarenko authored Jun 24, 2025



Signed-off-by: Anton Bondarenko <anton.bondarenko@arm.com>

Approved-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>

184e45c6

Matmul Micro-kernel(1xN) F32/F16 <- (QSI8D32) LHS x (QAI4C32) RHS · a0afd5e1

Anitha Raj authored Jun 24, 2025 and

Anton Bondarenko committed Jun 24, 2025



* Matrix multiplication (1xN) micro-kernels to compute the matrix multiplication of dynamically quantized symmetric signed 8-bit integer with per-block quantization (QSI8D32) LHS matrix and quantized asymmetric 4-bit signed integer with per-block quantization (QAI4C32) RHS matrix and the accumulation of the result into a single-precision (F32) and half-precision (F16) output, optimized for FEAT_DotProd and packing parameter kr = 8.

Signed-off-by: Anitha Raj <anitha.raj@arm.com>

Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

a0afd5e1

Jun 20, 2025

Add SME1 F16 GEMM micro-kernel · f3da48be

Jakub Sujak authored Jun 20, 2025

Adds F16 GEMM micro-kernel using SME1 MOPA instruction and 2VL x 2VL block size. This SME1 kernel is compatible with existing SME F16 LHS and RHS packing functions.

Signed-off-by: Jakub Sujak <jakub.sujak@arm.com>

Approved-by: Viet-Hoa Do <viet-hoa.do@arm.com>

f3da48be

Jun 19, 2025

Add GEMM F32 using SME1 MOPA with block size 2VLx2VL · b3b24af1

Viet-Hoa Do authored Jun 19, 2025



* Add GEMM F32 kernel using SME1 MOPA with block size 2VLx2VL.
* Add tests for the newly added kernel.
* Add CI job to run the kernel on FVP with SME1 and without SME2.

Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com>

Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

b3b24af1

Jun 16, 2025

Bump version to v1.10.0 · dc69e899

Emil Ohlsson authored Jun 16, 2025



Signed-off-by: Emil Ohlsson <emil.ohlsson@arm.com>

Approved-by: Jakub Sujak <jakub.sujak@arm.com>

dc69e899

Jun 13, 2025

Matmul Micro-kernel(MxN) F32 <- (QSI8D32) LHS x (QAI4C32) RHS · 4f5154e0

Anitha Raj authored Jun 13, 2025 and

Emil Ohlsson committed Jun 13, 2025



* Matrix multiplication (MxN) micro-kernels to compute the matrix multiplication of dynamically quantized symmetric signed 8-bit integer with per-block quantization (QSI8D32) LHS matrix and quantized asymmetric 4-bit signed integer with per-block quantization (QAI4C32) RHS matrix and the accumulation of the result into a single-precision (F32) output, optimized for FEAT_DotProd

Signed-off-by: Anitha Raj <anitha.raj@arm.com>

Reviewed-by: Anitha Raj <anitha.raj@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

4f5154e0

Jun 12, 2025

Split FP32 SME kernels into seperate assembly file. · cdcff672

Jens Elofsson authored Jun 12, 2025



Move the assembly blocks of the following kernels into their own files:
- rhs_pack_kxn_f32p2vlx1biasf32_f32_f32_sme
- lhs_pack_f32p2vlx1_f32_sme
- matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla
- matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla
- matmul_clamp_f32_f32p2vlx1_f32p2vlx1biasf32_sme2_mopa

Signed-off-by: Jens Elofsson <jens.elofsson@arm.com>

Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

cdcff672

Split FP16 SME kernels into seperate assembly file. · f0454add

Jens Elofsson authored Jun 12, 2025



Move the assembly blocks of the following kernels into their own files:
- lhs_pack_x16p2vlx2_x16_sme
- rhs_pack_kxn_x16p2vlx2b_x16_x16_sme
- matmul_clamp_f16_f16p2vlx2_f16p2vlx2_2vlx2vl_sme2_mopa
- matmul_clamp_f16_f16_f16p2vlx2b_1x16vl_sme2_dot

Signed-off-by: Jens Elofsson <jens.elofsson@arm.com>

Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

f0454add

Jun 11, 2025

Split INT8 SME kernels into seperate assembly file. · f11792a5

Jens Elofsson authored Jun 11, 2025



Move the assembly blocks of the following kernels into their own files:
- lhs_pack_x8p2vlx4_x8_sme
- rhs_pack_kxn_qsi8cxp2vlx4sb_qs8cx_f32_i32_sme
- matmul_clamp_qai8_qai8_qsi8cxp2vlx4sb_1x16vl_sme2_dot
- matmul_clamp_qai8_qai8p2vlx4_qsi8cxpsb2vlx4_2vlx2vl_sme2_mopa

Signed-off-by: Jens Elofsson <jens.elofsson@arm.com>

Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

f11792a5

Jun 10, 2025

Fix bug where kai_get_m_step returns the incorrect value · e4e3c549

Jens Elofsson authored Jun 10, 2025



Fix issue in kernels
- matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla
- matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla

where kai_get_m_step returns the incorrect value.

Signed-off-by: Jens Elofsson <jens.elofsson@arm.com>

Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

e4e3c549

Jun 05, 2025

Optimize kai_rhs_pack_nxk_qsi4c32p_qsu4c32s1s0 using advanced SIMD · 05ef512e

Evie Wright authored Jun 05, 2025 and

Anton Bondarenko committed Jun 05, 2025



Optimize the transposed RHS packing function for matmul_clamp_f32_qai8dxp_qsi4c32p using advanced SIMD, for kr / sr = 8

Signed-off-by: Evie Wright <evie.wright@arm.com>

Signed-off-by: Anitha Raj <anitha.raj@arm.com>

Reviewed-by: Anitha Raj <anitha.raj@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Reviewed-by: Evie Wright <evie.wright@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

05ef512e

Enable MSVC support for imatmul kernels · b3699d63

Emil Ohlsson authored Jun 05, 2025



This change adds MSVC support for the imatmul kernels. As a result, they
are also aligned, which results in some additional diffs.

Signed-off-by: Emil Ohlsson <emil.ohlsson@arm.com>

Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

b3699d63

Jun 04, 2025

Matmul Micro-kernel(MxN) F16 <- (QSI8D32) LHS x (QAI4C32) RHS · 8c63677f

Anitha Raj authored Jun 04, 2025 and

Jakub Sujak committed Jun 04, 2025



* Matrix multiplication (MxN) micro-kernels to compute the matrix multiplication of dynamically quantized symmetric signed 8-bit integer with per-block quantization (QSI8D32) LHS matrix and quantized asymmetric 4-bit signed integer with per-block quantization (QAI4C32) RHS matrix and the accumulation of the result into a half-precision (F16) output, optimized for FEAT_DotProd.

Signed-off-by: Anitha Raj <anitha.raj@arm.com>

Approved-by: Jakub Sujak <jakub.sujak@arm.com>

8c63677f

Jun 03, 2025

Add Conv2D example using FP16 IGEMM · d7f833a5

Suhail M authored Jun 03, 2025



- Example demonstrates creating an indirect buffer using a Conv2D input tensor
- Example demonstrates indirect buffer usage with imatmul kernels.

Signed-off-by: Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com>

Reviewed-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Reviewed-by: Suhail M <mohammedsuhail.munshi@arm.com>
Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

d7f833a5

Jun 02, 2025

Bump version and update changelog · a15b6725

Emil Ohlsson authored Jun 02, 2025



Signed-off-by: Emil Ohlsson <emil.ohlsson@arm.com>

Approved-by: Viet-Hoa Do <viet-hoa.do@arm.com>

a15b6725

May 30, 2025

Matmul Micro-kernels F32 <- QAI8DXP(LHS) x QSI8CXP(RHS) optimized for SME · 3d8217c2

Anitha Raj authored May 30, 2025 and

Felix Johnny Thomasmathibalan committed May 30, 2025



* Micro-kernels (1xN) to compute the matrix multiplication of dynamically quantized asymmetric 8-bit integer with per-channel quantization (QAI8DX) LHS matrix and quantized symmetric 8-bit integer with per-channel quantization (QSI8CX) RHS matrix and the accumulation of the result into a single-precision (F32) output, optimized for SME2 technology.
*  Micro-kernels (MxN) to compute the matrix multiplication of dynamically quantized asymmetric 8-bit integer with per-channel quantization (QAI8DX) LHS matrix and quantized symmetric 8-bit integer with per-channel quantization (QSI8CX) RHS matrix and the accumulation of the result into a single-precision (F32) output, optimized for SME2 technology.

Signed-off-by: Anitha Raj <anitha.raj@arm.com>

Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Reviewed-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Approved-by: Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>

3d8217c2

Apr 24, 2025

Set version to 1.8.0 · cca02c2f

Jens Elofsson authored Apr 24, 2025



Update all version indicators to 1.8.0.

Signed-off-by: Jens Elofsson <jens.elofsson@arm.com>

Approved-by: Jakub Sujak <jakub.sujak@arm.com>

cca02c2f

Add support for FP32 Indirect GEMM with SME · 1a450344

Suhail M authored Apr 24, 2025 and

Jakub Sujak committed Apr 24, 2025



- Adds packing and matmul kernels for FP32 SME Indirect GEMM
- Adds tests for Indirect Gemm with FP32 inputs/outputs.

Signed-off-by: Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com>

Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Reviewed-by: Mohammed Suhail Munshi <mohammedsuhail.munshi@arm.com>
Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

1a450344

Update the changelog for the 1.8.0 release. · a42ea653

Jens Elofsson authored Apr 24, 2025 and

Jakub Sujak committed Apr 24, 2025



Signed-off-by: Jens Elofsson <jens.elofsson@arm.com>

Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>
Approved-by: Jakub Sujak <jakub.sujak@arm.com>

a42ea653

Apr 23, 2025

Add FP16 IGEMM support · d8642031

Emil Ohlsson authored Apr 23, 2025



This change introduces three new kernels
* `kai_lhs_imatmul_pack_x16p2vlx2_x16p_sme`
* `kai_rhs_imatmul_pack_kxn_x16p2vlx2b_x16_x16_sme`
* `kai_imatmul_clamp_f16_f16p2vlx2_f16p2vlx2_2vlx2vl_sme2_mopa`

These are used for indirect matmul, for 16-bit floating point data.

This change also adds unit testing for the FP16 imatmul kernels. The
code is written in a type agnostic manner, as to easily allow testing
for other data types with very low effort. This required the addition of
non-templated `read`/`write` functions, as to allow runtime-generic
access.

Signed-off-by: Emil Ohlsson <emil.ohlsson@arm.com>

Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

d8642031

Apr 15, 2025

add support for f16 with asymmetric int8 LHS, symmetric int8 RHS · 8120ad23
Evie Wright authored Apr 15, 2025 and Jakub Sujak committed Apr 15, 2025
```
Signed-off-by: Evie Wright <evie.wright@arm.com>

Approved-by: Jakub Sujak <jakub.sujak@arm.com>
```
8120ad23

Set version to 1.7.0 · 22053433

Jens Elofsson authored Apr 15, 2025



Update all version indicators to 1.7.0.

Signed-off-by: Jens Elofsson <jens.elofsson@arm.com>

Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

22053433

Apr 11, 2025

Matmul Micro-kernels F16<-(QAI8DX) LHS x (QSI4CX) RHS · 315ed95c

Anitha Raj authored Apr 11, 2025 and

Viet-Hoa Do committed Apr 11, 2025

Micro-kernels to compute the matrix multiplication of dynamically quantized asymmetric signed 8-bit integer with per-channel quantization (QAI8DX) LHS matrix and quantized symmetric 4-bit signed integer with per-channel quantization (QSI4CX) RHS matrix and the accumulation of the result into a half-precision (F16):

Matrix multiplication (MxN) Micro-kernels of QAI8DX LHS and QSI4CX RHS with F16 output, optimized for FEAT_I8MM and FEAT_DotProd.
Matrix multiplication (1xN) Micro-kernels of QAI8DX LHS and QSI4CX RHS with F16 output, optimized for FEAT_DotProd.

Signed-off-by: Anitha Raj <anitha.raj@arm.com>

Signed-off-by: Evie Wright <evie.wright@arm.com>

Approved-by: Viet-Hoa Do <viet-hoa.do@arm.com>

315ed95c

Apr 09, 2025

Matmul Micro-kernels F32/F16 <- (QSI8D32) LHS x (QAI4C32) RHS · b27875d9

Anitha Raj authored Apr 09, 2025 and

Anton Bondarenko committed Apr 09, 2025



Micro-kernels to compute the matrix multiplication of dynamically quantized symmetric signed 8-bit integer with per-block quantization (QSI8D32) LHS matrix and quantized asymmetric 4-bit signed integer with per-block quantization (QAI4C32) RHS matrix and the accumulation of the result into a single-precision (F32) and half-precision (F16) output:

- Matrix multiplication (MxN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F32 output, optimized for FEAT_I8MM.
- Matrix multiplication (1xN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F32 output, optimized for FEAT_DotProd.
- Matrix multiplication (MxN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F16 output, optimized for FEAT_I8MM.
- Matrix multiplication (1xN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F16 output, optimized for FEAT_DotProd.

Signed-off-by: Anitha Raj <anitha.raj@arm.com>

Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Reviewed-by: Anitha Raj <anitha.raj@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Approved-by: Anton Bondarenko <anton.bondarenko@arm.com>

b27875d9

Add QAI8 IGEMM kernels · 22c47616

Emil Ohlsson authored Apr 09, 2025



This change introduces three new kernels:

* kai_imatmul_clamp_qai8_qai8p2vlx4_qsi8cxpsb2vlx4_2vlx2vl_sme2_mopa
* kai_lhs_imatmul_pack_x8p2vlx4_x8p_sme
* kai_rhs_imatmul_pack_kxn_qsi8cxp2vlx4sb_qs8cx_f32_i32_sme

These kernels are used for _indirect matmul_. The big difference between
these kernels and matmul kernels is that the LHS packing kernel takes an
indirection buffer where each pointer refers to a chunk in K dimension.
The pointers are laid out in a packed manner, where instead of being in
row major order, a column of `get_m_step` chunk pointers are placed
linearly in indirection buffer.

In addition to the kernels themselves, the
`matmul_clamp_qai8_qai8p_qsi8cxp_test.cpp` is extended to perform
testing of these new kernels. The testing flow for these new kernels is
a bit different, in that the packing kernels themselves are not directly
tested, instead only end-to-end flow is tested.

Signed-off-by: Emil Ohlsson <emil.ohlsson@arm.com>
Signed-off-by: Felix Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
Signed-off-by: Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com>

Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Reviewed-by: Anton Bondarenko <anton.bondarenko@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Reviewed-by: Emil Ohlsson <emil.ohlsson@arm.com>
Approved-by: Jakub Sujak <jakub.sujak@arm.com>

22c47616

Mar 26, 2025

Set version to 1.6.0 · 9668db3d

Jens Elofsson authored Mar 26, 2025



Update all version indicators to 1.6.0.

Signed-off-by: Jens Elofsson <jens.elofsson@arm.com>

Approved-by: Jakub Sujak <jakub.sujak@arm.com>

9668db3d

Update changelog with new changes. · 998c8683

Jens Elofsson authored Mar 26, 2025



Signed-off-by: Jens Elofsson <jens.elofsson@arm.com>

Approved-by: Jakub Sujak <jakub.sujak@arm.com>

998c8683

Optimizes RHS packing qsu4c32s16s0->qsi4c32pscalef16 · 9186e07d

Dan Johansson authored Mar 26, 2025 and

Emil Ohlsson committed Mar 26, 2025



Optimizes this RHS packing by vectorizing the XOR operation. This is done
for segment lenghts of 4 or 8 bytes. The unoptimized path is used for
any other segment length.

Signed-off-by: Dan Johansson <dan.johansson@arm.com>

Approved-by: Emil Ohlsson <emil.ohlsson@arm.com>

9186e07d