- Jul 22, 2025
-
-
Signed-off-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
-
- Jul 21, 2025
-
-
Anton Bondarenko authored
Also update CHANGELOG to mention more Signed-off-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
John McLoughlin <john.mcloughlin@arm.com>
-
Signed-off-by:
Evie Wright <evie.wright@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Jakub Sujak <jakub.sujak@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
Signed-off-by:
Evie Wright <evie.wright@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Jul 17, 2025
-
-
Create kai_common_sme_asm.S to hold support functions that uses pure assembly instead of inlined assembly. The function moved in this patch is kai_get_sme_vector_length_u8. Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Jul 16, 2025
-
-
Signed-off-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
-
Viet-Hoa Do authored
* List of micro-kernels added to the MSVC build: - kai_matmul_clamp_f32_f32_f32p8x1biasf32_6x8x4_neon_mla - kai_lhs_quant_pack_qsi8d32p_f32_neon - kai_rhs_pack_kxn_qsi8cxp_qsi8cx_neon - kai_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon - kai_rhs_pack_nxk_qsi4cxps1s0_qsu4cxs1s0_neon - kai_rhs_pack_nxk_qsi8cxp_qsi8cx_neon Signed-off-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Jens Elofsson <jens.elofsson@arm.com>
-
Viet-Hoa Do authored
* Kernels: - kai_rhs_pack_nxk_f32p2vlx1biasf32_f32_f32_sme - kai_rhs_pack_nxk_x16p2vlx2b_x16_x16_sme Signed-off-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Jul 15, 2025
-
-
* Update the asm kernel to multiply the zero-points and sum as integers instead of float. Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Anitha Raj <anitha.raj@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
-
- Jul 14, 2025
-
-
Jakub Sujak authored
* Add SME F16 GEMV micro-kernel. * The GEMV micro-kernel uses instructions compatible with FEAT_SME. * The GEMV micro-kernel is designed to reuse the same RHS packing functions as the SME F16 GEMM. This new GEMV micro-kernel is compatible with FEAT_SME but not FEAT_SME2 requirement. By using pairs of `FMLALB` and `FMLALT` instructions, we can reuse the existing RHS data format of the GEMM operation where `kr=2` thus eliminating the need for a specialized packing function for the GEMV operation. Signed-off-by:
Jakub Sujak <jakub.sujak@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
-
- Jul 09, 2025
-
-
Jakub Sujak authored
This SME1 GEMV kernel computes a 1x8VL block and is designed to work with the same RHS packing function as the SME1 GEMM. Signed-off-by:
Jakub Sujak <jakub.sujak@arm.com> Reviewed-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Jakub Sujak <jakub.sujak@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
Dan Johansson authored
Long test names cause the JUnit report files to exceed the size limit. By removing redundancies in the test names, the report size is reduced by 15%. Signed-off-by:
Dan Johansson <dan.johansson@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Jul 08, 2025
-
-
- Matrix multiplication (MxN) Micro-kernels of QAI8DXP LHS and QSI4CXP RHS with BF16 output, optimized for FEAT_I8MM. - Matrix multiplication (1xN) Micro-kernels of QAI8DXP LHS and QSI4CXP RHS with BF16 output, optimized for FEAT_DotProd. Signed-off-by:
Nikhil Gupta <nikhil.gupta2@arm.com> Signed-off-by:
Evie Wright <evie.wright@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Evie Wright <evie.wright@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Jul 04, 2025
-
-
Viet-Hoa Do authored
Signed-off-by:
Damien Dooley <damien.dooley@arm.com> Approved-by:
Viet-Hoa Do <viet-hoa.do@arm.com>
-
- Jul 03, 2025
-
-
Anton Bondarenko authored
Signed-off-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Dan Johansson <dan.johansson@arm.com>
-
- Jul 01, 2025
-
-
Anton Bondarenko authored
Signed-off-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Dan Johansson <dan.johansson@arm.com>
-
- Jun 26, 2025
-
-
Improves performance of ‘kai_rhs_pack_nxk_qsi4c32pnrx8_qsu4c32s1s0_neon’ by vectorizing row summation Signed-off-by:
Evie Wright <evie.wright@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Jun 25, 2025
-
-
Signed-off-by:
Evie Wright <evie.wright@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Approved-by:
Viet-Hoa Do <viet-hoa.do@arm.com>
-
- Jun 24, 2025
-
-
Anton Bondarenko authored
Signed-off-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
-
* Matrix multiplication (1xN) micro-kernels to compute the matrix multiplication of dynamically quantized symmetric signed 8-bit integer with per-block quantization (QSI8D32) LHS matrix and quantized asymmetric 4-bit signed integer with per-block quantization (QAI4C32) RHS matrix and the accumulation of the result into a single-precision (F32) and half-precision (F16) output, optimized for FEAT_DotProd and packing parameter kr = 8. Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Jun 23, 2025
-
-
update unit test to test signed integer inputs Resolves: KLEIDIAI-664 Signed-off-by:
Evie Wright <evie.wright@arm.com> Approved-by:
Suhail M <mohammedsuhail.munshi@arm.com>
-
- Jun 20, 2025
-
-
Jakub Sujak authored
Signed-off-by:
Jakub Sujak <jakub.sujak@arm.com> Approved-by:
Viet-Hoa Do <viet-hoa.do@arm.com>
-
Jakub Sujak authored
Adds F16 GEMM micro-kernel using SME1 MOPA instruction and 2VL x 2VL block size. This SME1 kernel is compatible with existing SME F16 LHS and RHS packing functions. Signed-off-by:
Jakub Sujak <jakub.sujak@arm.com> Approved-by:
Viet-Hoa Do <viet-hoa.do@arm.com>
-
Jakub Sujak authored
Signed-off-by:
Jakub Sujak <jakub.sujak@arm.com> Approved-by:
Viet-Hoa Do <viet-hoa.do@arm.com>
-
- Jun 19, 2025
-
-
Jakub Sujak authored
Test buffers must be initialized to a default value of 0. Report mismatches outside the portion-tested ROI to catch out-of-bound kernel writes. Signed-off-by:
Jakub Sujak <jakub.sujak@arm.com> Approved-by:
Viet-Hoa Do <viet-hoa.do@arm.com>
-
Viet-Hoa Do authored
Signed-off-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
Viet-Hoa Do authored
* Add GEMM F32 kernel using SME1 MOPA with block size 2VLx2VL. * Add tests for the newly added kernel. * Add CI job to run the kernel on FVP with SME1 and without SME2. Signed-off-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
Anton Bondarenko authored
Update Linux Kernel and bootloader-wrapper to versions supporting SME/SME2. This required to properly handle context switch with SME state preservation and restoration. Also address CPU boot issue by keep in sync number of CPUs used in bootloader/DTS and FVP configuration. Two CPUs used when first used for running system services and second one as isolated to run test programs. Linux Kernel: 6.16-rc2 (update to released version once available) Bootloader wrapper: latest available Signed-off-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
- Jun 16, 2025
-
-
Emil Ohlsson authored
Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
- Jun 13, 2025
-
-
* Matrix multiplication (MxN) micro-kernels to compute the matrix multiplication of dynamically quantized symmetric signed 8-bit integer with per-block quantization (QSI8D32) LHS matrix and quantized asymmetric 4-bit signed integer with per-block quantization (QAI4C32) RHS matrix and the accumulation of the result into a single-precision (F32) output, optimized for FEAT_DotProd Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Reviewed-by:
Anitha Raj <anitha.raj@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- Jun 12, 2025
-
-
Jens Elofsson authored
Move the assembly blocks of the following kernels into their own files: - rhs_pack_kxn_f32p2vlx1biasf32_f32_f32_sme - lhs_pack_f32p2vlx1_f32_sme - matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla - matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla - matmul_clamp_f32_f32p2vlx1_f32p2vlx1biasf32_sme2_mopa Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
Jens Elofsson authored
Move the assembly blocks of the following kernels into their own files: - lhs_pack_x16p2vlx2_x16_sme - rhs_pack_kxn_x16p2vlx2b_x16_x16_sme - matmul_clamp_f16_f16p2vlx2_f16p2vlx2_2vlx2vl_sme2_mopa - matmul_clamp_f16_f16_f16p2vlx2b_1x16vl_sme2_dot Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- Jun 11, 2025
-
-
Jens Elofsson authored
Move the assembly blocks of the following kernels into their own files: - lhs_pack_x8p2vlx4_x8_sme - rhs_pack_kxn_qsi8cxp2vlx4sb_qs8cx_f32_i32_sme - matmul_clamp_qai8_qai8_qsi8cxp2vlx4sb_1x16vl_sme2_dot - matmul_clamp_qai8_qai8p2vlx4_qsi8cxpsb2vlx4_2vlx2vl_sme2_mopa Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- Jun 10, 2025
-
-
Jens Elofsson authored
Fix issue in kernels - matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla - matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla where kai_get_m_step returns the incorrect value. Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- Jun 05, 2025
-
-
Update architectural feature guards and CMakeLists to enable MSVC build for matmul_clamp_f32_qsi8d32p_qai4c32p micro-kernels. Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
Optimize the transposed RHS packing function for matmul_clamp_f32_qai8dxp_qsi4c32p using advanced SIMD, for kr / sr = 8 Signed-off-by:
Evie Wright <evie.wright@arm.com> Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Reviewed-by:
Anitha Raj <anitha.raj@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Evie Wright <evie.wright@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
Suhail M authored
Signed-off-by:
Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
-
Emil Ohlsson authored
This change adds MSVC support for the imatmul kernels. As a result, they are also aligned, which results in some additional diffs. Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
Signed-off-by:
Charlotte Chen <charlotte.chen@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
- Jun 04, 2025
-
-
* Matrix multiplication (MxN) micro-kernels to compute the matrix multiplication of dynamically quantized symmetric signed 8-bit integer with per-block quantization (QSI8D32) LHS matrix and quantized asymmetric 4-bit signed integer with per-block quantization (QAI4C32) RHS matrix and the accumulation of the result into a half-precision (F16) output, optimized for FEAT_DotProd. Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-