- Jul 22, 2025
-
-
Signed-off-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
-
- Jul 21, 2025
-
-
Anton Bondarenko authored
Also update CHANGELOG to mention more Signed-off-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
John McLoughlin <john.mcloughlin@arm.com>
-
Signed-off-by:
Evie Wright <evie.wright@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Jakub Sujak <jakub.sujak@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Jul 16, 2025
-
-
Viet-Hoa Do authored
* List of micro-kernels added to the MSVC build: - kai_matmul_clamp_f32_f32_f32p8x1biasf32_6x8x4_neon_mla - kai_lhs_quant_pack_qsi8d32p_f32_neon - kai_rhs_pack_kxn_qsi8cxp_qsi8cx_neon - kai_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon - kai_rhs_pack_nxk_qsi4cxps1s0_qsu4cxs1s0_neon - kai_rhs_pack_nxk_qsi8cxp_qsi8cx_neon Signed-off-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Jens Elofsson <jens.elofsson@arm.com>
-
Viet-Hoa Do authored
* Kernels: - kai_rhs_pack_nxk_f32p2vlx1biasf32_f32_f32_sme - kai_rhs_pack_nxk_x16p2vlx2b_x16_x16_sme Signed-off-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Jul 15, 2025
-
-
* Update the asm kernel to multiply the zero-points and sum as integers instead of float. Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Anitha Raj <anitha.raj@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
-
- Jul 14, 2025
-
-
Jakub Sujak authored
* Add SME F16 GEMV micro-kernel. * The GEMV micro-kernel uses instructions compatible with FEAT_SME. * The GEMV micro-kernel is designed to reuse the same RHS packing functions as the SME F16 GEMM. This new GEMV micro-kernel is compatible with FEAT_SME but not FEAT_SME2 requirement. By using pairs of `FMLALB` and `FMLALT` instructions, we can reuse the existing RHS data format of the GEMM operation where `kr=2` thus eliminating the need for a specialized packing function for the GEMV operation. Signed-off-by:
Jakub Sujak <jakub.sujak@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
-
- Jul 09, 2025
-
-
Jakub Sujak authored
This SME1 GEMV kernel computes a 1x8VL block and is designed to work with the same RHS packing function as the SME1 GEMM. Signed-off-by:
Jakub Sujak <jakub.sujak@arm.com> Reviewed-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Jakub Sujak <jakub.sujak@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Jul 08, 2025
-
-
- Matrix multiplication (MxN) Micro-kernels of QAI8DXP LHS and QSI4CXP RHS with BF16 output, optimized for FEAT_I8MM. - Matrix multiplication (1xN) Micro-kernels of QAI8DXP LHS and QSI4CXP RHS with BF16 output, optimized for FEAT_DotProd. Signed-off-by:
Nikhil Gupta <nikhil.gupta2@arm.com> Signed-off-by:
Evie Wright <evie.wright@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Evie Wright <evie.wright@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Jul 01, 2025
-
-
Anton Bondarenko authored
Signed-off-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Dan Johansson <dan.johansson@arm.com>
-
- Jun 26, 2025
-
-
Improves performance of ‘kai_rhs_pack_nxk_qsi4c32pnrx8_qsu4c32s1s0_neon’ by vectorizing row summation Signed-off-by:
Evie Wright <evie.wright@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Jun 25, 2025
-
-
Signed-off-by:
Evie Wright <evie.wright@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Approved-by:
Viet-Hoa Do <viet-hoa.do@arm.com>
-
- Jun 24, 2025
-
-
Anton Bondarenko authored
Signed-off-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
-
* Matrix multiplication (1xN) micro-kernels to compute the matrix multiplication of dynamically quantized symmetric signed 8-bit integer with per-block quantization (QSI8D32) LHS matrix and quantized asymmetric 4-bit signed integer with per-block quantization (QAI4C32) RHS matrix and the accumulation of the result into a single-precision (F32) and half-precision (F16) output, optimized for FEAT_DotProd and packing parameter kr = 8. Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Jun 20, 2025
-
-
Jakub Sujak authored
Adds F16 GEMM micro-kernel using SME1 MOPA instruction and 2VL x 2VL block size. This SME1 kernel is compatible with existing SME F16 LHS and RHS packing functions. Signed-off-by:
Jakub Sujak <jakub.sujak@arm.com> Approved-by:
Viet-Hoa Do <viet-hoa.do@arm.com>
-
- Jun 19, 2025
-
-
Viet-Hoa Do authored
* Add GEMM F32 kernel using SME1 MOPA with block size 2VLx2VL. * Add tests for the newly added kernel. * Add CI job to run the kernel on FVP with SME1 and without SME2. Signed-off-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Jun 16, 2025
-
-
Emil Ohlsson authored
Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
- Jun 13, 2025
-
-
* Matrix multiplication (MxN) micro-kernels to compute the matrix multiplication of dynamically quantized symmetric signed 8-bit integer with per-block quantization (QSI8D32) LHS matrix and quantized asymmetric 4-bit signed integer with per-block quantization (QAI4C32) RHS matrix and the accumulation of the result into a single-precision (F32) output, optimized for FEAT_DotProd Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Reviewed-by:
Anitha Raj <anitha.raj@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- Jun 12, 2025
-
-
Jens Elofsson authored
Move the assembly blocks of the following kernels into their own files: - rhs_pack_kxn_f32p2vlx1biasf32_f32_f32_sme - lhs_pack_f32p2vlx1_f32_sme - matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla - matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla - matmul_clamp_f32_f32p2vlx1_f32p2vlx1biasf32_sme2_mopa Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
Jens Elofsson authored
Move the assembly blocks of the following kernels into their own files: - lhs_pack_x16p2vlx2_x16_sme - rhs_pack_kxn_x16p2vlx2b_x16_x16_sme - matmul_clamp_f16_f16p2vlx2_f16p2vlx2_2vlx2vl_sme2_mopa - matmul_clamp_f16_f16_f16p2vlx2b_1x16vl_sme2_dot Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- Jun 11, 2025
-
-
Jens Elofsson authored
Move the assembly blocks of the following kernels into their own files: - lhs_pack_x8p2vlx4_x8_sme - rhs_pack_kxn_qsi8cxp2vlx4sb_qs8cx_f32_i32_sme - matmul_clamp_qai8_qai8_qsi8cxp2vlx4sb_1x16vl_sme2_dot - matmul_clamp_qai8_qai8p2vlx4_qsi8cxpsb2vlx4_2vlx2vl_sme2_mopa Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- Jun 10, 2025
-
-
Jens Elofsson authored
Fix issue in kernels - matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla - matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla where kai_get_m_step returns the incorrect value. Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- Jun 05, 2025
-
-
Optimize the transposed RHS packing function for matmul_clamp_f32_qai8dxp_qsi4c32p using advanced SIMD, for kr / sr = 8 Signed-off-by:
Evie Wright <evie.wright@arm.com> Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Reviewed-by:
Anitha Raj <anitha.raj@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Evie Wright <evie.wright@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
Emil Ohlsson authored
This change adds MSVC support for the imatmul kernels. As a result, they are also aligned, which results in some additional diffs. Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Jun 04, 2025
-
-
* Matrix multiplication (MxN) micro-kernels to compute the matrix multiplication of dynamically quantized symmetric signed 8-bit integer with per-block quantization (QSI8D32) LHS matrix and quantized asymmetric 4-bit signed integer with per-block quantization (QAI4C32) RHS matrix and the accumulation of the result into a half-precision (F16) output, optimized for FEAT_DotProd. Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
- Jun 03, 2025
-
-
Suhail M authored
- Example demonstrates creating an indirect buffer using a Conv2D input tensor - Example demonstrates indirect buffer usage with imatmul kernels. Signed-off-by:
Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com> Reviewed-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Suhail M <mohammedsuhail.munshi@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- Jun 02, 2025
-
-
Emil Ohlsson authored
Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Viet-Hoa Do <viet-hoa.do@arm.com>
-
- May 30, 2025
-
-
* Micro-kernels (1xN) to compute the matrix multiplication of dynamically quantized asymmetric 8-bit integer with per-channel quantization (QAI8DX) LHS matrix and quantized symmetric 8-bit integer with per-channel quantization (QSI8CX) RHS matrix and the accumulation of the result into a single-precision (F32) output, optimized for SME2 technology. * Micro-kernels (MxN) to compute the matrix multiplication of dynamically quantized asymmetric 8-bit integer with per-channel quantization (QAI8DX) LHS matrix and quantized symmetric 8-bit integer with per-channel quantization (QSI8CX) RHS matrix and the accumulation of the result into a single-precision (F32) output, optimized for SME2 technology. Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
-
- Apr 24, 2025
-
-
Jens Elofsson authored
Update all version indicators to 1.8.0. Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
- Adds packing and matmul kernels for FP32 SME Indirect GEMM - Adds tests for Indirect Gemm with FP32 inputs/outputs. Signed-off-by:
Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Jakub Sujak <jakub.sujak@arm.com> Reviewed-by:
Mohammed Suhail Munshi <mohammedsuhail.munshi@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
- Apr 23, 2025
-
-
Emil Ohlsson authored
This change introduces three new kernels * `kai_lhs_imatmul_pack_x16p2vlx2_x16p_sme` * `kai_rhs_imatmul_pack_kxn_x16p2vlx2b_x16_x16_sme` * `kai_imatmul_clamp_f16_f16p2vlx2_f16p2vlx2_2vlx2vl_sme2_mopa` These are used for indirect matmul, for 16-bit floating point data. This change also adds unit testing for the FP16 imatmul kernels. The code is written in a type agnostic manner, as to easily allow testing for other data types with very low effort. This required the addition of non-templated `read`/`write` functions, as to allow runtime-generic access. Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Jakub Sujak <jakub.sujak@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Apr 15, 2025
-
-
Signed-off-by:
Evie Wright <evie.wright@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
Jens Elofsson authored
Update all version indicators to 1.7.0. Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- Apr 11, 2025
-
-
Micro-kernels to compute the matrix multiplication of dynamically quantized asymmetric signed 8-bit integer with per-channel quantization (QAI8DX) LHS matrix and quantized symmetric 4-bit signed integer with per-channel quantization (QSI4CX) RHS matrix and the accumulation of the result into a half-precision (F16): Matrix multiplication (MxN) Micro-kernels of QAI8DX LHS and QSI4CX RHS with F16 output, optimized for FEAT_I8MM and FEAT_DotProd. Matrix multiplication (1xN) Micro-kernels of QAI8DX LHS and QSI4CX RHS with F16 output, optimized for FEAT_DotProd. Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Signed-off-by:
Evie Wright <evie.wright@arm.com> Approved-by:
Viet-Hoa Do <viet-hoa.do@arm.com>
-
- Apr 09, 2025
-
-
Micro-kernels to compute the matrix multiplication of dynamically quantized symmetric signed 8-bit integer with per-block quantization (QSI8D32) LHS matrix and quantized asymmetric 4-bit signed integer with per-block quantization (QAI4C32) RHS matrix and the accumulation of the result into a single-precision (F32) and half-precision (F16) output: - Matrix multiplication (MxN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F32 output, optimized for FEAT_I8MM. - Matrix multiplication (1xN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F32 output, optimized for FEAT_DotProd. - Matrix multiplication (MxN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F16 output, optimized for FEAT_I8MM. - Matrix multiplication (1xN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F16 output, optimized for FEAT_DotProd. Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Anitha Raj <anitha.raj@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
Emil Ohlsson authored
This change introduces three new kernels: * kai_imatmul_clamp_qai8_qai8p2vlx4_qsi8cxpsb2vlx4_2vlx2vl_sme2_mopa * kai_lhs_imatmul_pack_x8p2vlx4_x8p_sme * kai_rhs_imatmul_pack_kxn_qsi8cxp2vlx4sb_qs8cx_f32_i32_sme These kernels are used for _indirect matmul_. The big difference between these kernels and matmul kernels is that the LHS packing kernel takes an indirection buffer where each pointer refers to a chunk in K dimension. The pointers are laid out in a packed manner, where instead of being in row major order, a column of `get_m_step` chunk pointers are placed linearly in indirection buffer. In addition to the kernels themselves, the `matmul_clamp_qai8_qai8p_qsi8cxp_test.cpp` is extended to perform testing of these new kernels. The testing flow for these new kernels is a bit different, in that the packing kernels themselves are not directly tested, instead only end-to-end flow is tested. Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Signed-off-by:
Felix Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com> Signed-off-by:
Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Jakub Sujak <jakub.sujak@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
- Mar 26, 2025
-
-
Jens Elofsson authored
Update all version indicators to 1.6.0. Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
Jens Elofsson authored
Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
Optimizes this RHS packing by vectorizing the XOR operation. This is done for segment lenghts of 4 or 8 bytes. The unoptimized path is used for any other segment length. Signed-off-by:
Dan Johansson <dan.johansson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-