- Jul 01, 2025
-
-
Anton Bondarenko authored
Signed-off-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Dan Johansson <dan.johansson@arm.com>
-
- Jun 26, 2025
-
-
Improves performance of ‘kai_rhs_pack_nxk_qsi4c32pnrx8_qsu4c32s1s0_neon’ by vectorizing row summation Signed-off-by:
Evie Wright <evie.wright@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Jun 25, 2025
-
-
Signed-off-by:
Evie Wright <evie.wright@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Approved-by:
Viet-Hoa Do <viet-hoa.do@arm.com>
-
- Jun 24, 2025
-
-
Anton Bondarenko authored
Signed-off-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
-
* Matrix multiplication (1xN) micro-kernels to compute the matrix multiplication of dynamically quantized symmetric signed 8-bit integer with per-block quantization (QSI8D32) LHS matrix and quantized asymmetric 4-bit signed integer with per-block quantization (QAI4C32) RHS matrix and the accumulation of the result into a single-precision (F32) and half-precision (F16) output, optimized for FEAT_DotProd and packing parameter kr = 8. Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Jun 23, 2025
-
-
update unit test to test signed integer inputs Resolves: KLEIDIAI-664 Signed-off-by:
Evie Wright <evie.wright@arm.com> Approved-by:
Suhail M <mohammedsuhail.munshi@arm.com>
-
- Jun 20, 2025
-
-
Jakub Sujak authored
Signed-off-by:
Jakub Sujak <jakub.sujak@arm.com> Approved-by:
Viet-Hoa Do <viet-hoa.do@arm.com>
-
Jakub Sujak authored
Adds F16 GEMM micro-kernel using SME1 MOPA instruction and 2VL x 2VL block size. This SME1 kernel is compatible with existing SME F16 LHS and RHS packing functions. Signed-off-by:
Jakub Sujak <jakub.sujak@arm.com> Approved-by:
Viet-Hoa Do <viet-hoa.do@arm.com>
-
Jakub Sujak authored
Signed-off-by:
Jakub Sujak <jakub.sujak@arm.com> Approved-by:
Viet-Hoa Do <viet-hoa.do@arm.com>
-
- Jun 19, 2025
-
-
Jakub Sujak authored
Test buffers must be initialized to a default value of 0. Report mismatches outside the portion-tested ROI to catch out-of-bound kernel writes. Signed-off-by:
Jakub Sujak <jakub.sujak@arm.com> Approved-by:
Viet-Hoa Do <viet-hoa.do@arm.com>
-
Viet-Hoa Do authored
Signed-off-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
Viet-Hoa Do authored
* Add GEMM F32 kernel using SME1 MOPA with block size 2VLx2VL. * Add tests for the newly added kernel. * Add CI job to run the kernel on FVP with SME1 and without SME2. Signed-off-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
Anton Bondarenko authored
Update Linux Kernel and bootloader-wrapper to versions supporting SME/SME2. This required to properly handle context switch with SME state preservation and restoration. Also address CPU boot issue by keep in sync number of CPUs used in bootloader/DTS and FVP configuration. Two CPUs used when first used for running system services and second one as isolated to run test programs. Linux Kernel: 6.16-rc2 (update to released version once available) Bootloader wrapper: latest available Signed-off-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
- Jun 16, 2025
-
-
Emil Ohlsson authored
Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
- Jun 13, 2025
-
-
* Matrix multiplication (MxN) micro-kernels to compute the matrix multiplication of dynamically quantized symmetric signed 8-bit integer with per-block quantization (QSI8D32) LHS matrix and quantized asymmetric 4-bit signed integer with per-block quantization (QAI4C32) RHS matrix and the accumulation of the result into a single-precision (F32) output, optimized for FEAT_DotProd Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Reviewed-by:
Anitha Raj <anitha.raj@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- Jun 12, 2025
-
-
Jens Elofsson authored
Move the assembly blocks of the following kernels into their own files: - rhs_pack_kxn_f32p2vlx1biasf32_f32_f32_sme - lhs_pack_f32p2vlx1_f32_sme - matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla - matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla - matmul_clamp_f32_f32p2vlx1_f32p2vlx1biasf32_sme2_mopa Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
Jens Elofsson authored
Move the assembly blocks of the following kernels into their own files: - lhs_pack_x16p2vlx2_x16_sme - rhs_pack_kxn_x16p2vlx2b_x16_x16_sme - matmul_clamp_f16_f16p2vlx2_f16p2vlx2_2vlx2vl_sme2_mopa - matmul_clamp_f16_f16_f16p2vlx2b_1x16vl_sme2_dot Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- Jun 11, 2025
-
-
Jens Elofsson authored
Move the assembly blocks of the following kernels into their own files: - lhs_pack_x8p2vlx4_x8_sme - rhs_pack_kxn_qsi8cxp2vlx4sb_qs8cx_f32_i32_sme - matmul_clamp_qai8_qai8_qsi8cxp2vlx4sb_1x16vl_sme2_dot - matmul_clamp_qai8_qai8p2vlx4_qsi8cxpsb2vlx4_2vlx2vl_sme2_mopa Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- Jun 10, 2025
-
-
Jens Elofsson authored
Fix issue in kernels - matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla - matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla where kai_get_m_step returns the incorrect value. Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- Jun 05, 2025
-
-
Update architectural feature guards and CMakeLists to enable MSVC build for matmul_clamp_f32_qsi8d32p_qai4c32p micro-kernels. Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
Optimize the transposed RHS packing function for matmul_clamp_f32_qai8dxp_qsi4c32p using advanced SIMD, for kr / sr = 8 Signed-off-by:
Evie Wright <evie.wright@arm.com> Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Reviewed-by:
Anitha Raj <anitha.raj@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Evie Wright <evie.wright@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
Suhail M authored
Signed-off-by:
Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
-
Emil Ohlsson authored
This change adds MSVC support for the imatmul kernels. As a result, they are also aligned, which results in some additional diffs. Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
Signed-off-by:
Charlotte Chen <charlotte.chen@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
- Jun 04, 2025
-
-
* Matrix multiplication (MxN) micro-kernels to compute the matrix multiplication of dynamically quantized symmetric signed 8-bit integer with per-block quantization (QSI8D32) LHS matrix and quantized asymmetric 4-bit signed integer with per-block quantization (QAI4C32) RHS matrix and the accumulation of the result into a half-precision (F16) output, optimized for FEAT_DotProd. Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
Emil Ohlsson authored
Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Jun 03, 2025
-
-
Suhail M authored
- Example demonstrates creating an indirect buffer using a Conv2D input tensor - Example demonstrates indirect buffer usage with imatmul kernels. Signed-off-by:
Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com> Reviewed-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Suhail M <mohammedsuhail.munshi@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- Following the procedure call standard, backup registers d8-d15 in kai_kernel_matmul_clamp_f32_qai8dxp1x4_qsi8cxp4vlx4_1x4vl_sme2_dot asm micro-kernel. Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- Jun 02, 2025
-
-
Emil Ohlsson authored
Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Viet-Hoa Do <viet-hoa.do@arm.com>
-
- May 30, 2025
-
-
* Micro-kernels (1xN) to compute the matrix multiplication of dynamically quantized asymmetric 8-bit integer with per-channel quantization (QAI8DX) LHS matrix and quantized symmetric 8-bit integer with per-channel quantization (QSI8CX) RHS matrix and the accumulation of the result into a single-precision (F32) output, optimized for SME2 technology. * Micro-kernels (MxN) to compute the matrix multiplication of dynamically quantized asymmetric 8-bit integer with per-channel quantization (QAI8DX) LHS matrix and quantized symmetric 8-bit integer with per-channel quantization (QSI8CX) RHS matrix and the accumulation of the result into a single-precision (F32) output, optimized for SME2 technology. Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
-
- May 28, 2025
-
-
Signed-off-by:
Evie Wright <evie.wright@arm.com> Reviewed-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
-
- May 27, 2025
-
-
Anton Bondarenko authored
Update missing or outdated components in build environment Signed-off-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
Emil Ohlsson authored
Newer versions of the linter flags issues with parentheis in expressionss, as well as use of `size_t` without inclusion of `stddef.h` Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
Emil Ohlsson authored
Given that there is a non-templated version for doing clamp testing, there is no need to toggle on type in matmul_test.cpp. Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Jens Elofsson <jens.elofsson@arm.com>
-
- May 26, 2025
-
-
* Replace `std::vector<uint8_t>` by `Buffer` class. * Update `Buffer` class: - Add support for initial value of the buffer. - Always initialize the buffer with 0 by default. * Add `pad_matrix` reference function to support extending the data buffer. Signed-off-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- May 22, 2025
-
-
Viet-Hoa Do authored
* Numeric limits report the lowest and highest finite values of F16 and BF16 to be 0 which disables testing of all F16 and BF16 kernels with clamping. * Update numeric limits to have the correct limits. * Update numeric limits to make sure compilation error when a type is not supported. Signed-off-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
Viet-Hoa Do authored
* The conversion between FP32 and FP16 is part of the base instruction set and does not require FEAT_FP16. The equivalent functions in the test framework need to change to avoid the need for FEAT_FP16. Signed-off-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- May 13, 2025
-
-
Emil Ohlsson authored
This commit introduces documentation for the imatmul kernels, with extra focus on the left hand side packing. Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Reviewed-by:
Jakub Sujak <jakub.sujak@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- May 08, 2025
-
-
Anton Bondarenko authored
* Remove duplicate data types and structures * Cleanup unused headers * Move common CPU feature checkers to cpu_info Signed-off-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- May 02, 2025
-
-
Jakub Sujak authored
Introduces a dedicated `Buffer` abstraction for managing blocks of memory. Buffer comes with protection mechanisms that can be enabled by setting the `KAI_TEST_BUFFER_POLICY` environment variable. Example usage: ```KAI_TEST_BUFFER_POLICY=PROTECT_OVERFLOW ./kleidiai_test``` Available memory protection mechanisms: - `KAI_TEST_BUFFER_POLICY=PROTECT_UNDERFLOW` - `KAI_TEST_BUFFER_POLICY=PROTECT_OVERFLOW` If `KAI_TEST_BUFFER_POLICY` is not set or is not one of the above values, then no memory protection mechanisms are enabled and Buffer performs naive malloc() allocation of memory. When `KAI_TEST_BUFFER_POLICY` is set to one of the above values, the following protections are enabled: - `PROTECT_UNDERFLOW`: Memory equal to the size of the user buffer rounded to the nearest whole page plus adjacent guard pages is allocated, and the user buffer is aligned to the end of the head guard page thus detecting whenever a buffer underflow occurs. - `PROTECT_OVERFLOW`: Same as above, but now the edge of the user buffer is aligned to the start of the tail guard page thus detecting whenever a buffer overflow occurs. Buffer is only intended to opaquely allocate and manage memory. The underlying memory resource can be requested using the familiar `Buffer::data()` method and interacted with using `kai::test::read_array<T>()` and `kai::test::write_array<T>()` utilities. Signed-off-by:
Jakub Sujak <jakub.sujak@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Jakub Sujak <jakub.sujak@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
-