- Jun 02, 2025
-
-
Emil Ohlsson authored
Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Viet-Hoa Do <viet-hoa.do@arm.com>
-
- May 30, 2025
-
-
* Micro-kernels (1xN) to compute the matrix multiplication of dynamically quantized asymmetric 8-bit integer with per-channel quantization (QAI8DX) LHS matrix and quantized symmetric 8-bit integer with per-channel quantization (QSI8CX) RHS matrix and the accumulation of the result into a single-precision (F32) output, optimized for SME2 technology. * Micro-kernels (MxN) to compute the matrix multiplication of dynamically quantized asymmetric 8-bit integer with per-channel quantization (QAI8DX) LHS matrix and quantized symmetric 8-bit integer with per-channel quantization (QSI8CX) RHS matrix and the accumulation of the result into a single-precision (F32) output, optimized for SME2 technology. Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
-
- May 28, 2025
-
-
Signed-off-by:
Evie Wright <evie.wright@arm.com> Reviewed-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
-
- May 27, 2025
-
-
Anton Bondarenko authored
Update missing or outdated components in build environment Signed-off-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
Emil Ohlsson authored
Newer versions of the linter flags issues with parentheis in expressionss, as well as use of `size_t` without inclusion of `stddef.h` Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
Emil Ohlsson authored
Given that there is a non-templated version for doing clamp testing, there is no need to toggle on type in matmul_test.cpp. Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Jens Elofsson <jens.elofsson@arm.com>
-
- May 26, 2025
-
-
* Replace `std::vector<uint8_t>` by `Buffer` class. * Update `Buffer` class: - Add support for initial value of the buffer. - Always initialize the buffer with 0 by default. * Add `pad_matrix` reference function to support extending the data buffer. Signed-off-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- May 22, 2025
-
-
Viet-Hoa Do authored
* Numeric limits report the lowest and highest finite values of F16 and BF16 to be 0 which disables testing of all F16 and BF16 kernels with clamping. * Update numeric limits to have the correct limits. * Update numeric limits to make sure compilation error when a type is not supported. Signed-off-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
Viet-Hoa Do authored
* The conversion between FP32 and FP16 is part of the base instruction set and does not require FEAT_FP16. The equivalent functions in the test framework need to change to avoid the need for FEAT_FP16. Signed-off-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- May 13, 2025
-
-
Emil Ohlsson authored
This commit introduces documentation for the imatmul kernels, with extra focus on the left hand side packing. Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Reviewed-by:
Jakub Sujak <jakub.sujak@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- May 08, 2025
-
-
Anton Bondarenko authored
* Remove duplicate data types and structures * Cleanup unused headers * Move common CPU feature checkers to cpu_info Signed-off-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- May 02, 2025
-
-
Jakub Sujak authored
Introduces a dedicated `Buffer` abstraction for managing blocks of memory. Buffer comes with protection mechanisms that can be enabled by setting the `KAI_TEST_BUFFER_POLICY` environment variable. Example usage: ```KAI_TEST_BUFFER_POLICY=PROTECT_OVERFLOW ./kleidiai_test``` Available memory protection mechanisms: - `KAI_TEST_BUFFER_POLICY=PROTECT_UNDERFLOW` - `KAI_TEST_BUFFER_POLICY=PROTECT_OVERFLOW` If `KAI_TEST_BUFFER_POLICY` is not set or is not one of the above values, then no memory protection mechanisms are enabled and Buffer performs naive malloc() allocation of memory. When `KAI_TEST_BUFFER_POLICY` is set to one of the above values, the following protections are enabled: - `PROTECT_UNDERFLOW`: Memory equal to the size of the user buffer rounded to the nearest whole page plus adjacent guard pages is allocated, and the user buffer is aligned to the end of the head guard page thus detecting whenever a buffer underflow occurs. - `PROTECT_OVERFLOW`: Same as above, but now the edge of the user buffer is aligned to the start of the tail guard page thus detecting whenever a buffer overflow occurs. Buffer is only intended to opaquely allocate and manage memory. The underlying memory resource can be requested using the familiar `Buffer::data()` method and interacted with using `kai::test::read_array<T>()` and `kai::test::write_array<T>()` utilities. Signed-off-by:
Jakub Sujak <jakub.sujak@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Jakub Sujak <jakub.sujak@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
-
- Apr 29, 2025
-
-
Jakub Sujak authored
* Fix incorrect calculation of LHS matrix stride value For kernels that use the LHS matrix stride in their API, namely `kai_matmul_clamp_f32_f32_f32p8x1biasf32_6x8x4_neon_mla` and `kai_matmul_clamp_f16_f16_f16p16x1biasf16_6x16x8_neon_mla` kernels, the LHS stride value was calculated incorrectly by computing in terms of bits, not bytes. * Fix insufficient allocation of memory for SME kernels For SME kernels, such as `kai_matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla`, the tensor sizes are in terms of the streaming SVE vector length. Thus, when running SME kernels we must scale the LHS/RHS/DST buffer sizes by the VL appropriately. The segmentation faults were discovered when running with address sanitizer enabled. Signed-off-by:
Jakub Sujak <jakub.sujak@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- Apr 25, 2025
-
-
Signed-off-by:
Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Apr 24, 2025
-
-
Jens Elofsson authored
Update all version indicators to 1.8.0. Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
Related to #9 - allows some tests to run when the CPU doesn't have bfloat16 support, by reducing need for bfloat16 support in the test framework. Use of bfcvt in the test framework is replaced by simple truncation - this will effectively round towards 0. Add a few tests for negative numbers in the test framework BFloat16 code, and allow running the unit tests for the framework code on CPUs without bfloat16 hardware support. Signed-off-by:
Matthew Bentham <Matthew.Bentham@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Adds packing and matmul kernels for FP32 SME Indirect GEMM - Adds tests for Indirect Gemm with FP32 inputs/outputs. Signed-off-by:
Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Jakub Sujak <jakub.sujak@arm.com> Reviewed-by:
Mohammed Suhail Munshi <mohammedsuhail.munshi@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
Emil Ohlsson authored
This change bundles some cleanup done while doing FP16 IGEMM work. The main change is the reduced output from the comparison function, which instead of printing each differing value on a new line instead bundles them on a per matrix row, with some block description. Also, the output is only printed if there is a mismatch Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
Emil Ohlsson authored
The QAI8 interface doesn't match the interface of the actual kernel. This change changes the interface to align with the kernel, and then enforces the interface alignment using unit testing. Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Apr 23, 2025
-
-
Emil Ohlsson authored
This change introduces three new kernels * `kai_lhs_imatmul_pack_x16p2vlx2_x16p_sme` * `kai_rhs_imatmul_pack_kxn_x16p2vlx2b_x16_x16_sme` * `kai_imatmul_clamp_f16_f16p2vlx2_f16p2vlx2_2vlx2vl_sme2_mopa` These are used for indirect matmul, for 16-bit floating point data. This change also adds unit testing for the FP16 imatmul kernels. The code is written in a type agnostic manner, as to easily allow testing for other data types with very low effort. This required the addition of non-templated `read`/`write` functions, as to allow runtime-generic access. Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Jakub Sujak <jakub.sujak@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
Emil Ohlsson authored
There are few cleanups for QAI8 that were discovered while doing FP16 work. This change bundles these cleanups Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
- Apr 22, 2025
-
-
Emil Ohlsson authored
This change moves the QAI8 static initializations to lazy, C++17 compliant, initializations. This change also makes use of kernel interfaces, as to make sure they're exercised Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Jens Elofsson <jens.elofsson@arm.com>
-
- Apr 16, 2025
-
-
Viet-Hoa Do authored
* Our SME kernels are compiled with auto-vectorization disabled, but the linker is unable to respect this flag, which can cause miscompilation and result in illegal instruction. Signed-off-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- Apr 15, 2025
-
-
Signed-off-by:
Evie Wright <evie.wright@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
Jens Elofsson authored
Update all version indicators to 1.7.0. Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- Apr 14, 2025
-
-
Felix Johnny Thomasmathibalan authored
The get_lhs_offset function pointer type in f32_bf16p_bf16p is fixed to read get_lhs_packed_offset as the LHS is packed as well. Signed-off-by:
Felix Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
- Apr 11, 2025
-
-
Micro-kernels to compute the matrix multiplication of dynamically quantized asymmetric signed 8-bit integer with per-channel quantization (QAI8DX) LHS matrix and quantized symmetric 4-bit signed integer with per-channel quantization (QSI4CX) RHS matrix and the accumulation of the result into a half-precision (F16): Matrix multiplication (MxN) Micro-kernels of QAI8DX LHS and QSI4CX RHS with F16 output, optimized for FEAT_I8MM and FEAT_DotProd. Matrix multiplication (1xN) Micro-kernels of QAI8DX LHS and QSI4CX RHS with F16 output, optimized for FEAT_DotProd. Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Signed-off-by:
Evie Wright <evie.wright@arm.com> Approved-by:
Viet-Hoa Do <viet-hoa.do@arm.com>
-
- Apr 10, 2025
-
-
Jens Elofsson authored
Static initialization has no guaranteed order, which may cause test listing to be initialized before the list of kernels. This fixes unit tests - matmul_clamp_f32_bf16p_bf16p_test - matmul_clamp_f16_bf16p_bf16p_test Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
Viet-Hoa Do authored
* Writing to one union member and reading the other union member is considered undefined behavior, therefore we need to avoid it. Signed-off-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
- Apr 09, 2025
-
-
Micro-kernels to compute the matrix multiplication of dynamically quantized symmetric signed 8-bit integer with per-block quantization (QSI8D32) LHS matrix and quantized asymmetric 4-bit signed integer with per-block quantization (QAI4C32) RHS matrix and the accumulation of the result into a single-precision (F32) and half-precision (F16) output: - Matrix multiplication (MxN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F32 output, optimized for FEAT_I8MM. - Matrix multiplication (1xN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F32 output, optimized for FEAT_DotProd. - Matrix multiplication (MxN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F16 output, optimized for FEAT_I8MM. - Matrix multiplication (1xN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F16 output, optimized for FEAT_DotProd. Signed-off-by:
Anitha Raj <anitha.raj@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Anitha Raj <anitha.raj@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
Emil Ohlsson authored
This change introduces three new kernels: * kai_imatmul_clamp_qai8_qai8p2vlx4_qsi8cxpsb2vlx4_2vlx2vl_sme2_mopa * kai_lhs_imatmul_pack_x8p2vlx4_x8p_sme * kai_rhs_imatmul_pack_kxn_qsi8cxp2vlx4sb_qs8cx_f32_i32_sme These kernels are used for _indirect matmul_. The big difference between these kernels and matmul kernels is that the LHS packing kernel takes an indirection buffer where each pointer refers to a chunk in K dimension. The pointers are laid out in a packed manner, where instead of being in row major order, a column of `get_m_step` chunk pointers are placed linearly in indirection buffer. In addition to the kernels themselves, the `matmul_clamp_qai8_qai8p_qsi8cxp_test.cpp` is extended to perform testing of these new kernels. The testing flow for these new kernels is a bit different, in that the packing kernels themselves are not directly tested, instead only end-to-end flow is tested. Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Signed-off-by:
Felix Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com> Signed-off-by:
Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com> Reviewed-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Jakub Sujak <jakub.sujak@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
- Apr 08, 2025
-
-
Emil Ohlsson authored
There is an issue where the order of static initializations has no guaranteed order, which can cause test listing to be initialized before list of kernels. This can be solved by lazily initialize kernel lists on first use. This patch applies this fix for `matmul_test.cpp` Signed-off-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Anton Bondarenko <anton.bondarenko@arm.com>
-
- Apr 03, 2025
-
-
Jens Elofsson authored
- Remove designated initializers for matmul_clamp_f32_qai8dxp_qsi4cxp_test to comply with C++17 standard. Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Approved-by:
Viet-Hoa Do <viet-hoa.do@arm.com>
-
- Apr 02, 2025
-
-
Viet-Hoa Do authored
* FP16 and BF16 classes are implemented in assembly so the rest of the test framework doesn't need to be compiled with FP16 and BF16 support anymore. It allows the test to be run on system with base architecture. * Remove unnecessary feature guard in kernel header file. The user of our API must not need to compile their code with BF16 support. Signed-off-by:
Viet-Hoa Do <viet-hoa.do@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
- Apr 01, 2025
-
-
Jens Elofsson authored
Removes designated initializers from - matmul_test.cpp - matmul_clamp_f16_bf16p_bf16p_test.cpp - matmul_clamp_f32_bf16p_bf16p_test.cpp Following changes are made to the test framework: - Added default value to data_type in DataFormats constructor - Initialize members of struct MatMulMethod - Add '-Wpedantic' as a build flag to the affected unit tests Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Reviewed-by:
Anton Bondarenko <anton.bondarenko@arm.com> Reviewed-by:
Emil Ohlsson <emil.ohlsson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-
Jakub Sujak authored
Although `hw.optional.AdvSIMD` is the replacement for `hw.optional.neon`, this parameter is not always present in different versions of the OS. This may lead to the test suite crashing or tests being erroneously skipped. Instead, we check if the machine supports `hw.optional.arm64` and, if true, we can assume Advanced SIMD support is always present. Signed-off-by:
Jakub Sujak <jakub.sujak@arm.com> Approved-by:
Viet-Hoa Do <viet-hoa.do@arm.com>
-
- Mar 26, 2025
-
-
Jens Elofsson authored
Update all version indicators to 1.6.0. Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
Jens Elofsson authored
Signed-off-by:
Jens Elofsson <jens.elofsson@arm.com> Approved-by:
Jakub Sujak <jakub.sujak@arm.com>
-
Optimizes this RHS packing by vectorizing the XOR operation. This is done for segment lenghts of 4 or 8 bytes. The unoptimized path is used for any other segment length. Signed-off-by:
Dan Johansson <dan.johansson@arm.com> Approved-by:
Emil Ohlsson <emil.ohlsson@arm.com>
-