Skip to content
  1. Jun 02, 2025
  2. May 30, 2025
  3. May 28, 2025
  4. May 27, 2025
  5. May 26, 2025
  6. May 22, 2025
  7. May 13, 2025
  8. May 08, 2025
  9. May 02, 2025
    • Jakub Sujak's avatar
      Smarter memory protection with Buffer class · 73d1e85d
      Jakub Sujak authored
      
      
      Introduces a dedicated `Buffer` abstraction for managing blocks of memory. Buffer comes with protection mechanisms that can be enabled by setting the `KAI_TEST_BUFFER_POLICY` environment variable.
      
      Example usage:
      
      ```KAI_TEST_BUFFER_POLICY=PROTECT_OVERFLOW ./kleidiai_test```
      
      Available memory protection mechanisms:
      
      - `KAI_TEST_BUFFER_POLICY=PROTECT_UNDERFLOW`
      - `KAI_TEST_BUFFER_POLICY=PROTECT_OVERFLOW`
      
      If `KAI_TEST_BUFFER_POLICY` is not set or is not one of the above values, then no memory protection mechanisms are enabled and Buffer performs naive malloc() allocation of memory.
      
      When `KAI_TEST_BUFFER_POLICY` is set to one of the above values, the following protections are enabled:
      
      - `PROTECT_UNDERFLOW`: Memory equal to the size of the user buffer rounded to the nearest whole page plus adjacent guard pages is allocated, and the user buffer is aligned to the end of the head guard page thus detecting whenever a buffer underflow occurs.
      - `PROTECT_OVERFLOW`: Same as above, but now the edge of the user buffer is aligned to the start of the tail guard page thus detecting whenever a buffer overflow occurs.
      
      Buffer is only intended to opaquely allocate and manage memory. The underlying memory resource can be requested using the familiar `Buffer::data()` method and interacted with using `kai::test::read_array<T>()` and `kai::test::write_array<T>()` utilities.
      
      Signed-off-by: Jakub Sujak's avatarJakub Sujak <jakub.sujak@arm.com>
      
      Reviewed-by: Viet-Hoa Do's avatarViet-Hoa Do <viet-hoa.do@arm.com>
      Reviewed-by: Jakub Sujak's avatarJakub Sujak <jakub.sujak@arm.com>
      Reviewed-by: Emil Ohlsson's avatarEmil Ohlsson <emil.ohlsson@arm.com>
      Approved-by: Felix Johnny Thomasmathibalan's avatarFelix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>
      73d1e85d
  10. Apr 29, 2025
    • Jakub Sujak's avatar
      Fix segmentation faults in benchmark tool · 49e0f869
      Jakub Sujak authored
      
      
      * Fix incorrect calculation of LHS matrix stride value
      
      For kernels that use the LHS matrix stride in their API, namely
      `kai_matmul_clamp_f32_f32_f32p8x1biasf32_6x8x4_neon_mla` and
      `kai_matmul_clamp_f16_f16_f16p16x1biasf16_6x16x8_neon_mla` kernels, the
      LHS stride value was calculated incorrectly by computing in terms of
      bits, not bytes.
      
      * Fix insufficient allocation of memory for SME kernels
      
      For SME kernels, such as
      `kai_matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla`, the tensor sizes
       are in terms of the streaming SVE vector length. Thus, when running SME
        kernels we must scale the LHS/RHS/DST buffer sizes by the VL
        appropriately.
      
      The segmentation faults were discovered when running with address
      sanitizer enabled.
      
      Signed-off-by: Jakub Sujak's avatarJakub Sujak <jakub.sujak@arm.com>
      
      Reviewed-by: Emil Ohlsson's avatarEmil Ohlsson <emil.ohlsson@arm.com>
      Approved-by: Emil Ohlsson's avatarEmil Ohlsson <emil.ohlsson@arm.com>
      49e0f869
  11. Apr 25, 2025
  12. Apr 24, 2025
  13. Apr 23, 2025
  14. Apr 22, 2025
  15. Apr 16, 2025
  16. Apr 15, 2025
  17. Apr 14, 2025
  18. Apr 11, 2025
    • Anitha Raj's avatar
      Matmul Micro-kernels F16<-(QAI8DX) LHS x (QSI4CX) RHS · 315ed95c
      Anitha Raj authored and Viet-Hoa Do's avatar Viet-Hoa Do committed
      
      
      Micro-kernels to compute the matrix multiplication of dynamically quantized asymmetric signed 8-bit integer with per-channel quantization (QAI8DX) LHS matrix and quantized symmetric 4-bit signed integer with per-channel quantization (QSI4CX) RHS matrix and the accumulation of the result into a half-precision (F16):
      
      Matrix multiplication (MxN) Micro-kernels of QAI8DX LHS and QSI4CX RHS with F16 output, optimized for FEAT_I8MM and FEAT_DotProd. 
      Matrix multiplication (1xN) Micro-kernels of QAI8DX LHS and QSI4CX RHS with F16 output, optimized for FEAT_DotProd.
      
      Signed-off-by: Anitha Raj's avatarAnitha Raj <anitha.raj@arm.com>
      
      Signed-off-by: Evie Wright's avatarEvie Wright <evie.wright@arm.com>
      
      Approved-by: Viet-Hoa Do's avatarViet-Hoa Do <viet-hoa.do@arm.com>
      315ed95c
  19. Apr 10, 2025
  20. Apr 09, 2025
  21. Apr 08, 2025
  22. Apr 03, 2025
  23. Apr 02, 2025
  24. Apr 01, 2025
  25. Mar 26, 2025
Loading