CHANGELOG.md

# Changelog

All notable changes to the Arm RAN Acceleration Library (ArmRAL) project will be
documented in this file.

## [Unreleased]

### Added

### Changed

### Deprecated

### Removed

### Fixed

### Security

## [25.01] - 2025-01-23

### Added

- Added the functions `armral_turbo_decode_batch`, and
  `armral_turbo_decode_batch_noalloc`. These functions implement a maximum a
  posteriori (MAP) algorithm to decode the output of the LTE Turbo encoding
  scheme on a batch of encoded data.

- Added the function `armral_turbo_decode_batch_noalloc_buffer_size` which
  returns the size of buffer required for `armral_turbo_decode_batch_noalloc`.

### Changed

- Updated all copyright headers, and the text in
  [LICENSE.md](https://gitlab.arm.com/networking/ral/-/blob/main/LICENSE.md),
  to include the `BSD-3-Clause` SPDX License Identifier.

- Improved Neon and SVE performance of `armral_fft_execute_cf32` and
  `armral_fft_execute_cs16`.

- The LTE Turbo coding Additive White Gaussian Noise (AWGN) simulation now
  supports the decoding of batches of data, using `armral_turbo_decode_batch`.
  The number of batches is specified using the flag "`-b <n>`".

- FFT lengths up to 42012 are now supported, although lengths greater
  than 4096 are mostly untested.

### Removed

- Unused FFT kernels have been removed.

### Fixed

- Improved error correction of LDPC decoding (`armral_ldpc_decode_block`) in
  the presence of channel noise. The function now uses 16-bit signed integers
  internally rather than 8-bit signed integers. This may result in decreased
  performance.

- The arguments to the function `armral_turbo_decode_block_noalloc_buffer_size`
  have been changed to remove the unused second argument, `max_iter`.

- When planning FFTs with an unsupported length, `armral_fft_create_plan_cf32`
  and `armral_fft_create_plan_cs16` now return `ARMRAL_ARGUMENT_ERROR`.

## [24.10] - 2024-10-17

### Added

- Added the function `armral_turbo_perm_idx_init` which generates all
  permutation indices used in the permutation step of LTE Turbo decoding.

- Added the function `armral_cmplx_matmul_i16_noalloc` which multiplies two
  matrices of complex Q15 values using a 64-bit Q32.31 accumulator. This
  function does not call any system memory allocators, unlike the existing
  `armral_cmplx_matmul_i16` function.

### Changed

- The interfaces for `armral_turbo_decode_block` and
  `armral_turbo_decode_block_noalloc` now have an additional argument. They now
  include the option to supply a user-allocated buffer which, if used, must be
  initialized with permutation indices by calling
  `armral_turbo_perm_idx_init`. This buffer can then be reused in subsequent
  calls to the Turbo decoding functions and will improve their performance by
  removing the need to compute the indices on each call. If the buffer is not
  initialized and a null pointer is passed instead, the functions will recompute
  the permutation indices on every call.

- Improved performance of `armral_fft_execute_cf32` and
  `armral_fft_execute_cs16`. Cases which were calculated using recursive calls
  to Rader's algorithm are now calculated using Bluestein's algorithm.

### Fixed

- Fixed performance regressions in the SVE versions of the following routines:

  - `armral_cmplx_vecdot_f32`
  - `armral_cmplx_vecmul_f32_2`

## [24.07] - 2024-07-18

### Added

- CMake option `ARMRAL_ENABLE_WEXTRA` to add the compiler flag `-Wextra` when
  building the library and tests.

### Changed

- Documentation is now installed by the `make install` target, if it has been
  built.

- Improved performance of `armral_cmplx_matmul_f32`. For complex 32-bit floating
  point matrix multiplication, we recommend you use this function for all
  cases. This function calls existing optimized special cases with minimal
  overhead and has new optimizations for larger cases.

- Improved performance of `armral_turbo_decode_block` and
  `armral_turbo_decode_block_noalloc`. These functions now operate internally on
  16-bit integer values rather than 16-bit or 32-bit floating point values.

- The following functions now use unsigned integers in their interfaces to
  represent the lengths of vectors and the dimensions of matrices:

  - `armral_cmplx_vecdot_f32`
  - `armral_cmplx_vecdot_f32_2`
  - `armral_cmplx_vecdot_i16`
  - `armral_cmplx_vecdot_i16_2`
  - `armral_cmplx_vecdot_i16_32bit`
  - `armral_cmplx_vecdot_i16_2_32bit`
  - `armral_cmplx_vecmul_f32`
  - `armral_cmplx_vecmul_f32_2`
  - `armral_cmplx_vecmul_i16`
  - `armral_cmplx_vecmul_i16_2`
  - `armral_corr_coeff_i16`
  - `armral_svd_cf32`
  - `armral_svd_cf32_noalloc`
  - `armral_svd_cf32_noalloc_buffer_size`

- Renamed `armral_cmplx_mat_mult_aah_f32` to be `armral_cmplx_matmul_aah_f32`.
  All arguments are in the same order and have the same meaning.

- Replaced `armral_cmplx_mat_mult_ahb_f32` with `armral_cmplx_matmul_ahb_f32`.
  Note that the meanings of the parameters `m`, `n`, and `k` differ between the
  old function and the new; a call to the old function of the form

    `armral_cmplx_mat_mult_ahb_f32(dim1, dim2, dim3, a, b, c);`

    becomes

    `armral_cmplx_matmul_ahb_f32(dim2, dim3, dim1, a, b, c);`

- Replaced `armral_cmplx_mat_mult_i16` with `armral_cmplx_matmul_i16`.  Note
  that the meanings of the parameters `m`, `n`, and `k` differ between the old
  function and the new; a call to the old function of the form

    `armral_cmplx_mat_mult_i16(dim1, dim2, dim3, a, b, c);`

    becomes

    `armral_cmplx_matmul_i16(dim1, dim3, dim2, a, b, c);`

- Replaced `armral_cmplx_mat_mult_i16_32bit` with
  `armral_cmplx_matmul_i16_32bit`.  Note that the meanings of the parameters
  `m`, `n`, and `k` differ between the old function and the new; a call to the
  old function of the form

    `armral_cmplx_mat_mult_i16_32bit(dim1, dim2, dim3, a, b, c);`

    becomes

    `armral_cmplx_matmul_i16_32bit(dim1, dim3, dim2, a, b, c);`

- Replaced `armral_cmplx_matmul_f32` with `armral_cmplx_matmul_f32`.  Note that
  the meanings of the parameters `m`, `n`, and `k` differ between the old
  function and the new; a call to the old function of the form

    `armral_cmplx_mat_mult_f32(dim1, dim2, dim3, a, b, c);`

    becomes

    `armral_cmplx_matmul_f32(dim1, dim3, dim2, a, b, c);`

### Fixed

- Corrected documentation for `armral_cmplx_mat_inverse_batch_f32` and
  `armral_cmplx_mat_inverse_batch_f32_pa` to clarify that these functions have
  no restriction on batch sizes.

## [24.04] - 2024-04-19

### Added

- Makefile target `bench_excel_summary` to run the benchmarks and create an
  Excel spreadsheet containing the results.

### Changed

- Moved `license_terms/BSD-3-Clause.txt` and
  `license_terms/third_party_licenses.txt` to
  [LICENSE.md](https://gitlab.arm.com/networking/ral/-/blob/main/LICENSE.md) and
  [THIRD_PARTY_LICENSES.md](https://gitlab.arm.com/networking/ral/-/blob/main/THIRD_PARTY_LICENSES.md)
  respectively.

- Extended `armral_cmplx_pseudo_inverse_direct_f32` and
  `armral_cmplx_pseudo_inverse_direct_f32_noalloc` to compute the regularized
  pseudo-inverse of a single complex 32-bit matrix of size `M-by-N` for the case
  where `M` and/or `N` == 1.

- Improved SVE2 performance of `armral_turbo_decode_block` and
  `armral_turbo_decode_block_noalloc`.

- Improved SVE2 performance of `armral_ldpc_encode_block` and
  `armral_ldpc_encode_block_noalloc`.

## [24.01] - 2024-01-19

### Changed

- Extended `armral_cmplx_pseudo_inverse_direct_f32` and
  `armral_cmplx_pseudo_inverse_direct_f32_noalloc` to compute the regularized
  pseudo-inverse of a single complex 32-bit matrix of size `M-by-N` for cases
  where `M > N` in addition to the cases where `M <= N`.

- Improved performance of `armral_turbo_decode_block` and
  `armral_turbo_decode_block_noalloc`.

- Improved SVE2 performance of `armral_seq_generator`, for the cases when
  `sequence_len` is not a multiple of 64.

### Fixed

- LDPC block encoding (`armral_ldpc_encode_block`), rate matching
  (`armral_ldpc_rate_matching`) and rate recovery (`armral_ldpc_rate_recovery`),
  and the corresponding channel simulator, now support the insertion and removal
  of filler bits as described in the 3GPP Technical Specification (TS) 38.212.
  From [@Suraj4g5g](https://gitlab.arm.com/Suraj4g5g).

## [23.10] - 2023-10-06

### Changed

- Extended the `sequence_len` parameter of `armral_seq_generator` to `uint32_t`.
  From [@Suraj4g5g](https://gitlab.arm.com/Suraj4g5g).

- Added parameter `i_bil` to `armral_polar_rate_matching` and
  `armral_polar_rate_recovery` to enable or disable bit interleaving. From
  [@Suraj4g5g](https://gitlab.arm.com/Suraj4g5g).

- Added parameter `nref` to `armral_ldpc_rate_matching` and
  `armral_ldpc_rate_recovery` to enable the functions to be used with a soft
  buffer size. From [@Suraj4g5g](https://gitlab.arm.com/Suraj4g5g).

- Added parameter nref to `armral_ldpc_rate_matching` and
  `armral_ldpc_rate_recovery` to enable the functions to be used with a soft
  buffer size. From [@Suraj4g5g](https://gitlab.arm.com/Suraj4g5g).

- Improved Neon performance of Polar block decoding
  (`armral_polar_decode_block`) for list sizes 1, 2, 4 and 8.

- Improved Neon performance of LDPC block decoding (`armral_ldpc_decode_block`
  and `armral_ldpc_decode_block_noalloc`).

- Simulation programs are now built by default and are tested by the make check
  target.

## [23.07] - 2023-07-07

### Added

- New function to compute the regularized pseudo-inverse of a single complex
  32-bit floating-point matrix (`armral_cmplx_pseudo_inverse_direct_f32`).

- New function to compute the multiplication of a complex 32-bit floating-point
  matrix with its conjugate transpose (`armral_cmplx_mat_mult_aah_f32`).

- New function to compute the complex 32-bit floating-point multiplication of
  the conjugate transpose of a matrix with a matrix
  (`armral_cmplx_mat_mult_ahb_f32`).

- Variants of existing functions which take a pre-allocated buffer rather than
  performing memory allocations internally. For functions where the buffer size
  is not easily calculated from the input parameters, helper functions to
  calculate the required size have been provided.

- Neon-optimized implementation of batched complex 32-bit floating-point
  matrix-vector multiplication (`armral_cmplx_mat_vec_mult_batch_f32`).

- SVE2-optimized implementation of complex 32-bit floating-point general matrix
  inverse for matrices of size `2x2`, `3x3` and `4x4`
  (`armral_cmplx_mat_inverse_f32`).

### Changed

- Improved Neon and SVE2 performance of Mu Law compression
  (`armral_mu_law_compr_8bit`, `armral_mu_law_compr_9bit`, and
  `armral_mu_law_compr_14bit`).

- Improved Neon performance of 8-bit block float compression
  (`armral_block_float_compr_8bit`).

- Improved SVE2 performance of 9-bit block scaling decompression
  (`armral_block_scaling_decompr_9bit`).

- Improved SVE2 performance of 14-bit block scaling decompression
  (`armral_block_scaling_decompr_14bit`).

- Improved SVE2 performance of 8-bit and 12-bit block float compression
  (`armral_block_float_compr_8bit` and `armral_block_float_compr_12bit`).

- Moved the definition of the symbol rate out of the `ebn0_to_snr` function
  (`simulation/awgn/awgn.cpp`) so that it is now a parameter that gets passed in
  by each of the simulation programs.

- Updated the `convolutional_awgn` simulation program to use OpenMP
  (`simulation/convolutional_awgn/convolutional_awgn.cpp`).

- Updated simulation programs to accept a path to write graphs to, instead of
  auto-generating filenames.

- Added the maximum number of iterations to the output of the Turbo simulation
  program (`simulation/turbo_awgn/turbo_error_rate.py`).

- Updated formatting of labels in simulation graph legends.

### Fixed

- Removed bandwidth scaling in all simulation programs so that the maximum
  spectral efficiency does not exceed the number of bits per symbol.

- Convolutional decoding algorithm
  (`armral_tail_biting_convolutional_decode_block`) now returns correct results
  for input lengths greater than 255.

- Test file for convolutional decoding (`test/ConvCoding/decoding/main.cpp`) is
  updated so that the tests pass as expected for input lengths which are not a
  multiple of 4.

- Neon block float decompression functions (`armral_block_float_decompr_8bit`,
  `armral_block_float_decompr_9bit`, `armral_block_float_decompr_12bit`, and
  `armral_block_float_decompr_14bit`) now truncate values before storing rather
  than rounding them. This means the Neon implementations of these functions now
  have the same behavior as the SVE implementations.

- Neon block scaling decompression functions.
  (`armral_block_scaling_decompr_8bit`, `armral_block_scaling_decompr_9bit`, and
  `armral_block_scaling_decompr_14bit`) now truncate values before storing
  rather than rounding them. This means the Neon implementations of these
  functions now have the same behavior as the SVE implementations.

## [23.04] - 2023-04-21

### Added

- Cyclic Redundancy Check (CRC) attachment function
  (`armral_polar_crc_attachment`) for Polar codes, described in section 5.2.1 of
  the 3GPP Technical Specification (TS) 38.212.

- CRC function to check the validity of the output(s) of Polar decoding
  (`armral_check_crc_polar`).

- New simulation program `modulation_awgn` which plots the error rate versus
  Eb/N0 (or signal-to-noise ratio (SNR)) of taking a hard demodulation decision
  for data sent over a noisy channel with no forward error correction.

- Added a field called `snr` to the JSON output of all simulation programs,
  which stores the signal-to-noise ratio.

- Added a flag called `x-unit` to all plotting scripts which allows the user to
  choose whether Eb/N0 or SNR is plotted on the x-axis.

- Added CRC attachment and check in Polar codes simulation.

### Changed

- Updated [license terms]
  (https://gitlab.arm.com/networking/ral/-/blob/main/license_terms/BSD-3-Clause.txt)
  to BSD-3-Clause.

- Updated Polar decoding (`armral_polar_decode_block`) to accept a list size of
  8.

- LDPC decoding (`armral_ldpc_decode_block`) can optionally make use of attached
  CRC information to terminate iteration early in the case that a match is
  found.

- Improved Neon performance of tail biting convolutional encoder for LTE
  (`armral_tail_biting_convolutional_encode_block`).

- Improved Neon performance of tail biting convolutional decoder for LTE
  (`armral_tail_biting_convolutional_decode_block`).

### Fixed

- Calculation of the encoded data length in the LDPC simulation program
  (`armral/simulation/ldpc_awgn/ldpc_error_rate.py`) is updated to match that
  used in ArmRAL.

- Graphs generated from results of simulation programs in the simulation
  directory no longer plot Shannon limits and theoretical maxima versus block
  error rates. Shannon limits and theoretical maxima continue to be plotted for
  bit error rates.

## [23.01] - 2023-01-27

### Added

- Rate matching for Turbo coding (`armral_turbo_rate_matching`). This implements
  the operations in section 5.1.4.1 of the 3GPP Technical Specification (TS)
  36.212.

- Rate recovery for Turbo coding (`armral_turbo_rate_recovery`). This implements
  the inverse operations of rate matching. Rate matching is described in section
  5.1.4.1 of the 3GPP Technical Specification (TS) 36.212.

- Tail-biting convolutional encoder for LTE
  (`armral_tail_biting_convolutional_encode_block`).

- Tail-biting convolutional decoder for LTE
  (`armral_tail_biting_convolutional_decode_block`).

- Scrambling for Physical Uplink Control Channels (PUCCH) formats 2, 3 and 4,
  Physical Downlink Shared Channel (PDSCH), Physical Downlink Control Channel
  (PDCCH), and Physical Broadcast Channel (PBCH) (`armral_scramble_code_block`).
  This covers scrambling as described in 3GPP Technical Specification (TS)
  38.211, sections 6.3.2.5.1, 6.3.2.6.1, 7.3.1.1, 7.3.2.3, and 7.3.3.1.

- Simulation program for LTE tail-biting convolutional coding
  (`armral/simulation/convolutional_awgn`).

- Python script that allows users to draw the data rates of each modulation and
  compare them to the capacity of the AWGN channel
  (`armral/simulation/capacity/capacity.py`).

- SVE2-optimized implementation of complex 32-bit floating point matrix-vector
  multiplication (`armral_cmplx_mat_vec_mult_f32`).

- SVE2-optimized implementation of 14-bit block scaling decompression
  (`armral_block_scaling_decompr_14bit`).

### Changed

- Modified error rate Python scripts (under `armral/simulation`) to use Eb/N0 as
  x-axis (instead of the SNR) and to show the Shannon limits.

- Added Turbo rate matching and recovery to the Turbo simulation program
  (`armral/simulation/turbo_awgn/turbo_awgn.cpp`).

- Improved Neon performance of block-float decompression for 9-bit and 14-bit
  block-float representations. (`armral_block_float_decompr_9bit` and
  `armral_block_float_decompr_14bit`).

- Improved Neon performance of complex 32-bit floating point matrix-vector
  multiplication (`armral_cmplx_mat_vec_mult_f32`).

- Improved Neon performance of Gold sequence generator (`armral_seq_generator`).

- Improved Neon performance of general matrix inversion
  (`armral_cmplx_mat_inverse_f32`).

- Improved Neon performance of batched general matrix inversion
  (`armral_cmplx_mat_inverse_batch_f32`).

### Fixed

- Documentation of the interface for Polar rate recovery
  (`armral_polar_rate_recovery`) updated to reflect how the parameters are used
  in the implementation.

## [22.10] - 2022-10-07

### Added

- SVE2-optimized implementations of `2x2` and `4x4` matrix multiplication
  functions where in-phase and quadrature components are separated
  (`armral_cmplx_mat_mult_2x2_f32_iq` and `armral_cmplx_mat_mult_4x4_f32_iq`).

### Changed

- The program to evaluate the error-correction performance of Polar coding in
  the presence of additive white Gaussian noise (AWGN) located in
  `simulation/polar_awgn` is updated to no longer take the length of a code
  block as a parameter.

- Improved the Neon and SVE2 performance of LDPC encoding for a single code
  block (`armral_ldpc_encode_block`).

- Improved the Neon performance of Turbo decoding for a single code block
  (`armral_turbo_decode_block`).

- Improved the Neon performance of Turbo encoding for a single code block
  (`armral_turbo_encode_block`).

- Improved the Neon performance of 32-bit floating point general matrix
  inversion (`armral_cmplx_mat_inverse_f32`).

- Improved the Neon performance of 32-bit floating point batch general matrix
  inversion (`armral_cmplx_mat_inverse_batch_f32` and
  `armral_cmplx_mat_inverse_batch_f32_pa`).

### Fixed

- The Turbo coding simulation program now builds when performing an SVE build of
  the library.

## [22.07] - 2022-07-15

### Added

- SVE2-optimized implementation of equalization with four subcarriers
  (`armral_solve_*x*_4sc_f32`).

- Matrix-vector multiplication functions for batches of 32-bit complex
  floating-point matrices and vectors (`armral_cmplx_mat_vec_mult_batch_f32` and
  `armral_cmplx_mat_vec_mult_batch_f32_pa`).

- LTE Turbo encoding function (`armral_turbo_encode_block`) that implements the
  encoding scheme defined in section 5.1.3.2 of the 3GPP Technical Specification
  (TS) 36.212 "Multiplexing and channel coding".

- LTE Turbo decoding function (`armral_turbo_decode_block`) that implements a
  maximum a posteriori (MAP) algorithm to return a hard decision (either 0 or 1)
  for each output bit.

- Functions to perform rate matching and rate recovery for Polar coding. These
  implement the specification in section 5.4.1 of the 3GPP Technical Specification
  (TS) 38.212.

- Functions to perform rate matching and rate recovery for LDPC coding. This
  implements the specification in section 5.4.2 of the 3GPP Technical
  Specification (TS) 38.212.

- Utilities to simulate the error correction performance for Polar, LDPC and
  Turbo coding over a noisy channel.

### Changed

- Renamed the Polar encoding and decoding functions to
  `armral_polar_encode_block` and `armral_polar_decode_block`.

- Improved the Neon and SVE2 performance of 16-QAM modulation
  (`armral_modulation` with `armral_modulation_type` set to `ARMRAL_MOD_16QAM)`.

- Improved the SVE2 performance of Mu law compression and decompression
  (`armral_mu_law_compr_*` and `armral_mu_law_decompr_*`).

- Improved the SVE2 performance of block float compression and decompression
  (`armral_block_float_compr_*` and `armral_block_float_decompr_*`).

- Improved the SVE2 performance of 8-bit block scaling compression
  (`armral_block_scaling_compr_8bit`).

- Improved the performance of 32-bit floating-point and 16-bit fixed-point
  complex valued FFTs (`armral_fft_execute_cf32` and `armral_fft_execute_cs16`)
  with large prime factors.

## [22.04] - 2022-04-08

### Added

- SVE2-optimized implementations batched 16-bit fixed-point matrix-vector
  multiplication with 64-bit and 32-bit fixed-point accumulator
  (`armral_cmplx_mat_vec_mult_batch_i16`,
  `armral_cmplx_mat_vec_mult_batch_i16_pa`,
  `armral_cmplx_mat_vec_mult_batch_i16_32bit`,
  `armral_cmplx_mat_vec_mult_batch_i16_32bit_pa`).

- SVE2-optimized implementation of complex 32-bit floating-point singular value
  decomposition (`armral_svd_cf32`).

- SVE2-optimized implementations of complex 32-bit floating-point Hermitian
  matrix inversion for a single matrix or a batch of matrices of size `3x3`
  (`armral_cmplx_hermitian_mat_inverse_f32` and
  `armral_cmplx_hermitian_mat_inverse_batch_f32`).

- SVE2-optimized implementations of 9-bit and 14-bit Mu law compression
  (`armral_mu_law_compr_9bit` and `armral_mu_law_compr_14bit`).

- SVE2-optimized implementations of 9-bit and 14-bit Mu law decompression
  (`armral_mu_law_decompr_9bit` and `armral_mu_law_decompr_14bit`).

- Complex 32-bit floating-point general matrix inversion for matrices of size
  `2x2`, `3x3`, `4x4`, `8x8`, and `16x16` (`armral_cmplx_mat_inverse_f32`).

### Changed

- Improved the performance of batched 16-bit floating-point matrix-vector
  multiplication with 64-bit floating-point accumulator
  (`armral_cmplx_mat_vec_mult_batch_i16` and
  `armral_cmplx_mat_vec_mult_batch_i16_pa`).

- Improved the performance of batched 16-bit floating-point matrix-vector
  multiplication with 32-bit floating-point accumulator
  (`armral_cmplx_mat_vec_mult_batch_i16_32bit` and
  `armral_cmplx_mat_vec_mult_batch_i16_32bit_pa`).

- Improved the performance of 14-bit block float compression
  (`armral_block_float_compr_14bit`).

- Improved the performance of 14-bit block scaling compression
  (`armral_block_scaling_compr_14bit`).

- Improved the performance of 14-bit Mu law compression
  (`armral_mu_law_compr_14bit`).

- Improved the performance of complex 32-bit floating-point singular value
  decomposition (`armral_svd_cf32`). The input matrix now needs to be stored in
  column-major order. Output matrices are also returned in column-major order.

- Improved the performance of complex 32-bit floating-point Hermitian matrix
  inversion for a single matrix or a batch of matrices of size `3x3`
  (`armral_cmplx_hermitian_mat_inverse_f32` and
  `armral_cmplx_hermitian_mat_inverse_batch_f32`).

- Improved the performance of Polar list decoding (`armral_polar_decoder`) with
  list size 4. The performance for list size 1 is slightly reduced, but the list
  size 4 gives much better error correction.

- Added restrictions to the number of matrices and vectors in the batch for the
  functions that perform batched matrix-vector multiplications in fixed-point
  precision (`armral_cmplx_mat_vec_mult_batch_i16`,
  `armral_cmplx_mat_vec_mult_batch_i16_pa`,
  `armral_cmplx_mat_vec_mult_batch_i16_32bit`,
  `armral_cmplx_mat_vec_mult_batch_i16_32bit_pa`).

- The function to perform fixed-point complex matrix-matrix multiplication with
  a 64-bit accumulator (`armral_cmplx_mat_mult_i16`) now narrows from the 64-bit
  accumulator to a 32-bit intermediate value, and then to the 16-bit result
  using truncating narrowing operations instead of rounding operations. This
  matches the behavior in the fixed-point complex matrix-matrix multiplication
  with a 32-bit accumulator.

- The function to perform fixed-point complex matrix-vector multiplication with
  a 64-bit accumulator (`armral_cmplx_mat_vec_mult_i16`) now narrows from the
  64-bit accumulator to a 32-bit intermediate value, and then to the 16-bit
  result using truncating narrowing operations instead of rounding
  operations. This matches the behavior in the fixed-point complex matrix-vector
  multiplication with a 32-bit accumulator.