Optimize F32 <- QAI8DXP (LHS) x QSI4C32P (RHS) for 4x8 i8mm
- Add new assembly ukernel optimized with FEAT_I8MM for matrix multiplication with 4x8 block size. - Update build script. - Add to unit test. Signed-off-by:Michael Kozlov <michael.kozlov@arm.com> Approved-by:
Felix Johnny Thomasmathibalan <felixjohnny.thomasmathibalan@arm.com>