Refactor RHS packing function for F32 <- QAI8DXP x QSU4C32
- Rename the packing function to include the the bf16 scale factor
- Optimize the scalar variant. The new implementation is ~1.5x faster
than the previous one
Signed-off-by:
Gian Marco Iodice <gianmarco.iodice@arm.com>