> AVX2 level includes FMA (fast multiply-add)
FMA acronym is not fast multiply add, it’s fused multiply add. Fused means the instruction computes the entire a * b + c expression using twice as many mantissa bits, only then rounds the number to the precision of the arguments.
It might be the Prism emulator failed to translate FMA instructions into a pair of two FMLA instructions (equally fused ARM64 equivalent), instead it did some emulation of that fused behaviour, which in turn what degraded the performance of the AVX2 emulation.
Author here - thanks - my bad. Fixed 'fast' -> 'fused' :)
I don't have insight into how Prism works, but I have wondered if the right debugger would see the ARM code and let us debug exactly what was going on for sure.
You’re welcome. Sadly, I don’t know how to observe ARM assembly produced by Prism.
And one more thing.
If you test on an AMD processor, you will probably see much less profit from FMA. Not because it’s slower, but because SSE4 version will runs much faster.
On Intel processors like your Tiger Lake, all 3 operations addition, multiplication and FMA compete for the same execution units. On AMD processors however, multiplication and FMA do as well but addition is independent, e.g. on Zen4 multiplication and FMA run on execution units FP0 or FP1 while addition runs on execution units FP2 or FP3. This way replacing multiply/add combo with FMA on AMD doesn’t substantially improve throughput in FLOPs. The only win is L1i cache and instruction decoder.
You can ... to a degree - Google for "XtaCache"