Cool investigation. This part perplexes me, though:
> Games have apparently been using split locks for quite a while, and have not created issues even on AMD’s Zen 2 and Zen 5.
For the life of me I don't understand why you'd ever want to do an atomic operation that's not naturally aligned, let alone one split across cache lines....
> For the life of me I don't understand why you'd ever want to do an atomic operation that's not naturally aligned, let alone one split across cache lines....
I assume they force packed their structure and it's poorly aligned, but x86 doesn't fault on unaligned access and Windows doesn't detect and punish split locks, so while you probably would get better performance with proper alignment, it might not be a meaningful improvement on the majority of the machines running the program.
Ah, that's a great hypothesis. I wonder, then, how it works with x86 emulation on ARM. IIRC, atomic ops on ARM fault if the address isn't naturally aligned... but I guess the runtime could intercept that and handle it slowly.
ARM macs apparently have some kind of specific handling in place for this when a process is running with x86_64 compatibility, but it’s not publicly documented anywhere that I can see.
XNU has this oddity: https://github.com/apple-oss-distributions/xnu/blob/f6217f89...
Redacted from open source XNU, but exists in the closed source version
Is it actually redacted, or just a leftover stub from a feature implemented in silicon instead of software? Isn't the x86 memory order compatibility done at hardware level?
Redacted
An emulated x86 atomic instruction wouldn’t need to use atomic instructions on ARM.
Why not?
They don’t have to match.
As an example, what about a divide instruction. A machine without an FPU can emulate a machine that has one. It will legitimately have to run hundreds/thousands of instructions to emulate a single divide instruction, it will certainly take longer.
Thats OK, just means the emulation is slower doing that than something like add that the host has a native instruction for. In ‘emulator time’ you still only ran one instruction. That world is still consistent.
? That's not how Windows on ARM emulation works. It uses dynamic JIT translation from x86 to ARM. When the compiler sees, e.g., lock add [mem], reg presumably it'll emit a ldadd, but that will have different semantics if the operand is misaligned.
- [deleted]
You mean the locking would be done in software?
They don't do it on purpose.
It's just really easy to do accidentally with custom allocators, and games tend to use custom allocators for performance reasons.
The system malloc will return pointers aligned to the size of the largest Atomic operation by default (16 bytes on x86), and compilers depend on this automatic alignment for correctness. But it's real easy for a custom allocator use a smaller alignment. Maybe the author didn't know, maybe they assumed they would never need the full 16-byte atomics. Maybe the 16-byte atomics weren't added until well after the custom allocator.
Packing structures can improve performance and overall memory usage by reducing cache misses.