Investigating Split Locks on x86-64

chipsandcheese.com

・

78 points

・

ingve

・

6 days ago

28 comments

anematode ・ 3 days ago

Cool investigation. This part perplexes me, though:

> Games have apparently been using split locks for quite a while, and have not created issues even on AMD’s Zen 2 and Zen 5.

For the life of me I don't understand why you'd ever want to do an atomic operation that's not naturally aligned, let alone one split across cache lines....

toast0 ・ 3 days ago

> For the life of me I don't understand why you'd ever want to do an atomic operation that's not naturally aligned, let alone one split across cache lines....
I assume they force packed their structure and it's poorly aligned, but x86 doesn't fault on unaligned access and Windows doesn't detect and punish split locks, so while you probably would get better performance with proper alignment, it might not be a meaningful improvement on the majority of the machines running the program.
- anematode ・ 3 days ago
  
  Ah, that's a great hypothesis. I wonder, then, how it works with x86 emulation on ARM. IIRC, atomic ops on ARM fault if the address isn't naturally aligned... but I guess the runtime could intercept that and handle it slowly.
  
  omcnoe ・ 3 days ago
  ・ 4 more
  
  ARM macs apparently have some kind of specific handling in place for this when a process is running with x86_64 compatibility, but it’s not publicly documented anywhere that I can see.
  
  my123 ・ 3 days ago
  ・ 3 more
  
  XNU has this oddity: https://github.com/apple-oss-distributions/xnu/blob/f6217f89...
  Redacted from open source XNU, but exists in the closed source version
  
  omcnoe ・ 3 days ago
  ・ 2 more
  
  Is it actually redacted, or just a leftover stub from a feature implemented in silicon instead of software? Isn't the x86 memory order compatibility done at hardware level?
  
  my123 ・ 3 days ago
  
  Redacted
  
  BobbyTables2 ・ 3 days ago
  ・ 6 more
  
  An emulated x86 atomic instruction wouldn’t need to use atomic instructions on ARM.
  
  dooglius ・ 3 days ago
  ・ 5 more
  
  Why not?
  
  MBCook ・ 3 days ago
  ・ 4 more
  
  They don’t have to match.
  As an example, what about a divide instruction. A machine without an FPU can emulate a machine that has one. It will legitimately have to run hundreds/thousands of instructions to emulate a single divide instruction, it will certainly take longer.
  Thats OK, just means the emulation is slower doing that than something like add that the host has a native instruction for. In ‘emulator time’ you still only ran one instruction. That world is still consistent.
  
  anematode ・ 3 days ago
  ・ 2 more
  
  ? That's not how Windows on ARM emulation works. It uses dynamic JIT translation from x86 to ARM. When the compiler sees, e.g., lock add [mem], reg presumably it'll emit a ldadd, but that will have different semantics if the operand is misaligned.
  
  undefined ・ 3 days ago
  
  [deleted]
  
  cylemons ・ 3 days ago
  
  You mean the locking would be done in software?
phire ・ 2 days ago

They don't do it on purpose.
It's just really easy to do accidentally with custom allocators, and games tend to use custom allocators for performance reasons.
The system malloc will return pointers aligned to the size of the largest Atomic operation by default (16 bytes on x86), and compilers depend on this automatic alignment for correctness. But it's real easy for a custom allocator use a smaller alignment. Maybe the author didn't know, maybe they assumed they would never need the full 16-byte atomics. Maybe the 16-byte atomics weren't added until well after the custom allocator.
userbinator ・ 3 days ago

Packing structures can improve performance and overall memory usage by reducing cache misses.

lifis ・ 3 days ago

But why doesn't the CPU just lock two cachelines? Seems relatively easy to do in microcode, no? Just sort by physical address with a conditional swap and then run the "lock one cacheline algorithm" twice, no?

Perhaps the issue it that each core has a locked cacheline entry for each other core, but even then given the size of current CPUs doubling it shouldn't be that significant. And one could also add just a single extra entry and then have a global lock but that only locks the ability to lock a second cacheline.

trebligdivad ・ 3 days ago

I suspect it's the risk of deadlocks and perhaps they have no easy way to avoid it.
- zeusk ・ 3 days ago
  
  ordering lock acquisition is a tested strategy to avoid deadlocks; so locking the cache lines sorted by PA would cover that?
cylemons ・ 3 days ago

I assume to save on resources, even if your algorithm is not much more taxxing on silicon, maybe the designers at intel and amd just didn't think optimizing split locks was worth it

strstr ・ 3 days ago

Split locks are weird. It’s never been obvious to me why you’d want to do them unless you are on a small core count system. When split lock detection rolled out for linux, it massacred perf for some games (which were probably min-maxing single core perf and didn’t care about noisy neighbor effects).

Frankly, I’m surprised split lock detection is enabled anywhere outside of multi-tenant clouds.

adrian_b ・ 3 days ago

Split locks are never something anyone wants to do, unless they are morons.
Split locks are always bugs, but like many other bugs that are benign on Intel/AMD CPU-based computers (due to efforts done by the hardware designers to mitigate such bugs, instead of crashing applications), they are not uncommon in commercial software written by careless programmers.
The Intel/AMD CPUs contain a lot of additional hardware whose only purpose is to enable them to run programs written by less competent programmers, who also use less sophisticated programming tools, so they are not able to deal with more subtle things, like data alignment or memory barriers.
This additional hardware greatly reduces the performance impact of non-optimized programs, but it cannot eliminate it entirely.
- trebligdivad ・ 3 days ago
  
  To me it seems an artefact of the x86 architecture; where any sane architecture just declared split locks illegal, x86 has some gently hidden away note about it being a bad thing.
gpderetta ・ 3 days ago

You don't want them. Except for bug-compatibility with old broken software which is something that Intel (and MS) care a lot about.
If you mean split-lock detection, it is because split locks are a massive DoS vulnerability on high core count CPUs.

sidkshatriya ・ 3 days ago

This article seems relevant to me for the following scenario:

- You have faulty software (e.g. games) that happen to have split locks

AND

- You have DISABLED split lock detection and "mitigation" which would have hugely penalised the thread in question (so the lock becomes painfully evident to that program and forced to be fixed).

AND

- You want to see which CPU does best in this scenario

In other words you just assume the CPU will take the bus lock penalty and continue WITHOUT culprit thread being actively throttled by the OS.

In the normal case, IIUC Linux should helpfully throttle the thread so the rest of the system is not affected by the bus lock. In this benchmark here the assumption is the thread will NOT be throttled by Linux via appropriate setting.

So to be honest I don't see the merit of this study. This study is essentially how fast is your interconnect so it can survive bad software that is allowed to run untrammelled.

On aarch64 the thread would simply be killed. It's possible to do the same on modern AMD / Intel also OR simply throttle the thread so that it does not cause problems via bus locks that affect other threads -- none of these are done in this benchmark.

VorpalWay ・ 3 days ago

> So to be honest I don't see the merit of this study. This study is essentially how fast is your interconnect so it can survive bad software that allowed to run untrammelled.
It seems like a worthwhile study if you want to know what CPU to buy to play specific old games that use bus locks. Games that will never be fixed.
- toast0 ・ 3 days ago
  
  It seemed to me that the issue with the games was that they did split locks at all, and when Linux detected that and descheduled the process, performance was trash. I didn't think they were doing frequent split locks that resulted in bad performance by itself.
  You don't need to be a careful shopper for this; just turn off detection while you're playing these games, or tune the punishment algorithm, or patch the game. Just because the developer won't doesn't mean you can't; there's plenty of 3rd party binary patches for games.
- sidkshatriya ・ 3 days ago
  
  > It seems like a worthwhile study if you want to know what CPU to buy to play specific old games that use bus locks. Games that will never be fixed.
  Fair.
  > old games that use bus locks
  Yes the bus locks here are unintentional since LOCK on cache line is not sufficient, the CPU falls back to locking the bus.

Cold_Miserable ・ 3 days ago

It went from ~30ns to 2K ns but mostly timed out when I changed the alignment to +7.5 QWORDs on Golden Cove.