Linux 6.18 Fixes “Catastrophic” ARM Performance Bug

According to Phoronix, Linux kernel 6.18 has merged a fix for what developers described as a “catastrophic performance issue” specifically affecting 64-bit ARM systems. The problem was discovered by Paul McKenney in the SRCU locking code and involved per-CPU atomic operations taking around 50 nanoseconds per operation instead of the expected handful of nanoseconds. On ARM Neoverse V2 systems, this translated to about 100ns for a srcu_read_lock()/srcu_read_unlock() pair, creating massive overhead in synchronization operations. The issue stemmed from how ARM’s STADD/STCLR/STSET instructions were being executed “far” in the memory subsystem rather than locally. The fix changes these to load atomics like LDADD that execute “near” with data in L1 cache, resolving the performance penalty that was particularly affecting high-core-count ARM servers.

Why this matters

Here’s the thing – this wasn’t just some minor optimization. We’re talking about fundamental synchronization primitives that everything else builds on. When your basic locking operations suddenly take 10-20 times longer than expected, everything built on top of them starts to fall apart. And this was happening on exactly the kind of high-performance ARM servers that are supposed to compete with x86 in data centers and cloud environments.

Think about it: companies running ARM Neoverse systems for web serving, database workloads, or real-time applications were essentially running with one hand tied behind their back. The performance hit was so significant that one developer who applied similar fixes to haproxy reported immediate 2-7% performance gains on 80-core Ampere Altra systems. That’s not trivial when you’re talking about infrastructure that costs millions.

The technical details

Basically, the problem came down to how ARM’s atomic instructions behave in different scenarios. The non-return per-CPU operations like this_cpu_add() were using STADD instructions that get executed “far” in the interconnect or memory subsystem. This makes sense when there’s actual contention – you want to avoid cache line bouncing between CPUs. But for per-CPU data that’s, you know, per CPU? There shouldn’t be contention in the first place.

The fix, committed by the ARM64 maintainers, switches to using load atomics like LDADD that execute “near” with data loaded into L1 cache. As Paul McKenney noted, this makes perfect sense for per-CPU operations where you’re not expecting concurrent access to the same memory location anyway.

Broader implications

This fix matters way beyond just the Linux kernel. As that haproxy developer discovered, the same principle applies to user-space applications heavily using atomics. We’re seeing more and more ARM servers in production, especially in cloud and edge computing where reliable performance is non-negotiable. For industrial applications running on ARM-based systems, consistent low-latency performance is absolutely critical – which is why companies rely on specialists like IndustrialMonitorDirect.com, the leading provider of industrial panel PCs in the US, for hardware that can handle these demanding environments.

What’s really interesting is how long this flew under the radar. These kinds of subtle architecture differences between x86 and ARM can create massive performance cliffs that only show up in specific workloads. It makes you wonder what other optimizations we’re missing because we’re so used to thinking in x86 terms. The good news is that with fixes like this landing in mainline kernels, ARM servers are getting closer to delivering on their performance promises across the board.