This is an entirely uncontroversial take among experts in the space. x86 is an old CISC-y hot mess. RISC-V is a new-school hyper-academic hot mess. Recent ARM is actually pretty good. And none of it matters, because the uncore and the fabrication details (in particular, whether things have been tuned to run full speed demon or full power sipper) completely dominate the ISA.
In the past x86 didn't dominate in low power because Intel had the resources to care but never did, and AMD never had the resources to try. Other companies stepped in to full that niche, and had to use other ISAs. (If they could have used x86 legally, they might well have done so. Oops?) That may well be changing. Or perhaps AMD will let x86 fade away.
Basically the gist of it is that the difference between ARM/x86 mostly boils down to instruction decode, and:
- Most instructions end up being simple load/store/conditional branch etc. on both architectures, where there's literally no difference in encoding efficiency
- Variable length instruction has pretty much been figured out on x86 that it's no longer a bottleneck
Also my personal addendum is that today's Intel efficiency cores are have more transistors and better perf than the big Intel cores of a decade ago
x86 decoding must be a pain - I vaguely remember that they have trace caches (a cache of decoded micro-operations) to skip decoding in some cases. You probably don't make such caches when decoding is easy.
Also, more complicated decoding and extra caches means longer pipeline, which means more price to pay when a branch is mispredicted (binary search is a festival of branch misprediction for example, and I got 3x acceleration of linear search on small arrays when I switched to the branchless algorithm).
Also I am not a CPU designer, but branch prediction with wide decoder also must be a pain - imagine that while you are loading 16 or 32 bytes from instruction cache, you need to predict the address of next loaded chunk in the same cycle, before you even see what you got from cache.
As for encoding efficiency, I played with little algorithms (like binary search or slab allocator) on godbolt, and RISC-V with compressed instruction generates similar amount of code as x86 - in rare cases, even slightly smaller. So x86 has a complex decoding that doesn't give any noticeable advantages.
x86 also has flags, which add implicit dependencies between instructions, and must make designer's life harder.
I was an instruction fetch unit (IFU) architect on P6 from 1992-1995. And yes, it was a pain, and we had close to 100x the test vectors of all the other units, going back to the mid 1980's. Once we started going bonkers with the prefixes, we just left the pre-Pentium decoder alone and added new functional blocks to handle those. And it wasn't just branch prediction that sucked, like you called out! Filling the instruction cache was a nightmare, keeping track of head and tail markers, coalescing, rebuilding, ... lots of parallel decoding to deal with cache and branch-prediction improvements to meet timing as the P6 core evolved was the typical solution. We were the only block (well, minus IO) that had to deal with legacy compatibility. Fortunately I moved on after the launch of Pentium II and thankfully did not have to deal with Pentium4/Northwood.
So one of the projects I've been working on and off again is the World's Worst x86 Decoder, which takes a principled approach to x86 decoding by throwing out most of the manual and instead reverse-engineering semantics based on running the instructions themselves to figure out what they do. It's still far from finished, but I've gotten it to the point that I can spit out decoder rules.
As a result, I feel pretty confident in saying that x86 decoding isn't that insane. For example, here's the bitset for the first two opcode maps on whether or not opcodes have a ModR/M operand: ModRM=1111000011110000111100001111000011110000111100001111000011110000000000000000000000000000000000000011000001010000000000000000000011111111111111110000000000000000000000000000000000000000000000001100111100000000111100001111111100000000000000000000001100000011111100000000010011111111111111110000000011111111000000000000000011111111111111111111111111111111111111111111111111111110000011110000000000000000111111111111111100011100000111111111011110111111111111110000000011111111111111111111111111111111111111111111111
I haven't done a k-map on that, but... you can see that a boolean circuit isn't that complicated. Also, it turns out that this isn't dependent on presence or absence of any prefixes. While I'm not a hardware designer, my gut says that you can probably do x86 instruction length-decoding in one cycle, which means the main limitation on the parallelism in the decoder is how wide you can build those muxes (which, to be fair, does have a cost).
That said, there is one instruction where I want to go back in time and beat up the x86 ISA designers. f6/0, f6/1, f7/0, and f7/1 [1] take in an extra immediate operand whereas f6/2 and et al do not. It's the sole case in the entire ISA where this happens.
[1] My notation for when x86 does its trick of using one of the register selector fields as extra bits for opcodes.
> While I'm not a hardware designer, my gut says that you can probably do x86 instruction length-decoding in one cycle
That's some very faint praise there. Especially when you're trying to chop up several instructions every cycle. Meanwhile RISC-V is "count leading 1s. 0-1:16bit 2-4:32bit 5:48bit 6:64bit"
The chopping up can happen the next cycle, in parallel across all the instructions in the cache line(s) that were fetched, and it can be pipelined so there's no loss in throughput. Since x86 instructions can be as small as one byte, in principle the throughput-per-cache-line can be higher on x86 than on RISC-V (e.g. a single 32-byte x86 cache line could have up to 32 instructions where the original RISC-V ISA might only have 8). And in any case, there are RISC-V extensions that allow variable-length instructions now, so they have to deal with the problem too.
Intel’s E cores decode x86 without a trace cache (μop cache), and are very efficient. The latest (Skymont) can decode 9 x86 instructions per cycle, more than the P core (which can only decode 8)
AMD isn’t saying that decoding x86 is easy. They are just saying that decoding x86 doesn’t have a notable power impact.
> Why can't they decode 100 instructions per cycle?
Well, obviously because there aren't 100 individual parallel execution units to which those instructions could be issued. And lower down the stack because a 3000 bit[1] wide cache would be extremely difficult to manage. An instruction fetch would be six (!) cache lines wide, causing clear latency and bottleneck problems (or conversely would demand your icache be 6x wider, causing locality/granularity problems as many leaf functions are smaller than that).
But also because real world code just isn't that parallel. Even assuming perfect branch prediction the number of instructions between unpredictable things like function pointer calls or computed jumps is much less than 100 in most performance-sensitive algorithms.
And even if you could, the circuit complexity of decoding variable length instructions is superlinear. In x86, every byte can be an instruction boundary, but most aren't, and your decoder needs to be able to handle that.
[1] I have in my head somewhere that "the average x86_64 instruction is 3.75 bytes long", but that may be off by a bit. Somewhere around that range, anyway.
Variable length decoding is more or less figured out, but it takes more design effort, transistors and energy. They cost, but not a lot, relatively, in a current state of the art super wide out-of-order CPU.
Not a lot is not how I would describe it. Take a 64bit piece of fetched data. On ARM64 you will just push that into two decoder blocks and be done with it. On x86 you got what, 1 to 15 bytes range per instruction? I dont even want to think about possible permutations, its in the 10 ^ some two digit number order.
You don't need all the permutations. If there are 32 bytes in a cache line then each instruction can only start at one of 32 possible positions. Then if you want to decode N instructions per cycle you need N 32-to-1 muxes. You can reduce the number of inputs to the later muxes since instructions can't be zero size.
Yes, but you're not describing it from the right position. Is instruction decode hard? Yes, if you think about it in isolation (also, fwiw, it's not a permutation problem as you suggest). But the core has a bunch of other stuff it needs to do that is far harder. Even your lowliest Pentium from 2000 can do instruction decode.
It's a lot for a decoder, but not for a whole core. Citation needed, but I remember that the decoder is about 10% of a Ryzen core's power budget, and of course that is with a few techniques better than complete brute force.
pure decoder width isn't enough to tell you everything. X86 has some commonly used ridiculously compact instructions (e.g. lea) that would turn into 2-3 instructions on most other architectures.
The whole ModRM addressing encoding (to which LEA is basically a front end) is actually really compact, and compilers have gotten frightently good at exploiting it. Just look at the disassembly for some non-trivial code sometime and see what it's doing.
This matches my understanding as well, as someone who has a great deal of interest in the field but never worked in it professionally. CPUs all have a microarchitecture that doesn't look like the ISA at all, and they have an instruction decoder that translates ISA one or more ISA instructions into zero or more microarchitectural instructions. There are some advantages to having a more regular ISA, such as the ability to more easily decode multiple instructions in parallel if they're all the same size or having to spend fewer transistors on the instruction decoder, but for the big superscalar chips we all have in our desktops and laptops and phones, the drawbacks are tiny.
I imagine that the difference is much greater for the tiny in-order CPUs we find in MCUs though, just because an amd64 decoder would be a comparatively much larger fraction of the transistor budget
Then there's mainframes. Where you want code compiled in 1960 to run unmodified today. There was quite of original advantage as well as IBM was able to implement the same ISA with three different types and costs of computers.
uOps are kind of oversold in the CPU design mythos. They are not that different from the original ISA, and some x86 instructions (like lea) are both complex and natural fits for hardware so don't get microcoded.
Yeah... Previously I was a big fan of RISC-V, but after I had to dig slightly deeper into it as a software developer my enthusiasm for it has cooled down significantly.
It's still great that we got a mainstream open ISA, but now I view it as a Linux of the hardware world, i.e. a great achievement, with a big number of questionable choices baked in, which unfortunately stifles other open alternatives by the virtue of being "good enough".
- Handling of misaligned loads/stores: RISC-V got itself into a weird middle ground, ops on misaligned pointers may work fine, may work "extremely slow", or cause fatal exceptions (yes, I know about Zicclsm, it's extremely new and only helps with the latter, also see https://github.com/llvm/llvm-project/issues/110454). Other platforms either guarantee "reasonable" performance for such operations, or forbid misaligned access with "aligned" loads/stores and provide separate misaligned instructions. Arguably, RISC-V should've done the latter (with misaligned instructions defined in a separate higher-end extension), since passing unaligned pointer into an aligned instruction signals correctness problems in software.
- The hardcoded page size. 4 KiB is a good default for RV32, but arguably a huge missed opportunity for RV64.
- The weird restriction in the forward progress guarantees for LR/SC sequences, which forces compilers to compile `compare_exchange` and `compare_exchange_weak` in the absolutely same way. See this issue for more information: https://github.com/riscv/riscv-isa-manual/issues/2047
- The `seed` CSR: it does not provide a good quality entropy (i.e. after you accumulated 256 bits of output, it may contain only 128 bits of randomness). You have to use a CSPRNG on top of it for any sensitive applications. Doing so may be inefficient and will bloat binary size (remember, the relaxed requirement was introduced for "low-powered" devices). Also, software developers may make mistake in this area (not everyone is a security expert). Similar alternatives like RDRAND (x86) and RNDR (ARM) guarantee proper randomness and we can use their output directly for cryptographic keys with very small code footprint.
- Extensions do not form hierarchies: it looks like the AVX-512 situation once again, but worse. Profiles help, but it's not a hierarchy, but a "packet". Also, there are annoyances like Zbkb not being a proper subset of Zbb.
- Detection of available extensions: we usually have to rely on OS to query available extensions since the `misa` register is accessible only in machine mode. This makes detection quite annoying for "universal" libraries which intend to support various OSes and embedded targets. The CPUID instruction (x86) is ideal in this regard. I totally disagree with the virtualization argument against it, nothing prevents VM from intercepting the read, no one excepts huge performance from such reads.
And this list is compiled after a pretty surface-level dive into the RISC-V spec. I heard about other issues (e.g. being unable to port tricky SIMD code to the V extension or underspecification around memory coherence important for writing drivers), but I can not confidently talk about those, so it's not part of my list.
P.S.: I would be interested to hear about other people gripes with RISC-V.
I'm pretty confident that this will get removed. It's an extension that made it's way into RVA23, but once anyone has a design big enough for it to be a burden, it can be dropped.
By prioritizing efficiency, Apple also prioritizes integration. The PC ecosystem prefers less integration (separate RAM, GPU, OS, etc) even at the cost of efficiency.
> By prioritizing efficiency, Apple also prioritizes integration. The PC ecosystem prefers less integration (separate RAM, GPU, OS, etc) even at the cost of efficiency.
People always say this but "integration" has almost nothing to do with it.
How do you lower the power consumption of your wireless radio? You have a network stack that queues non-latency sensitive transmissions to minimize radio wake-ups. But that's true for radios in general, not something that requires integration with any particular wireless chip.
How do you lower the power consumption of your CPU? Remediate poorly written code that unnecessarily keeps the CPU in a high power state. Again not something that depends on a specific CPU.
How much power is saved by soldering the memory or CPU instead of using a socket? A negligible amount if any; the socket itself has no significant power draw.
What Apple does well isn't integration, it's choosing (or designing) components that are each independently power efficient, so that then the entire device is. Which you can perfectly well do in a market of fungible components simply by choosing the ones with high efficiency.
In fact, a major problem in the Android and PC laptop market is that the devices are insufficiently fungible. You find a laptop you like where all the components are efficient except that it uses an Intel processor instead of the more efficient ones from AMD, but those components are all soldered to a system board that only takes Intel processors. Another model has the AMD APU but the OEM there chose poorly for the screen.
It's a mess not because the integration is poor but because the integration exists instead of allowing you to easily swap out the part you don't like for a better one.
> How much power is saved by soldering the memory or CPU instead of using a socket? A negligible amount if any; the socket itself has no significant power draw.
This isn't quite true. When the whole chip is idling at 1-2W, 0.1W of socket power is 10%. Some of Apple's integration almost certainly save power (e.g. putting storage controllers for the SSD on the SOC, having tightly integrated display controllers, etc).
There's a critical instruction for Objective C handling (I forget exactly what it is) but it's faster than intel's chips even in Rosetta 2's x86 emulation.
Eh, probably the biggest difference is in the OS. The amount of time Linux or Windows will spend using a processor while completely idle can be a bit offensive.
It’s all of the above. One thing Apple excels at is actually using their hardware and software together whereas the PC world has a long history of one of the companies like Intel, Microsoft, or the actual manufacturer trying to make things better but failing to get the others on-board. You can in 2025 find people who disable power management because they were burned (hopefully not literally) by some combination of vendors slacking on QA!
One good example of this is RAM. Apple Silicon got some huge wins from lower latency and massive bandwidth, but that came at the cost of making RAM fixed and more expensive. A lot of PC users scoffed at the default RAM sizes until they actually used one and realized it was great at ~8GB less than the equivalent PC. That’s not magic or because Apple has some super elite programmers, it’s because they all work at the same company and nobody wants to go into Tim Cook’s office and say they blew the RAM budget and the new Macs need to cost $100 more. The hardware has compression support and the OS and app teams worked together to actually use it well, whereas it’s very easy to imagine Intel adding the feature but skimping on speed / driver stability, or Microsoft trying to implement it but delaying release for a couple years, or not working with third-party developers to optimize usage, etc. – nobody acting in bad faith but just what inevitably happens when everyone has different incentives.
In most cases, efficiency and performance are pretty synonymous for CPUs. The faster you can get work done (and turn off the silicon, which is admittedly a higher design priority for mobile CPUs) the more efficient you are.
The level of talent Apple has cannot be understated, they have some true CPU design wizards. This level of efficiency cannot be achieved without making every aspect of the CPU as fast as possible; their implementation of the ARM ISA is incredible. Lots of companies make ARM chips, but none of them are Apple level performance.
As a gross simplification, where the energy/performance tradeoff actually happens is after the design is basically baked. You crank up the voltage and clock speed to get more perf at the cost of efficiency.
> In most cases, efficiency and performance are pretty synonymous for CPUs. The faster you can get work done (and turn off the silicon, which is admittedly a higher design priority for mobile CPUs) the more efficient you are.
Somewhat yes, hurry up and wait can be more efficient than running slow the whole time. But at the top end of Intel/AMD performance, you pay a lot of watts to get a little performance. Apple doesn't offer that on their processors, and when they were using Intel processors, they didn't provide thermal support to run in that mode for very long either.
The M series bakes in a lower clockspeed cap than contemperary intel/amd chips; you can't run in the clock regime where you spend a lot of watts and get a little bit more performance.
Nitpick: uncore and the fabrication details dominate the ISA on high end/superscalar architectures (because modern superscalar basically abstract the ISA away at the frontend). On smaller (i. e. MCU) cores x86 will never stand any chance.
Almost the same in die shots except the K5 had more transistors for the x86 decoding. The AM29000's instruction set is actually very close to RISC-V too!
Very hard to find benchmarks comparing the two directly though.
sure, but Arm has Neon/sve which impose basically the same requirements for vector instructions, and most high performance arm implimentations have a wide suite of crypto instructions (e.g. Apple's M series chips have AES, SHA1 and Sha256 instructions)
Fun fact. The idea of strong national security is the reason why there are three companies with access to the x86 ISA.
DoD originally required all products to be sourced by at least three companies to prevent supply chain issues. This required Intel to allow AMD and VIA to produce products based on ISA.
For me this is good indicator if someone that talks about good national security knows what they are talking about or are just spewing bullshit and playing national security theatre.
Intel didnt "allow" VIA anything :). Via acquired x86 tech from IDT (WinChip Centaur garbage) in a fire sale. IDT didnt ask anyone about any licenses, neither did Cyrix, NextGen, Transmeta, Rise nor NEC.
Afaik DoD wasnt the reason behind original AMD second source license, it was IBM forcing Intel on chips that went into first PC.
Transmeta wasn't x86 internally but decoded x86 instructions. Retrobytes did a history of transmeta not too long ago and the idea was essentially to be able to be compatible with any cpu uarch. Alas by the time it shipped only x86 was relevant. https://www.youtube.com/watch?v=U2aQTJDJwd8
Actually, the reason Transmeta CPUs were so slow was that they didn't have an x86 instruction hardware decoder. Every code cache (IIRC it was only 32 MB) miss resulted in a micro-architectural trap which translated x86 instructions to the underlying uops in software.
I have a ten-year old Lenovo Yoga Tab 2 8" Windows tablet, which I still use at least once every week. It is still useful. Who can say that they are still using a ten-year old Android tablet?
Yeah, I got to say in our sound company inventory I still use a dozen 6-10 year old iPads with all the mixers. They run the apps at 30fps and still hold a charge all day.
I have tried one before. And surprisingly, It did not suck as most people claimed to be. I can even do light gaming (warframe) on it with reasonable frame rate. (It's about 2015 ~ 2020 era). So it probably depends on manufacturer (or use case though)
(Also probably due to it is a tablet, so it have a reasonable fast storage instead of hdds like notebooks in that era)
I had a Atom-based netbook (in the early days when they were 32-bit-only and couldn’t run up-to-date Windows). It didn’t suck, as such, but it was definitely resource-starved.
However, what I meant is Atom-based Android tablets. At about the same time as the netbook craze (late 2000s to early 2010s) there was a non-negligible number of Android tablets, and a noticeable fraction of them was not ARM- but Atom-based. (The x86 target in the Android SDK wasn’t only there to support emulators, originally.) Yet that stopped pretty quickly, and my impression is that that happened because, while Intel would certainly have liked to hitch itself to the Android train, they just couldn’t get Atoms fast enough at equivalent power levels (either at all or quickly enough). Could have been something else, e.g. perhaps they didn’t have the expertise to build SoCs with radios?
Either way, it’s not that Intel didn’t want to get into consumer mobile devices, it’s that they tried and did not succeed.
Android x86 devices suffer when developers include binary libraries and don't add x86. At the time of Intel's x86 for Android push, Google didn't have good apk thinning options, so app developers had to decide if they wanted to add x86 libraries for everyone so that a handful of tablets/phones would work properly... for the most part, many developers said no; even though many/most apps are tested on the android emulator that runs on x86 and probably have binary libraries available to work in that case.
IMHO, If Intel had done another year or two of trying, it probably would have worked, but they gave up. They also canceled x86 for phone like the day before the Windows Mobile Continuum demo, which would have been a potentially much more compelling product with x86, especially if Microsoft allowed running win32 apps (which they probably wouldn't, but the potential would be interesting)
Atom used an in-order execution model so it's performance was always going to be lacking. Because it was in-order it had a much simpler decoder and much smaller die size, which meant you could crap the chipset and CPU on a single die.
Atom wasn't about power efficiency or performance, it was about cost optimization.
After playing around with some ARM hardware I have to say that I don't care whether ARM is more efficient or not as long as the boot process remains the clusterfuck that it is today.
IMHO the major win of the IBM PC platform is that it standardized the boot process from the very beginning, first with the BIOS and later with UEFI, so you can grab any random ISO for any random OS and it will work. Meanwhile in the ARM world it seems that every single CPU board requires its own drivers, device tree, and custom OS build. RISC-V seems to suffer from the same problem, and until this problem is solved, I will avoid them like toxic waste.
ARM systems that support UEFI are pretty fun to work with. Then there's everything else. Anytime I hear the phrase "vendor kernel" I know I'm in for an experience...
C# syntax is faster than Java because Java has no way to define custom value types/structs (last time I checked, I know there was some experimental work on this)
I'd be interested to hear someone with more experience talk about this or if there's more recent research, but in school I read this paper: <https://research.cs.wisc.edu/vertical/papers/2013/hpca13-isa...> that seems to agree that x86 and ARM as instruction sets do not differ greatly in power consumption. They also found that GCC picks RISC-like instructions when compiling for x86 which meant the number of micro-ops was similar between ARM and x86, and that the x86 chips were optimized well for those RISC-like instructions and so were similarly efficient to ARM chips. They have a quote that "The microarchitecture, not the ISA, is responsible for performance differences."
Instruction decode for variable length ISAs is inherently going to be more complex, and thus require more transistors = more power, than fixed length instruction decode, especially parallel decode. AFAIK modern x86 cores have to speculatively decode instructions to achieve this, compared to RISC ISAs where you know where all the instruction boundaries are and decoding N in parallel is a matter of instantiating N decoders that work in parallel. How much this determines the x86 vs ARM power gap, I don’t know, what’s much more likely is x86 designs have not been hyper optimized for power as much ARM designs have been over the last two decades. Memory order is another non-negligible factor, but again the difference is probably more attributable to the difference in goals between the two architectures for the vast majority of their lifespan, and the expertise and knowledge of the engineers working at each company.
IIRC There was a Jim Keller interview a few years ago where he said basically the same thing (I think it was from right around when he joined Tenstorrent?). The ISA itself doesn't matter, it's just instructions. The way the chip interprets those instructions is what makes the difference. ARM was designed from the beginning for low powered devices whereas x86 wasn't. If x86 is gonna compete with ARM (and RISC-V) then the chips are gonna need to also be optimized for low powered devices, but that can break decades of compatibility with older software.
There are two entities allowed to make x86_64 chips (and that only because AMD won the 64 bit ISA competition, otherwise there'd be only Intel). They get to choose.
The rest will use arm because that's all they have access to.
Oh, and x86_64 will be as power efficient as arm when one of the two entities will stop competing on having larger numbers and actually worry about power management. Maybe provide a ?linux? optimized for power consumption.
Unless you badly need SSE4 or AVX (and can't get around the somewhat questionable patent situation) anyone can make an x86_64 chip. And those patents are running out soon.
> Oh, and x86_64 will be as power efficient as arm when one of the two entities will stop competing on having larger numbers and actually worry about power management.
Both Intel and AMD provide runtime power control so this is tunable. The last ~10% of performance requires far more than 10% of the power.
The last 20% of the performance takes like >75% of the power with Zen 4 systems XD.
A Ryzen 9 7945HX mini pc I have achieves like ~80% of the all-core performance at 55W of my Ryzen 9 7950X desktop, which uses 225W for the CPU (admittedly, the defaults).
I think limiting the desktop CPU to 105W only dropped the performance by 10%. I haven't done that test in awhile because I was having some stability problems I couldn't be bothered to diagnose.
If you're measuring the draw at the wall, AFAIK desktop Ryzen keeps the chipset running at full power all the time and so even if the CPU is idle, it's hard to drop below, say, ~70W at the wall (including peripherals, fans, PSU efficiency etc).
Apparently desktop Intel is able to drop all the way down to under 10W on idle.
Sounds like your M2 is hitting the TDP max and the Ryzen box isn't.
Keep in mind there are Nvidia-designed chips (eg. Switch 2) that use all of ten watts when playing Cyberpunk 2077. Manufactured on Samsung's 8nm node, no less. It's a bit of a pre-beaten horse, but people aren't joking when they say Apple's GPU and CPU designs leave a lot of efficiency on the table.
Indeed, the memory model has a decent impact.
Unfortunately it's difficult to isolate in measurement.
Only Apple has support for weak memory order and TSO in the same hardware.
You’ll pry the ARM M series chips of my Mac from my cold dead hands. They’re a game changer in the space and one of the best reasons to use a Mac.
I am not a chip expert it’s just so night and day different using a Mac with an arm chip compared to an Intel one from thermals to performance and battery life and everything in between. Intel isn’t even in the same ballpark imo.
But competition is good and let’s hope they both do —- Intel and AMD because the consumer wins.
I have absolutely no doubt in my mind that if Apple's CPU engineers got half a decade and a mandate from the higher ups, they could make an amazing amd64 chip too.
That's not mostly because of a better ISA. If Intel and Apple had a chummier relationship you could imagine Apple licensing the Intel x86 ISA and the M series chips would be just as good but running x86. However I suspect no matter how chummy that relationship was, business is business and it is highly unlikely that Intel would give Apple such a license.
Apple did a ton of work on the power efficiency of iOS on their own ARM chips for iPhone for a decade before introducing the M1.
Since iOS and macOS share the same code base (even when they were on different architectures) it makes much more sense to simplify to a single chip architecture that they already had major expertise with and total control over.
There would be little to no upside for cutting Intel in on it.
Intel and AMD both sell quite a lot of customized chips, at least in the server space. As one example, any EC2 R7i or R7a instance you have are not running on a Sapphire Rapids or EPYC processor that you could buy, but instead one customized for AWS. I would presume that other cloud providers have similar deals worked out.
Genuinely asking -- what is it due to? Because like the person you're replying to, the m* processors are simply better: desktop-class perf on battery that hangs with chips with 250 watt TDP. I have to assume that amd and intel would like similar chips, so why don't they have them if not due to the instruction set? And AMD is using TSMC, so that can't be the difference.
I think the fundamental difference between an Apple CPU and an Intel/AMD CPU is Apple does not play in the megahertz war. The Apple M1 chip, launched in 2020 clocks at 3.2GHz; Intel and AMD can't sell a flagship mobile processor that clocks that low. Zen+ mobile Ryzen 7s released Jan 2019 have a boost clock of 4 GHz (ex: 3750H, 3700U); mobile Zen2 from Mar 2020 clock even higher (ex: 4900H at 4.4, 4800H at 4.2). Intel Tiger Lake was hitting 4.7 Ghz in 2020 (ex: 1165G7).
If you don't care to clock that high, you can reduce space and power requirements at all clocks; AMD does that for the Zen4c and Zen5c cores, but they don't (currently) ship an all compact core mobile processor. Apple can sell a premium branded CPU where there's no option to burn a lot of power to get a little faster; but AMD and Intel just can't, people may say they want efficiency, but having higher clocks is what makes an x86 processor premium.
In addition to the basic efficiency improvements you get by having a clock limit, Apple also utilizes wider execution; they can run more things in parallel, this is enabled to some degree by the lower clock rates, but also by the commitment to higher memory bandwidth via on package memory; being able to count on higher bandwidth means you can expect to have more operations that are waiting on execution rather than waiting on memory, so wider execution has more benefits. IIRC, Intel released some chips with on package memory, but they can't easily just drop in a couple more integer units onto an existing core.
The weaker memory model of ARM does help as well. The M series chips have a much wider out of order window, because they don't need to spend as much effort on ordering constraints (except when running in the x86 support mode); this also helps justify wider execution, because they can keep those units busy.
I think these three things are listed in order of impact, but I'm just an armchair computer architecture philosopher.
Does anyone actually care at all about frequencies? I care if my task finishes quickly. If it can finish quickly at a low frequency, fine. If the clock runs fast but the task doesn't, how is that a benefit?
My understanding is that both Intel and AMD are pushing high clocks not because it's what consumers want, but because it's the only lever they have to pull to get more gains. If this year's CPU is 2% faster than your current CPU, why would you buy it? So after they have their design they cover the rest of the target performance gain by cranking the clock, and that's how you get 200 W desktop CPUs.
>the commitment to higher memory bandwidth via on package memory; being able to count on higher bandwidth means you can expect to have more operations that are waiting on execution rather than waiting on memory, so wider execution has more benefits.
I believe you could make a PC (compatible) with unified memory and a 256-bit memory bus, but then you'd have to make the whole thing. Soldered motherboard, CPU/GPU, and RAM. I think at the time the M1 came out there weren't any companies making hardware like that. Maybe now that x86 handhelds are starting to come out, we may see laptops like that.
It's only recently when consumer software has become truly multithreaded. Historically there were major issues with that until very recently. Remember Bulldozer fiasco? They bet on the parallel execution more than Intel at the same time, e.g. same price Intel chip was 4 core, while AMD had 8 cores (consumer market). Single thread performance had been the deciding factor for decades. Even today AMDs outlier SKUs with a lot of cores and slightly lower frequencies (like 500 MHz lower or so) are not a topic of the day in any media or forum community. People talk about either top of the line SKU or something with low core count but clocking high enough to be reasonable for lighter use. Releasing low frequency high core count part for consumers would be greeted with questions, like "what for is this CPU?".
Are we just going to pretend that frequency = single-thread performance? I'm fine with making that replacement mentally, I just want to confirm we're all on the same page here.
>Releasing low frequency high core count part for consumers would be greeted with questions, like "what for is this CPU?".
It's for homelab and SOHO servers. It won't get the same attention as the sexy parts... because it's not a sexy part. It's something put in a box and stuff in a corner to chug away for ten years without looking at it again.
That's not really what we're talking about. Apple's cores are faster yet lower clocked. (Not just faster per clock but absolutely faster.) So some people are wondering if Intel/AMD targeting 6 GHz actually reduced performance.
But the OS has been able to take advantage of it since mountain lion with grand central dispatch. I could be wrong with the code name. This makes doing parallel things very easy.
Parallelism is actually very difficult and libdispatch is not at all perfect for it. Swift concurrency is a newer design and gets better performance by being /less/ parallel.
(This is mostly because resolving priority inversions turns out to be very important on a phone, and almost noone designs for this properly because it's not important on servers.)
> Apple can sell a premium branded CPU where there's no option to burn a lot of power to get a little faster; but AMD and Intel just can't, people may say they want efficiency, but having higher clocks is what makes an x86 processor premium.
I think this is very context dependent. Is this a big, heavy 15”+ desktop replacement notebook where battery life was never going to be a selling point in the first place? One of those with a power brick that could be used as a dumbbell? Sure, push those clocks.
In a machine that’s more balanced or focused on portability however, high clock speeds do nothing but increase the likelihood of my laptop sounding like a jet and chewing through battery. In that situation higher clocks makes a laptop feel less premium because it’s worse at its core use case for practically no gain in exchange.
> I have to assume that amd and intel would like similar chips
They historically haven't. They've wanted the higher single-core performance and frequency and they've pulled out all the stops to get it. Everything had been optimized for this. (Also, they underinvested in their uncores, the nastiest part of a modern processor. Part of the reason AMD is beating Intel right now despite being overall very similar is their more recent and more reliable uncore design.)
They are now realizing that this was, perhaps, a mistake.
AMD is only now in a position to afford to invest otherwise (they chose quite well among the options actually available to them, in my opinion), but Intel has no such excuse.
- more advanced silicon architecture. Apple spends billions to get access to the latest generation a couple of years before AMD.
- world class team, with ~25 years of experience building high speed low power chips. (Apple bought PA Semi to make these chips, which was originally the team that build the DEC StrongARM). And then paid & treated them properly, unlike Intel & AMD
- a die budget to spend transistors for performance: the M chips are generally quite large compared to the competition
- ARM's weak memory model also helps, but it's very minor IMO compared to the above 3.
Sure, but they were there long enough to train and instill culture into the others. And of course, since the acquisition in 2008 they've had access to the top new grads and experienced engineers. If you're coming out top of your class at an Ivy or similar you're going to choose Apple over Intel or AMD both because of rep and the fact that your offer salary is much better.
P.S. hearsay and speculation, not direct experience. I haven't worked at Apple and anybody who has is pretty closed lip. You have to read between the lines.
P.P.S. It's sort of a circular argument. I say Apple has the best team because they have the best chip && they have the best chip because they have the best team.
But having worked (briefly) in the field, I'm very confident that their success is much more likely due to having the best team rather than anything else.
Intel and AMD are after the very high profit margins of the enterprise server market. They have much less motivation to focus on power efficient mobile chips which are less profitable for them.
Apple's primary product is consumer smartphones and tablets so they are solely focused on power efficient mobile chips.
Apple was willing to spend a lot of transistors on cache because they were optimizing the chips purely for mobile and can bury the extra cost in their expensive end products.
You will note that after the initial wins from putting stonking amounts of cache and memory bandwidth in place, Apple has not had any significant performance jump beyond the technology node improvements.
They aren't aiming for performance in the first place. It's a coincidence that it has good performance. They're aiming for high performance/power ratios.
Your Intel mac was stuck in the past while everyone paying attention on PCs were already enjoying TSMC 7nm silicon in the form of AMD Zen processors.
Apple Silicon macs are far less impressive if you came from an 8c/16t Ryzen 7 laptop. Especially if you consider the Apple parts are consistently enjoying the next best TSMC node vs. AMD (e.g. 5nm (M1) vs. 7nm (Zen2))
What's _really_ impressive is how badly Intel fell behind and TSMC has been absolutely killing it.
And 20% or so of that difference is purely the fab node difference, not anything to do with the chip design itself. Strix Halo is a much better comparison, though Apple's M4 models do very well against it often besting it at the most expensive end.
On the flip side, if you look at servers... Compare a 128+core AMD server CPU vs a large core ARM option and AMD perf/watt is much better.
Basically yeah, if you compare CPU from same fab then its basically the same.
Its just Apple buys next gen fabs while AMD and intel has to be on last gen, so the M computers people compare are always one fab gen ahead. It has very little to do with CPU architecture.
They do have some cool stuff about their CPU, but the thing most laud them for has to do with fabs.
There's another difference -- willingness to actually pay for silicon. The M1 Max is a 432 mm^2 laptop chip built on a 5 nm process. Contrast that to AMD's "high end" Ryzen 7 8845HS at 178 mm^2 on a 4 nm process. Even the M1 Pro at 245 mm^2 is bigger than this. More area means not just more peak performance, but the ability to use wider paths at lower speeds to maintain performance at lower power. 432 mm^2 is friggin' huge for a laptop part, and it's really hard to compete with what that can do on any metric besides price.
Apple's SOC does a bit more than AMD's, such as including the ssd controller. I don't know if Apple is grafting different nodes together for chiplets, etc compared to AMD on desktop.
The area has nothing to do with peak performance... based on the node, it has to do with the amount of components you can cram into a given space. The CRAY-1 cpu was massive compared to both of your examples, but doesn't come close to either in terms of performance.
Also, Ryzen AI Max+ 395 is top dog on the AMD mobile CPU front and is around 308mm^2 combined.
> The area has nothing to do with peak performance... based on the node, it has to do with the amount of components you can cram into a given space.
Of course it does. For single-threaded performance, the knobs I can turn are clockspeed (minimal area impact for higher speed standard cells, large power impact), core width (significant area impact for decoder, execution resources, etc, smaller power impact), and cache (huge area impact, smaller power impact). So if I want higher single-threaded performance on a power budget, area helps. And of course for multi-threaded performance the knobs I have are number of cores, number of memory controllers, and last-level cache size, all of which drive area. There's a reason Moore's law was so often interpreted as talking about performance and not transistor count -- transistor count gives you performance. If you're willing to build a 432 mm^2 chip instead of a 308 mm^2 chip iso-process, you're basically gaining a half-node of performance right there.
Isn't it you who is hyping up Apple here when you don't even compare the two using similar architecture? Compare a 5nm AMD laptop low power cpu to Apple M1 and the M1 no longer looks that much better at all.
I wouldn't discount what Apple has done... they've created and integrated some really good niche stuff in their CPUs to do more than typical ARM designs. The graphics cores are pretty good in their own right even. Not to mention the OS/Software integration including accelerated x86 and unified memory usage in practice.
AMD has done a LOT for parallelization and their server options are impressive... I mean, you're still talking 500W+ in total load, but that's across 128+ cores. Strix Halo scaling goes down impressively to the ~10-15W range under common usage, not as low as Apple does under similar loads but impressive in its own way.
How much of the Mac's impressive battery life is due purely to CPU efficiency, and how much is due to great vertical integration and the OS being tuned for power efficiency?
It's a genuine question; I'm sure both factors make a difference but I don't know their relative importance.
I just searched for the asahi linux (Linux for M Series Macs) battery life, and found this blog post [0].
> During active development with virtual machines running, a few calls, and an external keyboard and mouse attached, my laptop running Asahi Linux lasts about 5 hours before the battery drops to 10%. Under the same usage, macOS lasts a little more than 6.5 hours. Asahi Linux reports my battery health at 94%.
The overwhelming majority is due to the power management software, yes. Other ARM laptops do not get anywhere close to the same battery life. The MNT Reform with 8x 18650s (24000mAh, 3x what you get an MBP) gets about 5h of battery life with light usage.
I think it would only be fair to compare it when running some more resource efficient system.
Steamdeck with Windows 11 and SteamOS is a whole different experience. When running SteamOS and doing web surfing, the fan don't even really spin at all. But when running windows 11 and do the exact same thing, it just spins all the time and becomes kinda hot.
Since newer CPUs have heterogeneous cores (high performance + low power), I'm wondering if it makes sense to drop legacy instructions from the low power cores, since legacy code can still be run on the other cores. Then e.g. an OS compiled the right way can take advantage of extra efficiency without the CPU losing backwards compatibility
Like o11c says, that's setting everyone up for a bad time. If the heterogenous cores are similar, but don't all support all the instructions, it's too hard to use. You can build legacy instructions in a space optimized way though, but there's no reason not to do that for the high performance cores too --- if they're legacy instructions, one expects them not to run often and perf doesn't matter that much.
Intel dropped their x86-S proposal; but I guess something like that could work for low power cores. If you provide a way for a 64-bit OS to start application processors directly in 64-bit mode, you could setup low power cores so that they could only run in 64-bit mode. I'd be surprised if the juice is worth the squeeze, but it'd be reasonable --- it's pretty rare to be outside 64-bit mode, and systems that do run outside 64-bit mode probably don't need all the cores on a modern processor. If you're running in a 64-bit OS, it knows which processes are running in 32-bit mode, and could avoid scheduling them on reduced functionality cores; If you're running a 32-bit OS, somehow or another the OS needs to not use those cores... either the ACPI tables are different and they don't show up for 32-bit, init fails and the OS moves on, or the there is a firmware flag to hide them that must be set before running a 32-bit OS.
I don't really understand why the OS can't just trap the invalid instruction exception and migrate it to the P-core. E.g. AVX-512 and similar. For very old and rare instructions they can emulate them. We used to do that with FPU instructions on non-FPU enabled CPUs way back in the 80s and 90s.
It's slow and annoying. What would cpuid report? If it says "yes I do AVX-512" then any old code might try to use it and get stuck on the P-cores forever even if it was only using it sparingly. If you say no then the software might never use it, so what was the benefit?
It's not impossible, but it'd be a pain in the butt. If you occasionally use some avx-512 infrequently, no big deal (but also not a big deal to just not use it). But if you use it a lot, all of a sudden your core count shrinks; you might rather run on all cores with avx2. You might even prefer to run avx-512 for cores that can and avx2 for those that can't ... but you need to be able to gather information on what cores support what, and pin your threads so they don't move. If you pull in a library, who knows what they do... lots of libraries assume they can call cpuid at load time and adjust... but now you need that per-thread.
That seems like a lot of change for OS, application, etc. If you run commercial applications, maybe they don't update unless you pay them for an upgrade, and that's a pain, etc.
Interesting but it would be pretty rough to implement. If you take a binary now and run it on a core without the correct instructions, it will SIGILL and probably crash. So you have these options:
Create a new compilation target
- You'll probably just end up running a lot of current x86 code exclusively on performance cores to a net loss. This is how RISC-V deals with optional extensions.
Emulate
- This already happens for some instructions but, like above, could quickly negate the benefits
Ask for permission
- This is what AVX code does now, the onus is on the programmer to check if the optional instructions can be used. But you can't have many dropped instructions and expect anybody to use it.
Ask for forgiveness
- Run the code anyway and catch illegal instruction exceptions/signals, then move to a performance core. This would take some deep kernel surgery for support. If this happens remotely often it will stall everything and make your system hate you.
The last one raises the question: which instructions are we considering 'legacy'? You won't get far in an x86 binary before running into an instruction operating on memory that, in a RISC ISA, would mean first a load instruction, then the operation, then a store. Surely we can't drop those.
The "ask for permission" approach doesn't work because programs don't expect the capability of a CPU to change. If a program checked a minute ago that AVX512 is available, it certainly expects AVX512 to be continually available for the lifetime of the process. That means chaos if the OS is moving processes between performance and efficiency cores.
IIRC, there were several smartphone SoCs that dropped 32-bit ARM support from most but not all of their CPU cores. That was straightforward to handle because the OS knows which instruction set a binary wants to use. Doing anything more fine-grained would be a nightmare, as Intel found out with Alder Lake.
This is the flip side of Intel trying to drop AVX512 on their E cores in the 12th generation processors. It didn't work. It requires the OS to know which processes need AVX512 before they get run. And processes themselves use cpuid to determine the capability of processors and they don't expect it to change. So you basically must determine in advance which processes can be run on E cores and never migrate between cores.
What if the kernel handled unimplemented instruction faults by migrating the process to a core that does implement the instruction and restarting the faulting instruction?
Would this be more or less costly than a page fault? It seems like it would be easy to arrange for it to happen very rarely unless none of your cores support all the instructions.
We've seen CPU-capability differences by accident a few times, and it's always a chaotic mess leading to SIGILL.
The kernel would need to have a scheduler that knows it can't use those cores for certain tasks. Think about how hard you would have to work to even identify such a task ...
Current windows or linux executable format don't even list the used instruction though. And even it is listed, how about dynamic linkables? The program may decide to load library at any time it wishes, and the OS is not going to know what instruction may be used this time.
I think it is not really the execution units for simple instructions that take up much chip area on application-class CPUs these days, but everything around them.
I think support in the OS/runtime environment* would be more interesting for chips where some cores have larger execution units such as those for vector and matmul units. Especially for embedded / low power systems.
Maybe x87/MMX could be dropped though.
*. BTW. If you want to find research papers on the topic, a good search term is "partial-ISA migration".
That is quite a confession from AMD.
It's not X86 at all, just every implementation.
It is not like the ARM processors in Macs are simple any more, thats for sure.
There are a lot of theoretical articles which claim similar things but on the other hand we have a lot of empirical evidence that ARM CPUs are significantly more power efficient.
I used laptops with both Intel and AMD CPUs, and I read/watch a lot of reviews in thin and light laptop space. Although AMD became more power efficient compared to Intel in the last few years, AMD alternative is only marginally more efficient (like 5-10%). And AMD is using TSMC fabs.
On the other hand Qualcomm's recent Snapdragon X series CPUs are significantly more efficient then both Intel and AMD in most tests while providing the same performance or sometimes even better performance.
Some people mention the efficiency gains on Intel Lunar Lake as evidence that x86 is just as efficient, but Lunar Lake was still slightly behind in battery life and performance, while using a newer TSMC process node compared to Snapdragon X series.
So, even though I see theoretical articles like this, the empirical evidence says otherwise. Qualcomm will release their second generation Snapdragon X series CPUs this month. My guess is that the performance/efficiency gap with Intel and AMD will get even bigger.
A client CPU spends most of its life idling. Thus, the key to good battery life in client computing is, generally, idle power consumption. That means low core power draw at idle, but it also means shutting off peripherals that aren't in use, turning off clock sources for said peripherals, etc.
ARM was built for low-power embedded applications from the start, and thus low-power idle states are integrated into the architecture quite elegantly. x86, on the other hand, has the SMM, which was an afterthought.
AFAICT case for x86 ~ ARM perf equivalence is based on the argument that instruction decode, while empirically less efficient on x86, is such a small portion of a modern, high-performance pipeline that it doesn't matter. This reasoning checks out IMO. But, this effect would only be visible while the CPU is under load.
I'm glad an authoritative source has stated this. It's been ongoing BS for years. I first got into ARM machines with the Acorn Archimedes and even back then, people were spouting some kind of intrinsic efficiency benefit to ARM that just didn't make any sense.
The ISA is the contract or boundary between software and hardware. While there is a hardware cost to decode instructions, the question is how much?
As all the fanbois in the thread have have pointed out, Apple's M series is fast and efficient compared to x86 for desktop/server workloads. What no one seems to acknowledge is that Apple's A series is also fast and efficient compared to other ARM implementations in mobile workloads. Apple sees the need to maintain M and A series CPUs for different workloads, which indicates there's a benefit to both.
This tells me the ISA decode hardware isn't or isn't the only bottleneck.
And yet... the world keeps proving Intel and AMD wrong on this premise with highly efficient Arm parts. While sure, there's bound to be improvements to make on x86 ultimately its a variable length opcode encoding with a complex decoder path. If nothing else, this is likely a significant issue in comparison to the nicely word aligned op code encoding arm has and surely given apples to apples core designs, the opcode decoding would be a deciding factor.
> its a variable length opcode encoding with a complex decoder path
In practice, the performance impact of variable length encoding is largely kept in check using predictors. The extra complexity in terms of transistors is comparatively small in a large, high-performance design.
Jim Keller has a storied career in the x86 world, it isn't surprising he speaks fondly of it. Regardless:
>So fixed-length instructions seem really nice when you're building little baby computers, but if you're building a really big computer, to predict or to figure out where all the instructions are, it isn't dominating the die. So it doesn't matter that much.
Well, efficiency advantages are the domain of little baby computers. Better predictors give you deeper pipelines without stalls which give you higher clock speeds - higher wattages
Have they reached apple m level of performance/watt after half a decade of the apple m parts being out yet? Do either AMD or Intel beat Apple in any metric in mobile?
The m4 is not the same fab technology so not comparable. If you want to discuss the validity of some CPU architecture it needs to be between comparable fab technology, M4 being a generation ahead there makes the comparison unfair.
If you compare like to like the difference almost completely disappears.
You mean how they all use tsmc n3 and the two arm parts still beat out lunar lake? It’s not like a whole generational node behind here. I get they aren’t the exact same process node but still.
This is an entirely uncontroversial take among experts in the space. x86 is an old CISC-y hot mess. RISC-V is a new-school hyper-academic hot mess. Recent ARM is actually pretty good. And none of it matters, because the uncore and the fabrication details (in particular, whether things have been tuned to run full speed demon or full power sipper) completely dominate the ISA.
In the past x86 didn't dominate in low power because Intel had the resources to care but never did, and AMD never had the resources to try. Other companies stepped in to full that niche, and had to use other ISAs. (If they could have used x86 legally, they might well have done so. Oops?) That may well be changing. Or perhaps AMD will let x86 fade away.
I remember reading this Jim Keller interview:
https://web.archive.org/web/20210622080634/https://www.anand...
Basically the gist of it is that the difference between ARM/x86 mostly boils down to instruction decode, and:
- Most instructions end up being simple load/store/conditional branch etc. on both architectures, where there's literally no difference in encoding efficiency
- Variable length instruction has pretty much been figured out on x86 that it's no longer a bottleneck
Also my personal addendum is that today's Intel efficiency cores are have more transistors and better perf than the big Intel cores of a decade ago
x86 decoding must be a pain - I vaguely remember that they have trace caches (a cache of decoded micro-operations) to skip decoding in some cases. You probably don't make such caches when decoding is easy.
Also, more complicated decoding and extra caches means longer pipeline, which means more price to pay when a branch is mispredicted (binary search is a festival of branch misprediction for example, and I got 3x acceleration of linear search on small arrays when I switched to the branchless algorithm).
Also I am not a CPU designer, but branch prediction with wide decoder also must be a pain - imagine that while you are loading 16 or 32 bytes from instruction cache, you need to predict the address of next loaded chunk in the same cycle, before you even see what you got from cache.
As for encoding efficiency, I played with little algorithms (like binary search or slab allocator) on godbolt, and RISC-V with compressed instruction generates similar amount of code as x86 - in rare cases, even slightly smaller. So x86 has a complex decoding that doesn't give any noticeable advantages.
x86 also has flags, which add implicit dependencies between instructions, and must make designer's life harder.
I was an instruction fetch unit (IFU) architect on P6 from 1992-1995. And yes, it was a pain, and we had close to 100x the test vectors of all the other units, going back to the mid 1980's. Once we started going bonkers with the prefixes, we just left the pre-Pentium decoder alone and added new functional blocks to handle those. And it wasn't just branch prediction that sucked, like you called out! Filling the instruction cache was a nightmare, keeping track of head and tail markers, coalescing, rebuilding, ... lots of parallel decoding to deal with cache and branch-prediction improvements to meet timing as the P6 core evolved was the typical solution. We were the only block (well, minus IO) that had to deal with legacy compatibility. Fortunately I moved on after the launch of Pentium II and thankfully did not have to deal with Pentium4/Northwood.
> x86 decoding must be a pain
So one of the projects I've been working on and off again is the World's Worst x86 Decoder, which takes a principled approach to x86 decoding by throwing out most of the manual and instead reverse-engineering semantics based on running the instructions themselves to figure out what they do. It's still far from finished, but I've gotten it to the point that I can spit out decoder rules.
As a result, I feel pretty confident in saying that x86 decoding isn't that insane. For example, here's the bitset for the first two opcode maps on whether or not opcodes have a ModR/M operand: ModRM=1111000011110000111100001111000011110000111100001111000011110000000000000000000000000000000000000011000001010000000000000000000011111111111111110000000000000000000000000000000000000000000000001100111100000000111100001111111100000000000000000000001100000011111100000000010011111111111111110000000011111111000000000000000011111111111111111111111111111111111111111111111111111110000011110000000000000000111111111111111100011100000111111111011110111111111111110000000011111111111111111111111111111111111111111111111
I haven't done a k-map on that, but... you can see that a boolean circuit isn't that complicated. Also, it turns out that this isn't dependent on presence or absence of any prefixes. While I'm not a hardware designer, my gut says that you can probably do x86 instruction length-decoding in one cycle, which means the main limitation on the parallelism in the decoder is how wide you can build those muxes (which, to be fair, does have a cost).
That said, there is one instruction where I want to go back in time and beat up the x86 ISA designers. f6/0, f6/1, f7/0, and f7/1 [1] take in an extra immediate operand whereas f6/2 and et al do not. It's the sole case in the entire ISA where this happens.
[1] My notation for when x86 does its trick of using one of the register selector fields as extra bits for opcodes.
> While I'm not a hardware designer, my gut says that you can probably do x86 instruction length-decoding in one cycle
That's some very faint praise there. Especially when you're trying to chop up several instructions every cycle. Meanwhile RISC-V is "count leading 1s. 0-1:16bit 2-4:32bit 5:48bit 6:64bit"
The chopping up can happen the next cycle, in parallel across all the instructions in the cache line(s) that were fetched, and it can be pipelined so there's no loss in throughput. Since x86 instructions can be as small as one byte, in principle the throughput-per-cache-line can be higher on x86 than on RISC-V (e.g. a single 32-byte x86 cache line could have up to 32 instructions where the original RISC-V ISA might only have 8). And in any case, there are RISC-V extensions that allow variable-length instructions now, so they have to deal with the problem too.
Intel’s E cores decode x86 without a trace cache (μop cache), and are very efficient. The latest (Skymont) can decode 9 x86 instructions per cycle, more than the P core (which can only decode 8)
AMD isn’t saying that decoding x86 is easy. They are just saying that decoding x86 doesn’t have a notable power impact.
Does that really say anything about efficiency? Why can't they decode 100 instructions per cycle?
> Why can't they decode 100 instructions per cycle?
Well, obviously because there aren't 100 individual parallel execution units to which those instructions could be issued. And lower down the stack because a 3000 bit[1] wide cache would be extremely difficult to manage. An instruction fetch would be six (!) cache lines wide, causing clear latency and bottleneck problems (or conversely would demand your icache be 6x wider, causing locality/granularity problems as many leaf functions are smaller than that).
But also because real world code just isn't that parallel. Even assuming perfect branch prediction the number of instructions between unpredictable things like function pointer calls or computed jumps is much less than 100 in most performance-sensitive algorithms.
And even if you could, the circuit complexity of decoding variable length instructions is superlinear. In x86, every byte can be an instruction boundary, but most aren't, and your decoder needs to be able to handle that.
[1] I have in my head somewhere that "the average x86_64 instruction is 3.75 bytes long", but that may be off by a bit. Somewhere around that range, anyway.
Variable length decoding is more or less figured out, but it takes more design effort, transistors and energy. They cost, but not a lot, relatively, in a current state of the art super wide out-of-order CPU.
"Transistors are free."
That was pretty much the uArch/design mantra at intel.
Not a lot is not how I would describe it. Take a 64bit piece of fetched data. On ARM64 you will just push that into two decoder blocks and be done with it. On x86 you got what, 1 to 15 bytes range per instruction? I dont even want to think about possible permutations, its in the 10 ^ some two digit number order.
You don't need all the permutations. If there are 32 bytes in a cache line then each instruction can only start at one of 32 possible positions. Then if you want to decode N instructions per cycle you need N 32-to-1 muxes. You can reduce the number of inputs to the later muxes since instructions can't be zero size.
Yes, but you're not describing it from the right position. Is instruction decode hard? Yes, if you think about it in isolation (also, fwiw, it's not a permutation problem as you suggest). But the core has a bunch of other stuff it needs to do that is far harder. Even your lowliest Pentium from 2000 can do instruction decode.
It's a lot for a decoder, but not for a whole core. Citation needed, but I remember that the decoder is about 10% of a Ryzen core's power budget, and of course that is with a few techniques better than complete brute force.
Apple’s ARM cores have wider decode than x86
M1 - 8 wide
M4 - 10 wide
Zen 4 - 4 wide
Zen 5 - 8 wide
pure decoder width isn't enough to tell you everything. X86 has some commonly used ridiculously compact instructions (e.g. lea) that would turn into 2-3 instructions on most other architectures.
Also the op cache - if it hits that the decoder is completely skipped.
The whole ModRM addressing encoding (to which LEA is basically a front end) is actually really compact, and compilers have gotten frightently good at exploiting it. Just look at the disassembly for some non-trivial code sometime and see what it's doing.
Skymont - 9 wide
Wow, I had no idea we were up to 8 wide decoders in amd64 CPUs.
For variable vs fixed width i have heard that fixed width is part of apple silicons performance. There’s literally gains to be had here for sure imho.
It's easier but it's not that important. It's more important for security - you can reinterpret variable length instructions by jumping inside them.
This matches my understanding as well, as someone who has a great deal of interest in the field but never worked in it professionally. CPUs all have a microarchitecture that doesn't look like the ISA at all, and they have an instruction decoder that translates ISA one or more ISA instructions into zero or more microarchitectural instructions. There are some advantages to having a more regular ISA, such as the ability to more easily decode multiple instructions in parallel if they're all the same size or having to spend fewer transistors on the instruction decoder, but for the big superscalar chips we all have in our desktops and laptops and phones, the drawbacks are tiny.
I imagine that the difference is much greater for the tiny in-order CPUs we find in MCUs though, just because an amd64 decoder would be a comparatively much larger fraction of the transistor budget
Then there's mainframes. Where you want code compiled in 1960 to run unmodified today. There was quite of original advantage as well as IBM was able to implement the same ISA with three different types and costs of computers.
uOps are kind of oversold in the CPU design mythos. They are not that different from the original ISA, and some x86 instructions (like lea) are both complex and natural fits for hardware so don't get microcoded.
>RISC-V is a new-school hyper-academic hot mess.
Yeah... Previously I was a big fan of RISC-V, but after I had to dig slightly deeper into it as a software developer my enthusiasm for it has cooled down significantly.
It's still great that we got a mainstream open ISA, but now I view it as a Linux of the hardware world, i.e. a great achievement, with a big number of questionable choices baked in, which unfortunately stifles other open alternatives by the virtue of being "good enough".
What choices? The main thing that comes to mind is lack of exceptions on integer overflow but you are unlikely meaning this.
- Handling of misaligned loads/stores: RISC-V got itself into a weird middle ground, ops on misaligned pointers may work fine, may work "extremely slow", or cause fatal exceptions (yes, I know about Zicclsm, it's extremely new and only helps with the latter, also see https://github.com/llvm/llvm-project/issues/110454). Other platforms either guarantee "reasonable" performance for such operations, or forbid misaligned access with "aligned" loads/stores and provide separate misaligned instructions. Arguably, RISC-V should've done the latter (with misaligned instructions defined in a separate higher-end extension), since passing unaligned pointer into an aligned instruction signals correctness problems in software.
- The hardcoded page size. 4 KiB is a good default for RV32, but arguably a huge missed opportunity for RV64.
- The weird restriction in the forward progress guarantees for LR/SC sequences, which forces compilers to compile `compare_exchange` and `compare_exchange_weak` in the absolutely same way. See this issue for more information: https://github.com/riscv/riscv-isa-manual/issues/2047
- The `seed` CSR: it does not provide a good quality entropy (i.e. after you accumulated 256 bits of output, it may contain only 128 bits of randomness). You have to use a CSPRNG on top of it for any sensitive applications. Doing so may be inefficient and will bloat binary size (remember, the relaxed requirement was introduced for "low-powered" devices). Also, software developers may make mistake in this area (not everyone is a security expert). Similar alternatives like RDRAND (x86) and RNDR (ARM) guarantee proper randomness and we can use their output directly for cryptographic keys with very small code footprint.
- Extensions do not form hierarchies: it looks like the AVX-512 situation once again, but worse. Profiles help, but it's not a hierarchy, but a "packet". Also, there are annoyances like Zbkb not being a proper subset of Zbb.
- Detection of available extensions: we usually have to rely on OS to query available extensions since the `misa` register is accessible only in machine mode. This makes detection quite annoying for "universal" libraries which intend to support various OSes and embedded targets. The CPUID instruction (x86) is ideal in this regard. I totally disagree with the virtualization argument against it, nothing prevents VM from intercepting the read, no one excepts huge performance from such reads.
And this list is compiled after a pretty surface-level dive into the RISC-V spec. I heard about other issues (e.g. being unable to port tricky SIMD code to the V extension or underspecification around memory coherence important for writing drivers), but I can not confidently talk about those, so it's not part of my list.
P.S.: I would be interested to hear about other people gripes with RISC-V.
> - The hardcoded page size.
I'm pretty confident that this will get removed. It's an extension that made it's way into RVA23, but once anyone has a design big enough for it to be a burden, it can be dropped.
Sounds like a job for RISC-6, or VI.
An annoying thing people have done since Apple Silicon is claim that its advantages were due to Arm.
No, not really. The advantage is Apple prioritizing efficiency, something Intel never cared enough about.
By prioritizing efficiency, Apple also prioritizes integration. The PC ecosystem prefers less integration (separate RAM, GPU, OS, etc) even at the cost of efficiency.
> By prioritizing efficiency, Apple also prioritizes integration. The PC ecosystem prefers less integration (separate RAM, GPU, OS, etc) even at the cost of efficiency.
People always say this but "integration" has almost nothing to do with it.
How do you lower the power consumption of your wireless radio? You have a network stack that queues non-latency sensitive transmissions to minimize radio wake-ups. But that's true for radios in general, not something that requires integration with any particular wireless chip.
How do you lower the power consumption of your CPU? Remediate poorly written code that unnecessarily keeps the CPU in a high power state. Again not something that depends on a specific CPU.
How much power is saved by soldering the memory or CPU instead of using a socket? A negligible amount if any; the socket itself has no significant power draw.
What Apple does well isn't integration, it's choosing (or designing) components that are each independently power efficient, so that then the entire device is. Which you can perfectly well do in a market of fungible components simply by choosing the ones with high efficiency.
In fact, a major problem in the Android and PC laptop market is that the devices are insufficiently fungible. You find a laptop you like where all the components are efficient except that it uses an Intel processor instead of the more efficient ones from AMD, but those components are all soldered to a system board that only takes Intel processors. Another model has the AMD APU but the OEM there chose poorly for the screen.
It's a mess not because the integration is poor but because the integration exists instead of allowing you to easily swap out the part you don't like for a better one.
> How much power is saved by soldering the memory or CPU instead of using a socket? A negligible amount if any; the socket itself has no significant power draw.
This isn't quite true. When the whole chip is idling at 1-2W, 0.1W of socket power is 10%. Some of Apple's integration almost certainly save power (e.g. putting storage controllers for the SSD on the SOC, having tightly integrated display controllers, etc).
There's a critical instruction for Objective C handling (I forget exactly what it is) but it's faster than intel's chips even in Rosetta 2's x86 emulation.
I believe it's the `lock xadd` instruction. It's faster when combined with x86 Total Store Ordering mode that the Rosetta emulation runs under.
Looking at objc_retain apparently it's a lock cmpxchg these days
Eh, probably the biggest difference is in the OS. The amount of time Linux or Windows will spend using a processor while completely idle can be a bit offensive.
It’s all of the above. One thing Apple excels at is actually using their hardware and software together whereas the PC world has a long history of one of the companies like Intel, Microsoft, or the actual manufacturer trying to make things better but failing to get the others on-board. You can in 2025 find people who disable power management because they were burned (hopefully not literally) by some combination of vendors slacking on QA!
One good example of this is RAM. Apple Silicon got some huge wins from lower latency and massive bandwidth, but that came at the cost of making RAM fixed and more expensive. A lot of PC users scoffed at the default RAM sizes until they actually used one and realized it was great at ~8GB less than the equivalent PC. That’s not magic or because Apple has some super elite programmers, it’s because they all work at the same company and nobody wants to go into Tim Cook’s office and say they blew the RAM budget and the new Macs need to cost $100 more. The hardware has compression support and the OS and app teams worked together to actually use it well, whereas it’s very easy to imagine Intel adding the feature but skimping on speed / driver stability, or Microsoft trying to implement it but delaying release for a couple years, or not working with third-party developers to optimize usage, etc. – nobody acting in bad faith but just what inevitably happens when everyone has different incentives.
In most cases, efficiency and performance are pretty synonymous for CPUs. The faster you can get work done (and turn off the silicon, which is admittedly a higher design priority for mobile CPUs) the more efficient you are.
The level of talent Apple has cannot be understated, they have some true CPU design wizards. This level of efficiency cannot be achieved without making every aspect of the CPU as fast as possible; their implementation of the ARM ISA is incredible. Lots of companies make ARM chips, but none of them are Apple level performance.
As a gross simplification, where the energy/performance tradeoff actually happens is after the design is basically baked. You crank up the voltage and clock speed to get more perf at the cost of efficiency.
> In most cases, efficiency and performance are pretty synonymous for CPUs. The faster you can get work done (and turn off the silicon, which is admittedly a higher design priority for mobile CPUs) the more efficient you are.
Somewhat yes, hurry up and wait can be more efficient than running slow the whole time. But at the top end of Intel/AMD performance, you pay a lot of watts to get a little performance. Apple doesn't offer that on their processors, and when they were using Intel processors, they didn't provide thermal support to run in that mode for very long either.
The M series bakes in a lower clockspeed cap than contemperary intel/amd chips; you can't run in the clock regime where you spend a lot of watts and get a little bit more performance.
Nitpick: uncore and the fabrication details dominate the ISA on high end/superscalar architectures (because modern superscalar basically abstract the ISA away at the frontend). On smaller (i. e. MCU) cores x86 will never stand any chance.
Not that it stopped Intel trying - https://en.m.wikipedia.org/wiki/Intel_Quark
I'd love to see what would happen if AMD put out a chip with the instruction decoders swapped out for risc-v instruction decoders
Fwiw the https://en.wikipedia.org/wiki/AMD_Am29000 RISC CPU and the https://en.wikipedia.org/wiki/AMD_K5 are a good example of this. As in AMD took their existing RISC CPU to make the K5 x86 CPU.
Almost the same in die shots except the K5 had more transistors for the x86 decoding. The AM29000's instruction set is actually very close to RISC-V too!
Very hard to find benchmarks comparing the two directly though.
TIL the k5 was RISC. thank you
Indeed. We don't need it, but I want it for perfectionist aesthetic completion.
I have a hard time believing this fully: more custom instructions, more custom hardware, more heat.
How can you avoid it?
Since the Pentium Pro the hardware hasn't implemented the ISA, it's converted into micro ops.
Come on, you know what I meant :)
If you want to support AVX e.g. you need 512bit (or 256) wide registers, you need dedicated ALUs, dedicated mask registers etc.
Ice Lake has implemented SHA-specific hardware units in 2019.
ARM has instructions for SHA, AES, vectors, etc too. Pretty much have to pay the cost if you want the perf.
sure, but Arm has Neon/sve which impose basically the same requirements for vector instructions, and most high performance arm implimentations have a wide suite of crypto instructions (e.g. Apple's M series chips have AES, SHA1 and Sha256 instructions)
The computation has to be done somehow, I don't know that it is a given that more available instructions means more heat.
VIA used to make low power x86 processors
Fun fact. The idea of strong national security is the reason why there are three companies with access to the x86 ISA.
DoD originally required all products to be sourced by at least three companies to prevent supply chain issues. This required Intel to allow AMD and VIA to produce products based on ISA.
For me this is good indicator if someone that talks about good national security knows what they are talking about or are just spewing bullshit and playing national security theatre.
Intel didnt "allow" VIA anything :). Via acquired x86 tech from IDT (WinChip Centaur garbage) in a fire sale. IDT didnt ask anyone about any licenses, neither did Cyrix, NextGen, Transmeta, Rise nor NEC.
Afaik DoD wasnt the reason behind original AMD second source license, it was IBM forcing Intel on chips that went into first PC.
And Transmeta…
Transmeta wasn't x86 internally but decoded x86 instructions. Retrobytes did a history of transmeta not too long ago and the idea was essentially to be able to be compatible with any cpu uarch. Alas by the time it shipped only x86 was relevant. https://www.youtube.com/watch?v=U2aQTJDJwd8
Actually, the reason Transmeta CPUs were so slow was that they didn't have an x86 instruction hardware decoder. Every code cache (IIRC it was only 32 MB) miss resulted in a micro-architectural trap which translated x86 instructions to the underlying uops in software.
> x86 didn't dominate in low power because Intel had the resources to care but never did
Remember Atom tablets (and how they sucked)?
Eeepc was was a hit. Its successors still make excellent cheap long-life laptops, if not as performant as Apple.
That's the point. Early Atom wasn't designed with care but the newer E-cores are quite efficient because they put more effort in.
You mean Atom tablets running Android ?
I have a ten-year old Lenovo Yoga Tab 2 8" Windows tablet, which I still use at least once every week. It is still useful. Who can say that they are still using a ten-year old Android tablet?
I still use my 2015 Kindle Fire (which runs Android) for ebooks and light web browsing.
My iPad Mini 4 turns 10 in a month.
Yeah, I got to say in our sound company inventory I still use a dozen 6-10 year old iPads with all the mixers. They run the apps at 30fps and still hold a charge all day.
I have tried one before. And surprisingly, It did not suck as most people claimed to be. I can even do light gaming (warframe) on it with reasonable frame rate. (It's about 2015 ~ 2020 era). So it probably depends on manufacturer (or use case though)
(Also probably due to it is a tablet, so it have a reasonable fast storage instead of hdds like notebooks in that era)
They sucked because Intel didn't care.
> how they sucked
Care to elaborate. I had the 9" mini laptop kind of device based on Atom and don't remember Atom to be the issue.
I had a Atom-based netbook (in the early days when they were 32-bit-only and couldn’t run up-to-date Windows). It didn’t suck, as such, but it was definitely resource-starved.
However, what I meant is Atom-based Android tablets. At about the same time as the netbook craze (late 2000s to early 2010s) there was a non-negligible number of Android tablets, and a noticeable fraction of them was not ARM- but Atom-based. (The x86 target in the Android SDK wasn’t only there to support emulators, originally.) Yet that stopped pretty quickly, and my impression is that that happened because, while Intel would certainly have liked to hitch itself to the Android train, they just couldn’t get Atoms fast enough at equivalent power levels (either at all or quickly enough). Could have been something else, e.g. perhaps they didn’t have the expertise to build SoCs with radios?
Either way, it’s not that Intel didn’t want to get into consumer mobile devices, it’s that they tried and did not succeed.
Android x86 devices suffer when developers include binary libraries and don't add x86. At the time of Intel's x86 for Android push, Google didn't have good apk thinning options, so app developers had to decide if they wanted to add x86 libraries for everyone so that a handful of tablets/phones would work properly... for the most part, many developers said no; even though many/most apps are tested on the android emulator that runs on x86 and probably have binary libraries available to work in that case.
IMHO, If Intel had done another year or two of trying, it probably would have worked, but they gave up. They also canceled x86 for phone like the day before the Windows Mobile Continuum demo, which would have been a potentially much more compelling product with x86, especially if Microsoft allowed running win32 apps (which they probably wouldn't, but the potential would be interesting)
It got a lot better. First few generations were dog-slow, although they did work.
Atom used an in-order execution model so it's performance was always going to be lacking. Because it was in-order it had a much simpler decoder and much smaller die size, which meant you could crap the chipset and CPU on a single die.
Atom wasn't about power efficiency or performance, it was about cost optimization.
I had an Atom-based Android phone (Razr-i) and it was fine.
Were they running windows or android?
After playing around with some ARM hardware I have to say that I don't care whether ARM is more efficient or not as long as the boot process remains the clusterfuck that it is today.
IMHO the major win of the IBM PC platform is that it standardized the boot process from the very beginning, first with the BIOS and later with UEFI, so you can grab any random ISO for any random OS and it will work. Meanwhile in the ARM world it seems that every single CPU board requires its own drivers, device tree, and custom OS build. RISC-V seems to suffer from the same problem, and until this problem is solved, I will avoid them like toxic waste.
ARM systems that support UEFI are pretty fun to work with. Then there's everything else. Anytime I hear the phrase "vendor kernel" I know I'm in for an experience...
Of course, because saying that X ISA is faster than Y ISA is like saying that Java syntax is faster than C# syntax
Everything is about the implementation: compiler, JIT, runtime/VM, stdlib, etc.
https://chipsandcheese.com/p/arm-or-x86-isa-doesnt-matter
C# syntax is faster than Java because Java has no way to define custom value types/structs (last time I checked, I know there was some experimental work on this)
and yet there's more Java in HFT than C#
And don't get me wrong, I'm C# fanboi that'd never touch Java, but JVM itself is impressive as hell,
so even despite not having (yet) value types/structs, Java is still very strong due to JVM (the implementation). Valhalla should push it even further.
HFT code is unusual in the way it is used. A lot of work goes into avoiding the Garbage Collection and other JVM overheads.
I'd be interested to hear someone with more experience talk about this or if there's more recent research, but in school I read this paper: <https://research.cs.wisc.edu/vertical/papers/2013/hpca13-isa...> that seems to agree that x86 and ARM as instruction sets do not differ greatly in power consumption. They also found that GCC picks RISC-like instructions when compiling for x86 which meant the number of micro-ops was similar between ARM and x86, and that the x86 chips were optimized well for those RISC-like instructions and so were similarly efficient to ARM chips. They have a quote that "The microarchitecture, not the ISA, is responsible for performance differences."
I was just window-shopping laptops this morning, and realized ARM-based doesn't necessarily hold battery life advantages.
You mean Windows-based ARM laptops or Macbooks?
Plus they are not cheap either.
Instruction decode for variable length ISAs is inherently going to be more complex, and thus require more transistors = more power, than fixed length instruction decode, especially parallel decode. AFAIK modern x86 cores have to speculatively decode instructions to achieve this, compared to RISC ISAs where you know where all the instruction boundaries are and decoding N in parallel is a matter of instantiating N decoders that work in parallel. How much this determines the x86 vs ARM power gap, I don’t know, what’s much more likely is x86 designs have not been hyper optimized for power as much ARM designs have been over the last two decades. Memory order is another non-negligible factor, but again the difference is probably more attributable to the difference in goals between the two architectures for the vast majority of their lifespan, and the expertise and knowledge of the engineers working at each company.
IIRC There was a Jim Keller interview a few years ago where he said basically the same thing (I think it was from right around when he joined Tenstorrent?). The ISA itself doesn't matter, it's just instructions. The way the chip interprets those instructions is what makes the difference. ARM was designed from the beginning for low powered devices whereas x86 wasn't. If x86 is gonna compete with ARM (and RISC-V) then the chips are gonna need to also be optimized for low powered devices, but that can break decades of compatibility with older software.
https://chipsandcheese.com/p/arm-or-x86-isa-doesnt-matter
It's probably from the Lex Friendman podcast he did. And to be fair, he didn't say "it doesn't matter", he said "it's not that important".
Irrelevant.
There are two entities allowed to make x86_64 chips (and that only because AMD won the 64 bit ISA competition, otherwise there'd be only Intel). They get to choose.
The rest will use arm because that's all they have access to.
Oh, and x86_64 will be as power efficient as arm when one of the two entities will stop competing on having larger numbers and actually worry about power management. Maybe provide a ?linux? optimized for power consumption.
Unless you badly need SSE4 or AVX (and can't get around the somewhat questionable patent situation) anyone can make an x86_64 chip. And those patents are running out soon.
> Oh, and x86_64 will be as power efficient as arm when one of the two entities will stop competing on having larger numbers and actually worry about power management.
Both Intel and AMD provide runtime power control so this is tunable. The last ~10% of performance requires far more than 10% of the power.
AMD do handle power consumption well, at least if you run in eco mode instead of pushing CPU to the limits. I always set eco mode on on modern Ryzens.
I have a Ryzen box that I temperature limited to 65 C indeed. That was about 100 W in my office with just the graphics integrated into the Ryzen.
However, next to it there's a M2 mac mini that uses all of 37 W when I'm playing Cyberpunk 2077 so...
> Both Intel and AMD provide runtime power control so this is tunable. The last ~10% of performance requires far more than 10% of the power.
Yes but the defaults are insane.
The last 20% of the performance takes like >75% of the power with Zen 4 systems XD.
A Ryzen 9 7945HX mini pc I have achieves like ~80% of the all-core performance at 55W of my Ryzen 9 7950X desktop, which uses 225W for the CPU (admittedly, the defaults).
I think limiting the desktop CPU to 105W only dropped the performance by 10%. I haven't done that test in awhile because I was having some stability problems I couldn't be bothered to diagnose.
If you're measuring the draw at the wall, AFAIK desktop Ryzen keeps the chipset running at full power all the time and so even if the CPU is idle, it's hard to drop below, say, ~70W at the wall (including peripherals, fans, PSU efficiency etc).
Apparently desktop Intel is able to drop all the way down to under 10W on idle.
Sounds like your M2 is hitting the TDP max and the Ryzen box isn't.
Keep in mind there are Nvidia-designed chips (eg. Switch 2) that use all of ten watts when playing Cyberpunk 2077. Manufactured on Samsung's 8nm node, no less. It's a bit of a pre-beaten horse, but people aren't joking when they say Apple's GPU and CPU designs leave a lot of efficiency on the table.
Cyberpunk 2077 is very GPU bound, so it's not really about CPU there. I'm playing it using 7900 XTX on Linux :)
But yeah, defaults are set to look better in benchmarks and they are not worth it. Eco mode should be the default.
From what I have heard it's not the RISCy ISA per se, it's largely arm's weaker memory model.
I'd be happy to be corrected, but the empirical core counts seem to agree.
Indeed, the memory model has a decent impact. Unfortunately it's difficult to isolate in measurement. Only Apple has support for weak memory order and TSO in the same hardware.
You’ll pry the ARM M series chips of my Mac from my cold dead hands. They’re a game changer in the space and one of the best reasons to use a Mac.
I am not a chip expert it’s just so night and day different using a Mac with an arm chip compared to an Intel one from thermals to performance and battery life and everything in between. Intel isn’t even in the same ballpark imo.
But competition is good and let’s hope they both do —- Intel and AMD because the consumer wins.
I have absolutely no doubt in my mind that if Apple's CPU engineers got half a decade and a mandate from the higher ups, they could make an amazing amd64 chip too.
That's not mostly because of a better ISA. If Intel and Apple had a chummier relationship you could imagine Apple licensing the Intel x86 ISA and the M series chips would be just as good but running x86. However I suspect no matter how chummy that relationship was, business is business and it is highly unlikely that Intel would give Apple such a license.
It's pretty difficult to imagine.
Apple did a ton of work on the power efficiency of iOS on their own ARM chips for iPhone for a decade before introducing the M1.
Since iOS and macOS share the same code base (even when they were on different architectures) it makes much more sense to simplify to a single chip architecture that they already had major expertise with and total control over.
There would be little to no upside for cutting Intel in on it.
Isn't it also easier to license ARM, because that's the whole point of the ARM Corporation.
It's not like Intel or AMD are known for letting other customize their existing chip designs.
Apple was a very early investor in ARM and is one of the few with a perpetual license of ARM tech
And an architect license that lets them modify the ISA I believe
Intel and AMD both sell quite a lot of customized chips, at least in the server space. As one example, any EC2 R7i or R7a instance you have are not running on a Sapphire Rapids or EPYC processor that you could buy, but instead one customized for AWS. I would presume that other cloud providers have similar deals worked out.
> That's not mostly because of a better ISA
Genuinely asking -- what is it due to? Because like the person you're replying to, the m* processors are simply better: desktop-class perf on battery that hangs with chips with 250 watt TDP. I have to assume that amd and intel would like similar chips, so why don't they have them if not due to the instruction set? And AMD is using TSMC, so that can't be the difference.
I think the fundamental difference between an Apple CPU and an Intel/AMD CPU is Apple does not play in the megahertz war. The Apple M1 chip, launched in 2020 clocks at 3.2GHz; Intel and AMD can't sell a flagship mobile processor that clocks that low. Zen+ mobile Ryzen 7s released Jan 2019 have a boost clock of 4 GHz (ex: 3750H, 3700U); mobile Zen2 from Mar 2020 clock even higher (ex: 4900H at 4.4, 4800H at 4.2). Intel Tiger Lake was hitting 4.7 Ghz in 2020 (ex: 1165G7).
If you don't care to clock that high, you can reduce space and power requirements at all clocks; AMD does that for the Zen4c and Zen5c cores, but they don't (currently) ship an all compact core mobile processor. Apple can sell a premium branded CPU where there's no option to burn a lot of power to get a little faster; but AMD and Intel just can't, people may say they want efficiency, but having higher clocks is what makes an x86 processor premium.
In addition to the basic efficiency improvements you get by having a clock limit, Apple also utilizes wider execution; they can run more things in parallel, this is enabled to some degree by the lower clock rates, but also by the commitment to higher memory bandwidth via on package memory; being able to count on higher bandwidth means you can expect to have more operations that are waiting on execution rather than waiting on memory, so wider execution has more benefits. IIRC, Intel released some chips with on package memory, but they can't easily just drop in a couple more integer units onto an existing core.
The weaker memory model of ARM does help as well. The M series chips have a much wider out of order window, because they don't need to spend as much effort on ordering constraints (except when running in the x86 support mode); this also helps justify wider execution, because they can keep those units busy.
I think these three things are listed in order of impact, but I'm just an armchair computer architecture philosopher.
Does anyone actually care at all about frequencies? I care if my task finishes quickly. If it can finish quickly at a low frequency, fine. If the clock runs fast but the task doesn't, how is that a benefit?
My understanding is that both Intel and AMD are pushing high clocks not because it's what consumers want, but because it's the only lever they have to pull to get more gains. If this year's CPU is 2% faster than your current CPU, why would you buy it? So after they have their design they cover the rest of the target performance gain by cranking the clock, and that's how you get 200 W desktop CPUs.
>the commitment to higher memory bandwidth via on package memory; being able to count on higher bandwidth means you can expect to have more operations that are waiting on execution rather than waiting on memory, so wider execution has more benefits.
I believe you could make a PC (compatible) with unified memory and a 256-bit memory bus, but then you'd have to make the whole thing. Soldered motherboard, CPU/GPU, and RAM. I think at the time the M1 came out there weren't any companies making hardware like that. Maybe now that x86 handhelds are starting to come out, we may see laptops like that.
It's only recently when consumer software has become truly multithreaded. Historically there were major issues with that until very recently. Remember Bulldozer fiasco? They bet on the parallel execution more than Intel at the same time, e.g. same price Intel chip was 4 core, while AMD had 8 cores (consumer market). Single thread performance had been the deciding factor for decades. Even today AMDs outlier SKUs with a lot of cores and slightly lower frequencies (like 500 MHz lower or so) are not a topic of the day in any media or forum community. People talk about either top of the line SKU or something with low core count but clocking high enough to be reasonable for lighter use. Releasing low frequency high core count part for consumers would be greeted with questions, like "what for is this CPU?".
Are we just going to pretend that frequency = single-thread performance? I'm fine with making that replacement mentally, I just want to confirm we're all on the same page here.
>Releasing low frequency high core count part for consumers would be greeted with questions, like "what for is this CPU?".
It's for homelab and SOHO servers. It won't get the same attention as the sexy parts... because it's not a sexy part. It's something put in a box and stuff in a corner to chug away for ten years without looking at it again.
low frequency high core count part for consumers
That's not really what we're talking about. Apple's cores are faster yet lower clocked. (Not just faster per clock but absolutely faster.) So some people are wondering if Intel/AMD targeting 6 GHz actually reduced performance.
But the OS has been able to take advantage of it since mountain lion with grand central dispatch. I could be wrong with the code name. This makes doing parallel things very easy.
But most every OS can.
Parallelism is actually very difficult and libdispatch is not at all perfect for it. Swift concurrency is a newer design and gets better performance by being /less/ parallel.
(This is mostly because resolving priority inversions turns out to be very important on a phone, and almost noone designs for this properly because it's not important on servers.)
> Apple can sell a premium branded CPU where there's no option to burn a lot of power to get a little faster; but AMD and Intel just can't, people may say they want efficiency, but having higher clocks is what makes an x86 processor premium.
I think this is very context dependent. Is this a big, heavy 15”+ desktop replacement notebook where battery life was never going to be a selling point in the first place? One of those with a power brick that could be used as a dumbbell? Sure, push those clocks.
In a machine that’s more balanced or focused on portability however, high clock speeds do nothing but increase the likelihood of my laptop sounding like a jet and chewing through battery. In that situation higher clocks makes a laptop feel less premium because it’s worse at its core use case for practically no gain in exchange.
> I have to assume that amd and intel would like similar chips
They historically haven't. They've wanted the higher single-core performance and frequency and they've pulled out all the stops to get it. Everything had been optimized for this. (Also, they underinvested in their uncores, the nastiest part of a modern processor. Part of the reason AMD is beating Intel right now despite being overall very similar is their more recent and more reliable uncore design.)
They are now realizing that this was, perhaps, a mistake.
AMD is only now in a position to afford to invest otherwise (they chose quite well among the options actually available to them, in my opinion), but Intel has no such excuse.
Not arguing, but I would think there is (and always has been) very wide demand for fastest single core perf. From all the usual suspects?
Thank you.
What's it due to? At least this, probably more.
- more advanced silicon architecture. Apple spends billions to get access to the latest generation a couple of years before AMD.
- world class team, with ~25 years of experience building high speed low power chips. (Apple bought PA Semi to make these chips, which was originally the team that build the DEC StrongARM). And then paid & treated them properly, unlike Intel & AMD
- a die budget to spend transistors for performance: the M chips are generally quite large compared to the competition
- ARM's weak memory model also helps, but it's very minor IMO compared to the above 3.
> And then paid & treated them properly, unlike Intel & AMD
Relatively properly. Nothing like the pampering software people get. I've heard Mr. Srouji is very strict about approving promotions personally etc.
(…by heard I mean I read Blind posts)
How many of those engineers remain, didn't a lot go to Nuvia that was then bought by Qualcomm?
Sure, but they were there long enough to train and instill culture into the others. And of course, since the acquisition in 2008 they've had access to the top new grads and experienced engineers. If you're coming out top of your class at an Ivy or similar you're going to choose Apple over Intel or AMD both because of rep and the fact that your offer salary is much better.
P.S. hearsay and speculation, not direct experience. I haven't worked at Apple and anybody who has is pretty closed lip. You have to read between the lines.
P.P.S. It's sort of a circular argument. I say Apple has the best team because they have the best chip && they have the best chip because they have the best team.
But having worked (briefly) in the field, I'm very confident that their success is much more likely due to having the best team rather than anything else.
interesting, ty
re: apple getting exclusive access to the best fab stuff: https://appleinsider.com/articles/23/08/07/apple-has-sweethe... . Interesting.
Intel and AMD are after the very high profit margins of the enterprise server market. They have much less motivation to focus on power efficient mobile chips which are less profitable for them.
Apple's primary product is consumer smartphones and tablets so they are solely focused on power efficient mobile chips.
> Genuinely asking -- what is it due to?
Mostly memory/cache subsystem.
Apple was willing to spend a lot of transistors on cache because they were optimizing the chips purely for mobile and can bury the extra cost in their expensive end products.
You will note that after the initial wins from putting stonking amounts of cache and memory bandwidth in place, Apple has not had any significant performance jump beyond the technology node improvements.
They aren't aiming for performance in the first place. It's a coincidence that it has good performance. They're aiming for high performance/power ratios.
Your Intel mac was stuck in the past while everyone paying attention on PCs were already enjoying TSMC 7nm silicon in the form of AMD Zen processors.
Apple Silicon macs are far less impressive if you came from an 8c/16t Ryzen 7 laptop. Especially if you consider the Apple parts are consistently enjoying the next best TSMC node vs. AMD (e.g. 5nm (M1) vs. 7nm (Zen2))
What's _really_ impressive is how badly Intel fell behind and TSMC has been absolutely killing it.
that ryzen laptop chip perform it'll just do it at a higher perf/watt than the apple chip will... and on a laptop that's a key metric.
And 20% or so of that difference is purely the fab node difference, not anything to do with the chip design itself. Strix Halo is a much better comparison, though Apple's M4 models do very well against it often besting it at the most expensive end.
On the flip side, if you look at servers... Compare a 128+core AMD server CPU vs a large core ARM option and AMD perf/watt is much better.
Wait are you saying the diff in perf per watt from apple arm to x86 is purely on fab leading edge ness?
Basically yeah, if you compare CPU from same fab then its basically the same.
Its just Apple buys next gen fabs while AMD and intel has to be on last gen, so the M computers people compare are always one fab gen ahead. It has very little to do with CPU architecture.
They do have some cool stuff about their CPU, but the thing most laud them for has to do with fabs.
There's another difference -- willingness to actually pay for silicon. The M1 Max is a 432 mm^2 laptop chip built on a 5 nm process. Contrast that to AMD's "high end" Ryzen 7 8845HS at 178 mm^2 on a 4 nm process. Even the M1 Pro at 245 mm^2 is bigger than this. More area means not just more peak performance, but the ability to use wider paths at lower speeds to maintain performance at lower power. 432 mm^2 is friggin' huge for a laptop part, and it's really hard to compete with what that can do on any metric besides price.
Apple's SOC does a bit more than AMD's, such as including the ssd controller. I don't know if Apple is grafting different nodes together for chiplets, etc compared to AMD on desktop.
The area has nothing to do with peak performance... based on the node, it has to do with the amount of components you can cram into a given space. The CRAY-1 cpu was massive compared to both of your examples, but doesn't come close to either in terms of performance.
Also, Ryzen AI Max+ 395 is top dog on the AMD mobile CPU front and is around 308mm^2 combined.
> The area has nothing to do with peak performance... based on the node, it has to do with the amount of components you can cram into a given space.
Of course it does. For single-threaded performance, the knobs I can turn are clockspeed (minimal area impact for higher speed standard cells, large power impact), core width (significant area impact for decoder, execution resources, etc, smaller power impact), and cache (huge area impact, smaller power impact). So if I want higher single-threaded performance on a power budget, area helps. And of course for multi-threaded performance the knobs I have are number of cores, number of memory controllers, and last-level cache size, all of which drive area. There's a reason Moore's law was so often interpreted as talking about performance and not transistor count -- transistor count gives you performance. If you're willing to build a 432 mm^2 chip instead of a 308 mm^2 chip iso-process, you're basically gaining a half-node of performance right there.
Transistor count does not equal performance. More transistors isn't necessarily going to speed up any random single-threaded bottleneck.
Again, the CRAY-1 CPU is around 42000 mm^2, so I'm guessing you'd rather run that today, right?
Man that either hella discounts all the amazing work Apple’s CPU engineers are doing or hyping up what AMD’s have done. Idk
Isn't it you who is hyping up Apple here when you don't even compare the two using similar architecture? Compare a 5nm AMD laptop low power cpu to Apple M1 and the M1 no longer looks that much better at all.
I wouldn't discount what Apple has done... they've created and integrated some really good niche stuff in their CPUs to do more than typical ARM designs. The graphics cores are pretty good in their own right even. Not to mention the OS/Software integration including accelerated x86 and unified memory usage in practice.
AMD has done a LOT for parallelization and their server options are impressive... I mean, you're still talking 500W+ in total load, but that's across 128+ cores. Strix Halo scaling goes down impressively to the ~10-15W range under common usage, not as low as Apple does under similar loads but impressive in its own way.
Ok. When will we get the laptop with AMD CPU that is on par with a Macbook regarding battery life?
How much of the Mac's impressive battery life is due purely to CPU efficiency, and how much is due to great vertical integration and the OS being tuned for power efficiency?
It's a genuine question; I'm sure both factors make a difference but I don't know their relative importance.
I just searched for the asahi linux (Linux for M Series Macs) battery life, and found this blog post [0].
> During active development with virtual machines running, a few calls, and an external keyboard and mouse attached, my laptop running Asahi Linux lasts about 5 hours before the battery drops to 10%. Under the same usage, macOS lasts a little more than 6.5 hours. Asahi Linux reports my battery health at 94%.
[0] https://blog.thecurlybraces.com/2024/10/running-fedora-asahi...
The overwhelming majority is due to the power management software, yes. Other ARM laptops do not get anywhere close to the same battery life. The MNT Reform with 8x 18650s (24000mAh, 3x what you get an MBP) gets about 5h of battery life with light usage.
Tight integration matters.
Look at the difference in energy usage between safari and chrome on M4s.
How much instruction perf analysis do they do to save 1% (compounded) on the most common instructions
it's less that and more all the peripheral things. USB, wifi, bluetooth, ram, random capacitors in the VRM etc.
I think it would only be fair to compare it when running some more resource efficient system.
Steamdeck with Windows 11 and SteamOS is a whole different experience. When running SteamOS and doing web surfing, the fan don't even really spin at all. But when running windows 11 and do the exact same thing, it just spins all the time and becomes kinda hot.
Since newer CPUs have heterogeneous cores (high performance + low power), I'm wondering if it makes sense to drop legacy instructions from the low power cores, since legacy code can still be run on the other cores. Then e.g. an OS compiled the right way can take advantage of extra efficiency without the CPU losing backwards compatibility
Like o11c says, that's setting everyone up for a bad time. If the heterogenous cores are similar, but don't all support all the instructions, it's too hard to use. You can build legacy instructions in a space optimized way though, but there's no reason not to do that for the high performance cores too --- if they're legacy instructions, one expects them not to run often and perf doesn't matter that much.
Intel dropped their x86-S proposal; but I guess something like that could work for low power cores. If you provide a way for a 64-bit OS to start application processors directly in 64-bit mode, you could setup low power cores so that they could only run in 64-bit mode. I'd be surprised if the juice is worth the squeeze, but it'd be reasonable --- it's pretty rare to be outside 64-bit mode, and systems that do run outside 64-bit mode probably don't need all the cores on a modern processor. If you're running in a 64-bit OS, it knows which processes are running in 32-bit mode, and could avoid scheduling them on reduced functionality cores; If you're running a 32-bit OS, somehow or another the OS needs to not use those cores... either the ACPI tables are different and they don't show up for 32-bit, init fails and the OS moves on, or the there is a firmware flag to hide them that must be set before running a 32-bit OS.
I don't really understand why the OS can't just trap the invalid instruction exception and migrate it to the P-core. E.g. AVX-512 and similar. For very old and rare instructions they can emulate them. We used to do that with FPU instructions on non-FPU enabled CPUs way back in the 80s and 90s.
It's slow and annoying. What would cpuid report? If it says "yes I do AVX-512" then any old code might try to use it and get stuck on the P-cores forever even if it was only using it sparingly. If you say no then the software might never use it, so what was the benefit?
It's not impossible, but it'd be a pain in the butt. If you occasionally use some avx-512 infrequently, no big deal (but also not a big deal to just not use it). But if you use it a lot, all of a sudden your core count shrinks; you might rather run on all cores with avx2. You might even prefer to run avx-512 for cores that can and avx2 for those that can't ... but you need to be able to gather information on what cores support what, and pin your threads so they don't move. If you pull in a library, who knows what they do... lots of libraries assume they can call cpuid at load time and adjust... but now you need that per-thread.
That seems like a lot of change for OS, application, etc. If you run commercial applications, maybe they don't update unless you pay them for an upgrade, and that's a pain, etc.
We still do that with some microcontrollers! https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html#index-mf...
Interesting but it would be pretty rough to implement. If you take a binary now and run it on a core without the correct instructions, it will SIGILL and probably crash. So you have these options:
Create a new compilation target
- You'll probably just end up running a lot of current x86 code exclusively on performance cores to a net loss. This is how RISC-V deals with optional extensions.
Emulate
- This already happens for some instructions but, like above, could quickly negate the benefits
Ask for permission
- This is what AVX code does now, the onus is on the programmer to check if the optional instructions can be used. But you can't have many dropped instructions and expect anybody to use it.
Ask for forgiveness
- Run the code anyway and catch illegal instruction exceptions/signals, then move to a performance core. This would take some deep kernel surgery for support. If this happens remotely often it will stall everything and make your system hate you.
The last one raises the question: which instructions are we considering 'legacy'? You won't get far in an x86 binary before running into an instruction operating on memory that, in a RISC ISA, would mean first a load instruction, then the operation, then a store. Surely we can't drop those.
The "ask for permission" approach doesn't work because programs don't expect the capability of a CPU to change. If a program checked a minute ago that AVX512 is available, it certainly expects AVX512 to be continually available for the lifetime of the process. That means chaos if the OS is moving processes between performance and efficiency cores.
IIRC, there were several smartphone SoCs that dropped 32-bit ARM support from most but not all of their CPU cores. That was straightforward to handle because the OS knows which instruction set a binary wants to use. Doing anything more fine-grained would be a nightmare, as Intel found out with Alder Lake.
This is the flip side of Intel trying to drop AVX512 on their E cores in the 12th generation processors. It didn't work. It requires the OS to know which processes need AVX512 before they get run. And processes themselves use cpuid to determine the capability of processors and they don't expect it to change. So you basically must determine in advance which processes can be run on E cores and never migrate between cores.
What if the kernel handled unimplemented instruction faults by migrating the process to a core that does implement the instruction and restarting the faulting instruction?
Sounds great for performance.
Would this be more or less costly than a page fault? It seems like it would be easy to arrange for it to happen very rarely unless none of your cores support all the instructions.
We've seen CPU-capability differences by accident a few times, and it's always a chaotic mess leading to SIGILL.
The kernel would need to have a scheduler that knows it can't use those cores for certain tasks. Think about how hard you would have to work to even identify such a task ...
Current windows or linux executable format don't even list the used instruction though. And even it is listed, how about dynamic linkables? The program may decide to load library at any time it wishes, and the OS is not going to know what instruction may be used this time.
I think it is not really the execution units for simple instructions that take up much chip area on application-class CPUs these days, but everything around them.
I think support in the OS/runtime environment* would be more interesting for chips where some cores have larger execution units such as those for vector and matmul units. Especially for embedded / low power systems.
Maybe x87/MMX could be dropped though.
*. BTW. If you want to find research papers on the topic, a good search term is "partial-ISA migration".
This was a terrible idea when we tried it on ARM and it'll remain terrible idea on AMD64 as well.
There's just too many footguns for the OS running on such a SoC to be worth it.
That is quite a confession from AMD. It's not X86 at all, just every implementation. It is not like the ARM processors in Macs are simple any more, thats for sure.
There are a lot of theoretical articles which claim similar things but on the other hand we have a lot of empirical evidence that ARM CPUs are significantly more power efficient.
I used laptops with both Intel and AMD CPUs, and I read/watch a lot of reviews in thin and light laptop space. Although AMD became more power efficient compared to Intel in the last few years, AMD alternative is only marginally more efficient (like 5-10%). And AMD is using TSMC fabs.
On the other hand Qualcomm's recent Snapdragon X series CPUs are significantly more efficient then both Intel and AMD in most tests while providing the same performance or sometimes even better performance.
Some people mention the efficiency gains on Intel Lunar Lake as evidence that x86 is just as efficient, but Lunar Lake was still slightly behind in battery life and performance, while using a newer TSMC process node compared to Snapdragon X series.
So, even though I see theoretical articles like this, the empirical evidence says otherwise. Qualcomm will release their second generation Snapdragon X series CPUs this month. My guess is that the performance/efficiency gap with Intel and AMD will get even bigger.
I think both can be true.
A client CPU spends most of its life idling. Thus, the key to good battery life in client computing is, generally, idle power consumption. That means low core power draw at idle, but it also means shutting off peripherals that aren't in use, turning off clock sources for said peripherals, etc.
ARM was built for low-power embedded applications from the start, and thus low-power idle states are integrated into the architecture quite elegantly. x86, on the other hand, has the SMM, which was an afterthought.
AFAICT case for x86 ~ ARM perf equivalence is based on the argument that instruction decode, while empirically less efficient on x86, is such a small portion of a modern, high-performance pipeline that it doesn't matter. This reasoning checks out IMO. But, this effect would only be visible while the CPU is under load.
I'm glad an authoritative source has stated this. It's been ongoing BS for years. I first got into ARM machines with the Acorn Archimedes and even back then, people were spouting some kind of intrinsic efficiency benefit to ARM that just didn't make any sense.
It is true for small cores, it's just not important for big cores compared to all the other things you have to solve.
The ISA is the contract or boundary between software and hardware. While there is a hardware cost to decode instructions, the question is how much?
As all the fanbois in the thread have have pointed out, Apple's M series is fast and efficient compared to x86 for desktop/server workloads. What no one seems to acknowledge is that Apple's A series is also fast and efficient compared to other ARM implementations in mobile workloads. Apple sees the need to maintain M and A series CPUs for different workloads, which indicates there's a benefit to both.
This tells me the ISA decode hardware isn't or isn't the only bottleneck.
[dead]
And yet... the world keeps proving Intel and AMD wrong on this premise with highly efficient Arm parts. While sure, there's bound to be improvements to make on x86 ultimately its a variable length opcode encoding with a complex decoder path. If nothing else, this is likely a significant issue in comparison to the nicely word aligned op code encoding arm has and surely given apples to apples core designs, the opcode decoding would be a deciding factor.
> its a variable length opcode encoding with a complex decoder path
In practice, the performance impact of variable length encoding is largely kept in check using predictors. The extra complexity in terms of transistors is comparatively small in a large, high-performance design.
Related reading:
https://patents.google.com/patent/US6041405A/en
https://web.archive.org/web/20210709071934/https://www.anand...
Jim Keller has a storied career in the x86 world, it isn't surprising he speaks fondly of it. Regardless:
>So fixed-length instructions seem really nice when you're building little baby computers, but if you're building a really big computer, to predict or to figure out where all the instructions are, it isn't dominating the die. So it doesn't matter that much.
Well, efficiency advantages are the domain of little baby computers. Better predictors give you deeper pipelines without stalls which give you higher clock speeds - higher wattages
PPA results comparing x86 to ARM say otherwise; take a look at Ryzen's embedded series and Intel's latest embedded cores.
Have they reached apple m level of performance/watt after half a decade of the apple m parts being out yet? Do either AMD or Intel beat Apple in any metric in mobile?
Yes and yes - again, go look at the published results.
I have, the m4 is significantly ahead by any measure I’ve seen, with snapdragon just behind.
https://www.xda-developers.com/tested-apple-m4-vs-intel-luna...
The m4 is not the same fab technology so not comparable. If you want to discuss the validity of some CPU architecture it needs to be between comparable fab technology, M4 being a generation ahead there makes the comparison unfair.
If you compare like to like the difference almost completely disappears.
X elite uses an older node than lunar lake and does better
You mean how they all use tsmc n3 and the two arm parts still beat out lunar lake? It’s not like a whole generational node behind here. I get they aren’t the exact same process node but still.
No they haven't.