The race to build a distributed GPU runtime

75 points by jonbaer 4 days ago

RachelF a day ago

I find it odd that given the billions of dollars involved, no competitor has managed to replicate the functions of CUDA.

Is it that hard to do, or is the software lock-in so great?

ronsor a day ago

The problem is that CUDA is tightly integrated with NVIDIA hardware. You don't just have to replicate CUDA (which is a lot of tedious work at best), but you also need the hardware to run your "I can't believe it's not CUDA"
JonChesterfield a day ago

I'm pretty sure it's a political limitation, not a technical one. Implementing it is definitely a pain - it's a mix of hardcore backwards compatibility (i.e. cruft) and a rapidly moving target - but it's also obviously just a lot of carefully chosen ascii written down in text files.
The non-nvidia hardware vendors really don't want cuda to win. AMD went for open source + collaborative in a big way, opencl then hsa. Both broadly ignored. I'm not sure what Intel are playing at with spirv - that stack doesn't make any sense to me whatsoever.
Cuda is alright though, in a kind of crufty obfuscation over SSA sense. Way less annoying than opencl certainly. You can run it on amdgpu hardware if you want to - https://docs.scale-lang.com/stable/ and https://github.com/vosen/ZLUDA already exist. I'm hacking on scale these days.
- joe_the_user 11 hours ago
  
  The thing that's also worth saying is that everyone speaks vaguely about CUDA's "institutional memory" and investment and so forth.
  But the concrete qualities of CUDA and Nvidia's offerings generally is a move toward general purpose parallel computing. Parallel processing is "the future" and approach of just do loop and have each iteration be parallel is dead simple.
  Which is to say Nvidia has invested a lot in making "easy things easy along with hard things no harder".
  In contrast, other chip makers seem to be acculturated to the natural lock-in of having a dumb, convoluted interface to compensate for a given chip being high performance.
pjmlp a day ago

Because most fail to understand what makes CUDA great, and keep only trying to replicate C++ API.
They overlook CUDA is a polyglot ecosystem composed by C, C++ and Fortran as main languages, with Python JIT DSL since this year, compiler infrastructure for any compiler backend that wishes to target it of which there are a few including strange stuff like Haskell, IDE integration with Eclipse and Visual Studio, graphical debugging just like on the CPU.
It is like when Khronos puts out those spaghetti riddled standards, expecting each vendor/open source community, to create some kind of SDK, versus the vertical integration of console devkits and proprietary APIs, and then asking why professional studios have no qualms with proprietary tooling.
- oivey a day ago
  
  Slight correction: CUDA Python JIT has existed for a very long time. Warp is a late comer.
  - pjmlp a day ago
    
    Kind of, none of those are at the integration level of CUTLASS 4, and the new cu tile architecture, introduced at GTC 2025.
    But you're right there was already something in place.
    
    oivey 11 hours ago
    
    I took a closer look at some of that and it’s pretty cool. Definitely neat to have some good higher level abstractions than the old C-style CUDA syntax that Numba was built on.
melodyogonna a day ago

Cuda is so many things I'm not sure it is even possible to replicate it.
triknomeister 21 hours ago

It is hard to do in the sense that it requires a very good taste about programming languages, which in turn requires really listening to the customers, and that requires huge number of people who are skilled. And no one has really invested that much money into their software ecosystem yet.
joe_the_user a day ago

CUDA does involve a massive investment for Nvidia. It's not that it's impossible to replicate the functionality. But once a company has replicated that functionality, that company basically is going to be selling at competitive prices, which isn't a formula for high profits.
Notably, AMD funded a CUDA clone, ZLUDA, and then quashed it[1]. Comments at the time here involved a lot of "they would always be playing catch up".
I think the mentality of chip makers generally is that they'd rather control a small slice of a market than fight competitively for a large slice. It makes sense in that they invest years in advance and expect those investments to pay high profits.
[1] https://www.tomshardware.com/pc-components/gpus/amd-asks-dev...
- noosphr a day ago
  
  Cuda isn't a massive investment, it's 20 years worth of institutional knowledge with a stable external api. There are very few companies outside of 00s Microsoft who have managed to support 20 years worth of backward compatibility along with the bleeding edge.
  - Izikiel43 a day ago
    
    > Cuda isn't a massive investmen
    > it's 20 years worth of institutional knowledge with a stable external api
    > There are very few companies outside of 00s Microsoft who have managed to support 20 years worth of backward compatibility along with the bleeding edge.
    To me that sounds like massive investment
- triknomeister 21 hours ago
  
  ZLUDA was quashed due to concerns about infringement /violating terms of use.
  - joe_the_user 14 hours ago
    
    That was the story but the legality of cloning an API/ABI/etc is well established by, for example Google vs Oracle (though with gotchas that might Nvidia to put a legal fight).
shmerl a day ago

A better question is why there is no stronger push for a nicer GPU language that's not tied to any particular GPU and serves any purpose of GPU usage (whether it's compute or graphics).
I mean efforts like rust-gpu: https://github.com/Rust-GPU/rust-gpu/
Combine such language with Vulkan (using Rust as well) and why would you need CUDA?
- bee_rider a day ago
  
  I think Intel Fortran has some ability to offload to their GPUs now. And Nvidia has some stuff to run(?) CUDA from Fortran.
  Probably just needs a couple short decades of refinement…
  - pjmlp a day ago
    
    One of the reasons CUDA won over OpenCL, was that NVidia, contrary to Khronos, saw a value in helping those HPC researchers move their Fortran code into the GPU.
    Hence they bought PGI, and improved their compiler.
    Intel eventually did the same with Open API (which isn't plain OpenCL, rather an extension with Intel goodies).
    I was on a Khronos webminar where the panel showed disbelief why anyone would care about Fortran, oh well.
    
    bee_rider 16 hours ago
    
    That’s actually pretty surprising to me. Of course, there are always jokes about Fortran being some language that people don’t realize is still kicking. But I’d expect a standards group that is at least parallel computing adjacent to know that it is still around.
    
    pjmlp an hour ago
    
    Yet not only they joked about Fortran, it took CUDA adoption success, for them to take C++ seriously and come up with SPIR as counterpoint to PTX.
    Which in the end was worthless because both Intel and AMD botched all OpenCL 2.x efforts.
    Hence why OpenCL 3.0 is basically OpenCL 1.0 rebranded, and SYSCL went its own way.
    It took a commercial company, Codeaplay a former compiler vendor for games consoles, to actually come up with a good tooling for SYSCL.
    Which Intel in the middle of extending SYSCL with their Data Paralell C++, eventually acquired.
    Those products are in the foundation of One API, and naturally go beyond what barebones OpenCL happens to be.
    The mismanagement Khronos has done with OpenCL is one of the reasons Apple lost ties with Khronos.
    
    kristianp a day ago
    
    It's insane how big the NVidia dev kit is. They've got a library for everything. It seems like they have as broad software support as possible.
- fnands a day ago
  
  Mojo might be what you are looking for: https://docs.modular.com/mojo/manual/gpu/intro-tutorial/
  The language is general, but the current focus is really on programming GPUs.
- JonChesterfield a day ago
  
  I like Julia for this. Pretty language, layered on LLVM like most things. Modular are doing interesting things with Mojo too. People seem to like cuda though.
  - shmerl a day ago
    
    CUDA is just DOA as a nice language being Nvidia only (not counting efforts like ZLUDA).
    
    JonChesterfield a day ago
    
    That's a compiler problem. Once could start from clang -xcuda and hack onwards. Or work in the intersection of CUDA and HIP which is relatively broad if a bit of a porting nuisance.
    
    shmerl a day ago
    
    May be, but who is working on that compiler? And the whole ecosystem is controlled by a nasty company. You don't want to deal with that.
    Besides, I'd say Rust is a nicer language than CUDA dialects.
    
    JonChesterfield a day ago
    
    Chris and Nick originally, a few more of us these days. Spectral compute. We might have a nicer world if people had backed opencl instead of cuda but whatever. Likewise rust has a serious edge over c++. But to the compiler hacker, this is all obfuscated SSA form anyway, it's hard to get too emotional about the variations.
    
    pjmlp a day ago
    
    Until Rust gets into any of industry compute standards, being a nicer language alone doesn't help.
    Khronos standards, CUDA, ROCm, One API, Metal, none of them has Rust on their sights.
    World did not back OpenCL, because it was stuck on a primitive C99 text based tooling, without an ecosystem.
    Also Google decided to push their Renderscript C99 dialect instead, while Intel and AMD were busy delivering janky tools and broken drivers.
    
    shmerl a day ago
    
    That's simply not true, because standard level should operate on the IR level, not on the language. You have to generate some IR from your language, at that level it makes sense to talk about standards. The only exception is probably WebGPU where Apple pushed using a fixed language instead of IR which is was a limiting idea.
    
    pjmlp a day ago
    
    None of those standards are about IR.
    Also SPIR worked so great for OpenCL 2.x, that Khronos rebooted the whole mess back to OpenCL 1.x with OpenCL 3.0 rebranding.
    
    shmerl a day ago
    
    They are pretty much about IR when it comes to language interchange. SPIR-V is explicitly an IR that can be targeted from a lot of different languages.
    
    pjmlp 19 hours ago
    
    And so far not much has been happening, hence Shader Languages at Vulkanised 2026.
    https://www.khronos.org/events/shading-languages-symposium-2...
    
    shmerl 17 hours ago
    
    These kind of projects is exactly it's happening.
    
    shmerl a day ago
    
    Language would matter more for those who actually would want to write some programs in it. So I'd say rust-gpu is something that should get more backing.
- pjmlp a day ago
  
  Tooling and ecosystem, that is why.
  - shmerl a day ago
    
    Rust has great tooling and ecosystem. The point here is more of interest of those who want better alternatives to CUDA. AMD would be an obvious beneficiary to back the above, so I'm surprised about some lack of interest from their likes.
    
    pjmlp a day ago
    
    It has zero CUDA tooling, that is what is relevant when positioning itself as alternative to C, C++, Fortran, Python JIT, PTX based compilers, compute libraries, Visual Studio and Eclipse integration, graphical debugger.
    Cross compiling Rust into PTX is not enough to make researchers leave CUDA.
    
    shmerl a day ago
    
    And CUDA has zero non CUDA tooling. That's a pointless circular argument which doesn't mean anything. Rust has Rust tooling and it's very good.
    Being language agnostic is also not the task of the language, but task of IR. There is already a bunch of languages, such as Slang. The point is to use Rust itself for it.
    
    pjmlp a day ago
    
    Where is the graphical debugging experience for Rust, given that it so great tooling?
    Slang belongs to NVidia, and was nicely given to Khronos, because almost everyone started relying on HLSL, given that Khronos decided not to spend any additional resources on GLSL.
    Just like with Mantle and Vulkan, Khronos seems that without external help they aren't able to produce anything meaningful since Long Peak days.
chickenzzzzu a day ago

Vulkan is at 95% of CUDA performance already. The remaining 5% is CUDA's small dispatch logic.
The reason why people continue to use CUDA and Pytorch and so on is because they are literally too stupid and too lazy to do it any other way
- pjmlp a day ago
  
  With zero tooling, hence why no one cares about Vulkan, other than Valve and Google.
  - chickenzzzzu a day ago
    
    What tooling do you need? I'll make it for you for free
    
    pjmlp a day ago
    
    Great, lets start with a Fortran compiler like CUDA has.
    When you're done, you can create IDE plugins, and a graphical debugger with feature parity to NInsights.
    
    chickenzzzzu a day ago
    
    Ok, that's a good retort. How many months of work do those things save you, compared to actually solving the problem you want to solve without those tools?
    The argument you are making sounds to me like, "well good luck making a Vulkan application without cmake, ninja, meson, git, visual studio, clion" etc, when in reality a 5 line bash script to gcc works just fine
    
    triknomeister 21 hours ago
    
    Wrong analogy. You have no idea how wrong you are. Just look at the difference in performance analysis tools for AMD and Nvidia for GPUs. Nvidia makes it simple for people to write GPU programs.
    
    chickenzzzzu 11 hours ago
    
    I do have an idea of how wrong I am.
    Nvidia's own people are the ones who have made Vulkan performance so close to CUDA's. AMD is behind, but the data shows that they're off in performance proportional to the cost of the device. If they implement coop mat 2, then they would bridge the gap.
    99.9% of people who use Pytorch and so on could achieve good enough performance using a "simple vulkan backend" for whatever Python stuff they're used to writing. That would strip out millions of lines of code.
    The reason nobody has done this outside of a few github projects that Nvidia themselves have contributed to, is because there isn't a whole lot of money in iterative performance gains, when in reality better algorithmic approaches are being invented quite near every month or so.
    
    pjmlp a day ago
    
    First step is to understand why proprietary technology gets adoption.
    Lacking understanding is doomed to failure.

smj-edison a day ago

This reminds me a lot of Seymour Cray's two maxims of supercomputing: get the data where it needs to be when it needs to be there, and get the heat out. Still seems to apply today!

CamperBob2 a day ago

Calls to mind his other famous quote, "Would you rather plow your field with two strong oxen or 1024 chickens?"
How about ten billion chickens?
- smj-edison a day ago
  
  Yeah. I feel like he's still partially vindicated with things like the dragonfly topology, as a lot of problems don't nicely map onto a 2D or 3D topology (so longest distance is still the limiting factor). But the chicken approach certainly scales better, and I feel like since Cray's time there are more local-aware algorithms around.

anonymousDan a day ago

Can someone tell me if the challenges the article describes and indeed the frameworks they mention are mostly relevant for training or also for inference?

benreesman a day ago

The fast interconnect between nodes has aaplications in inference at scale (big KV caches and other semi-durable state, multi-node tensor parallelism on mega models).
But this article in particular is emphasizing extreme performance ambitions for columnar data processing with hardware acceleration. Relevant to many ML training scenarios, but also other kinds of massive MapReduce-style (or at least scale) workloads. There are lots of applications of "magic massive petabyte plus DataFrame" (which is not I think solved in the general case).

JonChesterfield a day ago

The underlying problem here is real and legitimately difficult. Shunting data around a cluster (ideally as parts of it fall over) to minimise overall time, in an application independent fashion, is a definable dataflow problem and also a serious discrete optimisation challenge. The more compute you spend on trying to work out where to move the data around, the less you have left over for the application. Also tricky working out what the data access patterns seem to be. Very like choosing how much of the runtime budget to spend on a JIT compiler.

This _should_ breakdown as running optimised programs on their runtime makes things worse and running less-carefully-structured ones makes things better, where many programs out there turn out to be either quite naive or obsessively optimised for an architecture that hasn't existed for decades. I'd expect this runtime to be difficult to build but with high value on success. Interesting project, thanks for posting it.

KaiserPro a day ago

One thing thats not addressed here is that the bigger you scale your shared memory cluster the closer to 100% chance that one node fucks up and corrupts your global memory space.

Currently the fastest way to get data from node a to node b is to RDMA it. which means that any node can inject anything into your memory space.

I'm not really sure how Theseus guards against that.

buildbot a day ago

I’m not sure any system prevents RDMA from ruining your day :(
Back in grad school I remember we did something fairly simple but clearly illegal and wedged the machine so bad the out of band management also went down!
- KaiserPro a day ago
  
  > wedged the machine so bad the out of band management also went down!
  Now thats living the dream of a shared cluster!
  This is hazy now, but I do remember a massive outage of a lustre cluster, which I think was because there was a dodgy node injecting crap into everyone's memory space via the old lustre fast filesystem kernel driver. I think they switched to NFS export nodes after that. (for the render farm and desktops at least.)

tucnak a day ago

I don't understand, doesn't kauldron[0] already exist?

[0] https://github.com/google-research/kauldron

up2isomorphism a day ago

As of today GPU is just too expensive for data processing. The direction they took makes it a very hard sell.

jauntywundrkind a day ago

Lot of hype, but man does Voltron Data keep blowing me away with what they bring out. Mad respect.

> There’s a strong argument to be made that RAPIDS cuDF/RAPIDS libcudf drives NVIDIA’s CUDA-X Data Processing stack, from ETL (NVTabular) and SQL (BlazingSQL) to MLOps/security (Morpheus) and Spark acceleration (cuDF-Java).

Yeah this seems like the core indeed, libcudf.

Focus here is on TCP & GPUDirect (Nvidia's pci-p2p, letting for example RDMA without CPU involvement across a full GPU -> NIC -> switch -> nic -> GPU happen).

Personally it feels super dangerous to just trust Nvidia on all of this, to just buy the solution available. Pytorch nicely sees this somewhat, adopted & took over Facebook/Meta's gloo project, which wraps a lot of the rdma efforts. But man there's just so so many steps ahead that Theseus is here with figuring out & planning what to do with these capabilities, these ultra efficient links, figuring out how to not need to use them if possible! The coordination problems keep growing in computing. I think of RISC-V with its arbitrary vector-based alternative to conventional x86 simd, going from a specific instruction for each particular operation to instructions being parameterized over different data lengths & types. https://github.com/pytorch/gloo

I'd really like to see a concerted effort to around Ultra Ethernet emerge, fast. Hardware isnt really available, and it's going to start out being absurdly expensive. But Ultra Ethernet looks like a lovely mix of collision-less credit-based Infiniband RDMA and Ethernet, with lots of other niceties (transport security). Deployments just starting (AMD Pensando Pollara 400 at Oracle). I worry that without broader availability & interest, without mass saturation, AI is going to stay stuck on libcudf forever; getting hardware out there & getting software stacos using it is a chicken & egg problem that big players need to spend real effort accelerating UET or else. https://www.tomshardware.com/networking/amd-deploys-its-firs...

latchkey a day ago

Our MI300x boxes have had 8x400G Thor2 RDMA working great for a year now.

varelse a day ago

[dead]

danielz4tp5 4 days ago

[dead]